-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DataTree.to_zarr
with append_dim fails on empty groups
#9858
Comments
Thanks for reporting - this seems like a bug with
I would expect the solution to have to consider #9106 and #9778.
I'm not sure exactly what you mean. A |
As for the actual error, I think first we should figure out what the expected behaviour would be for ds1 = xr.Dataset()
ds2 = xr.Dataset()
ds1.to_zarr(store)
ds2.to_zarr(store, append_dim='time') because a datatree with a single empty group should behave the same as that. Concatenating two empty datasets along a new dimension does not raise: In [6]: ds1 = xr.Dataset()
In [7]: ds2 = xr.Dataset()
In [8]: xr.concat([ds1, ds2], dim='time')
Out[8]:
<xarray.Dataset> Size: 0B
Dimensions: ()
Data variables:
*empty* |
Dataset
at the root level when saving DataTree
in zarrDataTree.to_zarr
with append_dim fails on empty groups
Following up on this,
I got same error as in Traceback (most recent call last):
File "/snap/pycharm-community/436/plugins/python-ce/helpers/pydev/pydevd.py", line 1570, in _exec
pydev_imports.execfile(file, globals, locals) # execute the script
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/snap/pycharm-community/436/plugins/python-ce/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "/media/alfonso/drive/Alfonso/python/raw2zarr/issue-delete.py", line 84, in <module>
tom_save()
File "/media/alfonso/drive/Alfonso/python/raw2zarr/issue-delete.py", line 11, in tom_save
ds2.to_zarr(store, append_dim='time')
File "/home/alfonso/mambaforge/envs/raw2zarr/lib/python3.12/site-packages/xarray/core/dataset.py", line 2595, in to_zarr
return to_zarr( # type: ignore[call-overload,misc]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/alfonso/mambaforge/envs/raw2zarr/lib/python3.12/site-packages/xarray/backends/api.py", line 2184, in to_zarr
dump_to_store(dataset, zstore, writer, encoding=encoding)
File "/home/alfonso/mambaforge/envs/raw2zarr/lib/python3.12/site-packages/xarray/backends/api.py", line 1920, in dump_to_store
store.store(variables, attrs, check_encoding, writer, unlimited_dims=unlimited_dims)
File "/home/alfonso/mambaforge/envs/raw2zarr/lib/python3.12/site-packages/xarray/backends/zarr.py", line 889, in store
raise ValueError(
ValueError: append_dim='time' does not match any existing dataset dimensions {} |
That's interesting. I would have thought that if "append" is supposed to be synonymous with "concat" here then the behaviour of tagging @jhamman as zarr-in-xarray representative (and FYI @abarciauskas-bgse) |
Another thing I noticed @TomNicholas, is that let's suppose we can avoid the issue with the empty for node in dt.subtree:
at_root = node is dt
if node.is_empty & node.is_root: # New lines
continue # to avoid empty datasets here, Even if we avoid the empty Dataset issue at the root level, we will still encounter another challenge related to inheritance as at the root only inherited coordinates are retained. Thus, the ds = xr.tutorial.open_dataset("air_temperature").drop_attrs()
# Split dataset in two
ds_first_half = ds.isel(time=range(0, int(len(ds.time)/2)))
ds_second_half = ds.isel(time=range(int(len(ds.time)/2), len(ds.time)))
ds_daily_fh = ds_first_half .resample(time="D").mean("time")
ds_weekly_fh = ds_first_half .resample(time="W").mean("time")
ds_monthly_fh = ds_first_half .resample(time="ME").mean("time")
# second half dtree
ds_daily_sh = ds_second_half.resample(time="D").mean("time")
ds_weekly_sh = ds_second_half.resample(time="W").mean("time")
ds_monthly_sh = ds_second_half.resample(time="ME").mean("time")
# first half dtree
dt_fh = xr.DataTree.from_dict(
{
"/temp": ds.drop_dims("time"),
"/temp/daily": ds_daily_fh.drop_vars(["lat", "lon"]),
"/temp/weekly": ds_weekly_fh.drop_vars(["lat", "lon"]),
"/temp/monthly": ds_monthly_fh.drop_vars(["lat", "lon"]),
}
)
print(dt_fh)
<xarray.DataTree>
Group: /
└── Group: /temp
│ Dimensions: (lat: 25, lon: 53)
│ Coordinates:
│ * lat (lat) float32 100B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0
│ * lon (lon) float32 212B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0
├── Group: /temp/daily
│ Dimensions: (time: 365, lat: 25, lon: 53)
│ Coordinates:
│ * time (time) datetime64[ns] 3kB 2013-01-01 2013-01-02 ... 2013-12-31
│ Data variables:
│ air (time, lat, lon) float64 4MB 241.9 242.3 242.7 ... 295.5 294.8
├── Group: /temp/weekly
│ Dimensions: (time: 53, lat: 25, lon: 53)
│ Coordinates:
│ * time (time) datetime64[ns] 424B 2013-01-06 2013-01-13 ... 2014-01-05
│ Data variables:
│ air (time, lat, lon) float64 562kB 245.3 245.2 245.0 ... 296.2 295.6
└── Group: /temp/monthly
Dimensions: (time: 12, lat: 25, lon: 53)
Coordinates:
* time (time) datetime64[ns] 96B 2013-01-31 2013-02-28 ... 2013-12-31
Data variables:
air (time, lat, lon) float64 127kB 244.5 244.7 244.7 ... 297.4 297.4 Now, we can save the first half store = "dtree_inherited.zarr"
dt_fh.to_zarr(
store,
consolidated=True,
write_inherited_coords=True,
) Then, when appending new information, it will generates the following error dt_sh = xr.DataTree.from_dict(
{
"/temp": ds.drop_dims("time"),
"/temp/daily": ds_daily_sh.drop_vars(["lat", "lon"]),
"/temp/weekly": ds_weekly_sh.drop_vars(["lat", "lon"]),
"/temp/monthly": ds_monthly_sh.drop_vars(["lat", "lon"]),
}
)
dt_sh.to_zarr(
store,
mode="a-",
consolidated=True,
append_dim="time",
write_inherited_coords=True,
)
Traceback (most recent call last):
File "/snap/pycharm-community/436/plugins/python-ce/helpers/pydev/pydevd.py", line 1570, in _exec
pydev_imports.execfile(file, globals, locals) # execute the script
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/snap/pycharm-community/436/plugins/python-ce/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "/media/alfonso/drive/Alfonso/python/raw2zarr/issue-delete.py", line 85, in <module>
main()
File "/media/alfonso/drive/Alfonso/python/raw2zarr/issue-delete.py", line 75, in main
dt_sh.to_zarr(
File "/home/alfonso/mambaforge/envs/raw2zarr/lib/python3.12/site-packages/xarray/core/datatree.py", line 1699, in to_zarr
_datatree_to_zarr(
File "/home/alfonso/mambaforge/envs/raw2zarr/lib/python3.12/site-packages/xarray/core/datatree_io.py", line 123, in _datatree_to_zarr
ds.to_zarr(
File "/home/alfonso/mambaforge/envs/raw2zarr/lib/python3.12/site-packages/xarray/core/dataset.py", line 2595, in to_zarr
return to_zarr( # type: ignore[call-overload,misc]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/alfonso/mambaforge/envs/raw2zarr/lib/python3.12/site-packages/xarray/backends/api.py", line 2184, in to_zarr
dump_to_store(dataset, zstore, writer, encoding=encoding)
File "/home/alfonso/mambaforge/envs/raw2zarr/lib/python3.12/site-packages/xarray/backends/api.py", line 1920, in dump_to_store
store.store(variables, attrs, check_encoding, writer, unlimited_dims=unlimited_dims)
File "/home/alfonso/mambaforge/envs/raw2zarr/lib/python3.12/site-packages/xarray/backends/zarr.py", line 889, in store
raise ValueError(
ValueError: append_dim='time' does not match any existing dataset dimensions {'lat': 25, 'lon': 53} Let me know your thoughts. |
The newly created issue (#9892) is to capture specifically the |
From talking with @TomNicholas and @jhamman zarr currently does not support appending a new dimension with Borrowing sample data from our unit tests: time = xr.DataArray(data=["2022-01", "2023-01"], dims="time")
stations = xr.DataArray(data=list("abcdef"), dims="station")
lon = [-100, -80, -60]
lat = [10, 20, 30]
# Set up fake data
wind_speed = xr.DataArray(np.ones((2, 6)) * 2, dims=("time", "station"))
pressure = xr.DataArray(np.ones((2, 6)) * 3, dims=("time", "station"))
air_temperature = xr.DataArray(np.ones((2, 6)) * 4, dims=("time", "station"))
dewpoint = xr.DataArray(np.ones((2, 6)) * 5, dims=("time", "station"))
infrared = xr.DataArray(np.ones((2, 3, 3)) * 6, dims=("time", "lon", "lat"))
true_color = xr.DataArray(np.ones((2, 3, 3)) * 7, dims=("time", "lon", "lat"))
sample_tree = xr.DataTree.from_dict(
{
"/": xr.Dataset(
coords={"time": time},
),
"/weather": xr.Dataset(
coords={"station": stations},
data_vars={
"wind_speed": wind_speed,
"pressure": pressure,
},
),
"/weather/temperature": xr.Dataset(
data_vars={
"air_temperature": air_temperature,
"dewpoint": dewpoint,
},
),
"/satellite": xr.Dataset(
coords={"lat": lat, "lon": lon},
data_vars={
"infrared": infrared,
"true_color": true_color,
},
),
},) store = 'test_data_append_works.zarr'
sample_tree.to_zarr(store)
sample_tree.to_zarr(store, mode='a', append_dim="time") This works But then when you drop the sample_tree_drop = xr.DataTree.from_dict(
{
"/": xr.Dataset(
coords={"time": time},
),
"/weather": xr.Dataset(
coords={"station": stations},
data_vars={
"wind_speed": wind_speed,
"pressure": pressure,
},
).drop_dims('time'),
"/weather/temperature": xr.Dataset(
data_vars={
"air_temperature": air_temperature,
"dewpoint": dewpoint,
},
),
"/satellite": xr.Dataset(
coords={"lat": lat, "lon": lon},
data_vars={
"infrared": infrared,
"true_color": true_color,
},
),
},) store = 'test_data_append_works.zarr'
sample_tree.to_zarr(store)
sample_tree.to_zarr(store, mode='a', append_dim="time") You get a
But the sample_tree_drop['/weather/temperature'].air_temperature.dims Returns ('time', 'station') Instead of raising a EDIT: TLDR #9892 might be on hold but this issue can be fixed before that one. |
What happened?
I found out that when dealing with nested
DataTree
and storing/saving in zarr (dt.to_zarr()
), it creates an emptyDataset
at the root level. I was trying to append an isomorphic datatree to an existing datatree stored in Zarr. It fails when passing themode
anddim
parameters since theDataset
at the root level has no dimension or coordinates.What did you expect to happen?
I wondered why we need an empty
Dataset
at the root level without data or coordinates. However, I am not sure if this is the best way to merge/concat/append twoDataTrees
.Minimal Complete Verifiable Example
MVCE confirmation
Relevant log output
Environment
The text was updated successfully, but these errors were encountered: