Parallel computing with dask¶
xarray integrates with dask to support parallel computations and streaming computation on datasets that don’t fit into memory.
Currently, dask is an entirely optional feature for xarray. However, the benefits of using dask are sufficiently strong that dask may become a required dependency in a future version of xarray.
For a full example of how to use xarray’s dask integration, read the blog post introducing xarray and dask.
What is a dask array?¶
Dask divides arrays into many small pieces, called chunks, each of which is presumed to be small enough to fit into memory.
Unlike NumPy, which has eager evaluation, operations on dask arrays are lazy. Operations queue up a series of tasks mapped over blocks, and no computation is performed until you actually ask values to be computed (e.g., to print results to your screen or write to disk). At that point, data is loaded into memory and computation proceeds in a streaming fashion, block-by-block.
The actual computation is controlled by a multi-processing or thread pool, which allows dask to take full advantage of multiple processors available on most modern computers.
For more details on dask, read its documentation.
Reading and writing data¶
The usual way to create a dataset filled with dask arrays is to load the
data from a netCDF file or files. You can do this by supplying a chunks
argument to open_dataset()
or using the
open_mfdataset()
function.
In [1]: ds = xr.open_dataset('example-data.nc', chunks={'time': 10})
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
<ipython-input-1-9c7ea69516aa> in <module>()
----> 1 ds = xr.open_dataset('example-data.nc', chunks={'time': 10})
~/checkouts/readthedocs.org/user_builds/xray/conda/v0.10.1/lib/python3.5/site-packages/xarray-0.10.1-py3.5.egg/xarray/backends/api.py in open_dataset(filename_or_obj, group, decode_cf, mask_and_scale, decode_times, autoclose, concat_characters, decode_coords, engine, chunks, lock, cache, drop_variables)
284 store = backends.NetCDF4DataStore.open(filename_or_obj,
285 group=group,
--> 286 autoclose=autoclose)
287 elif engine == 'scipy':
288 store = backends.ScipyDataStore(filename_or_obj,
~/checkouts/readthedocs.org/user_builds/xray/conda/v0.10.1/lib/python3.5/site-packages/xarray-0.10.1-py3.5.egg/xarray/backends/netCDF4_.py in open(cls, filename, mode, format, group, writer, clobber, diskless, persist, autoclose)
273 diskless=diskless, persist=persist,
274 format=format)
--> 275 ds = opener()
276 return cls(ds, mode=mode, writer=writer, opener=opener,
277 autoclose=autoclose)
~/checkouts/readthedocs.org/user_builds/xray/conda/v0.10.1/lib/python3.5/site-packages/xarray-0.10.1-py3.5.egg/xarray/backends/netCDF4_.py in _open_netcdf4_group(filename, mode, group, **kwargs)
197 import netCDF4 as nc4
198
--> 199 ds = nc4.Dataset(filename, mode=mode, **kwargs)
200
201 with close_on_error(ds):
netCDF4/_netCDF4.pyx in netCDF4._netCDF4.Dataset.__init__()
netCDF4/_netCDF4.pyx in netCDF4._netCDF4._ensure_nc_success()
FileNotFoundError: [Errno 2] No such file or directory: b'/home/docs/checkouts/readthedocs.org/user_builds/xray/checkouts/v0.10.1/doc/example-data.nc'
In [2]: ds