Indexing and selecting data

Similarly to pandas objects, xray objects support both integer and label based lookups along each dimension. However, xray objects also have named dimensions, so you can optionally use dimension names instead of relying on the positional ordering of dimensions.

Thus in total, xray supports four different kinds of indexing, as described below and summarized in this table:

Dimension lookup Index lookup DataArray syntax Dataset syntax
Positional By integer arr[:, 0] not available
Positional By label arr.loc[:, 'IA'] not available
By name By integer arr.isel(space=0) or
arr[dict(space=0)]
ds.isel(space=0) or
ds[dict(space=0)]
By name By label arr.sel(space='IA') or
arr.loc[dict(space='IA')]
ds.sel(space='IA') or
ds.loc[dict(space='IA')]

Positional indexing

Indexing a DataArray directly works (mostly) just like it does for numpy arrays, except that the returned object is always another DataArray:

In [1]: arr = xray.DataArray(np.random.rand(4, 3),
   ...:                      [('time', pd.date_range('2000-01-01', periods=4)),
   ...:                       ('space', ['IA', 'IL', 'IN'])])
   ...: 

In [2]: arr[:2]
Out[2]: 
<xray.DataArray (time: 2, space: 3)>
array([[ 0.12696983,  0.96671784,  0.26047601],
       [ 0.89723652,  0.37674972,  0.33622174]])
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02
  * space    (space) |S2 'IA' 'IL' 'IN'

In [3]: arr[0, 0]
Out[3]: 
<xray.DataArray ()>
array(0.12696983303810094)
Coordinates:
    time     datetime64[ns] 2000-01-01
    space    |S2 'IA'

In [4]: arr[:, [2, 1]]
Out[4]: 
<xray.DataArray (time: 4, space: 2)>
array([[ 0.26047601,  0.96671784],
       [ 0.33622174,  0.37674972],
       [ 0.12310214,  0.84025508],
       [ 0.44799682,  0.37301223]])
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
  * space    (space) |S2 'IN' 'IL'

xray also supports label-based indexing, just like pandas. Because we use a pandas.Index under the hood, label based indexing is very fast. To do label based indexing, use the loc attribute:

In [5]: arr.loc['2000-01-01':'2000-01-02', 'IA']
Out[5]: 
<xray.DataArray (time: 2)>
array([ 0.12696983,  0.89723652])
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02
    space    |S2 'IA'

You can perform any of the label indexing operations supported by pandas, including indexing with individual, slices and arrays of labels, as well as indexing with boolean arrays. Like pandas, label based indexing in xray is inclusive of both the start and stop bounds.

Setting values with label based indexing is also supported:

In [6]: arr.loc['2000-01-01', ['IL', 'IN']] = -10

In [7]: arr
Out[7]: 
<xray.DataArray (time: 4, space: 3)>
array([[  0.12696983, -10.        , -10.        ],
       [  0.89723652,   0.37674972,   0.33622174],
       [  0.45137647,   0.84025508,   0.12310214],
       [  0.5430262 ,   0.37301223,   0.44799682]])
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
  * space    (space) |S2 'IA' 'IL' 'IN'

Indexing with labeled dimensions

With labeled dimensions, we do not have to rely on dimension order and can use them explicitly to slice data. There are two ways to do this:

  1. Use a dictionary as the argument for array positional or label based array indexing:

    # index by integer array indices
    In [8]: arr[dict(space=0, time=slice(None, 2))]
    Out[8]: 
    <xray.DataArray (time: 2)>
    array([ 0.12696983,  0.89723652])
    Coordinates:
      * time     (time) datetime64[ns] 2000-01-01 2000-01-02
        space    |S2 'IA'
    
    # index by dimension coordinate labels
    In [9]: arr.loc[dict(time=slice('2000-01-01', '2000-01-02'))]
    Out[9]: 
    <xray.DataArray (time: 2, space: 3)>
    array([[  0.12696983, -10.        , -10.        ],
           [  0.89723652,   0.37674972,   0.33622174]])
    Coordinates:
      * time     (time) datetime64[ns] 2000-01-01 2000-01-02
      * space    (space) |S2 'IA' 'IL' 'IN'
    
  2. Use the sel() and isel() convenience methods:

    # index by integer array indices
    In [10]: arr.isel(space=0, time=slice(None, 2))
    Out[10]: 
    <xray.DataArray (time: 2)>
    array([ 0.12696983,  0.89723652])
    Coordinates:
      * time     (time) datetime64[ns] 2000-01-01 2000-01-02
        space    |S2 'IA'
    
    # index by dimension coordinate labels
    In [11]: arr.sel(time=slice('2000-01-01', '2000-01-02'))
    Out[11]: 
    <xray.DataArray (time: 2, space: 3)>
    array([[  0.12696983, -10.        , -10.        ],
           [  0.89723652,   0.37674972,   0.33622174]])
    Coordinates:
      * time     (time) datetime64[ns] 2000-01-01 2000-01-02
      * space    (space) |S2 'IA' 'IL' 'IN'
    

The arguments to these methods can be any objects that could index the array along the dimension given by the keyword, e.g., labels for an individual value, Python slice() objects or 1-dimensional arrays.

Note

We would love to be able to do indexing with labeled dimension names inside brackets, but unfortunately, Python does yet not support indexing with keyword arguments like arr[space=0]

Warning

Do not try to assign values when using isel or sel:

# DO NOT do this
arr.isel(space=0) = 0

Depending on whether the underlying numpy indexing returns a copy or a view, the method will fail, and when it fails, it will fail silently. Instead, you should use normal index assignment:

# this is safe
arr[dict(space=0)] = 0

Dataset indexing

We can also use these methods to index all variables in a dataset simultaneously, returning a new dataset:

In [12]: ds = arr.to_dataset()

In [13]: ds.isel(space=[0], time=[0])
Out[13]: 
<xray.Dataset>
Dimensions:  (space: 1, time: 1)
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01
  * space    (space) |S2 'IA'
Data variables:
    None     (time, space) float64 0.127

In [14]: ds.sel(time='2000-01-01')
Out[14]: 
<xray.Dataset>
Dimensions:  (space: 3)
Coordinates:
    time     datetime64[ns] 2000-01-01
  * space    (space) |S2 'IA' 'IL' 'IN'
Data variables:
    None     (space) float64 0.127 -10.0 -10.0

Positional indexing on a dataset is not supported because the ordering of dimensions in a dataset is somewhat ambiguous (it can vary between different arrays). However, you can do normal indexing with labeled dimensions:

In [15]: ds[dict(space=[0], time=[0])]
Out[15]: 
<xray.Dataset>
Dimensions:  (space: 1, time: 1)
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01
  * space    (space) |S2 'IA'
Data variables:
    None     (time, space) float64 0.127

In [16]: ds.loc[dict(time='2000-01-01')]
Out[16]: 
<xray.Dataset>
Dimensions:  (space: 3)
Coordinates:
    time     datetime64[ns] 2000-01-01
  * space    (space) |S2 'IA' 'IL' 'IN'
Data variables:
    None     (space) float64 0.127 -10.0 -10.0

Using indexing to assign values to a subset of dataset (e.g., ds[dict(space=0)] = 1) is not yet supported.

Dropping labels

The drop() method returns a new object with the listed index labels along a dimension dropped:

In [17]: ds.drop(['IN', 'IL'], dim='space')
Out[17]: 
<xray.Dataset>
Dimensions:  (space: 1, time: 4)
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
  * space    (space) |S2 'IA'
Data variables:
    None     (time, space) float64 0.127 0.8972 0.4514 0.543

drop is both a Dataset and DataArray method.

Indexing details

Like pandas, whether array indexing returns a view or a copy of the underlying data depends entirely on numpy:

  • Indexing with a single label or a slice returns a view.
  • Indexing with a vector of array labels returns a copy.

Attributes are persisted in array indexing:

In [18]: arr2 = arr.copy()

In [19]: arr2.attrs['units'] = 'meters'

In [20]: arr2[0, 0].attrs
Out[20]: OrderedDict([('units', 'meters')])

Indexing with xray objects has one important difference from indexing numpy arrays: you can only use one-dimensional arrays to index xray objects, and each indexer is applied “orthogonally” along independent axes, instead of using numpy’s advanced broadcasting. This means you can do indexing like this, which would require slightly more awkward syntax with numpy arrays:

In [21]: arr[arr['time.day'] > 1, arr['space'] != 'IL']
Out[21]: 
<xray.DataArray (time: 3, space: 2)>
array([[ 0.89723652,  0.33622174],
       [ 0.45137647,  0.12310214],
       [ 0.5430262 ,  0.44799682]])
Coordinates:
  * time     (time) datetime64[ns] 2000-01-02 2000-01-03 2000-01-04
  * space    (space) |S2 'IA' 'IN'

This is a much simpler model than numpy’s advanced indexing, and is basically the only model that works for labeled arrays. If you would like to do array indexing, you can always index .values directly instead:

In [22]: arr.values[arr.values > 0.5]
Out[22]: array([ 0.89723652,  0.84025508,  0.5430262 ])

Align and reindex

xray’s reindex, reindex_like and align impose a DataArray or Dataset onto a new set of coordinates corresponding to dimensions. The original values are subset to the index labels still found in the new labels, and values corresponding to new labels not found in the original object are in-filled with NaN.

To reindex a particular dimension, use reindex():

In [23]: arr.reindex(space=['IA', 'CA'])
Out[23]: 
<xray.DataArray (time: 4, space: 2)>
array([[ 0.12696983,         nan],
       [ 0.89723652,         nan],
       [ 0.45137647,         nan],
       [ 0.5430262 ,         nan]])
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
  * space    (space) |S2 'IA' 'CA'

The reindex_like() method is a useful shortcut. To demonstrate, we will make a subset DataArray with new values:

In [24]: foo = arr.rename('foo')

In [25]: baz = (10 * arr[:2, :2]).rename('baz')

In [26]: baz
Out[26]: 
<xray.DataArray 'baz' (time: 2, space: 2)>
array([[   1.26969833, -100.        ],
       [   8.97236524,    3.76749716]])
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02
  * space    (space) |S2 'IA' 'IL'

Reindexing foo with baz selects out the first two values along each dimension:

In [27]: foo.reindex_like(baz)
Out[27]: 
<xray.DataArray 'foo' (time: 2, space: 2)>
array([[  0.12696983, -10.        ],
       [  0.89723652,   0.37674972]])
Coordinates:
  * space    (space) object 'IA' 'IL'
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02

The opposite operation asks us to reindex to a larger shape, so we fill in the missing values with NaN:

In [28]: baz.reindex_like(foo)
Out[28]: 
<xray.DataArray 'baz' (time: 4, space: 3)>
array([[   1.26969833, -100.        ,           nan],
       [   8.97236524,    3.76749716,           nan],
       [          nan,           nan,           nan],
       [          nan,           nan,           nan]])
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
  * space    (space) object 'IA' 'IL' 'IN'

The align() function lets us perform more flexible database-like 'inner', 'outer', 'left' and 'right' joins:

In [29]: xray.align(foo, baz, join='inner')
Out[29]: 
(<xray.DataArray 'foo' (time: 2, space: 2)>
 array([[  0.12696983, -10.        ],
        [  0.89723652,   0.37674972]])
 Coordinates:
   * space    (space) object 'IA' 'IL'
   * time     (time) datetime64[ns] 2000-01-01 2000-01-02,
 <xray.DataArray 'baz' (time: 2, space: 2)>
 array([[   1.26969833, -100.        ],
        [   8.97236524,    3.76749716]])
 Coordinates:
   * time     (time) datetime64[ns] 2000-01-01 2000-01-02
   * space    (space) object 'IA' 'IL')

In [30]: xray.align(foo, baz, join='outer')
Out[30]: 
(<xray.DataArray 'foo' (time: 4, space: 3)>
 array([[  0.12696983, -10.        , -10.        ],
        [  0.89723652,   0.37674972,   0.33622174],
        [  0.45137647,   0.84025508,   0.12310214],
        [  0.5430262 ,   0.37301223,   0.44799682]])
 Coordinates:
   * space    (space) object 'IA' 'IL' 'IN'
   * time     (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04,
 <xray.DataArray 'baz' (time: 4, space: 3)>
 array([[   1.26969833, -100.        ,           nan],
        [   8.97236524,    3.76749716,           nan],
        [          nan,           nan,           nan],
        [          nan,           nan,           nan]])
 Coordinates:
   * time     (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
   * space    (space) object 'IA' 'IL' 'IN')

Both reindex_like and align work interchangeably between DataArray and Dataset objects, and with any number of matching dimension names:

In [31]: ds
Out[31]: 
<xray.Dataset>
Dimensions:  (space: 3, time: 4)
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
  * space    (space) |S2 'IA' 'IL' 'IN'
Data variables:
    None     (time, space) float64 0.127 -10.0 -10.0 0.8972 0.3767 0.3362 0.4514 0.8403 ...

In [32]: ds.reindex_like(baz)
Out[32]: 
<xray.Dataset>
Dimensions:  (space: 2, time: 2)
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02
  * space    (space) object 'IA' 'IL'
Data variables:
    None     (time, space) float64 0.127 -10.0 0.8972 0.3767

In [33]: other = xray.DataArray(['a', 'b', 'c'], dims='other')

# this is a no-op, because there are no shared dimension names
In [34]: ds.reindex_like(other)
Out[34]: 
<xray.Dataset>
Dimensions:  (space: 3, time: 4)
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
  * space    (space) |S2 'IA' 'IL' 'IN'
Data variables:
    None     (time, space) float64 0.127 -10.0 -10.0 0.8972 0.3767 0.3362 0.4514 0.8403 ...