GroupBy: split-apply-combine

xray supports “group by” operations with the same API as pandas to implement the split-apply-combine strategy:

  • Split your data into multiple independent groups.
  • Apply some function to each group.
  • Combine your groups back into a single data object.

Group by operations work on both Dataset and DataArray objects. Currently, you can only group by a single one-dimensional variable (eventually, we hope to remove this limitation).

Split

Let’s create a simple example dataset:

In [5]: ds = xray.Dataset({'foo': (('x', 'y'), np.random.rand(4, 3))},
   ...:                    coords={'x': [10, 20, 30, 40],
   ...:                            'letters': ('x', list('abba'))})
   ...: 

In [6]: arr = ds['foo']

In [7]: ds
 Out[7]: 
<xray.Dataset>
Dimensions:  (x: 4, y: 3)
Coordinates:
  * x        (x) int64 10 20 30 40
    letters  (x) |S1 'a' 'b' 'b' 'a'
  * y        (y) int64 0 1 2
Data variables:
    foo      (x, y) float64 0.127 0.9667 0.2605 0.8972 0.3767 0.3362 0.4514 ...

If we groupby the name of a variable or coordinate in a dataset (we can also use a DataArray directly), we get back a xray.GroupBy object:

In [8]: ds.groupby('letters')
 Out[8]: <xray.core.groupby.DatasetGroupBy at 0x7fce2230f9d0>

This object works very similarly to a pandas GroupBy object. You can view the group indices with the groups attribute:

In [9]: ds.groupby('letters').groups
 Out[9]: {'a': [0, 3], 'b': [1, 2]}

You can also iterate over over groups in (label, group) pairs:

In [10]: list(ds.groupby('letters'))
Out[10]: 
[('a',
  <xray.Dataset>
Dimensions:  (x: 2, y: 3)
Coordinates:
  * x        (x) int64 10 40
    letters  (x) |S1 'a' 'a'
  * y        (y) int64 0 1 2
Data variables:
    foo      (x, y) float64 0.127 0.9667 0.2605 0.543 0.373 0.448),
 ('b',
  <xray.Dataset>
Dimensions:  (x: 2, y: 3)
Coordinates:
  * x        (x) int64 20 30
    letters  (x) |S1 'b' 'b'
  * y        (y) int64 0 1 2
Data variables:
    foo      (x, y) float64 0.8972 0.3767 0.3362 0.4514 0.8403 0.1231)]

Just like in pandas, creating a GroupBy object is cheap: it does not actually split the data until you access particular values.

Apply

To apply a function to each group, you can use the flexible xray.GroupBy.apply() method. The resulting objects are automatically concatenated back together along the group axis:

In [11]: def standardize(x):
   ....:      return (x - x.mean()) / x.std()
   ....: 

In [12]: arr.groupby('letters').apply(standardize)
Out[12]: 
<xray.DataArray 'foo' (x: 4, y: 3)>
array([[-1.23 ,  1.937, -0.726],
       [ 1.42 , -0.46 , -0.607],
       [-0.191,  1.214, -1.376],
       [ 0.339, -0.302, -0.019]])
Coordinates:
  * y        (y) int64 0 1 2
  * x        (x) int64 10 20 30 40
    letters  (x) |S1 'a' 'b' 'b' 'a'

GroupBy objects also have a reduce() method and methods like mean() as shortcuts for applying an aggregation function:

In [13]: arr.groupby('letters').mean(dim='x')
Out[13]: 
<xray.DataArray 'foo' (letters: 2, y: 3)>
array([[ 0.335,  0.67 ,  0.354],
       [ 0.674,  0.609,  0.23 ]])
Coordinates:
  * y        (y) int64 0 1 2
  * letters  (letters) object 'a' 'b'

Using a groupby is thus also a convenient shortcut for aggregating over all dimensions other than the provided one:

In [14]: ds.groupby('x').std()
Out[14]: 
<xray.Dataset>
Dimensions:  (x: 4)
Coordinates:
    letters  (x) |S1 'a' 'b' 'b' 'a'
  * x        (x) int64 10 20 30 40
Data variables:
    foo      (x) float64 0.3684 0.2554 0.2931 0.06957

First and last

There are two special aggregation operations that are currently only found on groupby objects: first and last. These provide the first or last example of values for group along the grouped dimension:

In [15]: ds.groupby('letters').first()
Out[15]: 
<xray.Dataset>
Dimensions:  (letters: 2, y: 3)
Coordinates:
  * y        (y) int64 0 1 2
  * letters  (letters) object 'a' 'b'
Data variables:
    foo      (letters, y) float64 0.127 0.9667 0.2605 0.8972 0.3767 0.3362

By default, they skip missing values (control this with skipna).

Grouped arithmetic

GroupBy objects also support a limited set of binary arithmetic operations, as a shortcut for mapping over all unique labels. Binary arithmetic is supported for (GroupBy, Dataset) and (GroupBy, DataArray) pairs, as long as the dataset or data array uses the unique grouped values as one of its index coordinates. For example:

In [16]: alt = arr.groupby('letters').mean()

In [17]: alt
Out[17]: 
<xray.DataArray 'foo' (letters: 2)>
array([ 0.453,  0.504])
Coordinates:
  * letters  (letters) object 'a' 'b'

In [18]: ds.groupby('letters') - alt
Out[18]: 
<xray.Dataset>
Dimensions:  (x: 4, y: 3)
Coordinates:
  * y        (y) int64 0 1 2
  * x        (x) int64 10 20 30 40
    letters  (x) |S1 'a' 'b' 'b' 'a'
Data variables:
    foo      (x, y) float64 -0.3261 0.5137 -0.1926 0.3931 -0.1274 -0.1679 ...

This last line is roughly equivalent to the following:

results = []
for label, group in ds.groupby('letters'):
    results.append(group - alt.sel(x=label))
xray.concat(results, dim='x')

Squeezing

When grouping over a dimension, you can control whether the dimension is squeezed out or if it should remain with length one on each group by using the squeeze parameter:

In [19]: next(iter(arr.groupby('x')))
Out[19]: 
(10,
 <xray.DataArray 'foo' (y: 3)>
array([ 0.127,  0.967,  0.26 ])
Coordinates:
  * y        (y) int64 0 1 2
    x        int64 10
    letters  |S1 'a')
In [20]: next(iter(arr.groupby('x', squeeze=False)))
Out[20]: 
(10,
 <xray.DataArray 'foo' (x: 1, y: 3)>
array([[ 0.127,  0.967,  0.26 ]])
Coordinates:
  * y        (y) int64 0 1 2
  * x        (x) int64 10
    letters  (x) |S1 'a')

Although xray will attempt to automatically transpose dimensions back into their original order when you use apply, it is sometimes useful to set squeeze=False to guarantee that all original dimensions remain unchanged.

You can always squeeze explicitly later with the Dataset or DataArray squeeze() methods.