Frequently Asked Questions

Why is pandas not enough?

pandas, thanks to its unrivaled speed and flexibility, has emerged as the premier python package for working with labeled arrays. So why are we contributing to further fragmentation in the ecosystem for working with data arrays in Python?

Sometimes, we really want to work with collections of higher dimensional array (ndim > 2), or arrays for which the order of dimensions (e.g., columns vs rows) shouldn’t really matter. For example, climate and weather data is often natively expressed in 4 or more dimensions: time, x, y and z.

Pandas does support N-dimensional panels, but the implementation is very limited:

  • You need to create a new factory type for each dimensionality.
  • You can’t do math between NDPanels with different dimensionality.
  • Each dimension in a NDPanel has a name (e.g., ‘labels’, ‘items’, ‘major_axis’, etc.) but the dimension names refer to order, not their meaning. You can’t specify an operation as to be applied along the “time” axis.

Fundamentally, the N-dimensional panel is limited by its context in pandas’s tabular model, which treats a 2D DataFrame as a collections of 1D Series, a 3D Panel as a collection of 2D DataFrame, and so on. pandas gets a lot of things right, but scientific users need fully multi- dimensional data structures.

When should I use xray instead of pandas?

It’s not an either/or choice! xray provides robust support for converting back and forth between the tabular data-structures of pandas and its own multi-dimensional data-structures.

That said, you should only bother with xray if some aspect of data is fundamentally multi-dimensional. If your data is unstructured or one-dimensional, stick with pandas, which is a more developed toolkit for doing data analysis in Python.

How can I use xray with heterogeneous data?

All items in a DataArray must have a single (homogeneous) data type. To work with heterogeneous or structured data types in xray, put separate DataArray objects in a single Dataset.

The Dataset object allows for most of the flexibility of heterogenerous data without the complexity or performance cost, because its constituent arrays only have a single dtype.

What is your approach to metadata?

We are firm believers in the power of labeled data! In addition to dimensions and coordinates, xray supports arbitrary metadata in the form of global (Dataset) and variable specific (DataArray) attributes (attrs).

Automatic interpretation of labels is powerful but also reduces flexibility. With xray, we draw a firm line between labels that the library understands (dims and coords) and labels for users and user code (attrs). For example, we do not automatically intrepret and enforce units or CF conventions. (An exception is serialization to netCDF with cf_conventions=True.)

An implication of this choice is that we do not propagate attrs through most operations unless explicitly flagged (some methods have a keep_attrs option). Similarly, xray usually drops conflicting attrs when combining arrays and datasets instead of raising an exception, unless explicitly requested with the option compat='identical'. The guiding principle is that metadata should not be allowed to get in the way.