我有一个数据集,我在其中存储不同类/子类型的复制(不确定该怎么称呼它),然后存储每个类/子类的属性。从本质上讲,有5个子类型/类,每个子类型/类有4个重复,测量了100个属性。
有没有像np.ravel
或np.flatten
这样的方法可以使用Xarray
合并2个维度
在这个例子中,我想合并调光subtype
和replicates
,这样我就有了一个2D阵列(或者pd.DataFrame
和attributes vs. subtype/replicates
。
它不需要具有"coord_1|coord_2"或任何格式。如果它保留原始的coord名称,那将非常有用。也许有groupby
这样的东西可以做到这一点?Groupby
总是让我困惑,所以如果它是xarray
原生的东西,那就太棒了。
import xarray as xr
import numpy as np
# Set up xr.DataArray
dims = (5,4,100)
DA_data = xr.DataArray(np.random.random(dims), dims=["subtype","replicates","attributes"])
DA_data.coords["subtype"] = ["subtype_%d"%_ for _ in range(dims[0])]
DA_data.coords["replicates"] = ["rep_%d"%_ for _ in range(dims[1])]
DA_data.coords["attributes"] = ["attr_%d"%_ for _ in range(dims[2])]
# DA_data.coords
# Coordinates:
# * subtype (subtype) <U9 'subtype_0' 'subtype_1' 'subtype_2' ...
# * replicates (replicates) <U5 'rep_0' 'rep_1' 'rep_2' 'rep_3'
# * attributes (attributes) <U7 'attr_0' 'attr_1' 'attr_2' 'attr_3' ...
# DA_data.dims
# ('subtype', 'replicates', 'attributes')
# Naive way to collapse the replicate dimension into the subtype dimension
desired_columns = list()
for subtype in DA_data.coords["subtype"]:
for replicate in DA_data.coords["replicates"]:
desired_columns.append(str(subtype.values) + "|" + str(replicate.values))
desired_columns
# ['subtype_0|rep_0',
# 'subtype_0|rep_1',
# 'subtype_0|rep_2',
# 'subtype_0|rep_3',
# 'subtype_1|rep_0',
# 'subtype_1|rep_1',
# 'subtype_1|rep_2',
# 'subtype_1|rep_3',
# 'subtype_2|rep_0',
# 'subtype_2|rep_1',
# 'subtype_2|rep_2',
# 'subtype_2|rep_3',
# 'subtype_3|rep_0',
# 'subtype_3|rep_1',
# 'subtype_3|rep_2',
# 'subtype_3|rep_3',
# 'subtype_4|rep_0',
# 'subtype_4|rep_1',
# 'subtype_4|rep_2',
# 'subtype_4|rep_3']
是的,这正是.stack
的用途:
In [33]: stacked = DA_data.stack(desired=['subtype', 'replicates'])
In [34]: stacked
Out[34]:
<xarray.DataArray (attributes: 100, desired: 20)>
array([[ 0.54020268, 0.14914837, 0.83398895, ..., 0.25986503,
0.62520466, 0.08617668],
[ 0.47021735, 0.10627027, 0.66666478, ..., 0.84392176,
0.64461418, 0.4444864 ],
[ 0.4065543 , 0.59817851, 0.65033094, ..., 0.01747058,
0.94414244, 0.31467342],
...,
[ 0.23724934, 0.61742922, 0.97563316, ..., 0.62966631,
0.89513904, 0.20139552],
[ 0.21157447, 0.43868899, 0.77488211, ..., 0.98285015,
0.24367352, 0.8061804 ],
[ 0.21518079, 0.234854 , 0.18294781, ..., 0.64679141,
0.49678393, 0.32215219]])
Coordinates:
* attributes (attributes) |S7 'attr_0' 'attr_1' 'attr_2' 'attr_3' ...
* desired (desired) object ('subtype_0', 'rep_0') ...
得到的堆叠坐标是pandas.MultiIndex
,其值由元组给出:
In [35]: stacked['desired'].values
Out[35]:
array([('subtype_0', 'rep_0'), ('subtype_0', 'rep_1'),
('subtype_0', 'rep_2'), ('subtype_0', 'rep_3'),
('subtype_1', 'rep_0'), ('subtype_1', 'rep_1'),
('subtype_1', 'rep_2'), ('subtype_1', 'rep_3'),
('subtype_2', 'rep_0'), ('subtype_2', 'rep_1'),
('subtype_2', 'rep_2'), ('subtype_2', 'rep_3'),
('subtype_3', 'rep_0'), ('subtype_3', 'rep_1'),
('subtype_3', 'rep_2'), ('subtype_3', 'rep_3'),
('subtype_4', 'rep_0'), ('subtype_4', 'rep_1'),
('subtype_4', 'rep_2'), ('subtype_4', 'rep_3')], dtype=object)