首先,我们使用MultiIndex创建一个大型数据集,其第一条记录包含缺失值np.NaN
In [200]: data = []
...: val = 0
...: for ind_1 in range(3000):
...: if ind_1 == 0:
...: data.append({'ind_1': 0, 'ind_2': np.NaN, 'val': np.NaN})
...: else:
...: for ind_2 in range(3000):
...: data.append({'ind_1': ind_1, 'ind_2': ind_2, 'val': val})
...: val += 1
...: df = pd.DataFrame(data).set_index(['ind_1', 'ind_2'])
In [201]: df
Out[201]:
val
ind_1 ind_2
0 NaN NaN
1 0.0 0.0
1.0 1.0
2.0 2.0
3.0 3.0
... ...
2999 2995.0 8996995.0
2996.0 8996996.0
2997.0 8996997.0
2998.0 8996998.0
2999.0 8996999.0
[8997001 rows x 1 columns]
我想选择ind_1
<3 和ind_2
<3 的所有行
首先,我创建一个多索引i1
其中ind_1
<3
In [202]: i1 = df.loc[df.index.get_level_values('ind_1') < 3].index
In [203]: i1
Out[203]:
MultiIndex([(0, nan),
(1, 0.0),
(1, 1.0),
(1, 2.0),
(1, 3.0),
(1, 4.0),
(1, 5.0),
(1, 6.0),
(1, 7.0),
(1, 8.0),
...
(2, 2990.0),
(2, 2991.0),
(2, 2992.0),
(2, 2993.0),
(2, 2994.0),
(2, 2995.0),
(2, 2996.0),
(2, 2997.0),
(2, 2998.0),
(2, 2999.0)],
names=['ind_1', 'ind_2'], length=6001)
然后我创建一个多索引i2
其中ind_2
<3
In [204]: i2 = df.loc[~(df.index.get_level_values('ind_2') > 2)].index
In [205]: i2
Out[205]:
MultiIndex([( 0, nan),
( 1, 0.0),
( 1, 1.0),
( 1, 2.0),
( 2, 0.0),
( 2, 1.0),
( 2, 2.0),
( 3, 0.0),
( 3, 1.0),
( 3, 2.0),
...
(2996, 2.0),
(2997, 0.0),
(2997, 1.0),
(2997, 2.0),
(2998, 0.0),
(2998, 1.0),
(2998, 2.0),
(2999, 0.0),
(2999, 1.0),
(2999, 2.0)],
names=['ind_1', 'ind_2'], length=8998)
从逻辑上讲,解决方案应该是这两个集合的交集
In [206]: df.loc[i1 & i2]
Out[206]:
val
ind_1 ind_2
1 0.0 0.0
1.0 1.0
2.0 2.0
2 0.0 3000.0
1.0 3001.0
2.0 3002.0
为什么第一条记录(0,nan)被过滤掉?
使用布尔数组 i1, i2 代替
索引In [27]: i1 = df.index.get_level_values('ind_1') < 3
In [28]: i2 = ~(df.index.get_level_values('ind_2') > 2)
In [29]: i1
Out[29]: array([ True, True, True, ..., False, False, False])
In [30]: i2
Out[30]: array([ True, True, True, ..., False, False, False])
In [31]: df.loc[i1 & i2]
Out[31]:
val
ind_1 ind_2
0 NaN NaN
1 0.0 0.0
1.0 1.0
2.0 2.0
2 0.0 3000.0
1.0 3001.0
2.0 3002.0