pandas:根据特定列和行中的条件划分数据框



我有一个数据框架(比这个例子大得多),如下所示,其中前两列中的所有行重复5次。

import pandas as pd
df = pd.DataFrame({'text':['the weather is nice','the weather is nice','the weather is nice','the weather is nice','the weather is nice',
'the house is beautiful','the house is beautiful','the house is beautiful','the house is beautiful','the house is beautiful',
'the day is long','the day is long','the day is long','the day is long','the day is long'],
'reference':['weather','weather','weather','weather','weather',
'house','house','house','house','house',
'day','day','day','day','day'],
'id':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]})

我想把这个pandas数据框分成两个数据框,前两个连续的行出现在一个数据框中,另外三个出现在第二个数据框中,如下所示。

期望输出:

first df:
text reference  id
0      the weather is nice   weather   1
1      the weather is nice   weather   2
3   the house is beautiful     house   6
4   the house is beautiful     house   7
5         the day is long       day  11
6         the day is long       day  12
second df:
text reference  id
0      the weather is nice   weather   3
1      the weather is nice   weather   4
2      the weather is nice   weather   5
3   the house is beautiful     house   8
4   the house is beautiful     house   9
5   the house is beautiful     house  10
6         the day is long       day  13
7         the day is long       day  14
8         the day is long       day  15

显然选择n行不行(e,g df)。Iloc [::3,:] or df[df.]索引% 3 == 0]),所以我想知道上述输出是如何实现的。

如果你想对引用的值进行分组(前2项与其余项):

mask = df.groupby('reference').cumcount().gt(1)
groups = [g for k,g in df.groupby(mask)]
# or manually
# df1 = df[~mask]
# df2 = df[mask]

使用位置:

mask = (np.arange(len(df))%5)<1
# or with a range index
# mask = df.index.mod(5).gt(1)
# then same as above using groupby or slicing

制作蒙版m:

import numpy as np
m = np.tile([True, True, False, False, False], len(df) // 5)
df1 = df[m]
df2 = df[~m]

最新更新