我有一个数据框架(比这个例子大得多),如下所示,其中前两列中的所有行重复5次。
import pandas as pd
df = pd.DataFrame({'text':['the weather is nice','the weather is nice','the weather is nice','the weather is nice','the weather is nice',
'the house is beautiful','the house is beautiful','the house is beautiful','the house is beautiful','the house is beautiful',
'the day is long','the day is long','the day is long','the day is long','the day is long'],
'reference':['weather','weather','weather','weather','weather',
'house','house','house','house','house',
'day','day','day','day','day'],
'id':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]})
我想把这个pandas数据框分成两个数据框,前两个连续的行出现在一个数据框中,另外三个出现在第二个数据框中,如下所示。
期望输出:
first df:
text reference id
0 the weather is nice weather 1
1 the weather is nice weather 2
3 the house is beautiful house 6
4 the house is beautiful house 7
5 the day is long day 11
6 the day is long day 12
second df:
text reference id
0 the weather is nice weather 3
1 the weather is nice weather 4
2 the weather is nice weather 5
3 the house is beautiful house 8
4 the house is beautiful house 9
5 the house is beautiful house 10
6 the day is long day 13
7 the day is long day 14
8 the day is long day 15
显然选择n行不行(e,g df)。Iloc [::3,:] or df[df.]索引% 3 == 0]),所以我想知道上述输出是如何实现的。
如果你想对引用的值进行分组(前2项与其余项):
mask = df.groupby('reference').cumcount().gt(1)
groups = [g for k,g in df.groupby(mask)]
# or manually
# df1 = df[~mask]
# df2 = df[mask]
使用位置:
mask = (np.arange(len(df))%5)<1
# or with a range index
# mask = df.index.mod(5).gt(1)
# then same as above using groupby or slicing
制作蒙版m
:
import numpy as np
m = np.tile([True, True, False, False, False], len(df) // 5)
df1 = df[m]
df2 = df[~m]