我有一个pandas.DataFrame
的形式
index df df1
0 0 111
1 1 111
2 2 111
3 3 111
4 0 111
5 2 111
6 3 111
7 0 111
8 2 111
9 3 111
10 0 111
11 1 111
12 2 111
13 3 111
14 0 111
15 1 111
16 2 111
17 3 111
18 1 111
19 2 111
20 3 111
我想创建一个数据框,其中列df重复0,1,2,3。但是数据中缺少了一些东西。我试图通过附加行值来填补空白。以下是我的预期结果:
index df df1
0 0 111
1 1 111
2 2 111
3 3 111
4 0 111
5 1 0
6 2 111
7 3 111
8 0 111
9 1 0
10 2 111
11 3 111
12 0 111
13 1 111
14 2 111
15 3 111
16 0 111
17 1 111
18 2 111
19 3 111
20 0 0
21 1 111
22 2 111
23 3 111
我怎样才能做到这一点?
编辑:如果我的输入如下所示,我该怎么办?
index df1 df2
0 0 111
1 1 111
2 2 111
3 3 111
4 0 111
5 3 111
6 1 111
7 2 111
这是我的预期结果:
index df1 df2
0 0 111
1 1 111
2 2 111
3 3 111
4 0 111
5 1 0
6 2 0
7 3 111
8 0 0
9 1 111
10 2 111
11 3 0
使用@Mozway的想法,并结合pyjanitor
中的一些辅助函数,可以将缺失的值明确化,然后再填充。同样,这只是另一个选项:
# pip install pyjanitor
import pandas as pd
import janitor as jn
(df.assign(temp = df.df.diff().le(0).cumsum())
.complete('df', 'temp') # helper function
.fillna(0)
# relevant if you care about the order
.sort_values('temp', kind='mergesort')
# helper function
.select_columns('df*') # or .drop(columns='temp')
)
df df1
0 0 111.0
6 1 111.0
12 2 111.0
18 3 111.0
1 0 111.0
7 1 0.0
13 2 111.0
19 3 111.0
2 0 111.0
8 1 0.0
14 2 111.0
20 3 111.0
3 0 111.0
9 1 111.0
15 2 111.0
21 3 111.0
4 0 111.0
10 1 111.0
16 2 111.0
22 3 111.0
5 0 0.0
11 1 111.0
17 2 111.0
23 3 111.0
您可以设置自定义分组,以检测何时增加"df"重置为较低(或相等)的值。
然后使用"df"中唯一值的乘积重新索引。和唯一的组
最后,用fillna
/reset_index
/rename_axis
的组合重新加工输出:
# uncomment below if "index" is not the index
# df = df.set_index('index')
# find positions where "df" resets and make groups
groups = df['df'].diff().le(0).cumsum()
(df.set_index([groups, 'df'], drop=True) # set custom groups and "df" as index
.reindex(pd.MultiIndex.from_product([groups.unique(), # reindex with all
range(4), # combinations
], names=['group', 'df']))
.fillna(0, downcast='infer') # set missing values as zero
.reset_index('df') # all below to restore a range index
.reset_index(drop=True)
.rename_axis('index')
)
输出:
df df1
index
0 0 111
1 1 111
2 2 111
3 3 111
4 0 111
5 1 0
6 2 111
7 3 111
8 0 111
9 1 0
10 2 111
11 3 111
12 0 111
13 1 111
14 2 111
15 3 111
16 0 111
17 1 111
18 2 111
19 3 111
20 0 0
21 1 111
22 2 111
23 3 111
第二个示例的输出:
df1 df2
index
0 0 111
1 1 111
2 2 111
3 3 111
4 0 111
5 1 0
6 2 0
7 3 111
8 0 0
9 1 111
10 2 111
11 3 0
可以按列df
的递增顺序设置group。然后使用.unstack()
和.stack()
,如下所示:
group = df['df'].le(df['df'].shift()).cumsum() # new group if column `df` <= `df` last entry
df_out = (df.set_index([group, 'df']) # set `group` and column `df` as index
.unstack(fill_value=0) # unstack `df` and fill missing entry of `df` in [0,1,2,3] as 0 for `df1`
.stack() # stack back to original shape
.droplevel(0) # drop `group` from index
.reset_index() # restore `df` from index back to data column
)
结果:
print(df_out)
df df1
0 0 111
1 1 111
2 2 111
3 3 111
4 0 111
5 1 0
6 2 111
7 3 111
8 0 111
9 1 0
10 2 111
11 3 111
12 0 111
13 1 111
14 2 111
15 3 111
16 0 111
17 1 111
18 2 111
19 3 111
20 0 0
21 1 111
22 2 111
23 3 111
对于编辑过的输入,使用类似的代码:
group = df['df1'].le(df['df1'].shift()).cumsum()
df_out2 = (df.set_index([group, 'df1'])
.unstack(fill_value=0)
.stack()
.droplevel(0)
.reset_index()
)
结果:
print(df_out2)
df1 df2
0 0 111
1 1 111
2 2 111
3 3 111
4 0 111
5 1 0
6 2 0
7 3 111
8 0 0
9 1 111
10 2 111
11 3 0
首先,对df
进行分组df11=df1.assign(group=(df1.df.diff()<1).cumsum())
第二,创建一个新的df
df2=pd.DataFrame({'df':[0,1,2,3]*df11.group.max()}).assign(col2=lambda dd:(dd.df==0).cumsum()-1)
第三,合并和填充
df2.merge(df11, how='left', left_on=['df','col2'], right_on=['df','group'])
.loc[:,['df','df1']].fillna(0).astype(int)
df df1
0 0 111
1 1 111
2 2 111
3 3 111
4 0 111
5 1 0
6 2 111
7 3 111
8 0 111
9 1 0
10 2 111
11 3 111
12 0 111
13 1 111
14 2 111
15 3 111
16 0 111
17 1 111
18 2 111
19 3 111
20 0 0
21 1 111
22 2 111
23 3 111