我把df写成
pd.DataFrame([["A1" "B1", "C1", "P"],
["A2" "B2", "C2", "P"],
["A3" "B3", "C3", "P"]], columns=["col_a" "col_b", "col_c", "col_d"])
col_a col_b col_c col_d
A1 B1 C1 P
A2 B2 C2 P
A3 B3 C3 P
...
我需要的结果基本上是重复,并确保列在col_d中有p Q R扩展对于每个唯一的行
col_a col_b col_c col_d
A1 B1 C1 P
A1 B1 C1 Q
A1 B1 C1 R
A2 B2 C2 P
A2 B2 C2 Q
A2 B2 C2 R
A3 B3 C3 P
A3 B3 C3 Q
A3 B3 C3 R
...
我现在只知道:
new_df = pd.DataFrame(np.repeat(df.values, 3, axis=0), columns=df.columns)
会导致这些值的重复,但col_d不变
编辑:
现在我无意中发现了另一个需求,对于每个唯一的col_a和col_b,我需要添加"S"col_d
结果如下:
col_a col_b col_c col_d
A1 B1 C1 P
A1 B1 C1 Q
A1 B1 C1 R
A1 B1 T S
A2 B2 C2 P
A2 B2 C2 Q
A2 B2 C2 R
A2 B2 T S
非常感谢您的帮助!
为col_d
列添加DataFrame.assign
与numpy.tile
的值:
L = ['P','Q','R']
new_df = (pd.DataFrame(np.repeat(df.values, 3, axis=0), columns=df.columns)
.assign(col_d = np.tile(L, len(df))))
print (new_df)
col_acol_b col_c col_d
0 A1B1 C1 P
1 A1B1 C1 Q
2 A1B1 C1 R
3 A2B2 C2 P
4 A2B2 C2 Q
5 A2B2 C2 R
6 A3B3 C3 P
7 A3B3 C3 Q
8 A3B3 C3 R
另一个类似的想法是通过DataFrame.loc
重复索引和重复行:
L = ['P','Q','R']
new_df = (df.loc[df.index.repeat(3)]
.assign(col_d = np.tile(L, len(df)))
.reset_index(drop=True))
print (new_df)
col_acol_b col_c col_d
0 A1B1 C1 P
1 A1B1 C1 Q
2 A1B1 C1 R
3 A2B2 C2 P
4 A2B2 C2 Q
5 A2B2 C2 R
6 A3B3 C3 P
7 A3B3 C3 Q
8 A3B3 C3 R
编辑:
L = ['P','Q','R','S']
new_df = (pd.DataFrame(np.repeat(df.values, len(L), axis=0), columns=df.columns)
.assign(col_d = np.tile(L, len(df)),
col_c = lambda x: x['col_c'].mask(x['col_d'].eq('S'), 'T')))
print (new_df)
col_acol_b col_c col_d
0 A1B1 C1 P
1 A1B1 C1 Q
2 A1B1 C1 R
3 A1B1 T S
4 A2B2 C2 P
5 A2B2 C2 Q
6 A2B2 C2 R
7 A2B2 T S
8 A3B3 C3 P
9 A3B3 C3 Q
10 A3B3 C3 R
11 A3B3 T S
如果你已经有了第一个数据帧,你可以assign
和explode
:
l= ['P','Q','R']
new_df = df.assign(col_d=[l]*len(df)).explode('col_d')
或merge
:
new_df = df.drop(columns='col_d').merge(pd.Series(l, name='col_d'), how='cross')
输出:
col_acol_b col_c col_d
0 A1B1 C1 P
1 A1B1 C1 Q
2 A1B1 C1 R
3 A2B2 C2 P
4 A2B2 C2 Q
5 A2B2 C2 R
6 A3B3 C3 P
7 A3B3 C3 Q
8 A3B3 C3 R
您可以轻松地从pyjanitor中生成与complete的组合:
# pip install pyjanitor
import pandas as pd
import janitor
df.complete(['col_a', 'col_b', 'col_c'], {'col_d': ['P','Q','R']})
col_a col_b col_c col_d
0 A1 B1 C1 P
1 A1 B1 C1 Q
2 A1 B1 C1 R
3 A2 B2 C2 P
4 A2 B2 C2 Q
5 A2 B2 C2 R
6 A3 B3 C3 P
7 A3 B3 C3 Q
8 A3 B3 C3 R
基本上,你将['col_a', 'col_b', 'col_c']
与{'col_d': ['P','Q','R']}
结合起来;使用字典可以在数据中引入新的值。
对于需要引入S
的场景,您可以分解以下步骤:
(df
.complete(['col_a', 'col_b'], {'col_d': ['P','Q','R', 'S']})
.assign(col_c = lambda df: np.where(df.col_d.eq('S'), 'T', df.col_c))
.ffill()
)
col_a col_b col_c col_d
0 A1 B1 C1 P
1 A1 B1 C1 Q
2 A1 B1 C1 R
3 A1 B1 T S
4 A2 B2 C2 P
5 A2 B2 C2 Q
6 A2 B2 C2 R
7 A2 B2 T S
8 A3 B3 C3 P
9 A3 B3 C3 Q
10 A3 B3 C3 R
11 A3 B3 T S