假设我有一个DataFramebase_df
,它读为:
0 1 2 3
0 2 'A' 'B' NaN
1 2 'A' 'C' NaN
2 2 'A' NaN 'D'
3 2 'A' NaN 'E'
4 2 'A' NaN 'F'
如何通过单元格和列展开,最好不需要迭代,以产生:
0 1 2 3
0 2 'A' 'B' NaN
1 2 'A' 'C' NaN
2 2 'A' NaN 'D'
3 2 'A' NaN 'E'
4 2 'A' NaN 'F'
5 3 'A' 'B' 'D'
6 3 'A' 'C' 'D'
7 3 'A' 'B' 'E'
8 3 'A' 'C' 'E'
9 3 'A' 'B' 'F'
10 3 'A' 'C' 'F'
列0我可以很好地处理base_df.count(axis=1)
,但我的解决方案通常迫使我用.iterrows()
迭代行。在熊猫身上有没有更好的方法?
编辑:我设法解决了这个问题,尽管它几乎不够快,没有优势:
DF = pd.DataFrame
in_def = <A STRING-NAN DF>
colspan = len(d.PG_LANGS) + 1
cols = range(1, colspan)
for keep_len in range(3, len(d.PG_LANGS) + 1):
out_df: DF = DF(columns=range(colspan))
print('KEEP LEN:', keep_len)
for dex_a in cols:
for dex_b in cols:
if dex_a == dex_b:
continue
a_df: DF = in_df[in_df[dex_a].notna()]
sansb_df: DF = a_df[a_df[dex_b].isna()]
withb_df: DF = a_df[a_df[dex_b].notna()]
shared_as: set[str] =
set(sansb_df[dex_a]) & set(withb_df[dex_a]) # type: ignore
for sha in shared_as:
sansb: DF =
sansb_df[sansb_df[dex_a] == sha] # type: ignore
withb: DF =
withb_df[withb_df[dex_a] == sha] # type: ignore
# print('SANS', sansb.shape[0])
# print('WITH', withb.shape[0])
if sansb.shape[0] == 0:
continue
if withb.shape[0] == 0:
continue
sansb =
pd.concat([sansb] * withb.shape[0], # type: ignore
axis=0, ignore_index=True)
withb =
pd.concat([withb] * sansb.shape[0], # type: ignore
axis=0, ignore_index=True)
sansb[dex_b] = withb[dex_b]
sansb.drop_duplicates(ignore_index=True, inplace=True)
# print(sansb)
out_df =
pd.concat([out_df, sansb], axis=0, # type: ignore
ignore_index=True, sort=False)
out_df.reset_index()
out_df[0] = out_df.count(axis=1) # type: ignore
out_df.drop_duplicates(ignore_index=True, inplace=True)
print(out_df)
in_df = out_df
您可以使用itertools.product
和concat
:
from itertools import product
cols = ['B', 'C', 'D']
out = pd.concat([df,
pd.DataFrame(product([3], *[df[col].dropna().unique()
for col in cols]),
columns=['A']+cols)
])
print(out)
输出:
A B C D
0 2 A B NaN
1 2 A C NaN
2 2 A NaN D
3 2 A NaN E
4 2 A NaN F
0 3 A B D
1 3 A B E
2 3 A B F
3 3 A C D
4 3 A C E
应该可以:
new_lines = df.loc[df[2].notna(), [0, 1, 2]].merge(df.loc[df[2].isna(), [3]], how='cross')
new_lines[0] += 1
df = pd.concat([df, new_lines])
输出:
0 1 2 3
0 2 A B NaN
1 2 A C NaN
2 2 A NaN D
3 2 A NaN E
4 2 A NaN F
0 3 A B D
1 3 A B E
2 3 A B F
3 3 A C D
4 3 A C E
5 3 A C F