我有一个数据帧,其中包含具有唯一列值的类似行。如果任何一行有重复的值组合,我需要将每个行的唯一值连接到一列中。
样本数据
| program | subject | course | title |
|:------- |:------- |:------ |:----- |
|music | eng | 101 | 000 |
|music | math | 101 | 123 |
|music | eng | 102 | 000 |
|music | math | 101 | 456 |
|art | span | 201 | 123 |
|art | hst | 101 | 000 |
|art | span | 201 | 456 |
|art | span | 202 | 000 |
所需数据
| program | subject | course | title. |
|:------- |:------- |:------ |:----- |
|music | eng | 101 | 000 |
|music | math | 101 | 123-456 |
|music | eng | 102 | 000 |
|music | math | 101 | 456-123 |
|art | span | 201 | 123-456 |
|art | hst | 101 | 000 |
|art | span | 201 | 456-123 |
|art | span | 202 | 000 |
第2行和第4行以及第5行和第7行中的前三列匹配。我想将标题连接起来,这样每一行都包含匹配行的标题组合。
让我们尝试分组转换:
df['title'] = df.groupby(
['program', 'subject', 'course'], as_index=False, sort=False
)['title'].transform('-'.join)
print(df)
输出:
program subject course title
0 music eng 101 000
1 music math 101 123-456
2 music eng 102 000
3 music math 101 123-456
4 art span 201 123-456
5 art hst 101 000
6 art span 201 123-456
7 art span 202 000
用networkx
进行实验以匹配准确的预期输出,可能是过度设计:
import networkx as nx
u = df.assign(k=df.groupby(['program','subject','course']).ngroup())
G = nx.from_pandas_edgelist(u,'title','k',create_using=nx.DiGraph())
l =[f"{a}-{''.join(b.difference([a]))}".rstrip("-")
for a,b in zip(u['title'],u['k'].map(lambda x: nx.ancestors(G,x)))]
df['new_title'] = l
print(df)
program subject course title new_title
0 music eng 101 000 000
1 music math 101 123 123-456
2 music eng 102 000 000
3 music math 101 456 456-123
4 art span 201 123 123-456
5 art hst 101 000 000
6 art span 201 456 456-123
7 art span 202 000 000
您可以合并两个数据帧,然后删除重复的行
frames = [df1, df2]
result = pd.concat(frames)
# dropping duplicate values
result.drop_duplicates(keep=False,inplace=True)