我有一个CSV文件,由4列a、B、C、D组成。我想:
- 查找列A、B、C具有相同值的所有重复项
- 对于这些,取D的值并创建一个没有重复项的单行,其中D列是所有重复项的D列的并集
CSV输入示例:
John,Yes,123,street 1
John,Yes,123,street 2
Tom,No,345,street 1
Tom,No,345,street 2
Tom,No,345,street 3
Jason,Yes,567,street 1
Thomas,No,123,street 1
Jess,No,999,street 1
预期结果:
John,Yes,123,street 1 street 2
Tom,No,345,street 1 street 2 street 3
Jason,Yes,567,street 1
Thomas,No,123,street 1
Jess,No,999,street 1
df.groupby(['A','B','C'])['D'].apply(' '.join).reset_index()
完整代码:
from io import StringIO
df = """A,B,C,D
John,Yes,123,street 1
John,Yes,123,street 2
Tom,No,345,street 1
Tom,No,345,street 2
Tom,No,345,street 3
Jason,Yes,567,street 1
Thomas,No,123,street 1
Jess,No,999,street 1"""
df = pd.read_csv(StringIO(df))
df.groupby(['A','B','C'])['D'].apply(' '.join).reset_index()
输出:
A | |||||||||
---|---|---|---|---|---|---|---|---|---|
0 | Jason | 2 | John | td style="text-align:right;">3Thomas | 4 | Tom |