Pandas: groupby和concat字符串与条件

我有一个数据集

id   category   description   status
11   A          Text_1        Finished
11   A          Text_2        Pause
11   A          Text_3        Started
22   A          Text_1        Pause
33   B          Text_1        Finished
33   B          Text_2        Finished

并且我想将数据与id分组，并仅为具有status = 'Finished'的raw连接description

所以期望输出是

id    category   description
11    A          Text_1
22    A          
33    B          Text_1 Text_2

我可以用

连接它

data.groupby(['id', 'category'])['description'].apply(' '.join).reset_index()

但是如何在表达式中使用condition呢?

您可以在groupby之前过滤，然后在reindex之前过滤缺少的组

out = data.loc[data.status == 'Finished'].groupby(['id', 'category'])['description'].apply(' '.join).reindex(pd.MultiIndex.from_frame(data[['id','category']].drop_duplicates()),fill_value= ' ').reset_index()
Out[70]: 
id category    description
0  11        A         Text_1
1  22        A               
2  33        B  Text_1 Text_2

如果过滤后组为空，则可以使用groupby.apply和条件和默认值:

out = (df
.groupby(['id', 'category'])
.apply(lambda g: ' '.join(d['description'])
if len(d:=g[g['status'].eq('Finished')])
else '' )
.reset_index(name='description')
)

输出:

id category    description
0  11        A         Text_1
1  22        A               
2  33        B  Text_1 Text_2

有个办法:

key = ['id', 'category']
df2 = data[key].drop_duplicates().join(
data.query("status == 'Finished'").groupby(key).description.apply(' '.join), 
on=key).fillna('').reset_index(drop=True)

解释:

使用query()来过滤状态为"完成"，使用groupby()按key[id, category]分组，然后在每个组

str.join()

description

使用key列和DataFrame.join()的删除版本将过滤结果扩展为包含所有key值，并使用fillna()将NaN替换为description列中被过滤掉的键的空字符串。

id category    description
0  11        A         Text_1
1  22        A
2  33        B  Text_1 Text_2

相关内容

最新更新

热门标签：