我首先需要按列分组,删除不需要的值,然后将其解压缩或解压缩到下一行。
我的数据集如下:
Text tag
drink coke mic
eat pizza mic
eat fruits yes
eat banana yes
eat banana mic
eat fruits mic
eat pizza no
eat pizza mic
eat pizza yes
drink coke yes
drink coke no
drink coke no
drink coke yes
我用这个函数来分组。
df = pd.DataFrame(df.groupby(['text'])['tag'].apply(lambda x: list(x.values)))
Text labels
eat pizza [mic,no,mic,yes]
eat fruits [yes,mic]
eat banana [yes,mic]
drink coke [yes,yes,no,no,yes]
如果列标签中有一个"no"和一个"yes",我需要从列标签中删除这些值,然后重新解压。
输出应该是这样的。
Text tag
drink coke mic
eat pizza mic
eat fruits yes
eat banana yes
eat banana mic
eat fruits mic
eat pizza mic
执行:
# Answer, does the group contain both yes and no?
contains_both = (df.groupby('Text')['tag']
.transform(lambda x: all(i in x.values for i in ('yes', 'no'))))
# We'll keep it if it doesn't contain both yes and no
# But if it does, remove the yes and no.
df = df[~contains_both | ~df.tag.isin(['yes', 'no'])]
print(df)
输出:
Text tag
0 drink coke mic
1 eat pizza mic
2 eat fruits yes
3 eat banana yes
4 eat banana mic
5 eat fruits mic
7 eat pizza mic
FYI,您的df
计算可以缩短为:
df = df.groupby('Text', as_index=False)['tag'].agg(list)
# Output:
Text tag
0 drink coke [mic, yes, no, no, yes]
1 eat banana [yes, mic]
2 eat fruits [yes, mic]
3 eat pizza [mic, no, mic, yes]