删除元组中只在数据集中出现一次的单词



我有一组包含单词元组的多行数据(>1000)。我想删除元组中在所有行中只出现一次的单词。下面是一个数据的例子…

before_cleaning    after_cleaning
0                [cool]            [cool]
1            [gooooood]                []
2  [we, love, it, cool]  [love, it, cool]
3            [love, it]        [love, it]

Columnbefore_cleaning是初始数据,Columnafter_cleaning是我期望数据在删除后的样子。正如你在这个例子中看到的,"gooooood"one_answers"我们;被删除,因为单词在第0行到第3行只出现一次。

使用collections.Counteritertools.chain,set和列表推导式:

from collections import Counter
from itertools import chain
keep = {k for k,v in Counter(chain.from_iterable(df['before_cleaning'])).items()
if v>1}
# {'cool', 'it', 'love'}
df['after_cleaning'] = [[x for x in l if x in keep]
for l in df['before_cleaning']]

输出:

before_cleaning    after_cleaning
0                [cool]            [cool]
1                [good]                []
2  [we, love, it, cool]  [love, it, cool]
3            [love, it]        [love, it]

熊猫创建set的备选方案:

keep = set(df['before_cleaning'].explode().value_counts().loc[lambda x: x>1].index)

您可以使用lambda fun,并在其中循环遍历每个行列表并检查count是否大于1。

代码;

df['after'] = df['before'].apply(lambda row: [i for i in row if sum(list(df['before']),[]).count(i)>1])

最新更新