我有一组包含单词元组的多行数据(>1000)。我想删除元组中在所有行中只出现一次的单词。下面是一个数据的例子…
before_cleaning after_cleaning
0 [cool] [cool]
1 [gooooood] []
2 [we, love, it, cool] [love, it, cool]
3 [love, it] [love, it]
Columnbefore_cleaning是初始数据,Columnafter_cleaning是我期望数据在删除后的样子。正如你在这个例子中看到的,"gooooood"one_answers"我们;被删除,因为单词在第0行到第3行只出现一次。
使用collections.Counter
和itertools.chain
,set
和列表推导式:
from collections import Counter
from itertools import chain
keep = {k for k,v in Counter(chain.from_iterable(df['before_cleaning'])).items()
if v>1}
# {'cool', 'it', 'love'}
df['after_cleaning'] = [[x for x in l if x in keep]
for l in df['before_cleaning']]
输出:
before_cleaning after_cleaning
0 [cool] [cool]
1 [good] []
2 [we, love, it, cool] [love, it, cool]
3 [love, it] [love, it]
熊猫创建set
的备选方案:
keep = set(df['before_cleaning'].explode().value_counts().loc[lambda x: x>1].index)
您可以使用lambda fun,并在其中循环遍历每个行列表并检查count是否大于1。
代码;
df['after'] = df['before'].apply(lambda row: [i for i in row if sum(list(df['before']),[]).count(i)>1])