我正在尝试获取pandas中第二列中一列内容的计数。我想把频率计数放在一个叫做频率的新列中。
我想在名为[频率]的新列中的[描述]列中查找[关键字]列中的字符串的次数。
所需输出
[keyword] [Description] [Frequency]
car car dog car car 3
car car dog dog dog 1
new car old car car dog 0
我尝试过的代码
我尝试了以下代码,但出现了两个问题。(频率计数不准确,格式完全错误(。
s = df['Keyword']
pat = r'b{}b'.format('|'.join(s))
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df_new = pd.DataFrame(mlb.fit_transform(df['Description'].str.findall(pat)),
columns=mlb.classes_,
index=df.index).reindex(columns=s, fill_value=0)
如果您想要精确的单词匹配,请使用此选项:
df['frequency'] = [len(re.findall(rf'b{k}b', d)) for k, d in zip(df['keyword'], df['Description'])]
print(df)
输出
keyword Description frequency
0 car car dog car car 3
1 car car dog dog dog 1
2 new car old car car dog 0
@jezrael建议的一个更好的替代方案是:
df['frequency'] = [len(re.findall(rf'b{k}b', d)) for k, d in df[['Description', 'keyword']].to_numpy()]
如果精确匹配不重要,请使用count
,这意味着如果描述中的carito
与car
匹配。如果需要避免,请使用@Dani Mesejo
答案。
df['new'] = df.apply(lambda x: x['Description'].count(x['keyword']), axis=1)
print (df)
keyword Description Frequency new
0 car car dog car car 3 3
1 car car dog dog dog 1 1
2 new car old car car dog 0 0