获取具有预定义列表的pandas列字符串中匹配单词的计数



我有一个包含indextext列的DataFrame

例如:

index | text
1     | "I have a pen, but I lost it today."
2     | "I have pineapple and pen, but I lost it today."

现在我有一个长列表,我想将text中的每个单词与该列表进行匹配。

比方说:

long_list = ['pen', 'pineapple']

我想创建一个FunctionTransformer,将long_list中的单词与列值的每个单词进行匹配,如果存在匹配,则返回计数。

index | text                                             | count
1     | "I have a pen, but I lost it today."             | 1
2     | "I have pineapple and pen, but I lost it today." | 2

我是这样做的:

def count_words(df):
long_list = ['pen', 'pineapple']
count = 0
for c in df['tweet_text']:
if c in long_list:
count = count + 1

df['count'] = count   
return df
count_word = FunctionTransformer(count_words, validate=False)

我如何开发另一个FunctionTransformer的例子是:

def convert_twitter_datetime(df):
df['hour'] = pd.to_datetime(df['created_at'], format='%a %b %d %H:%M:%S +0000 %Y').dt.strftime('%H').astype(int)
return df
convert_datetime = FunctionTransformer(convert_twitter_datetime, validate=False)

Pandas具有str.count:

# matching any of the words
pattern = r'b{}b'.format('|'.join(long_list))
df['count'] = df.text.str.count(pattern)

输出:

index                                              text  count
0      1              "I have a pen, but I lost it today."      1
1      2  "I have pineapple and pen, but I lost it today."      2

灵感来自@Quang Hoang的答案

import pandas as pd
import sklearn as sk
y=['pen', 'pineapple']
def count_strings(X, y):
pattern = r'b{}b'.format('|'.join(y))
return X['text'].str.count(pattern)
string_transformer = sk.preprocessing.FunctionTransformer(count_strings, kw_args={'y': y})
df['count'] = string_transformer.fit_transform(X=df)

中的结果

text                                              count
1   "I have a pen, but I lost it today."                1
2   "I have pineapple and pen, but I lost it today.     2

对于以下df2:

#df2
text
1     "I have a pen, but I lost it today. pen pen"
2     "I have pineapple and pen, but I lost it today."

我们得到

string_transformer.transform(X=df2)
#result
1    3
2    2
Name: text, dtype: int64

这表明,我们将函数转换为sklearn样式的对象。为了进一步避免这种情况,我们可以将列名作为关键字参数移交给count_strings

使用|连接列表中的元素。查找与.str.findall()匹配的元素,并将.str.len()应用于计数

p='|'.join(long_list)
df=df.assign(count=(df.text.str.findall(p)).str.len())
text   count
0              "I have a pen, but I lost it today."      1
1  "I have pineapple and pen, but I lost it today."      2

最新更新