我有一个包含index
和text
列的DataFrame
。
例如:
index | text
1 | "I have a pen, but I lost it today."
2 | "I have pineapple and pen, but I lost it today."
现在我有一个长列表,我想将text
中的每个单词与该列表进行匹配。
比方说:
long_list = ['pen', 'pineapple']
我想创建一个FunctionTransformer
,将long_list
中的单词与列值的每个单词进行匹配,如果存在匹配,则返回计数。
index | text | count
1 | "I have a pen, but I lost it today." | 1
2 | "I have pineapple and pen, but I lost it today." | 2
我是这样做的:
def count_words(df):
long_list = ['pen', 'pineapple']
count = 0
for c in df['tweet_text']:
if c in long_list:
count = count + 1
df['count'] = count
return df
count_word = FunctionTransformer(count_words, validate=False)
我如何开发另一个FunctionTransformer
的例子是:
def convert_twitter_datetime(df):
df['hour'] = pd.to_datetime(df['created_at'], format='%a %b %d %H:%M:%S +0000 %Y').dt.strftime('%H').astype(int)
return df
convert_datetime = FunctionTransformer(convert_twitter_datetime, validate=False)
Pandas具有str.count
:
# matching any of the words
pattern = r'b{}b'.format('|'.join(long_list))
df['count'] = df.text.str.count(pattern)
输出:
index text count
0 1 "I have a pen, but I lost it today." 1
1 2 "I have pineapple and pen, but I lost it today." 2
灵感来自@Quang Hoang的答案
import pandas as pd
import sklearn as sk
y=['pen', 'pineapple']
def count_strings(X, y):
pattern = r'b{}b'.format('|'.join(y))
return X['text'].str.count(pattern)
string_transformer = sk.preprocessing.FunctionTransformer(count_strings, kw_args={'y': y})
df['count'] = string_transformer.fit_transform(X=df)
中的结果
text count
1 "I have a pen, but I lost it today." 1
2 "I have pineapple and pen, but I lost it today. 2
对于以下df2
:
#df2
text
1 "I have a pen, but I lost it today. pen pen"
2 "I have pineapple and pen, but I lost it today."
我们得到
string_transformer.transform(X=df2)
#result
1 3
2 2
Name: text, dtype: int64
这表明,我们将函数转换为sklearn
样式的对象。为了进一步避免这种情况,我们可以将列名作为关键字参数移交给count_strings
。
使用|
连接列表中的元素。查找与.str.findall()
匹配的元素,并将.str.len()
应用于计数
p='|'.join(long_list)
df=df.assign(count=(df.text.str.findall(p)).str.len())
text count
0 "I have a pen, but I lost it today." 1
1 "I have pineapple and pen, but I lost it today." 2