如何使用单词表计算数据帧中的单词数



我有一个关于使用python进行字数统计的问题。

数据帧有三列。(id,text,word(

首先,这是示例表。

[数据帧]

df = pd.DataFrame({
"id":[
"100",
"200",
"300"
],
"text":[
"The best part of Zillow is you can search/view thousands of home within a click of a button without even stepping out of your door.At the comfort of your home you can get all the details such as the floor plan, tax history, neighborhood, mortgage calculator, school ratings etc. and also getting in touch with the contact realtor is just a click away and you are scheduled for the home tour!As a first time home buyer, this website greatly helped me to study the market before making the right choice.",
"I love all of the features of the Zillow app, especially the filtering options and the feature that allows you to save customized searches.",
"Data is not updated spontaneously. Listings are still shown as active while the Mls shows pending or closed."
],
"word":[
"[best, word, door, subway, rain]",
"[item, best, school, store, hospital]",
"[gym, mall, pool, playground]",
]
})

我已经把文本分解成字典了。

所以,我想把每一行单词列表都检查成文本。

这是我想要的结果。

| id |                   word dict                          |
| -- | -----------------------------------------------      |
| 100| {best: 1, word: 0, door: 1, subway: 0 , rain: 0}     |         
| 200| {item: 0, best: 0, school: 0, store: 0, hospital: 0} |
| 300| {gym: 0, mall: 0, pool: 0, playground: 0}            |

请检查此问题。

我们可以使用re来提取list中的所有单词。注意,这只会匹配你列表中的单词,而不是数字。

然后应用一个函数,该函数返回一个包含列表中每个单词计数的dict。然后,我们可以将此函数应用于df中的一个新列。

import re
def count_words(row):
words = re.findall(r'(w+)', row['word'])
return {word: row['text'].count(word) for word in words}
df['word_counts'] = df.apply(lambda x: count_words(x), axis=1)

输出


id  ...                                        word_counts
0  100  ...  {'best': 1, 'word': 0, 'door': 1, 'subway': 0,...
1  200  ...  {'item': 0, 'best': 0, 'school': 0, 'store': 0...
2  300  ...  {'gym': 0, 'mall': 0, 'pool': 0, 'playground': 0}
[3 rows x 4 columns]

由于单词列是字符串类型,请先将其转换为列表:

df['word'] = df['word'].str[1:-1].str.split(',')

现在您可以使用application foraxis=1来计算每个单词的逻辑:

df[['text', 'word']].apply(lambda row: {item:row['text'].count(item) for item in row['word']}, axis=1)

输出

Out[32]: 
0    {'best': 1, ' word': 0, ' door': 1, ' subway':...
1    {'item': 0, ' best': 0, ' school': 0, ' store'...
2    {'gym': 0, ' mall': 0, ' pool': 0, ' playgroun...
dtype: object

最新更新