如何计算包含一定数量单词的字符串数据的行数



我有一个数据帧df,其中有一列包含文本df['text'](在本例中是报纸上的文章)。如何获取df['text']中字数超过n个单词阈值的行数?

下面显示了df的示例。每篇文章可以包含任意数量的单词。

print(df['text'].head(10))
0    Emerging evidence that Mexico economy was back...
1    Chrysler Corp Tuesday announced million in new...
2    CompuServe Corp Tuesday reported surprisingly ...
3    CompuServe Corp Tuesday reported surprisingly ...
4    If dining at Planet Hollywood made you feel li...
5    Hog prices fell Tuesday after government slaug...
6    Blue chip stocks rallied Tuesday after the Fed...
7    Sprint Corp Tuesday announced plans to offer I...
8    Shoppers are loading up this year on perennial...
9    Kansas and Arizona filed lawsuits against some...
Name: text, dtype: object

我使用此数据的目标是找到包含超过n个单词的文章计数。有关示例,请参阅下面的伪代码。

n = 250 # number of words cutoff for counting
counter = 0
for row in df['text']:
if df['text'].wordcount >= n: # wordcount is some function on a df that counts the words in a string for one row
counter += 1
print(counter)

所需的输出是包含超过n个单词的文章数(在本例中,n任意设置为 250)。因此,在上面的伪代码中,wordcount是一些函数,可以计算一行(或者,在这种情况下,一篇文章)中的单词。因此,对于第x行,如果N(文章中的字数)为 340,它将大于n,后者设置为阈值 250。因此,将触发if语句,counter将增加 1。

理想情况下,我想以矢量化的方式执行此操作,因为数据帧很大。如果没有,apply工作正常。

输入示例

d="""text
Emerging evidence that Mexico economy was back...
Chrysler Corp Tuesday announced million in new...
CompuServe Corp Tuesday reported surprisingly ...
CompuServe Corp Tuesday reported surprisingly ...
If dining at Planet Hollywood made you feel li...
Hog prices fell Tuesday after government slaug...
Blue chip stocks rallied Tuesday after the Fed...
Sprint Corp Tuesday announced plans to offer I...
Shoppers are loading up this year on perennial...
Kansas and Arizona filed lawsuits against some..."""
df=pd.read_csv(StringIO(d))
df

如果我们只想要单词大于 n 的行数

n=7 # replace it with 250
df[df['text'].str.split().str.len() > n].count()

输出

text    4
dtype: int64

如果我们希望行的计数大于 n

n=7 # replace it with 250
df[df['text'].str.split().str.len() > n]

输出

text
4   If dining at Planet Hollywood made you feel li...
6   Blue chip stocks rallied Tuesday after the Fed...
7   Sprint Corp Tuesday announced plans to offer I...
8   Shoppers are loading up this year on perennial...

如果我们想要每行的字数

df['len'] = df['text'].str.split().str.len()
df

输出

text                                               len
0   Emerging evidence that Mexico economy was back...   7
1   Chrysler Corp Tuesday announced million in new...   7
2   CompuServe Corp Tuesday reported surprisingly ...   6
3   CompuServe Corp Tuesday reported surprisingly ...   6
4   If dining at Planet Hollywood made you feel li...   9
5   Hog prices fell Tuesday after government slaug...   7
6   Blue chip stocks rallied Tuesday after the Fed...   8
7   Sprint Corp Tuesday announced plans to offer I...   8
8   Shoppers are loading up this year on perennial...   8
9   Kansas and Arizona filed lawsuits against some...   7

假设"单词"用空格分隔,一种方法是计算单词之间的空格数并加 1。然后与 n 值进行比较。

import pandas as pd
df = pd.DataFrame({
'text': {0: 'Emerging evidence that Mexico economy was back',
1: 'Chrysler Corp Tuesday announced million in new',
2: 'CompuServe Corp Tuesday reported surprisingly',
3: 'CompuServe Corp Tuesday reported surprisingly',
4: 'If dining at Planet Hollywood made you feel li',
5: 'Hog prices fell Tuesday after government slaug',
6: 'Blue chip stocks rallied Tuesday after the Fed',
7: 'Sprint Corp Tuesday announced plans to offer I',
8: 'Shoppers are loading up this year on perennial',
9: 'Kansas and Arizona filed lawsuits against s'}
})
n = 8
# Words are 1 more than the number of spaces
# Compare greater than equal to n
m = df['text'].str.count(' ').add(1).ge(n)
filtered_df = df[m]
print(filtered_df)

df

text
0  Emerging evidence that Mexico economy was back  # 7
1  Chrysler Corp Tuesday announced million in new  # 7
2   CompuServe Corp Tuesday reported surprisingly  # 5
3   CompuServe Corp Tuesday reported surprisingly  # 5
4  If dining at Planet Hollywood made you feel li  # 9
5  Hog prices fell Tuesday after government slaug  # 7
6  Blue chip stocks rallied Tuesday after the Fed  # 8
7  Sprint Corp Tuesday announced plans to offer I  # 8
8  Shoppers are loading up this year on perennial  # 8
9     Kansas and Arizona filed lawsuits against s  # 7

filtered

text
4  If dining at Planet Hollywood made you feel li  # 9
6  Blue chip stocks rallied Tuesday after the Fed  # 8
7  Sprint Corp Tuesday announced plans to offer I  # 8
8  Shoppers are loading up this year on perennial  # 8

如果只需要匹配的行数,请使用掩码上的sum。真值为 1,假值为 0。因此,根本不需要构建筛选的数据帧即可获取计数:

m = df['text'].str.count(' ').add(1).ge(n)
print(m.sum())

输出:

4

如果您只想按字数进行过滤,请在空间上split文本并比较结果列表的length

>>> df[df['text'].str.split().apply(len)>=8]
text
4  If dining at Planet Hollywood made you feel li
6  Blue chip stocks rallied Tuesday after the Fed
7  Sprint Corp Tuesday announced plans to offer I
8  Shoppers are loading up this year on perennial

如果要按unique word count进行筛选,则可能需要在split后将生成的list转换为set

>>> df[df['text'].str.split().apply(set).apply(len)>=8]
text
4  If dining at Planet Hollywood made you feel li
6  Blue chip stocks rallied Tuesday after the Fed
7  Sprint Corp Tuesday announced plans to offer I
8  Shoppers are loading up this year on perennial

最新更新