


0    Emerging evidence that Mexico economy was back...
1    Chrysler Corp Tuesday announced million in new...
2    CompuServe Corp Tuesday reported surprisingly ...
3    CompuServe Corp Tuesday reported surprisingly ...
4    If dining at Planet Hollywood made you feel li...
5    Hog prices fell Tuesday after government slaug...
6    Blue chip stocks rallied Tuesday after the Fed...
7    Sprint Corp Tuesday announced plans to offer I...
8    Shoppers are loading up this year on perennial...
9    Kansas and Arizona filed lawsuits against some...
Name: text, dtype: object


n = 250 # number of words cutoff for counting
counter = 0
for row in df['text']:
if df['text'].wordcount >= n: # wordcount is some function on a df that counts the words in a string for one row
counter += 1

所需的输出是包含超过n个单词的文章数(在本例中,n任意设置为 250)。因此,在上面的伪代码中,wordcount是一些函数,可以计算一行(或者,在这种情况下,一篇文章)中的单词。因此,对于第x行,如果N(文章中的字数)为 340,它将大于n,后者设置为阈值 250。因此,将触发if语句,counter将增加 1。



Emerging evidence that Mexico economy was back...
Chrysler Corp Tuesday announced million in new...
CompuServe Corp Tuesday reported surprisingly ...
CompuServe Corp Tuesday reported surprisingly ...
If dining at Planet Hollywood made you feel li...
Hog prices fell Tuesday after government slaug...
Blue chip stocks rallied Tuesday after the Fed...
Sprint Corp Tuesday announced plans to offer I...
Shoppers are loading up this year on perennial...
Kansas and Arizona filed lawsuits against some..."""

如果我们只想要单词大于 n 的行数

n=7 # replace it with 250
df[df['text'].str.split().str.len() > n].count()


text    4
dtype: int64

如果我们希望行的计数大于 n

n=7 # replace it with 250
df[df['text'].str.split().str.len() > n]


4   If dining at Planet Hollywood made you feel li...
6   Blue chip stocks rallied Tuesday after the Fed...
7   Sprint Corp Tuesday announced plans to offer I...
8   Shoppers are loading up this year on perennial...


df['len'] = df['text'].str.split().str.len()


text                                               len
0   Emerging evidence that Mexico economy was back...   7
1   Chrysler Corp Tuesday announced million in new...   7
2   CompuServe Corp Tuesday reported surprisingly ...   6
3   CompuServe Corp Tuesday reported surprisingly ...   6
4   If dining at Planet Hollywood made you feel li...   9
5   Hog prices fell Tuesday after government slaug...   7
6   Blue chip stocks rallied Tuesday after the Fed...   8
7   Sprint Corp Tuesday announced plans to offer I...   8
8   Shoppers are loading up this year on perennial...   8
9   Kansas and Arizona filed lawsuits against some...   7

假设"单词"用空格分隔,一种方法是计算单词之间的空格数并加 1。然后与 n 值进行比较。

import pandas as pd
df = pd.DataFrame({
'text': {0: 'Emerging evidence that Mexico economy was back',
1: 'Chrysler Corp Tuesday announced million in new',
2: 'CompuServe Corp Tuesday reported surprisingly',
3: 'CompuServe Corp Tuesday reported surprisingly',
4: 'If dining at Planet Hollywood made you feel li',
5: 'Hog prices fell Tuesday after government slaug',
6: 'Blue chip stocks rallied Tuesday after the Fed',
7: 'Sprint Corp Tuesday announced plans to offer I',
8: 'Shoppers are loading up this year on perennial',
9: 'Kansas and Arizona filed lawsuits against s'}
n = 8
# Words are 1 more than the number of spaces
# Compare greater than equal to n
m = df['text'].str.count(' ').add(1).ge(n)
filtered_df = df[m]


0  Emerging evidence that Mexico economy was back  # 7
1  Chrysler Corp Tuesday announced million in new  # 7
2   CompuServe Corp Tuesday reported surprisingly  # 5
3   CompuServe Corp Tuesday reported surprisingly  # 5
4  If dining at Planet Hollywood made you feel li  # 9
5  Hog prices fell Tuesday after government slaug  # 7
6  Blue chip stocks rallied Tuesday after the Fed  # 8
7  Sprint Corp Tuesday announced plans to offer I  # 8
8  Shoppers are loading up this year on perennial  # 8
9     Kansas and Arizona filed lawsuits against s  # 7


4  If dining at Planet Hollywood made you feel li  # 9
6  Blue chip stocks rallied Tuesday after the Fed  # 8
7  Sprint Corp Tuesday announced plans to offer I  # 8
8  Shoppers are loading up this year on perennial  # 8

如果只需要匹配的行数,请使用掩码上的sum。真值为 1,假值为 0。因此,根本不需要构建筛选的数据帧即可获取计数:

m = df['text'].str.count(' ').add(1).ge(n)




>>> df[df['text'].str.split().apply(len)>=8]
4  If dining at Planet Hollywood made you feel li
6  Blue chip stocks rallied Tuesday after the Fed
7  Sprint Corp Tuesday announced plans to offer I
8  Shoppers are loading up this year on perennial

如果要按unique word count进行筛选,则可能需要在split后将生成的list转换为set

>>> df[df['text'].str.split().apply(set).apply(len)>=8]
4  If dining at Planet Hollywood made you feel li
6  Blue chip stocks rallied Tuesday after the Fed
7  Sprint Corp Tuesday announced plans to offer I
8  Shoppers are loading up this year on perennial
