我有一个数据帧df
,其中有一列包含文本df['text']
(在本例中是报纸上的文章)。如何获取df['text']
中字数超过n
个单词阈值的行数?
下面显示了df
的示例。每篇文章可以包含任意数量的单词。
print(df['text'].head(10))
0 Emerging evidence that Mexico economy was back...
1 Chrysler Corp Tuesday announced million in new...
2 CompuServe Corp Tuesday reported surprisingly ...
3 CompuServe Corp Tuesday reported surprisingly ...
4 If dining at Planet Hollywood made you feel li...
5 Hog prices fell Tuesday after government slaug...
6 Blue chip stocks rallied Tuesday after the Fed...
7 Sprint Corp Tuesday announced plans to offer I...
8 Shoppers are loading up this year on perennial...
9 Kansas and Arizona filed lawsuits against some...
Name: text, dtype: object
我使用此数据的目标是找到包含超过n
个单词的文章计数。有关示例,请参阅下面的伪代码。
n = 250 # number of words cutoff for counting
counter = 0
for row in df['text']:
if df['text'].wordcount >= n: # wordcount is some function on a df that counts the words in a string for one row
counter += 1
print(counter)
所需的输出是包含超过n
个单词的文章数(在本例中,n
任意设置为 250)。因此,在上面的伪代码中,wordcount
是一些函数,可以计算一行(或者,在这种情况下,一篇文章)中的单词。因此,对于第x
行,如果N
(文章中的字数)为 340,它将大于n
,后者设置为阈值 250。因此,将触发if
语句,counter
将增加 1。
理想情况下,我想以矢量化的方式执行此操作,因为数据帧很大。如果没有,apply
工作正常。
输入示例
d="""text
Emerging evidence that Mexico economy was back...
Chrysler Corp Tuesday announced million in new...
CompuServe Corp Tuesday reported surprisingly ...
CompuServe Corp Tuesday reported surprisingly ...
If dining at Planet Hollywood made you feel li...
Hog prices fell Tuesday after government slaug...
Blue chip stocks rallied Tuesday after the Fed...
Sprint Corp Tuesday announced plans to offer I...
Shoppers are loading up this year on perennial...
Kansas and Arizona filed lawsuits against some..."""
df=pd.read_csv(StringIO(d))
df
如果我们只想要单词大于 n 的行数
n=7 # replace it with 250
df[df['text'].str.split().str.len() > n].count()
输出
text 4
dtype: int64
如果我们希望行的计数大于 n
n=7 # replace it with 250
df[df['text'].str.split().str.len() > n]
输出
text
4 If dining at Planet Hollywood made you feel li...
6 Blue chip stocks rallied Tuesday after the Fed...
7 Sprint Corp Tuesday announced plans to offer I...
8 Shoppers are loading up this year on perennial...
如果我们想要每行的字数
df['len'] = df['text'].str.split().str.len()
df
输出
text len
0 Emerging evidence that Mexico economy was back... 7
1 Chrysler Corp Tuesday announced million in new... 7
2 CompuServe Corp Tuesday reported surprisingly ... 6
3 CompuServe Corp Tuesday reported surprisingly ... 6
4 If dining at Planet Hollywood made you feel li... 9
5 Hog prices fell Tuesday after government slaug... 7
6 Blue chip stocks rallied Tuesday after the Fed... 8
7 Sprint Corp Tuesday announced plans to offer I... 8
8 Shoppers are loading up this year on perennial... 8
9 Kansas and Arizona filed lawsuits against some... 7
假设"单词"用空格分隔,一种方法是计算单词之间的空格数并加 1。然后与 n 值进行比较。
import pandas as pd
df = pd.DataFrame({
'text': {0: 'Emerging evidence that Mexico economy was back',
1: 'Chrysler Corp Tuesday announced million in new',
2: 'CompuServe Corp Tuesday reported surprisingly',
3: 'CompuServe Corp Tuesday reported surprisingly',
4: 'If dining at Planet Hollywood made you feel li',
5: 'Hog prices fell Tuesday after government slaug',
6: 'Blue chip stocks rallied Tuesday after the Fed',
7: 'Sprint Corp Tuesday announced plans to offer I',
8: 'Shoppers are loading up this year on perennial',
9: 'Kansas and Arizona filed lawsuits against s'}
})
n = 8
# Words are 1 more than the number of spaces
# Compare greater than equal to n
m = df['text'].str.count(' ').add(1).ge(n)
filtered_df = df[m]
print(filtered_df)
df
:
text
0 Emerging evidence that Mexico economy was back # 7
1 Chrysler Corp Tuesday announced million in new # 7
2 CompuServe Corp Tuesday reported surprisingly # 5
3 CompuServe Corp Tuesday reported surprisingly # 5
4 If dining at Planet Hollywood made you feel li # 9
5 Hog prices fell Tuesday after government slaug # 7
6 Blue chip stocks rallied Tuesday after the Fed # 8
7 Sprint Corp Tuesday announced plans to offer I # 8
8 Shoppers are loading up this year on perennial # 8
9 Kansas and Arizona filed lawsuits against s # 7
filtered
:
text
4 If dining at Planet Hollywood made you feel li # 9
6 Blue chip stocks rallied Tuesday after the Fed # 8
7 Sprint Corp Tuesday announced plans to offer I # 8
8 Shoppers are loading up this year on perennial # 8
如果只需要匹配的行数,请使用掩码上的sum
。真值为 1,假值为 0。因此,根本不需要构建筛选的数据帧即可获取计数:
m = df['text'].str.count(' ').add(1).ge(n)
print(m.sum())
输出:
4
如果您只想按字数进行过滤,请在空间上split
文本并比较结果列表的length
。
>>> df[df['text'].str.split().apply(len)>=8]
text
4 If dining at Planet Hollywood made you feel li
6 Blue chip stocks rallied Tuesday after the Fed
7 Sprint Corp Tuesday announced plans to offer I
8 Shoppers are loading up this year on perennial
如果要按unique word count
进行筛选,则可能需要在split
后将生成的list
转换为set
>>> df[df['text'].str.split().apply(set).apply(len)>=8]
text
4 If dining at Planet Hollywood made you feel li
6 Blue chip stocks rallied Tuesday after the Fed
7 Sprint Corp Tuesday announced plans to offer I
8 Shoppers are loading up this year on perennial