如何从数据框中删除 R 中仅包含几个单词的行



我正在尝试从数据框中删除包含少于 5 个单词的行。例如

mydf <- as.data.frame(read.xlsx("C:\data.xlsx", 1, header=TRUE)
head(mydf)
     NO    ARTICLE
1    34    The New York Times reports a lot of words here.
2    12    Greenwire reports a lot of words.
3    31    Only three words.
4     2    The Financial Times reports a lot of words.
5     9    Greenwire short.
6    13    The New York Times reports a lot of words again.

我想删除包含 5 个或更少单词的行。 我该怎么做?

有两种方法:

mydf[sapply(gregexpr("\W+", mydf$ARTICLE), length) >4,]
#   NO                                          ARTICLE
# 1 34  The New York Times reports a lot of words here.
# 2 12                Greenwire reports a lot of words.
# 4  2      The Financial Times reports a lot of words.
# 6 13 The New York Times reports a lot of words again.

mydf[sapply(strsplit(as.character(mydf$ARTICLE)," "),length)>5,]
#   NO                                          ARTICLE
# 1 34  The New York Times reports a lot of words here.
# 2 12                Greenwire reports a lot of words.
# 4  2      The Financial Times reports a lot of words.
# 6 13 The New York Times reports a lot of words again.
第一个

生成一个向量,其中包含第一个单词之后每个单词的起始位置,然后计算该向量的长度。

第二个将 ARTICLE 列拆分为包含组件字的向量,并计算该向量的长度。这可能是一个更好的方法。

qdap 包中的字数统计 ( wc ) 函数也可以促进这一点:

dat <- read.transcript(text="34    The New York Times reports a lot of words here.
12    Greenwire reports a lot of words.
31    Only three words.
2    The Financial Times reports a lot of words.
9    Greenwire short.
13    The New York Times reports a lot of words again.", 
    col.names = qcv(NO, ARTICLE), sep="   ")
library(qdap)
dat[wc(dat$ARTICLE) > 4, ]
##   NO                                          ARTICLE
## 1 34  The New York Times reports a lot of words here.
## 2 12                Greenwire reports a lot of words.
## 4  2      The Financial Times reports a lot of words.
## 6 13 The New York Times reports a lot of words again.

最新更新