r语言 - 从相似字符串的向量中获取唯一字符串



我不太知道如何表达这个问题。我刚刚开始处理一堆推文,我已经做了一些基本的清理,现在一些推文看起来像:

x <- c("stackoverflow is a great site",
"stackoverflow is a great si",
"stackoverflow is a great",
"omg it is friday and so sunny",
"omg it is friday and so",
"arggh how annoying")

基本上,我想通过检查字符串的第一部分是否匹配并返回其中最长的部分来删除重复。在这种情况下,我的结果应该是:

[1]"stackoverflow is a great site"
[2]"omg it is friday and so sunny"
[3]"arggh how annoying"

,因为所有其他都是上述内容的截断重复。我试过用unique()函数,但它没有返回我想要的结果,因为它试图匹配字符串的整个长度。有什么建议吗?

我在Mac OSX 10.7上使用R 3.1.1版本…

谢谢!

这是另一个选项。我在您的示例数据中添加了一个字符串。

x <- c("stackoverflow is a great site",
"stackoverflow is a great si",
"stackoverflow is a great",
"stackoverflow is an OK site",
"omg it is friday and so sunny",
"omg it is friday and so",
"arggh how annoying")
Filter(function(y) {
    x2 <- sapply(setdiff(x, y), substr, start=1, stop=nchar(y))
    ! duplicated(c(y, x2), fromLast=TRUE)[1]
}, x)

# [1] "stackoverflow is a great site" "stackoverflow is an OK site"   "omg it is friday and so sunny" [4] "arggh how annoying"  

这是我的尝试:

library(stringr)
x[!sapply(seq_along(x), function(i) any(str_detect(x[-i], x[i])))]
[1] "stackoverflow is a great site" "omg it is friday and so sunny" "arggh how annoying" 

基本上,我排除了那些已经包含在其他字符串中的字符串。这可能与您所描述的略有不同,但效果大致相同,并且非常简单。

@tonytonov解决方案很好,但我建议使用stringi包:)

stringi <- function(x){
  x[!sapply(seq_along(x), function(i) any(stri_detect_fixed(x[-i], x[i])))]
}
stringr <- function(x){
  x[!sapply(seq_along(x), function(i) any(str_detect(x[-i], x[i])))]
}
require(microbenchmark)
microbenchmark(stringi(x), stringr(x))
Unit: microseconds
       expr     min       lq   median       uq      max neval
 stringi(x)  52.482  58.1760  64.3275  71.9630  120.374   100
 stringr(x) 538.482 551.0485 564.3445 602.7095 1736.601   100

最新更新