优化正则表达式 - 匹配、提取、脱妆



我使用以下 2 个函数在字符串中查找国家/地区名称,匹配名称,将其放入数据帧中的新列中,然后从原始字符串中删除国家/地区名称:

library("stringr")
ListofCountries <- read.table(file="https://raw.github.com/umpirsky/country-list/master/country/cldr/en/country.csv",header=T,sep=",")
CoffeeTable <- data.frame(Product=c("Kenya Ndumberi", "Kenya Ndumberi", "Finca Nombre de Dios", "Finca La Providencia", "Las Penidas", "Las Penidas", "Las Penidas", "Panama Duncan", "Panama Duncan", "Panama Duncan", "Panama Duncan", "Panama Duncan", "Panama Duncan", "Progresso", "Progresso", "Progresso", "Progresso", "Finca El Injerto", "Finca El Injerto", "Finca El Injerto", "Finca El Injerto", "Finca El Injerto", "Finca El Injerto", "El Socoro Reserva Don Diego", "El Socoro Reserva Don Diego", "El Socoro Reserva Don Diego", "El Socoro Reserva Don Diego", "nEl Socoro Reserva Don Diego", "El Socoro Reserva Don Diego", "Thiriku Nyeri", "Thiriku Nyeri", "Thiriku Nyeri", "Thiriku Nyeri", "Kenya Kia Oro", "Kenya Kia Oro", "Kenya Kia Oro", "Kenya Kia Oro", "Kenya Kia Oro", "Bufcafe Natural Sundried Microlot", "Bufcafe Natural Sundried Microlot", "Bufcafe Natural Sundried Microlot", "Geisha", "Geisha", "Geisha", "Pacamara", "Pacamara", "Pacamara", "Pacamara", "Bolivia", "Pacamara", "Bolivia", "Pacamara", "Bolivia", "Brazil yellow bourbon pea berry", "Finca El Vintilador", "nWashed Yirgacheffe", "Finca El Vintilador", "Washed Yirgacheffe", "Washed Yirgacheffe", "Washed Yirgacheffe", "Leza", "Finca La Libertad", "Pacamara", "Pacamara", "Pacamara", "Finca La Bolsa", "Thunguri Kenya", "Thunguri Kenya", "Thunguri Kenya", "Thiriku Nyeri", "Thiriku Nyeri", "Thiriku Nyeri", "Pedregal", "Pedregal", "Barrel Aged", "Pedregal", "Barrel Aged", "Toarco Jaya Peaberry Sulawesi", "Amigo de Buesaco", "Amigo de Buesaco", "Amigo de Buesaco", "Barrel Aged", "Toarco Jaya Peaberry Sulawesi", "nToarco Jaya Peaberry Sulawesi", "El Cypress", "El Cypress", "Kenya Kia Oro", "Kenya Kia Oro", "Kenya Kia Oro", "Kenya Kia Oro"))
CoffeeTable$Country <- str_trim(str_match(tolower(CoffeeTable$Product), 
                                            tolower(paste(ListofCountries, collapse="|")))[,1])

CoffeeTable$Product <- str_trim(gsub(tolower(paste(ListofCountries, collapse="|")), replacement="", 
                          CoffeeTable$Product, ignore.case=T))

问题 1 - 这很慢。如何使这些功能更快?

问题 2 - 这只捕获国家的正式名称。有谁知道一个好的常见国名列表?(例如"中国"与"中华人民共和国")

谢谢!


编辑:这是90个咖啡名称的列表,使其成为可复制的示例;我想补充一点,在我的实际应用程序中,CoffeeTable 已经存在并且有 ~2,000 行和 45 列。我不是在寻找更快的方法来构建 data.frame/等。

谢谢!

编辑 2:问题 2 已得到解答,现在我只是尝试优化 2 个功能,这样它们就不会花费 5 - 10 秒来运行!

对于您的第二个问题,这里有一个广泛的选项列表。试试这个:

countries <- read.table(file="https://raw.github.com/umpirsky/country-list/master/country/cldr/en/country.csv",header=T,sep=",")

编辑:回应OP的评论。

给定您提供的示例数据,并复制 25 倍以创建与实际数据中大致相同的行数,您的代码在大约 1.6 秒内运行。很难相信你的系统和我的系统之间有8倍的差异,所以一定还有其他事情发生。

我唯一能推荐的是查看gsubfn包中的strapplyc(...)。这应该非常有效,但在我的系统上实际上比你的代码慢。

有关示例和基准测试,请参阅下面的代码。对不起,我不能提供更多帮助...

library(stringr)
df <- CoffeeTable
df$Product=as.vector(df$Product)
df=rbind(df,df,df,df,df)    # replicate 25X
df=rbind(df,df,df,df,df)    # total rows = 2250
pattern    <- tolower(paste(ListofCountries$name,collapse="|"))
f1 = function(df){
  df$Country <- str_trim(str_match(tolower(df$Product), pattern)[,1])
  df$Product <- str_trim(gsub(pattern, "",df$Product, ignore.case=T))
  return(df)
}
library(gsubfn)
library(tcltk2)
f2 = function(df){
  df$Country <- strapplyc(tolower(df$Product),pattern)
  df$Product <- str_trim(gsub(pattern,"", df$Product, ignore.case=T))
  return(df)
}
library(microbenchmark)
microbenchmark(df1<-f1(df),df2<-f2(df),times=10)
# Unit: seconds
#           expr      min       lq   median       uq      max neval
#  df1 <- f1(df) 1.365222 1.506017 1.611458 1.689611 1.722626    10
#  df2 <- f2(df) 2.006162 2.055963 2.148158 2.249707 2.285955    10

好的,回到第一个问题。这可能不是最有效的解决方案,但它有效。

我建议的第一件事是在最初生成 CoffeeTable 数据框时指定stringsAsFactors = FALSE。否则,您最终会得到因素。我还将此表中的初始数据列重命名为"复合",以便您可以看到分离的结果。

match <- gregexpr(tolower(paste(ListofCountries$name, collapse="|")),
    tolower(CoffeeTable$Composite))
CoffeeTable$Country <- sapply(regmatches(CoffeeTable$Composite, match),
    function(m) {ifelse(length(m), m, "")})
CoffeeTable$Product <- sapply(regmatches(CoffeeTable$Composite, match, invert = TRUE),
    function(m) {paste0(m, collapse = "")})

结果如下所示:

> head(CoffeeTable, 10)
              Composite Country              Product
1        Kenya Ndumberi   Kenya             Ndumberi
2        Kenya Ndumberi   Kenya             Ndumberi
3  Finca Nombre de Dios         Finca Nombre de Dios
4  Finca La Providencia         Finca La Providencia
5           Las Penidas                  Las Penidas
6           Las Penidas                  Las Penidas
7           Las Penidas                  Las Penidas
8         Panama Duncan  Panama               Duncan
9         Panama Duncan  Panama               Duncan
10        Panama Duncan  Panama               Duncan

最新更新