我有一个带有公司名称和地址信息的data.table。我想从公司名称中删除法律实体和最常见的单词。因此,我编写了一个函数,并将其应用于我的数据表。
search_for_default <- c("inc", "corp", "co", "llc", "se", "\&", "holding", "professionals",
"services", "international", "consulting", "the", "for")
clean_strings <- function(string, search_for=search_for_default){
clean_step1 <- str_squish(str_replace_all(string, "[:punct:]", " ")) #remove punctation
clean_step2 <- unlist(str_split(tolower(clean_step1), " ")) #split in tokens
clean_step2 <- clean_step2[!str_detect(clean_step2, "^american|^canadian")] # clean up geographical names
res <- str_squish(str_c(clean_step2[!clean_step2 %in% search_for], sep="", collapse=" ")) #remove legal entities and common words
res <- paste(unique(unlist(str_split(res, " "))), collapse=" ") # paste string together
return(res) }
datatable[, COMPANY_NAME_clean:=clean_strings(COMPANY_NAME), by=COMPANY_NAME]
剧本写得很好。但是当我有一个大的数据集(>3b行(时,它需要很长时间。有更有效的方法吗?
示例:
输入:
Company_Name <- c("Walmart Inc.", "Amazon.com, Inc.", "Apple Inc.", "American Test Company for Consulting")
预期:
Company_name_clean <- c("walmart", "amazon.com", "apple", "test company")
以下是我的操作方法:
library(stringr)
words <- c("corp", "co", "llc", "se", "&", "holding", "professionals", "services",
"international", "consulting", "the", "for", "american", "canadian") |>
paste0("\b", ...= _, "\b", collapse = "|")
others <- c("inc\.", ",") |> paste(... = _, collapse = "|")
Company_Name |>
tolower() |>
str_remove_all(pattern = paste0(words, "|", others)) |>
str_trim()
几个注意事项:
- 在代码中,您可以匹配所有标点符号。实际上你不希望那样,因为那样会把
amazon.com
中的点去掉。只要匹配你需要的东西 - 30亿行太多了!期望任何脚本在这种数据帧上快速运行是不现实的虽然可能比您目前拥有的速度更快
- 如果你能以某种方式减少行数(例如通过删除重复项(,那就太好了
- 搜索诸如";co";,你会得到假阳性,比如匹配";星巴克咖啡";。匹配单词边界("\b"(以删除此问题