r-查找字符串和查找表之间所有可能匹配的短语

我有一个包含一堆文本字符串的数据帧。在第二个数据帧中，我有一个短语列表，我将其用作查找表。我想在文本字符串中搜索查找表中所有可能匹配的短语。

我的问题是有些短语有重叠的单词。例如："；鸡蛋"；以及"；绿色鸡蛋"；。

library(udpipe)
library(dplyr)
# Download english dictionary
ud_model <- udpipe_download_model(language = "english")
ud_model <- udpipe_load_model(ud_model$file_model)
# Create example data
sample <- data.frame(doc_id = 1, text = "the cat in the hat ate green eggs and ham")
phrases <- data.frame(phrase = c("cat", "hat", "eggs", "green eggs", "ham", "the cat"))
# Tokenize text
x <- udpipe_annotate(ud_model, x = sample$text, doc_id = sample$doc_id)
x <- as.data.frame(x)
x$token <- tolower(x$token)
test_results <- x %>% select(doc_id, token)
test_results$term <- txt_recode_ngram(test_results$token, 
compound = phrases$phrase, 
ngram = str_count(phrases$phrase, '\w+'), 
sep = " ")
# Remove any tokens that don't match a phrase in the lookup table
test_results <- filter(test_results, term %in% phrases$phrase)

在结果中，您可以看到"；猫；返回而不是"返回"；cat"绿色鸡蛋"；但不是"；鸡蛋"；。

> test_results$term
[1] "the cat"    "hat"        "green eggs" "ham"

如何在文本字符串和查找表之间找到所有可能的短语匹配？

我应该补充一点，我并不执着于任何特定的包裹。我只是在这里使用udpipe，因为我最熟悉它。

如果一个字符串在另一个字符串中，我认为您可以简单地使用grepl进行匹配。从你的applygrepl到所有其他匹配模式

# Create example data
sample <- data.frame(doc_id = 1, text = "the cat in the hat ate green eggs and ham")
phrases <- data.frame(phrase = c("cat", "hat", "eggs", "green eggs", "ham", "the cat"))
apply(phrases, 1, grepl,sample$text)

如果你想要你的比赛，你可以：

phrases[apply(phrases, 1, grepl,sample$text),]

但也许dataframe类型与短语不是最相关的

相关内容

最新更新

热门标签：