r语言 - 在两组不同的数据集之间运行"match"



我有两个不同的数据集。

第一个称为people,结构如下:

people <- structure(list(userID = c(175890530, 178691082, 40228319, 472555502, 
1063565418, 242983504, 3253221155), bio = c("Living in Atlanta", 
        "Born in Seattle, resident of Phoenix", "Columbus, Ohio", "Bronx born and raised", 
        "What's up Chicago?!?!", "Product of Los Angeles, taxpayer in St. Louis", 
        "Go Dallas Cowboys!")), class = "data.frame", row.names = c(NA, 
                                                                    -7L))

下一个是名为location的文件,其结构如下:

location <- structure(list(city = c("Atlanta", "Seattle", "Phoenix", "Columbus", 
"Bronx", "Chicago", "Los Angeles", "St. Louis", "Dallas"), state = c("GA", 
                                 "WA", "AZ", "OH", "NY", "IL", "CA", "MO", "TX")), class = "data.frame", row.names = c(NA, 
                                                                                                                       -9L))

我正在尝试做的是对people数据集中的bio字段运行"匹配",它将字符串与location数据集中的city字段进行匹配。

虽然理论上我可以做这样的事情:

mutate(city = str_extract_all(bio, "Atlanta|Seattle|Phoenix|Columbus|Bronx|Chicago|Los Angeles|St. Louis|St. Louis|Dallas"))

这在实践中实际上行不通,因为我将使用更多的数据和更多可能的城市,所以它不可能是硬编码的东西。我正在寻找结构如下的输出:

complete <- structure(list(userID = c(175890530, 178691082, 40228319, 472555502, 
1063565418, 242983504, 3253221155), bio = c("Living in Atlanta", 
"Born in Seattle, resident of Phoenix", "Columbus, Ohio", "Bronx born and raised", 
"What's up Chicago?!?!", "Product of Los Angeles, taxpayer in St. Louis", 
"Go Dallas Cowboys!"), city_return = c("Atlanta", "Seattle, Phoenix", 
"Columbus", "Bronx", "Chicago", "Los Angeles, St. Louis", "Dallas"
)), class = "data.frame", row.names = c(NA, -7L))

这个想法是,它遍历people$bio的每一行,并将其与location$city中的所有可能性"匹配",并创建一个名为complete的新数据帧,其中包含people数据集中userIDbio字段以及一个名为city_return的新列,该列为我们提供了所需的匹配项。

library(tidyverse)
people %>%
separate_rows(bio) %>%
left_join(location, by = c("bio" = "city")) %>%
filter(!is.na(state))

这基本上有效,但有两个问题:

"亚特兰纳"与"亚特兰大"不匹配,但可能与fuzzyjoin一起使用,但可能会产生误报。

与洛杉矶不匹配,因为这只能通过单个单词匹配。有关两个单词的城市名称的方法,请参见下文。您可以运行其中的每一个并组合

# A tibble: 6 × 3
userID bio      state
<dbl> <chr>    <chr>
1  178691082 Seattle  WA    # two places mentioned for this user
2  178691082 Phoenix  AZ    # two places mentioned for this user
3   40228319 Columbus OH   
4  472555502 Bronx    NY   
5 1063565418 Chicago  IL    
6 3253221155 Dallas   TX 

如果我们想捕捉两个单词的城市,我们可以做这样的事情:

left_join(
people %>% tidytext::unnest_ngrams(bio, bio, n = 2),
location %>% tidytext::unnest_ngrams(bio, city, n = 2) %>%
filter(!is.na(bio))) %>%
filter(!is.na(state))

结果

Joining, by = "bio"
userID         bio state
1 242983504 los angeles    CA
2 242983504    st louis    MO

您可以bind_rows( [first code], [second code] )获得完整的输出。

相关内容

最新更新