使用精确匹配和模糊匹配将R中的两个大型数据集连接起来



我正在尝试内部连接两个数据集:50000 obs的df1看起来像这样:

Name              | Line.1           | Line.2     | Town       | County       | Postcode 
-------------------|------------------|------------|------------|--------------|---------- 
ACME Inc          | 63 Long Street   |            | Fakeington | Lincolnshire | PA4 8QU  
BETA LTD          | 91a              | Main Drove | Cloud City | Something    | BN1 6LD  
The Giga          | 344 Lorem Street |            | Ipsom      | Dolor        | G2 8LY   

500000 obs的df2如下所示:

Name              | AddressLine1   | AddressLine2     | AddressLine3 | AddressLine4 | Postcode | RatingValue 
-------------------|----------------|------------------|--------------|--------------|----------|------------- 
ACME              |                | 63 Long Street   | Fakeington   | Lincolnshire | PA4 8QU  | 1           
Random Company    |                | Rose Ave         | Fakeington   |              | AB2 51GL | 5           
BETA Limited      | Business House | 91a Main Drove   | Something    |              | BN1 6LD  | 3           
Giga Incorporated |                | 344 Lorem Street | Ipsum        | Dolor        | G2 8LY   | 5           

我想去df_final

Name              | Postcode | RatingValue 
-------------------|----------|------------- 
ACME Inc          | PA4 8QU  | 1           
BETA LTD          | BN1 6LD  | 3           
Giga Incorporated | G2 8LY   | 5           

这些是一对一的匹配,并且df1中的所有值都应该存在于df2中。Postcode是完全匹配的,而地址被分割成多行,没有规则模式,所以我认为我最好的选择是通过Name匹配。

我尝试了fuzzyjoin包,但我得到了Error: cannot allocate vector of size 120.6 Gb,所以我想我必须使用另一种适用于更大数据集的方法。

关于解决这个问题的最佳方法是什么,有什么想法吗?

df1 <- data.frame(
stringsAsFactors = FALSE,
Name = c("ACME Inc", "BETA LTD", "Giga Incorporated"),
Line.1 = c("63 Long Street", "91a", "344 Lorem Street"),
Line.2 = c(NA, "Main Drove", NA),
Town = c("Fakeington", "Cloud City", "Ipsom"),
County = c("Lincolnshire", "Something", "Dolor"),
Postcode = c("PA4 8QU", "BN1 6LD", "G2 8LY")
)
df2 <- data.frame(
stringsAsFactors = FALSE,
Name = c("ACME", "Random Company","BETA Limited","Giga Incorporated"),
AddressLine1 = c(NA, NA, "Business House", NA),
AddressLine2 = c("63 Long Street", "Rose Ave","91a Main Drove","344 Lorem Street"),
AddressLine3 = c("Fakeington", "Fakeington", "Something", "Ipsum"),
AddressLine4 = c("Lincolnshire", NA, NA, "Dolor"),
Postcode = c("PA4 8QU", "AB2 51GL", "BN1 6LD", "G2 8LY"),
RatingValue = c(1L, 5L, 3L, 5L)
)

也许下面的内容可以满足问题的要求。它使用包stringdist,而不是fuzzyjoin

首先,由于匹配是精确的,因此仅使用mergePostcode。然后得出Name的相似性。如果它们高于预定阈值,则保留这些行。

thresh <- 0.75
df_final <- merge(df2[c(1, 6:7)], df1[c(1, 6)], by = "Postcode", suffixes = c("",".y"))
i <- apply(df_final[c(2, 4)], 1, function(x) {stringdist::stringsim(x[1], x[2], method = 'jw')}) >= thresh
df_final <- df_final[i, c(2, 1, 3)]
df_final 
#               Name Postcode RatingValue
#1      BETA Limited  BN1 6LD           3
#2 Giga Incorporated   G2 8LY           5
#3              ACME  PA4 8QU           1

最新更新