我有一个数据框架,里面有一串又长又乱的家用设施。我想把字符串分解成独特的便利设施,在数据帧中为每个独特的便利条件创建一个新列,并在新列中记录字符串中单个便利设施的存在/不存在。使用嵌套的for
循环,我找到了一种完成任务的方法。然而,我想知道的是,如何使用apply
函数族或dplyr
方法来避免循环,从而获得相同的结果。
可再现数据:
df <- data.frame(
id = 1:4,
amenities = c('{"Wireless Internet","Wheelchair accessible",Kitchen,Elevator,"Buzzer/wireless intercom",Heating}',
'{TV,"Cable TV",Internet,"Wireless Internet","Air conditioning",Kitchen,"Smoking allowed","Pets allowed"}',
'{"Buzzer/wireless intercom",Heating,"Family/kid friendly","Smoke detector",Carbon monoxide}',
'{Washer,Dryer,Essentials,Shampoo,Hangers,"Laptop friendly workspace"}'))
到目前为止,我所做的是:
amenities_clean <- gsub('[{}"]', '', df$amenities) # remove unwanted stuff
amenities_split <- strsplit(amenities_clean, ",") # split rows into individual amenities
amenities_unique <- unique(unlist(strsplit(amenities_clean, ","))) # get a list of unique amenities
df[amenities_unique] <- NA # set up the columns for each amenity
为了在新的列中记录字符串中是否存在单独的便利设施,我使用了str_detect
和嵌套的for
循环:
# record presence/absence of individual amenities in each new column:
library(stringr)
for(i in 1:ncol(df[amenities_unique])){
for(j in 1:nrow(df)){
df[amenities_unique][j,i] <-
ifelse(str_detect(amenities_split[j], names(df[amenities_unique][i])), 1, 0)
}
}
虽然这会产生警告,但它们似乎无害,因为结果看起来不错:
df
id amenities Wireless Internet
1 1 {"Wireless Internet","Wheelchair accessible",Kitchen,Elevator,"Buzzer/wireless intercom",Heating} 1
2 2 {TV,"Cable TV",Internet,"Wireless Internet","Air conditioning",Kitchen,"Smoking allowed","Pets allowed"} 1
3 3 {"Buzzer/wireless intercom",Heating,"Family/kid friendly","Smoke detector",Carbon monoxide} 0
4 4 {Washer,Dryer,Essentials,Shampoo,Hangers,"Laptop friendly workspace"} 0
Wheelchair accessible Kitchen Elevator Buzzer/wireless intercom Heating TV Cable TV Internet Air conditioning Smoking allowed
1 1 1 1 1 1 0 0 1 0 0
2 0 1 0 0 0 1 1 1 1 1
3 0 0 0 1 1 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0
Pets allowed Family/kid friendly Smoke detector Carbon monoxide Washer Dryer Essentials Shampoo Hangers Laptop friendly workspace
1 0 0 0 0 0 0 0 0 0 0
2 1 0 0 0 0 0 0 0 0 0
3 0 1 1 1 0 0 0 0 0 0
4 0 0 0 0 1 1 1 1 1 1
考虑到警告和嵌套循环的复杂性,如何使用apply
函数族中的函数或使用dplyr
来获得相同的结果?
清洁设施后,即可使用splitstackshape
中的cSplit_e
。
df$amenities_clean <- gsub('[{}"]', '', df$amenities)
splitstackshape::cSplit_e(df, "amenities_clean", type = "character", fill = 0)
使用我们可以做的应用函数之一来解决它:
temp <- strsplit(df$amenities_clean, ",")
amenities_unique <- unique(unlist(temp))
cbind(df, t(sapply(temp, function(x)
table(factor(x, levels = amenities_unique)))))
我相信这会提供您需要的输出:
library(tidyverse)
df %>%
mutate(amenities = str_replace_all(amenities, '["{}]', '')) %>%
separate_rows(amenities, sep = ",") %>%
pivot_wider(names_from = amenities, values_from = amenities, values_fn = list(amenities = is.character)) %>%
mutate_all(replace_na, 0)
结果是:
# A tibble: 4 x 22
id `Wireless Inter~ `Wheelchair acc~ Kitchen Elevator `Buzzer/wireles~ Heating TV `Cable TV` Internet `Air conditioni~ `Smoking allowe~
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 1 1 1 1 0 0 0 0 0
2 2 1 0 1 0 0 0 1 1 1 1 1
3 3 0 0 0 0 1 1 0 0 0 0 0
4 4 0 0 0 0 0 0 0 0 0 0 0
# ... with 10 more variables: `Pets allowed` <dbl>, `Family/kid friendly` <dbl>, `Smoke detector` <dbl>, `Carbon monoxide` <dbl>, Washer <dbl>,
# Dryer <dbl>, Essentials <dbl>, Shampoo <dbl>, Hangers <dbl>, `Laptop friendly workspace` <dbl>