让我们有一个列表lis
chicago = data.frame('city' = rep('chicago'), 'year' = c(2018,2019,2020), 'population' = c(100, 105, 110))
paris = data.frame('city' = rep('paris'), 'year' = c(2018,2019,2020), 'population' = c(200, 205, 210))
berlin = data.frame('city' = rep('berlin'), 'year' = c(2018,2019,2020), 'population' = c(300, 305, 310))
bangalore = data.frame('city' = rep('bangalore'), 'year' = c(2018,2019,2020), 'population' = c(400, 405, 410))
lis = list(chicago = chicago, paris = paris, berlin = berlin, bangalore = bangalore)
现在我有一个新的df
包含每个city
的最新数据,
df = data.frame('city' = c('chicago', 'paris', 'berlin', 'bangalore'), 'year' = rep(2021), 'population' = c(115, 215, 315, 415))
我想在city
的基础上将df
的每一行添加到lis
。
我这样做,
#convert to datframe
lis = dplyr::bind_rows(lis)
#rbind
lis = rbind(lis, df)
#again convert to list
lis = split(lis, lis$city)
对于大型数据集来说效率很低。对于大型数据集,它们是否有任何有效的替代方案?
?谢谢。
编辑
我的原始列表包含2239
数据帧,每个数据帧的维度是310x15
。
估计执行时间,
最佳表现:
library(data.table)
rbindlist(c(lis, list(df)))[, .(split(.SD, city))]$V1
Unit: milliseconds
expr min lq mean median uq max neval
av() 823.2123 850.56 933.109 865.7741 921.9321 1268.007 100
下,
lis = dplyr::bind_rows(lis)
#rbind
lis = rbind(lis, df)
#again convert to list
lis = split(lis, lis$city)
Unit: seconds
expr min lq mean median uq max neval
ac() 1.893728 2.032478 2.323619 2.285914 2.325451 4.304177 100
,
Map(rbind, lis, split(df, df$city)[names(lis)])
Unit: seconds
expr min lq mean median uq max neval
az() 2.29919 2.444761 2.749236 2.688349 2.887123 4.205997 100
,
imap(lis, ~ .x %>%
bind_rows(df %>%
filter(city == .y)))
Unit: seconds
expr min lq mean median uq max neval
ax() 4.9921 5.072752 5.178707 5.121748 5.183845 6.069612 100
我们可以使用imap
来遍历list
,filter
根据list
的名称来添加list
元素的行
library(dplyr)
library(purrr)
lis2 <- imap(lis, ~ .x %>%
bind_rows(df %>%
filter(city == .y)))
与产出
> lis2
$chicago
city year population
1 chicago 2018 100
2 chicago 2019 105
3 chicago 2020 110
4 chicago 2021 115
$paris
city year population
1 paris 2018 200
2 paris 2019 205
3 paris 2020 210
4 paris 2021 215
$berlin
city year population
1 berlin 2018 300
2 berlin 2019 305
3 berlin 2020 310
4 berlin 2021 315
$bangalore
city year population
1 bangalore 2018 400
2 bangalore 2019 405
3 bangalore 2020 410
4 bangalore 2021 415
或将base R
与Map
和rbind
一起使用
Map(function(x, nm) rbind(x, df[df$city == nm,]), lis, names(lis))
或者从data.table
使用rbindlist
library(data.table)
rbindlist(c(lis, list(df)))[, .(split(.SD, city))]$V1
或者稍微有效一点,将是split
Map(rbind, lis, split(df, df$city)[names(lis)])