这个问题在这个问题的基础上稍作修改。
我有一个长表格式的数据帧,如下所示:
df1 <- data.frame(ID=c(1,1,1,1,1,1,2,2),
name=c("a","c","a","c","a","c","a","c"),
value=c("broad",50,"mangrove",50,"mangrove",50,"coniferous",50))
ID name value
1 a broad
1 c 50
1 a mangrove
1 c 50
1 a mangrove
1 c 50
2 a coniferous
2 c 50
关于数据:第二行的值50对应于第一行的值wide。类似地,第四行50的值对应于第三行红树林,依此类推。简单地说,名称c与名称a相关。
我想以这样一种方式组合值,即我可以为每个名称获得相应的值,这也将聚合具有类似名称的值:
df2 <- data.frame(ID=c(1,1,2),
name=c("c_broad","c_mangrove","c_coniferous"),
value=c(50,100,50))
应该是这样的:
ID name value
1 c_broad 50
1 c_mangrove 100
2 c_coniferous 50
使用整形2:
library(reshape2)
df1$grp = cumsum(df1$name == "a")
df2 = dcast(df1, ID + grp ~ name)
df2$c = as.numeric(df2$c)
aggregate(c ~ ID + a, df2, sum)
ID a c
1 1 broad 50
2 2 coniferous 50
3 1 mangrove 100
列名可以根据需要改变;c";可以通过粘贴添加到名称中。
使用tidyverse:
value_a <- df1 %>% dplyr::filter(name=="a") %>% dplyr::pull(value)
df1 %>%
dplyr::filter(name=="c") %>% #Modify into a sensible data frame from here
dplyr::mutate(a = value_a,
name = stringr::str_c(name, "_" ,a)) %>%
dplyr::select(-a) %>% # to here
dplyr::group_by(ID, name) %>%
dplyr::summarise(value=sum(as.numeric(value)))
# A tibble: 3 x 3
# Groups: ID [2]
ID name value
<dbl> <chr> <dbl>
1 1 c_broad 50
2 1 c_mangrove 100
3 2 c_coniferous 50
您在数据帧中发现的一个主要问题是,单个列包含名称和值,这是您应该解决的第一件事。我的建议是始终将原始数据帧修改为整洁的格式(https://tidyr.tidyverse.org/articles/tidy-data.html)并从那里利用所有的tidyverse力量,或data.table或您选择的框架。
请注意,时间变量value_a可以直接包含在管道中,为了清楚起见,我没有这样做。其主要思想是在不同的列中分离值和种类,即管道中的前三个调用,然后应用通常的tidyverse操作。
可能不是最优雅的,但它很有效:
df1 <- data.frame(ID=c(1,1,1,1,1,1,2,2),
name=c("a","c","a","c","a","c","a","c"),
value=c("broad",50,"mangrove",50,"mangrove",50,"coniferous",50)
)
df1 %>% group_by( 1+floor((1:n()-1)/2) ) %>%
summarize(
ID = ID[1],
name = paste0( name[2], "_", value[1] ),
value = as.numeric(value[2])
) %>% ungroup %>% select( -1 ) %>% group_by(name) %>%
mutate( value = sum(value) ) %>%
unique
这里有一些改进的东西,实际上是人类可读的:
i <- seq( 1, nrow(df1), 2 )
df1 %>% summarise(
ID = ID[i],
name = paste0( name[i+1], "_", value[i] ),
value = as.numeric(value[i+1])
) %>% group_by(name) %>%
summarize(
ID=ID[1], value = sum( value )
) %>% arrange(ID)
基本R解决方案:
# Nullify numeric values belonging to a grouping category: grps => character vector
grps <- gsub("\d+", NA, df1$value)
# Interpolate NA values using prior string value: a => character vector
df1$a <- na.omit(grps)[cumsum(!(is.na(grps)))]
# Split-Apply-Combine aggregation: data.frame => stdout(console)
data.frame(do.call(rbind, lapply(with(df1, split(df1, a)), function(x){
y <- transform(subset(x, !grepl("\D+", value)), value = as.numeric(value))
setNames(
aggregate(value ~ ID + a, y, FUN = function(z){sum(z, na.rm = TRUE)}),
c("ID", "a", "c")
)
}
)
),
row.names = NULL
)
附加选项
df1 <- data.frame(ID=c(1,1,1,1,1,1,2,2),
name=c("a","c","a","c","a","c","a","c"),
value=c("broad",50,"mangrove",50,"mangrove",50,"coniferous",50))
library(tidyverse)
df1 %>%
pivot_wider(ID, names_from = name, values_from = value) %>%
unnest(c("a", "c")) %>%
group_by(ID, name = a) %>%
summarise(value = sum(as.numeric(c), na.rm = T), .groups = "drop")
#> # A tibble: 3 x 3
#> ID name value
#> <dbl> <chr> <dbl>
#> 1 1 broad 50
#> 2 1 mangrove 100
#> 3 2 coniferous 50
创建于2021-04-12由reprex包(v2.0.0(