R:根据另一数据帧中的条件对数据帧进行条件行和



我有一个数据框架dfcompanies,其中约有6000家公司上市,其识别码为indentifier,其行业industry和其得分(NA, 0, or 1)分为132个不同类别,例如category1-4:

> dfcompanies
identifier  industry            category1   category2   category3   category4
1   MBWS.PA     Alcoholic Beverages 1           0           1           1
2   KAER.VI     Commercial Banks    0           1           0           1
3   TIRO.VI     Commercial Banks    1           NA          1           0
4   EVNV.VI     Power Generation    0           1           1           1
5   MANV.VI     Processed Foods     NA          0           NA          NA
6   LENV.VI     Chemical            1           NA          1           1

我还有第二个数据帧,dfmatcategories,它定义了industry中存在的所有行业;列industrymat;它们的材料类别在categorymat中,并在第三列中描述type(strength, or concern)类别:

> dfmatcategories
industrymat         categorymat type
1   Alcoholic Beverages category1   concern
2   Alcoholic Beverages category2   strength
3   Alcoholic Beverages category3   strength
4   Insurance           category100 strength
5   Insurance           category99  strength
6   Chemical            category1   concern

理想情况下,我希望得到dfcompanies中的每个公司在categorymat中定义的材料类别中获得的积分总和,这些类别将typestrength中的sum_mat添加到dfcompanies中。

> dfcompanies
identifier  industry            category1   category2   category3   category4   sum_mat
1   MBWS.PA     Alcoholic Beverages 1           0           1           1           1
2   KAER.VI     Commercial Banks    0           1           0           1           5
3   TIRO.VI     Commercial Banks    1           NA          1           0           3
4   EVNV.VI     Power Generation    0           1           1           1           4
5   MANV.VI     Processed Foods     NA          0           NA          NA          1
6   LENV.VI     Chemical            1           NA          1           1           0

例子:

dfcompanies中,拥有indentifierMBWS.PA的公司属于industryAlcoholic Beverages,category1中的1,category2中的0,category3中的1,category4中的1

IndfmatcategoriesunderindustrymatforAlcoholic Beveragescategorymat列中的1-3行只列出category1category2category3作为材料类别。

只有category2category3属于typestrength

因此MBWS.PAsum_mat => 1=0(scorecategory2)+1(scorecategory3)。

不幸的是,我是R的新手,完全不知道如何处理这个问题。

有人能帮帮我吗?

致以最亲切的问候托马斯。

下面的基于tidyverse的解决方案通过旋转数据来实现这一功能。

library(dplyr)
library(tidyr)
dfcompanies %>%
full_join(dfmatcategories, by = c('industry' = 'industrymat')) %>%
filter(!identifier %in% NA) %>%
pivot_longer(category1:category4) %>%
mutate(sum_mat = if_else(categorymat == name & 
type =='strength' &
value == 1, 1, 0)) %>%
pivot_wider(names_from = name, values_from = value) %>%
group_by(identifier) %>%
mutate(sum_mat = sum(sum_mat)) %>%
slice(1) %>%
ungroup() %>%
select(colnames(dfcompanies), sum_mat)
# # A tibble: 6 x 7
#   identifier industry            category1 category2 category3 category4 sum_mat
#   <chr>      <chr>                   <dbl>     <dbl>     <dbl>     <dbl>   <dbl>
# 1 EVNV.VI    Power Generation            0        NA        NA        NA      NA
# 2 KAER.VI    Commercial Banks            0        NA         0        NA      NA
# 3 LENV.VI    Chemical                    1        NA         1         1       0
# 4 MANV.VI    Processed Foods            NA        NA        NA        NA      NA
# 5 MBWS.PA    Alcoholic Beverages         1         1         1         1       2
# 6 TIRO.VI    Commercial Banks            1        NA         1        NA      NA

dfcompanies <- structure(list(identifier = c("MBWS.PA", "KAER.VI", "TIRO.VI", 
"EVNV.VI", "MANV.VI", "LENV.VI"), industry = c("Alcoholic Beverages", 
"Commercial Banks", "Commercial Banks", "Power Generation", "Processed Foods", 
"Chemical"), category1 = c(1, 0, 1, 0, NA, 1), category2 = c(1, 
1, NA, 1, 0, NA), category3 = c(1, 0, 1, 1, NA, 1), category4 = c(1, 
1, 0, 1, NA, 1)), row.names = c(NA, -6L), class = c("tbl_df", 
"tbl", "data.frame"))
dfmatcategories <- structure(list(industrymat = c("Alcoholic Beverages", "Alcoholic Beverages", 
"Alcoholic Beverages", "Insurance", "Insurance", "Chemical"), 
categorymat = c("category1", "category2", "category3", "category100", 
"category99", "category1"), type = c("concern", "strength", 
"strength", "strength", "strength", "concern")), row.names = c(NA, 
-6L), class = c("tbl_df", "tbl", "data.frame"))

最新更新