我有一个数据框架dfcompanies
,其中约有6000家公司上市,其识别码为indentifier
,其行业industry
和其得分(NA, 0, or 1)
分为132个不同类别,例如category1-4
:
> dfcompanies
identifier industry category1 category2 category3 category4
1 MBWS.PA Alcoholic Beverages 1 0 1 1
2 KAER.VI Commercial Banks 0 1 0 1
3 TIRO.VI Commercial Banks 1 NA 1 0
4 EVNV.VI Power Generation 0 1 1 1
5 MANV.VI Processed Foods NA 0 NA NA
6 LENV.VI Chemical 1 NA 1 1
我还有第二个数据帧,dfmatcategories
,它定义了industry
中存在的所有行业;列industrymat
;它们的材料类别在categorymat
中,并在第三列中描述type
(strength, or concern)
类别:
> dfmatcategories
industrymat categorymat type
1 Alcoholic Beverages category1 concern
2 Alcoholic Beverages category2 strength
3 Alcoholic Beverages category3 strength
4 Insurance category100 strength
5 Insurance category99 strength
6 Chemical category1 concern
理想情况下,我希望得到dfcompanies
中的每个公司在categorymat
中定义的材料类别中获得的积分总和,这些类别将type
strength
中的sum_mat
添加到dfcompanies
中。
> dfcompanies
identifier industry category1 category2 category3 category4 sum_mat
1 MBWS.PA Alcoholic Beverages 1 0 1 1 1
2 KAER.VI Commercial Banks 0 1 0 1 5
3 TIRO.VI Commercial Banks 1 NA 1 0 3
4 EVNV.VI Power Generation 0 1 1 1 4
5 MANV.VI Processed Foods NA 0 NA NA 1
6 LENV.VI Chemical 1 NA 1 1 0
例子:
在dfcompanies
中,拥有indentifier
MBWS.PA
的公司属于industry
Alcoholic Beverages
,category1
中的1
,category2
中的0
,category3
中的1
,category4
中的1
。
Indfmatcategories
underindustrymat
forAlcoholic Beverages
categorymat
列中的1-3
行只列出category1
、category2
和category3
作为材料类别。
只有category2
和category3
属于type
strength
。
因此MBWS.PA
sum_mat => 1
=
0
(scorecategory2
)+
1
(scorecategory3
)。
不幸的是,我是R的新手,完全不知道如何处理这个问题。
有人能帮帮我吗?
致以最亲切的问候托马斯。
下面的基于tidyverse的解决方案通过旋转数据来实现这一功能。
library(dplyr)
library(tidyr)
dfcompanies %>%
full_join(dfmatcategories, by = c('industry' = 'industrymat')) %>%
filter(!identifier %in% NA) %>%
pivot_longer(category1:category4) %>%
mutate(sum_mat = if_else(categorymat == name &
type =='strength' &
value == 1, 1, 0)) %>%
pivot_wider(names_from = name, values_from = value) %>%
group_by(identifier) %>%
mutate(sum_mat = sum(sum_mat)) %>%
slice(1) %>%
ungroup() %>%
select(colnames(dfcompanies), sum_mat)
# # A tibble: 6 x 7
# identifier industry category1 category2 category3 category4 sum_mat
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 EVNV.VI Power Generation 0 NA NA NA NA
# 2 KAER.VI Commercial Banks 0 NA 0 NA NA
# 3 LENV.VI Chemical 1 NA 1 1 0
# 4 MANV.VI Processed Foods NA NA NA NA NA
# 5 MBWS.PA Alcoholic Beverages 1 1 1 1 2
# 6 TIRO.VI Commercial Banks 1 NA 1 NA NA
dfcompanies <- structure(list(identifier = c("MBWS.PA", "KAER.VI", "TIRO.VI",
"EVNV.VI", "MANV.VI", "LENV.VI"), industry = c("Alcoholic Beverages",
"Commercial Banks", "Commercial Banks", "Power Generation", "Processed Foods",
"Chemical"), category1 = c(1, 0, 1, 0, NA, 1), category2 = c(1,
1, NA, 1, 0, NA), category3 = c(1, 0, 1, 1, NA, 1), category4 = c(1,
1, 0, 1, NA, 1)), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"))
dfmatcategories <- structure(list(industrymat = c("Alcoholic Beverages", "Alcoholic Beverages",
"Alcoholic Beverages", "Insurance", "Insurance", "Chemical"),
categorymat = c("category1", "category2", "category3", "category100",
"category99", "category1"), type = c("concern", "strength",
"strength", "strength", "strength", "concern")), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))