R - 查找州内和州之间的平均项目成员



(本论坛其他地方的其他问题和答案似乎没有涉及本提要中提到的跨境问题(

假设我有以下数据:

df <- data.frame(id=c("Eric", "John", "Sarah", "Simon", "Abdul", "Charlotte", "Alex", "Susan"),
state=c("CA", "AK", "NY", "NY", "NJ", "GA", "CA", "CA"),
project=c(1, 2, 2, 2, 3, 4, 5, 5), stringsAsFactors = F)
> df
id state project
1      Eric    CA       1
2      John    AK       2
3     Sarah    NY       2
4     Simon    NY       2
5     Abdul    NJ       3
6 Charlotte    GA       4
7      Alex    CA       5
8     Susan    CA       5

我想得到每个州项目成员的平均数量,也计算跨境成员。

为了仅获得州内成员的平均值,我做了以下操作:

dfx <- data.frame()
dfy <- data.frame()
for(j in unique(df$state)){
h <- subset(df, state==j)
counts <- plyr::count(h, 'project')
#uniques <- length(unique(sub$invje))
average_members <- mean(counts$freq)
dfx <- data.frame(state=j,
average_members=average_members)
dfy <- rbind(dfy, dfx)
} 
> dfy
state average_members
1    CA             1.5
2    AK             1.0
3    NY             2.0
4    NJ             1.0
5    GA             1.0

我想要的输出,AK 和 NY 都应该得到 3 分,因为每个 ID 都可以与项目中的另外两个 ID 一起使用(尽管生活在不同的状态(。

> desired
state average_members
1    CA             1.5
2    AK             3.0
3    NY             3.0
4    NJ             1.0
5    GA             1.0

有谁知道如何编码?

library(data.table)
setDT(df)
df[, .(num_proj = .N), by = .(state, project)][, .(average_members = mean(num_proj)), by = state]

结果:

state average_members
1:    CA             1.5
2:    AK             1.0
3:    NY             2.0
4:    NJ             1.0
5:    GA             1.0

对于第二种情况,在第一次迭代中将state拉出组。

unique(df[, .(state, num_proj = .N), by = project])[, .(average_members = mean(num_proj)), by = state]
1:    CA             1.5
2:    AK             3.0
3:    NY             3.0
4:    NJ             1.0
5:    GA             1.0

您可以使用dplyr库执行此操作。您可以通过以下方式回答您的州内问题

library(dplyr)
df %>% count(state, project) %>% 
group_by(state) %>% summarize(avg=mean(n))
#   state       avg
# 1    AK       1.0
# 2    CA       1.5
# 3    GA       1.0
# 4    NJ       1.0
# 5    NY       2.0

您可以通过以下方式获得跨州结果

df %>% distinct(project, state) %>% 
inner_join(df %>% count(project)) %>% 
group_by(state) %>% summarize(avg=mean(n))
#   state       avg
# 1    AK       3.0
# 2    CA       1.5
# 3    GA       1.0
# 4    NJ       1.0
# 5    NY       3.0
df <- data.frame(id=c("Eric", "John", "Sarah", "Simon", "Abdul", "Charlotte", "Alex", "Susan"),
state=c("CA", "AK", "NY", "NY", "NJ", "GA", "CA", "CA"),
project=c(1, 2, 2, 2, 3, 4, 5, 5), stringsAsFactors = F)

dfx <- data.frame()
dfy <- data.frame()
for (j in unique(df$state)) {
h = subset(df, state==j)
thisStatesProjects = unique(h[,"project"])
h2 = subset(df, project %in% thisStatesProjects)
average_members = nrow(h2)/length(thisStatesProjects)
dfx <- data.frame(state=j,
average_members=average_members)
dfy <- rbind(dfy, dfx)
}

最新更新