我有以下数据集,其中"value"列中的值从开始到结束日期都有效:
data.table(company = c("A", "A", "B", "B"), person = c("a", "b", "b", "c"), value = c(2,3,5,5), start_date = c("2015-01-01", "2015-01-04", "2015-01-02", "2015-01-06"), end_date = c("2015-01-06", "2015-01-07", "2015-01-07", "2015-01-07"))
company person value start_date end_date
1: A a 2 2015-01-01 2015-01-06
2: A b 3 2015-01-04 2015-01-07
3: B b 5 2015-01-02 2015-01-07
4: B c 5 2015-01-06 2015-01-07
我想根据这些数据计算三件事:
- 每家公司每个日期的平均值
- 每个日期的公司数量
- 每个公司每个日期的人数
我已经尝试了以下方法,它对我的测试样本来说就像一个魅力,但它在实际数据集上惨遭失败,因为它需要大量的计算能力。我知道这是由制作一个数据集引起的,每个公司每人每个日期都有单独的行,但是,我不知道如何使用 R 中的某种函数来解决这个问题。
尝试的代码:
test$start_date = as.Date(as.character(test$start_date), format = "%Y-%m-%d")
test$end_date = as.Date(as.character(test$end_date), format = "%Y-%m-%d")
#indexing per row
indxtest = test[,.(Date=seq(from = min(start_date), to = max(end_date), by = "day")), by = 1:nrow(test)]
test = test[, nrow := 1:nrow(test)]
test = merge(indxtest, test, by = "nrow", all.x = TRUE)
setDT(test, "company","Date")
test = test[, mean_EPS := mean(value, na.rm = TRUE), by = c("company", "Date")]
test = test[, Number_people := .N, by = c("company", "Date")]
test = test[, number_companies := uniqueN(company), by = "Date"]
我目前的结果看起来像这样:
nrow Date company person value start_date end_date mean_value Number_people number_companies
1: 1 2015-01-01 A a 2 2015-01-01 2015-01-06 2.0 1 1
2: 1 2015-01-02 A a 2 2015-01-01 2015-01-06 2.0 1 2
3: 3 2015-01-02 B b 5 2015-01-02 2015-01-07 5.0 1 2
4: 1 2015-01-03 A a 2 2015-01-01 2015-01-06 2.0 1 2
5: 3 2015-01-03 B b 5 2015-01-02 2015-01-07 5.0 1 2
6: 1 2015-01-04 A a 2 2015-01-01 2015-01-06 2.5 2 2
7: 2 2015-01-04 A b 3 2015-01-04 2015-01-07 2.5 2 2
8: 3 2015-01-04 B b 5 2015-01-02 2015-01-07 5.0 1 2
9: 1 2015-01-05 A a 2 2015-01-01 2015-01-06 2.5 2 2
10: 2 2015-01-05 A b 3 2015-01-04 2015-01-07 2.5 2 2
11: 3 2015-01-05 B b 5 2015-01-02 2015-01-07 5.0 1 2
12: 1 2015-01-06 A a 2 2015-01-01 2015-01-06 2.5 2 2
13: 2 2015-01-06 A b 3 2015-01-04 2015-01-07 2.5 2 2
14: 3 2015-01-06 B b 5 2015-01-02 2015-01-07 5.0 2 2
15: 4 2015-01-06 B c 5 2015-01-06 2015-01-07 5.0 2 2
16: 2 2015-01-07 A b 3 2015-01-04 2015-01-07 3.0 1 2
17: 3 2015-01-07 B b 5 2015-01-02 2015-01-07 5.0 2 2
18: 4 2015-01-07 B c 5 2015-01-06 2015-01-07 5.0 2 2
除了我自己想到的解决方案之外,我无法在这里找到任何相关内容,但是,如果有参考,那将是一个很大的帮助。
你真的必须避免这种联接,因为它会因为更大的数据而爆炸。你可以试试这个循环是否足够快(日期的数量可能不是很大,我预计最多不超过三到四千(。
library(data.table)
DT <- data.table(company = c("A", "A", "B", "B"),
person = c("a", "b", "b", "c"),
value = c(2,3,5,5),
start_date = c("2015-01-01", "2015-01-04", "2015-01-02", "2015-01-06"),
end_date = c("2015-01-06", "2015-01-07", "2015-01-07", "2015-01-07"))
DT[, c("start_date", "end_date") := lapply(.(start_date, end_date), as.Date)]
dates <- DT[, seq(from = min(start_date), to = max(end_date), by = "day")]
res <- lapply(dates, function(x) {
d <- x
DT[, .(date = d, mean_EPS = mean(value, na.rm = TRUE), .N), by = .(company, x >= start_date & x <= end_date)][x == TRUE]
})
res <- rbindlist(res)
# company x date mean_EPS N
# 1: A TRUE 2015-01-01 2.0 1
# 2: A TRUE 2015-01-02 2.0 1
# 3: B TRUE 2015-01-02 5.0 1
# 4: A TRUE 2015-01-03 2.0 1
# 5: B TRUE 2015-01-03 5.0 1
# 6: A TRUE 2015-01-04 2.5 2
# 7: B TRUE 2015-01-04 5.0 1
# 8: A TRUE 2015-01-05 2.5 2
# 9: B TRUE 2015-01-05 5.0 1
#10: A TRUE 2015-01-06 2.5 2
#11: B TRUE 2015-01-06 5.0 2
#12: A TRUE 2015-01-07 3.0 1
#13: B TRUE 2015-01-07 5.0 2
res[, .N, by = date]
# date N
#1: 2015-01-01 1
#2: 2015-01-02 2
#3: 2015-01-03 2
#4: 2015-01-04 2
#5: 2015-01-05 2
#6: 2015-01-06 2
#7: 2015-01-07 2
这是一个整洁的解决方案:
library(tidyverse)
df =df%>%as.tibble()%>%
transmute(Date = map2(start_date, end_date, seq, by = "day"), company,person,value) %>%
unnest()
df1=df%>%group_by(Date,company)%>%
summarize(mean_value=mean(value),Number_people=n_distinct(person))%>%
right_join(df,by=c("company","Date"))
df2=df%>%
group_by(Date)%>%
summarize(companies=n_distinct(company))%>%
right_join(df1,by="Date")%>%
arrange(Date)
df2
Date companies company mean_value Number_people person value
<date> <int> <chr> <dbl> <int> <chr> <dbl>
1 2015-01-01 1 A 2 1 a 2
2 2015-01-02 2 A 2 1 a 2
3 2015-01-02 2 B 5 1 b 5
4 2015-01-03 2 A 2 1 a 2
5 2015-01-03 2 B 5 1 b 5
6 2015-01-04 2 A 2.5 2 a 2
7 2015-01-04 2 A 2.5 2 b 3
8 2015-01-04 2 B 5 1 b 5
9 2015-01-05 2 A 2.5 2 a 2
10 2015-01-05 2 A 2.5 2 b 3
11 2015-01-05 2 B 5 1 b 5
12 2015-01-06 2 A 2.5 2 a 2
13 2015-01-06 2 A 2.5 2 b 3
14 2015-01-06 2 B 5 2 b 5
15 2015-01-06 2 B 5 2 c 5
16 2015-01-07 2 A 3 1 b 3
17 2015-01-07 2 B 5 2 b 5
18 2015-01-07 2 B 5 2 c 5