r语言 - 有效地填充日期之间的值



我有以下数据集,其中"value"列中的值从开始到结束日期都有效:

data.table(company = c("A", "A", "B", "B"), person = c("a", "b", "b", "c"), value = c(2,3,5,5), start_date = c("2015-01-01", "2015-01-04", "2015-01-02", "2015-01-06"), end_date = c("2015-01-06", "2015-01-07", "2015-01-07", "2015-01-07"))
company person value start_date   end_date
1:       A      a     2 2015-01-01 2015-01-06
2:       A      b     3 2015-01-04 2015-01-07
3:       B      b     5 2015-01-02 2015-01-07
4:       B      c     5 2015-01-06 2015-01-07

我想根据这些数据计算三件事:

  1. 每家公司每个日期的平均值
  2. 每个日期的公司数量
  3. 每个公司每个日期的人数

我已经尝试了以下方法,它对我的测试样本来说就像一个魅力,但它在实际数据集上惨遭失败,因为它需要大量的计算能力。我知道这是由制作一个数据集引起的,每个公司每人每个日期都有单独的行,但是,我不知道如何使用 R 中的某种函数来解决这个问题。

尝试的代码:

test$start_date = as.Date(as.character(test$start_date), format = "%Y-%m-%d")
test$end_date = as.Date(as.character(test$end_date), format = "%Y-%m-%d")
#indexing per row
indxtest = test[,.(Date=seq(from = min(start_date), to = max(end_date), by = "day")), by = 1:nrow(test)]
test = test[, nrow := 1:nrow(test)]
test = merge(indxtest, test, by =  "nrow", all.x = TRUE)
setDT(test, "company","Date")
test = test[, mean_EPS := mean(value, na.rm = TRUE), by = c("company", "Date")]
test = test[, Number_people := .N, by = c("company", "Date")]
test = test[, number_companies := uniqueN(company), by = "Date"]

我目前的结果看起来像这样:

nrow       Date company person value start_date   end_date mean_value Number_people number_companies
1:    1 2015-01-01       A      a     2 2015-01-01 2015-01-06      2.0             1                1
2:    1 2015-01-02       A      a     2 2015-01-01 2015-01-06      2.0             1                2
3:    3 2015-01-02       B      b     5 2015-01-02 2015-01-07      5.0             1                2
4:    1 2015-01-03       A      a     2 2015-01-01 2015-01-06      2.0             1                2
5:    3 2015-01-03       B      b     5 2015-01-02 2015-01-07      5.0             1                2
6:    1 2015-01-04       A      a     2 2015-01-01 2015-01-06      2.5             2                2
7:    2 2015-01-04       A      b     3 2015-01-04 2015-01-07      2.5             2                2
8:    3 2015-01-04       B      b     5 2015-01-02 2015-01-07      5.0             1                2
9:    1 2015-01-05       A      a     2 2015-01-01 2015-01-06      2.5             2                2
10:    2 2015-01-05       A      b     3 2015-01-04 2015-01-07      2.5             2                2
11:    3 2015-01-05       B      b     5 2015-01-02 2015-01-07      5.0             1                2
12:    1 2015-01-06       A      a     2 2015-01-01 2015-01-06      2.5             2                2
13:    2 2015-01-06       A      b     3 2015-01-04 2015-01-07      2.5             2                2
14:    3 2015-01-06       B      b     5 2015-01-02 2015-01-07      5.0             2                2
15:    4 2015-01-06       B      c     5 2015-01-06 2015-01-07      5.0             2                2
16:    2 2015-01-07       A      b     3 2015-01-04 2015-01-07      3.0             1                2
17:    3 2015-01-07       B      b     5 2015-01-02 2015-01-07      5.0             2                2
18:    4 2015-01-07       B      c     5 2015-01-06 2015-01-07      5.0             2                2

除了我自己想到的解决方案之外,我无法在这里找到任何相关内容,但是,如果有参考,那将是一个很大的帮助。

你真的必须避免这种联接,因为它会因为更大的数据而爆炸。你可以试试这个循环是否足够快(日期的数量可能不是很大,我预计最多不超过三到四千(。

library(data.table)
DT <- data.table(company = c("A", "A", "B", "B"), 
person = c("a", "b", "b", "c"), 
value = c(2,3,5,5), 
start_date = c("2015-01-01", "2015-01-04", "2015-01-02", "2015-01-06"), 
end_date = c("2015-01-06", "2015-01-07", "2015-01-07", "2015-01-07"))
DT[, c("start_date", "end_date") := lapply(.(start_date, end_date), as.Date)]
dates <- DT[, seq(from = min(start_date), to = max(end_date), by = "day")]
res <- lapply(dates, function(x) {
d <- x
DT[, .(date = d,  mean_EPS = mean(value, na.rm = TRUE), .N), by = .(company, x >= start_date & x <= end_date)][x == TRUE]
})
res <- rbindlist(res)
#    company    x       date mean_EPS N
# 1:       A TRUE 2015-01-01      2.0 1
# 2:       A TRUE 2015-01-02      2.0 1
# 3:       B TRUE 2015-01-02      5.0 1
# 4:       A TRUE 2015-01-03      2.0 1
# 5:       B TRUE 2015-01-03      5.0 1
# 6:       A TRUE 2015-01-04      2.5 2
# 7:       B TRUE 2015-01-04      5.0 1
# 8:       A TRUE 2015-01-05      2.5 2
# 9:       B TRUE 2015-01-05      5.0 1
#10:       A TRUE 2015-01-06      2.5 2
#11:       B TRUE 2015-01-06      5.0 2
#12:       A TRUE 2015-01-07      3.0 1
#13:       B TRUE 2015-01-07      5.0 2
res[, .N, by = date]
#         date N
#1: 2015-01-01 1
#2: 2015-01-02 2
#3: 2015-01-03 2
#4: 2015-01-04 2
#5: 2015-01-05 2
#6: 2015-01-06 2
#7: 2015-01-07 2

这是一个整洁的解决方案:

library(tidyverse)
df =df%>%as.tibble()%>%
transmute(Date = map2(start_date, end_date, seq, by = "day"), company,person,value) %>%
unnest()  
df1=df%>%group_by(Date,company)%>%
summarize(mean_value=mean(value),Number_people=n_distinct(person))%>%
right_join(df,by=c("company","Date"))
df2=df%>%
group_by(Date)%>%
summarize(companies=n_distinct(company))%>%
right_join(df1,by="Date")%>%
arrange(Date)
df2
Date       companies company mean_value Number_people person value
<date>         <int> <chr>        <dbl>         <int> <chr>  <dbl>
1 2015-01-01         1 A              2               1 a          2
2 2015-01-02         2 A              2               1 a          2
3 2015-01-02         2 B              5               1 b          5
4 2015-01-03         2 A              2               1 a          2
5 2015-01-03         2 B              5               1 b          5
6 2015-01-04         2 A              2.5             2 a          2
7 2015-01-04         2 A              2.5             2 b          3
8 2015-01-04         2 B              5               1 b          5
9 2015-01-05         2 A              2.5             2 a          2
10 2015-01-05         2 A              2.5             2 b          3
11 2015-01-05         2 B              5               1 b          5
12 2015-01-06         2 A              2.5             2 a          2
13 2015-01-06         2 A              2.5             2 b          3
14 2015-01-06         2 B              5               2 b          5
15 2015-01-06         2 B              5               2 c          5
16 2015-01-07         2 A              3               1 b          3
17 2015-01-07         2 B              5               2 b          5
18 2015-01-07         2 B              5               2 c          5

最新更新