r语言 - 在dplyr中使用相同的数据集通过id创建多个变量



我有一个这样的数据集:

df <- data.frame(year = c("2000", "2000", "2000", "2002", "2000", "2002", "2007"), id = c("X", "X", "X", "X", "Z", "Z", "Z"), product = c("apple",                                                                                                                                          "orange", "orange", "orange", "cake", "cake", "bacon"), market = c("CHN", "USA", "USA", "USA", "SPA", "CHL", "CHL"), value = c(1, 2, 3, 4, 5, 6, 7))

我想通过id创建以下变量:

  1. years_PM=该产品和市场的年数(包括t-1年)
  2. value_PM=该产品和市场的总价值(包括t-1年)
  3. years_OPM=其他市场上其他产品的年数(包括t-1年)
  4. years_SP_OM=同一产品在其他市场的年数(包括t-1年)
  5. =如果给定id有历史记录(包括年份t-1),则取值1
  6. year_id=相同id的年数(包括t-1年)
  7. year_id_consecutive=相同id的年数。如果连续超过2年没有观测到相同的id,那么它将分配一个0(包括第t-1年)-例如计数将从0开始(作为一个新的观测)。
  8. n_id_PM=该产品和市场(t-1年)的id数(与观察到的id数不同)

因此,新的数据集看起来像:

df_new <- data.frame(year = c("2000", "2000", "2000", "2002", "2000", 
"2002", "2007"), id = c("X", "X", "X", "X", "Z", "Z", "Z"), product = c("apple", 
"orange", "orange", "orange", "cake", "cake", "bacon"), market = c("CHN", 
"USA", "USA", "USA", "SPA", "CHL", "CHL"), value = c(1, 2, 3, 
4, 5, 6, 7), years_PM = c(0, 0, 0, 1, 0, 0, 0), value_PM = c(0, 
0, 0, 5, 0, 0, 0), years_OPM = c(0, 0, 0, 1, 0, 0, 0), years_SP_OM = c(0, 
0, 0, 0, 0, 1, 0), 
history = c(0, 0, 0, 1, 0, 1, 1), year_id = c(0, 0, 0, 1, 
0, 1, 2), year_id_consecutive = c(0, 0, 0, 1, 0, 1, 0), n_id_PM = c(0, 
0, 0, 0, 0, 0, 0))

我使用了summarise,但是它切断了数据。我不想合并多个数据集之后。而且,mutate也没有达到同样的效果。

知道如何使用dplyr更直接地创建它们吗?

不要使用summarize(已经说过很多次了),它(几乎)总是会减少你的数据

这是一个镜头,给定你在这个问题的三次迭代中问过的各种变量。

df %>%
mutate(year = as.integer(year)) %>%
group_by(product, market) %>%
mutate(
FPFM = +(year == min(year)),
years_PM = sapply(year, function(y) n_distinct(year[year < y])),
value_PM = sapply(year, function(y) sum(value[year < y])),
n_id_PM = sapply(year, function(y) n_distinct(id[year < y]))
) %>%
group_by(product) %>%
mutate(
FP = +(year == min(year)),
years_P = sapply(year, function(y) n_distinct(unique(year[year < y]))),
value_P = sapply(year, function(y) sum(value[year < y])),
n_id_P = sapply(year, function(y) n_distinct(id[year < y]))
) %>%
group_by(market) %>%
mutate(
FM = +(year == min(year)),
years_M = sapply(year, function(y) n_distinct(unique(year[year < y]))),
value_M = sapply(year, function(y) sum(value[year < y])),
n_id_M = sapply(year, function(y) n_distinct(id[year < y]))
) %>%
ungroup() %>%
mutate(
years_OPM   = mapply(function(y, p, m) n_distinct(year[year < y & product != p & market != m]),
year, product, market),
years_SP_OM = mapply(function(y, p, m) n_distinct(year[year < y & product == p & market != m]),
year, product, market),
years_OP_SM = mapply(function(y, p, m) n_distinct(year[year < y & product != p & market == m]),
year, product, market)
) %>%
group_by(id) %>%
mutate(
history = +(lengths(sapply(year, function(y) year[year < y])) > 0),
year_id = sapply(year, function(y) n_distinct(year[year < y])),
year_id_consecutive = sapply(year, function(y) {
years <- year[year < y]
if (length(years)) {
+(length(setdiff(seq(min(years), max(years)), years)) < 2)
} else 0L
})
) %>%
ungroup()
# # A tibble: 7 × 23
#    year id    product market value  FPFM years_PM value_PM n_id_PM    FP years_P value_P n_id_P    FM years_M value_M n_id_M years_OPM years_SP_OM years_OP_SM history year_id year_id_consecutive
#   <int> <chr> <chr>   <chr>  <dbl> <int>    <int>    <dbl>   <int> <int>   <int>   <dbl>  <int> <int>   <int>   <dbl>  <int>     <int>       <int>       <int>   <int>   <int>               <int>
# 1  2000 X     apple   CHN        1     1        0        0       0     1       0       0      0     1       0       0      0         0           0           0       0       0                   0
# 2  2000 X     orange  USA        2     1        0        0       0     1       0       0      0     1       0       0      0         0           0           0       0       0                   0
# 3  2000 X     orange  USA        3     1        0        0       0     1       0       0      0     1       0       0      0         0           0           0       0       0                   0
# 4  2002 X     orange  USA        4     0        1        5       1     0       1       5      1     0       1       5      1         1           0           0       1       1                   1
# 5  2000 Z     cake    SPA        5     1        0        0       0     1       0       0      0     1       0       0      0         0           0           0       0       0                   0
# 6  2002 Z     cake    CHL        6     1        0        0       0     0       1       5      1     1       0       0      0         1           1           0       1       1                   1
# 7  2007 Z     bacon   CHL        7     1        0        0       0     1       0       0      0     0       1       6      1         2           0           1       1       2                   1

有些值与你的不同,但我认为这可能是由于你预期输出的错误或误解/误解了每一栏的意图。

模式应该是明确的:group_by相关的变量,并在必要时迭代year或其他一些变量(以限制前几年)和计数/sum/什么的。

我擅自将year固定为整数。

最新更新