我有一个这样的数据集:
df <- data.frame(year = c("2000", "2000", "2000", "2002", "2000", "2002", "2007"), id = c("X", "X", "X", "X", "Z", "Z", "Z"), product = c("apple", "orange", "orange", "orange", "cake", "cake", "bacon"), market = c("CHN", "USA", "USA", "USA", "SPA", "CHL", "CHL"), value = c(1, 2, 3, 4, 5, 6, 7))
我想通过id创建以下变量:
- years_PM=该产品和市场的年数(包括t-1年)
- value_PM=该产品和市场的总价值(包括t-1年)
- years_OPM=其他市场上其他产品的年数(包括t-1年)
- years_SP_OM=同一产品在其他市场的年数(包括t-1年)
- =如果给定id有历史记录(包括年份t-1),则取值1
- year_id=相同id的年数(包括t-1年)
- year_id_consecutive=相同id的年数。如果连续超过2年没有观测到相同的id,那么它将分配一个0(包括第t-1年)-例如计数将从0开始(作为一个新的观测)。
- n_id_PM=该产品和市场(t-1年)的id数(与观察到的id数不同)
因此,新的数据集看起来像:
df_new <- data.frame(year = c("2000", "2000", "2000", "2002", "2000",
"2002", "2007"), id = c("X", "X", "X", "X", "Z", "Z", "Z"), product = c("apple",
"orange", "orange", "orange", "cake", "cake", "bacon"), market = c("CHN",
"USA", "USA", "USA", "SPA", "CHL", "CHL"), value = c(1, 2, 3,
4, 5, 6, 7), years_PM = c(0, 0, 0, 1, 0, 0, 0), value_PM = c(0,
0, 0, 5, 0, 0, 0), years_OPM = c(0, 0, 0, 1, 0, 0, 0), years_SP_OM = c(0,
0, 0, 0, 0, 1, 0),
history = c(0, 0, 0, 1, 0, 1, 1), year_id = c(0, 0, 0, 1,
0, 1, 2), year_id_consecutive = c(0, 0, 0, 1, 0, 1, 0), n_id_PM = c(0,
0, 0, 0, 0, 0, 0))
我使用了summarise
,但是它切断了数据。我不想合并多个数据集之后。而且,mutate
也没有达到同样的效果。
知道如何使用dplyr更直接地创建它们吗?
不要使用summarize
(已经说过很多次了),它(几乎)总是会减少你的数据
这是一个镜头,给定你在这个问题的三次迭代中问过的各种变量。
df %>%
mutate(year = as.integer(year)) %>%
group_by(product, market) %>%
mutate(
FPFM = +(year == min(year)),
years_PM = sapply(year, function(y) n_distinct(year[year < y])),
value_PM = sapply(year, function(y) sum(value[year < y])),
n_id_PM = sapply(year, function(y) n_distinct(id[year < y]))
) %>%
group_by(product) %>%
mutate(
FP = +(year == min(year)),
years_P = sapply(year, function(y) n_distinct(unique(year[year < y]))),
value_P = sapply(year, function(y) sum(value[year < y])),
n_id_P = sapply(year, function(y) n_distinct(id[year < y]))
) %>%
group_by(market) %>%
mutate(
FM = +(year == min(year)),
years_M = sapply(year, function(y) n_distinct(unique(year[year < y]))),
value_M = sapply(year, function(y) sum(value[year < y])),
n_id_M = sapply(year, function(y) n_distinct(id[year < y]))
) %>%
ungroup() %>%
mutate(
years_OPM = mapply(function(y, p, m) n_distinct(year[year < y & product != p & market != m]),
year, product, market),
years_SP_OM = mapply(function(y, p, m) n_distinct(year[year < y & product == p & market != m]),
year, product, market),
years_OP_SM = mapply(function(y, p, m) n_distinct(year[year < y & product != p & market == m]),
year, product, market)
) %>%
group_by(id) %>%
mutate(
history = +(lengths(sapply(year, function(y) year[year < y])) > 0),
year_id = sapply(year, function(y) n_distinct(year[year < y])),
year_id_consecutive = sapply(year, function(y) {
years <- year[year < y]
if (length(years)) {
+(length(setdiff(seq(min(years), max(years)), years)) < 2)
} else 0L
})
) %>%
ungroup()
# # A tibble: 7 × 23
# year id product market value FPFM years_PM value_PM n_id_PM FP years_P value_P n_id_P FM years_M value_M n_id_M years_OPM years_SP_OM years_OP_SM history year_id year_id_consecutive
# <int> <chr> <chr> <chr> <dbl> <int> <int> <dbl> <int> <int> <int> <dbl> <int> <int> <int> <dbl> <int> <int> <int> <int> <int> <int> <int>
# 1 2000 X apple CHN 1 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0
# 2 2000 X orange USA 2 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0
# 3 2000 X orange USA 3 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0
# 4 2002 X orange USA 4 0 1 5 1 0 1 5 1 0 1 5 1 1 0 0 1 1 1
# 5 2000 Z cake SPA 5 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0
# 6 2002 Z cake CHL 6 1 0 0 0 0 1 5 1 1 0 0 0 1 1 0 1 1 1
# 7 2007 Z bacon CHL 7 1 0 0 0 1 0 0 0 0 1 6 1 2 0 1 1 2 1
有些值与你的不同,但我认为这可能是由于你预期输出的错误或误解/误解了每一栏的意图。
模式应该是明确的:group_by
相关的变量,并在必要时迭代year
或其他一些变量(以限制前几年)和计数/sum/什么的。
我擅自将year
固定为整数。