R - 将范围和聚合日期中的列值除以月份,以计算该月的范围频率

  • 本文关键字:范围 计算 频率 日期 r dplyr tidyr
  • 更新时间 :
  • 英文 :


我有一个数据框,其中包含一个整数类型的日期列。 我还想将价格除以 10,000 的范围,然后计算该月下降的频率

> df
date        values  price
11/25/18   a        10000
11/30/18   b        30500
12/4/18    a        20000
12/5/18    b        65000
12/5/18    a        50000
12/6/18    b        35000
12/6/18    c        40000
12/6/18    a        45000
12/6/18    a        30000
12/7/18    b        80000
12/7/18    c        85000
12/7/18    a        90000
12/9/18    b        20000
12/12/18   a        32500
12/12/18   c        40200
12/13/18   b        56000
1/9/19     a        82000
1/9/19     c        63000
1/9/19     b        20000
1/10/19    d        25000
1/10/19    d        34000
1/10/19    d        13020
1/10/19    a        50000
1/11/19    c        24300
1/11/19    d        40000
2/1/19     a        95000
2/10/19    a        20000
2/13/19    b        10000
3/14/19    d        30000
3/17/19    c        45000
5/4/19     d        18000
5/5/19     c        12000
5/6/19     d        90000
5/31/19    a        90000

我正在尝试此代码,但我无法在一个月内聚合

df %>% 
group_by(date) %>%
count(values)

由此,我得到了每天的频率

group_by(month = month(date)) %>% 
count(values)

当我尝试使用此代码以按月聚合日期时,我收到以下错误

(错误在 as 中。POSIXlt.character(as.character(x(, ...( : 字符串不是标准的明确格式(

并按 10,000 步长(在价格列中(分组,我使用以下代码

tally(group_by(df, values,
price = cut(price, breaks = seq(10000, 200000, by = 10000)))) %>%
ungroup() %>% 
spread(price, n, fill = 0)

问题:

我无法将其与代码相结合以按月聚合日期,然后按价格组传播数据。

预期产出:

date  values 10k-20k 20k-30k 30k-40k 40k-50k 50k-60k 60k-70k 70k-80k 80k-90k
11/18  a       1
11/18  b                        1
12/18  a                1       1       1      1                        1
12/18  b                1       1              1         1     
12/18  c                        1       1                               1
...

我们可以从日期列中提取月-年,使用cutprice分解为不同的存储桶,count频率,然后spread宽格式。

library(dplyr)
cut_group <- seq(10000,200000,by=10000)
df %>%
mutate(date = as.Date(date, "%m/%d/%y"), 
month_year = format(date, "%m-%y"), 
groups = cut(price, cut_group, include.lowest = TRUE, 
labels = paste(cut_group[-length(cut_group)], cut_group[-1], sep = "-"))) %>%
count(values, month_year, groups) %>%
tidyr::spread(groups, n, fill = 0)

#  values month_year `10000-20000` `20000-30000` `30000-40000` `40000-50000`
#   <fct>  <chr>            <dbl>         <dbl>         <dbl>         <dbl> 
# 1 a      01-19             0             0             0             1
# 2 a      02-19             1             0             0             0
# 3 a      05-19             0             0             0             0
# 4 a      11-18             1             0             0             0
#.....

数据

df <- structure(list(date = structure(c(4L, 5L, 8L, 9L, 9L, 10L, 10L, 
10L, 10L, 11L, 11L, 11L, 12L, 6L, 6L, 7L, 3L, 3L, 3L, 1L, 1L, 
1L, 1L, 2L, 2L, 13L, 14L, 15L, 16L, 17L, 19L, 20L, 21L, 18L), .Label = 
c("1/10/19", "1/11/19", "1/9/19", "11/25/18", "11/30/18", "12/12/18", "12/13/18", 
"12/4/18", "12/5/18", "12/6/18", "12/7/18", "12/9/18", "2/1/19", 
"2/10/19", "2/13/19", "3/14/19", "3/17/19", "5/31/19", "5/4/19", 
"5/5/19", "5/6/19"), class = "factor"), values = structure(c(1L, 
2L, 1L, 2L, 1L, 2L, 3L, 1L, 1L, 2L, 3L, 1L, 2L, 1L, 3L, 2L, 1L, 
3L, 2L, 4L, 4L, 4L, 1L, 3L, 4L, 1L, 1L, 2L, 4L, 3L, 4L, 3L, 4L, 
1L), .Label = c("a", "b", "c", "d"), class = "factor"), price = c(10000L, 
30500L, 20000L, 65000L, 50000L, 35000L, 40000L, 45000L, 30000L, 
80000L, 85000L, 90000L, 20000L, 32500L, 40200L, 56000L, 82000L, 
63000L, 20000L, 25000L, 34000L, 13020L, 50000L, 24300L, 40000L, 
95000L, 20000L, 10000L, 30000L, 45000L, 18000L, 12000L, 90000L, 
90000L)), class = "data.frame", row.names = c(NA, -34L))

如果有帮助,我可以提供一个 data.table + 润滑解决方案:

library(data.table)
library(lubridate)
setDT(df)
df[,  .N, by = floor_date(date, "month")]

编辑: 我错过了整个"10000 组"部分:

df2 <- df[, .N, by = .(date = floor_date(date, "month"), range = cut(price, seq(0, 100e3, 10e3))]

然后你可以使用 dcast 让它成为宽格式:

dcast(df2, date~range) 

最新更新