我之前也问过类似的问题,但意识到我不够具体。我目前正在分析R中推特的数据。推文来自不同用户,在不同的时间段内撰写(每个用户在一年内收集数据(。我想使用字典绘制数据,但因此我需要统一数据的时间范围。
为了简单起见,我创建了两个数据帧来解释我正在寻找的内容。这就是我的数据帧目前的样子(只有更多的数据(:
Author <- rep(c("Person1"), times = 7)
Text <- c("A","B","C", "D", "E", "F", "G")
Date <- as.Date(c('2015-01-15','2015-01-23','2015-02-14','2015-02-20', '2015-02-25', '2015-03-04', '2015-04-20'))
Pers1 <- data.frame(Author,Text,Date)
Author <- rep(c("Person2"), times = 7)
Text <- c("H","I","J", "K", "L", "M", "N")
Date <- as.Date(c('2020-08-10','2020-08-15','2020-09-05','2020-09-20', '2020-09-30', '2020-10-15','2020-10-25'))
Pers2 <- data.frame(Author,Text,Date)
DF <- bind_rows(Pers1, Pers2)
例如,我正在查看个人1从2015年1月15日到2016年1月5日的推文。观察的第一个月(1月15日至2月15日(应被称为第一个月,依此类推(直到第12个月(
人物2观察从8月10日开始(第一个月到9月10日,第二个月从9月10号到10月10日…(
最后,我希望数据帧看起来像这样:
> DF
Author Text Date Period
1 Person1 A 2015-01-15 First Month
2 Person1 B 2015-01-23 First Month
3 Person1 C 2015-02-14 First Month
4 Person1 D 2015-02-20 Second Month
5 Person1 E 2015-02-25 Second Month
6 Person1 F 2015-03-04 Second Month
7 Person1 G 2015-04-20 Third Month
8 Person2 H 2020-08-10 First Month
9 Person2 I 2020-08-15 First Month
10 Person2 J 2020-09-05 First Month
11 Person2 K 2020-09-20 Second Month
12 Person2 L 2020-09-30 Second Month
13 Person2 M 2020-10-15 Third Month
14 Person2 N 2020-10-25 Third Month
也许在将每个数据帧组合成一个大数据帧之前,我必须准备好它们,但我不知道如何做到。提前感谢您的所有建议。
代码
library(lubridate)
DF %>%
group_by(Author) %>%
mutate(Period = 1 + (interval(first(Date), Date) %/% months(1)))
结果
Author Text Date Period
<fct> <fct> <date> <dbl>
1 Person1 A 2015-01-15 1
2 Person1 B 2015-01-23 1
3 Person1 C 2015-02-14 1
4 Person1 D 2015-02-20 2
5 Person1 E 2015-02-25 2
6 Person1 F 2015-03-04 2
7 Person1 G 2015-04-20 4
8 Person2 H 2020-08-10 1
9 Person2 I 2020-08-15 1
10 Person2 J 2020-09-05 1
11 Person2 K 2020-09-20 2
12 Person2 L 2020-09-30 2
13 Person2 M 2020-10-15 3
14 Person2 N 2020-10-25 3
您可以这样做:
library(dplyr)
months_since_start <- function(dates, start_date) {
floor(as.numeric(difftime(dates, start_date, unit = "week")) / 4.33) + 1
}
DF %>%
group_by(Author) %>%
mutate(month = months_since_start(Date, first(Date)))
#> # A tibble: 14 x 4
#> # Groups: Author [2]
#> Author Text Date month
#> <chr> <chr> <date> <dbl>
#> 1 Person1 A 2015-01-15 1
#> 2 Person1 B 2015-01-23 1
#> 3 Person1 C 2015-02-14 1
#> 4 Person1 D 2015-02-20 2
#> 5 Person1 E 2015-02-25 2
#> 6 Person1 F 2015-03-04 2
#> 7 Person1 G 2015-04-20 4
#> 8 Person2 H 2020-08-10 1
#> 9 Person2 I 2020-08-15 1
#> 10 Person2 J 2020-09-05 1
#> 11 Person2 K 2020-09-20 2
#> 12 Person2 L 2020-09-30 2
#> 13 Person2 M 2020-10-15 3
#> 14 Person2 N 2020-10-25 3
使用MESS::cumsumbinning
library(dplyr)
DF %>%
group_by(Author) %>%
mutate(Month = MESS::cumsumbinning(c(0,diff(Date - first(Date))), 30, cutwhenpassed = F))
Author Text Date Month
<chr> <chr> <date> <int>
1 Person1 A 2015-01-15 1
2 Person1 B 2015-01-23 1
3 Person1 C 2015-02-14 1
4 Person1 D 2015-02-20 2
5 Person1 E 2015-02-25 2
6 Person1 F 2015-03-04 2
7 Person1 G 2015-04-20 3
8 Person2 H 2020-08-10 1
9 Person2 I 2020-08-15 1
10 Person2 J 2020-09-05 1
11 Person2 K 2020-09-20 2
12 Person2 L 2020-09-30 2
13 Person2 M 2020-10-15 3
14 Person2 N 2020-10-25 3
要获得预期结果,可以使用english::ordinal
:
library(english)
library(tidyverse)
library(MESS)
DF %>%
group_by(Author) %>%
mutate(Month = MESS::cumsumbinning(c(0,diff(Date - first(Date))), 30, cutwhenpassed = F) %>%
ordinal() %>%
paste(., "Month") %>%
stringr::str_to_title()
)
Author Text Date Month
<chr> <chr> <date> <chr>
1 Person1 A 2015-01-15 First Month
2 Person1 B 2015-01-23 First Month
3 Person1 C 2015-02-14 First Month
4 Person1 D 2015-02-20 Second Month
5 Person1 E 2015-02-25 Second Month
6 Person1 F 2015-03-04 Second Month
7 Person1 G 2015-04-20 Third Month
8 Person2 H 2020-08-10 First Month
9 Person2 I 2020-08-15 First Month
10 Person2 J 2020-09-05 First Month
11 Person2 K 2020-09-20 Second Month
12 Person2 L 2020-09-30 Second Month
13 Person2 M 2020-10-15 Third Month
14 Person2 N 2020-10-25 Third Month