我目前有一个数据帧,它有两列:逮捕日期和逮捕次数。从2006年到2020年,日期栏几乎每天都有;我希望得到的不是每天的逮捕人数,而是每月、每年的逮捕人数。
数据帧将被转换为时间序列分析的xts对象,因此我需要一个包含年份和月份的日期列。
以下是数据集的前6个月数据:
structure(list(ARREST_DATE = structure(c(13149, 13150, 13151,
13152, 13153, 13154, 13155, 13156, 13157, 13158, 13159, 13160,
13161, 13162, 13163, 13164, 13165, 13166, 13167, 13168, 13169,
13170, 13171, 13172, 13173, 13174, 13175, 13176, 13177, 13178,
13179, 13180, 13181, 13182, 13183, 13184, 13185, 13186, 13187,
13188, 13189, 13190, 13191, 13192, 13193, 13194, 13195, 13196,
13197, 13198, 13199, 13200, 13201, 13202, 13203, 13204, 13205,
13206, 13207, 13208, 13209, 13210, 13211, 13212, 13213, 13214,
13215, 13216, 13217, 13218, 13219, 13220, 13221, 13222, 13223,
13224, 13225, 13226, 13227, 13228, 13229, 13230, 13231, 13232,
13233, 13234, 13235, 13236, 13237, 13238, 13239, 13240, 13241,
13242, 13243, 13244, 13245, 13246, 13247, 13248, 13249, 13250,
13251, 13252, 13253, 13254, 13255, 13256, 13257, 13258, 13259,
13260, 13261, 13262, 13263, 13264, 13265, 13266, 13267, 13268,
13269, 13270, 13271, 13272, 13273, 13274, 13275, 13276, 13277,
13278, 13279, 13280, 13281, 13282, 13283, 13284, 13285, 13286,
13287, 13288, 13289, 13290, 13291, 13292, 13293, 13294, 13295,
13296, 13297, 13298, 13299, 13300, 13301, 13302, 13303, 13304,
13305, 13306, 13307, 13308, 13309, 13310, 13311, 13312, 13313,
13314, 13315, 13316, 13317, 13318, 13319, 13320, 13321, 13322,
13323, 13324, 13325, 13326, 13327, 13328, 13329), class = "Date"),
num_of_arrests = c(550L, 617L, 895L, 1224L, 1379L, 1246L,
893L, 635L, 889L, 1316L, 1223L, 1264L, 1258L, 852L, 478L,
710L, 1131L, 1190L, 1309L, 1085L, 910L, 704L, 852L, 1278L,
1322L, 1250L, 1128L, 967L, 686L, 812L, 998L, 1350L, 1356L,
1292L, 1006L, 568L, 867L, 1296L, 1428L, 1327L, 1182L, 821L,
233L, 618L, 915L, 1370L, 1391L, 1237L, 992L, 649L, 888L,
1167L, 1369L, 1126L, 1071L, 888L, 615L, 831L, 1019L, 1364L,
1109L, 1239L, 962L, 720L, 930L, 1233L, 1413L, 1350L, 1258L,
1034L, 629L, 954L, 1181L, 1421L, 1332L, 974L, 924L, 680L,
958L, 1232L, 1389L, 1289L, 1189L, 931L, 672L, 824L, 1188L,
1332L, 1194L, 1005L, 1011L, 653L, 822L, 1252L, 1421L, 1316L,
1231L, 902L, 740L, 811L, 1184L, 1362L, 1401L, 1144L, 860L,
383L, 775L, 1143L, 1296L, 1271L, 1056L, 729L, 593L, 836L,
1264L, 1341L, 1298L, 1127L, 771L, 548L, 908L, 1290L, 1398L,
1297L, 1127L, 878L, 663L, 928L, 1258L, 1389L, 1300L, 1135L,
937L, 600L, 851L, 1173L, 1366L, 1211L, 958L, 912L, 602L,
843L, 1274L, 1368L, 1332L, 1068L, 823L, 589L, 482L, 1076L,
1217L, 1194L, 1020L, 822L, 628L, 895L, 1225L, 1116L, 1264L,
1254L, 829L, 747L, 911L, 1241L, 1291L, 1267L, 1182L, 924L,
438L, 826L, 1228L, 1361L, 1255L, 1095L, 763L, 594L, 860L,
1056L, 1157L, 1073L, 898L)), row.names = c(NA, 181L), class = "data.frame")
要获得每月的逮捕次数,您可以执行以下操作:使用lubridate函数month()
和year()
提取月份和年份,按它们分组(在您的示例中可以省略年份,因为只有2006年(,summarize()
提取sum()
。
根据请求,要获得一个包含年份和月份的列,paste()
将它们放在一起,ungroup()
将取消选择辅助列,并将relocate()
yearmonth放在前面。
代码
library(dplyr)
library(lubridate)
result <- data %>% mutate(year = year(ARREST_DATE), month = month(ARREST_DATE)) %>%
group_by(year, month) %>% summarise(arrests_per_month = sum(num_of_arrests)) %>%
mutate(yearmonth = paste(year, month, sep = "-")) %>% ungroup() %>%
select(-c(year, month)) %>% relocate(yearmonth)
输出
> result
# A tibble: 6 x 2
yearmonth arrests_per_month
<chr> <int>
1 2006-1 31051
2 2006-2 28872
3 2006-3 33910
4 2006-4 30541
5 2006-5 32253
6 2006-6 30414