我正在做课本上的一个练习,要求我把我的数据集分成两组。这些组基于fitted
列的有序值。我用Tidyverse
包中的arrange()
函数按升序排列。观察rowid
不再按顺序排列,所以我不能将其用作过滤选项。
structure(list(rowid = c(24, 23, 28, 25, 35, 30, 39, 33, 40,
31, 32, 7, 27, 11, 18), Total_Labour_hrs = c(4314, 4114, 4178,
4289, 4016, 4226, 4146, 4475, 4555, 4121, 3998, 4110, 4347, 4401,
4195), Cases_Shipped = c(248328, 227996, 245743, 249894, 252225,
256506, 270051, 269121, 265239, 271854, 293225, 269334, 273848,
269189, 293880), Labour_Hrs_Cost = c(8.5, 7.22, 8.12, 8.08, 7.85,
7.79, 8.19, 8.01, 7.55, 7.89, 9.01, 7.23, 7.39, 7.05, 8.38),
Holiday = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
fitted = c(4233.43014593002, 4234.27973216581, 4236.39863043405,
4240.19244186657, 4245.05531064968, 4249.21476291559, 4254.60935901285,
4256.24725771141, 4259.24818049515, 4259.97827069745, 4262.05302404833,
4266.68440079867, 4268.13071857239, 4268.94015759698, 4270.8631537863
), residuals = c(80.5698540699768, -120.279732165811, -58.398630434046,
48.8075581334297, -229.055310649683, -23.2147629155879, -108.609359012852,
218.752742288589, 295.751819504853, -138.978270697446, -264.053024048329,
-156.684400798672, 78.869281427611, 132.05984240302, -75.863153786303
)), row.names = c(NA, -15L), class = c("tbl_df", "tbl", "data.frame"
))
特别是这个问题要求我根据fitted
值的顺序将数据集分成两半。我所使用的真实数据集只有52行。因此,我可以轻松地转到第26行,引用fitted
变量对应的单元格中的值,然后使用逻辑参数(如filter(.data = Grocery_Retailier_arranged_fitted, fitted <= value)
)并拆分数据。
但是,由于我正在研究我的技能,以便在更具挑战性的环境和更大的数据集中使用,我想知道我该如何使用具有数百万行的数据集?当然,我想我可以手动操作,但我认为这是一个问题,如果说我需要把数据分成更小的数据集。
正确执行此操作的最佳实践或方法是什么?
您可以根据数据中的行数将数据分成两半。
n <- nrow(df)
data <- split(df, seq(n) <= n/2)
#$`FALSE`
# A tibble: 8 x 7
# rowid Total_Labour_hrs Cases_Shipped Labour_Hrs_Cost Holiday fitted residuals
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 33 4475 269121 8.01 0 4256. 219.
#2 40 4555 265239 7.55 0 4259. 296.
#3 31 4121 271854 7.89 0 4260. -139.
#4 32 3998 293225 9.01 0 4262. -264.
#5 7 4110 269334 7.23 0 4267. -157.
#6 27 4347 273848 7.39 0 4268. 78.9
#7 11 4401 269189 7.05 0 4269. 132.
#8 18 4195 293880 8.38 0 4271. -75.9
#$`TRUE`
# A tibble: 7 x 7
# rowid Total_Labour_hrs Cases_Shipped Labour_Hrs_Cost Holiday fitted residuals
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 24 4314 248328 8.5 0 4233. 80.6
#2 23 4114 227996 7.22 0 4234. -120.
#3 28 4178 245743 8.12 0 4236. -58.4
#4 25 4289 249894 8.08 0 4240. 48.8
#5 35 4016 252225 7.85 0 4245. -229.
#6 30 4226 256506 7.79 0 4249. -23.2
#7 39 4146 270051 8.19 0 4255. -109.
我们可以使用gl
在arrange
之后进行'rowid'分割
library(dplyr)
df1 %>%
arrange(rowid) %>%
group_split(grp = as.integer(gl(n(), ceiling(n()/2), n())),
.keep = FALSE)
与产出
[[1]]
# A tibble: 8 x 7
rowid Total_Labour_hrs Cases_Shipped Labour_Hrs_Cost Holiday fitted residuals
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 7 4110 269334 7.23 0 4267. -157.
2 11 4401 269189 7.05 0 4269. 132.
3 18 4195 293880 8.38 0 4271. -75.9
4 23 4114 227996 7.22 0 4234. -120.
5 24 4314 248328 8.5 0 4233. 80.6
6 25 4289 249894 8.08 0 4240. 48.8
7 27 4347 273848 7.39 0 4268. 78.9
8 28 4178 245743 8.12 0 4236. -58.4
[[2]]
# A tibble: 7 x 7
rowid Total_Labour_hrs Cases_Shipped Labour_Hrs_Cost Holiday fitted residuals
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 30 4226 256506 7.79 0 4249. -23.2
2 31 4121 271854 7.89 0 4260. -139.
3 32 3998 293225 9.01 0 4262. -264.
4 33 4475 269121 8.01 0 4256. 219.
5 35 4016 252225 7.85 0 4245. -229.
6 39 4146 270051 8.19 0 4255. -109.
7 40 4555 265239 7.55 0 4259. 296.