如何根据r中特定变量的值拆分大数据集



我正在做课本上的一个练习,要求我把我的数据集分成两组。这些组基于fitted列的有序值。我用Tidyverse包中的arrange()函数按升序排列。观察rowid不再按顺序排列,所以我不能将其用作过滤选项。

structure(list(rowid = c(24, 23, 28, 25, 35, 30, 39, 33, 40, 
31, 32, 7, 27, 11, 18), Total_Labour_hrs = c(4314, 4114, 4178, 
4289, 4016, 4226, 4146, 4475, 4555, 4121, 3998, 4110, 4347, 4401, 
4195), Cases_Shipped = c(248328, 227996, 245743, 249894, 252225, 
256506, 270051, 269121, 265239, 271854, 293225, 269334, 273848, 
269189, 293880), Labour_Hrs_Cost = c(8.5, 7.22, 8.12, 8.08, 7.85, 
7.79, 8.19, 8.01, 7.55, 7.89, 9.01, 7.23, 7.39, 7.05, 8.38), 
Holiday = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), 
fitted = c(4233.43014593002, 4234.27973216581, 4236.39863043405, 
4240.19244186657, 4245.05531064968, 4249.21476291559, 4254.60935901285, 
4256.24725771141, 4259.24818049515, 4259.97827069745, 4262.05302404833, 
4266.68440079867, 4268.13071857239, 4268.94015759698, 4270.8631537863
), residuals = c(80.5698540699768, -120.279732165811, -58.398630434046, 
48.8075581334297, -229.055310649683, -23.2147629155879, -108.609359012852, 
218.752742288589, 295.751819504853, -138.978270697446, -264.053024048329, 
-156.684400798672, 78.869281427611, 132.05984240302, -75.863153786303
)), row.names = c(NA, -15L), class = c("tbl_df", "tbl", "data.frame"
))

特别是这个问题要求我根据fitted值的顺序将数据集分成两半。我所使用的真实数据集只有52行。因此,我可以轻松地转到第26行,引用fitted变量对应的单元格中的值,然后使用逻辑参数(如filter(.data = Grocery_Retailier_arranged_fitted, fitted <= value))并拆分数据。

但是,由于我正在研究我的技能,以便在更具挑战性的环境和更大的数据集中使用,我想知道我该如何使用具有数百万行的数据集?当然,我想我可以手动操作,但我认为这是一个问题,如果说我需要把数据分成更小的数据集。

正确执行此操作的最佳实践或方法是什么?

您可以根据数据中的行数将数据分成两半。

n <- nrow(df)
data <- split(df, seq(n) <= n/2)
#$`FALSE`
# A tibble: 8 x 7
#  rowid Total_Labour_hrs Cases_Shipped Labour_Hrs_Cost Holiday fitted residuals
#  <dbl>            <dbl>         <dbl>           <dbl>   <dbl>  <dbl>     <dbl>
#1    33             4475        269121            8.01       0  4256.     219. 
#2    40             4555        265239            7.55       0  4259.     296. 
#3    31             4121        271854            7.89       0  4260.    -139. 
#4    32             3998        293225            9.01       0  4262.    -264. 
#5     7             4110        269334            7.23       0  4267.    -157. 
#6    27             4347        273848            7.39       0  4268.      78.9
#7    11             4401        269189            7.05       0  4269.     132. 
#8    18             4195        293880            8.38       0  4271.     -75.9
#$`TRUE`
# A tibble: 7 x 7
#  rowid Total_Labour_hrs Cases_Shipped Labour_Hrs_Cost Holiday fitted residuals
#  <dbl>            <dbl>         <dbl>           <dbl>   <dbl>  <dbl>     <dbl>
#1    24             4314        248328            8.5        0  4233.      80.6
#2    23             4114        227996            7.22       0  4234.    -120. 
#3    28             4178        245743            8.12       0  4236.     -58.4
#4    25             4289        249894            8.08       0  4240.      48.8
#5    35             4016        252225            7.85       0  4245.    -229. 
#6    30             4226        256506            7.79       0  4249.     -23.2
#7    39             4146        270051            8.19       0  4255.    -109. 

我们可以使用glarrange之后进行'rowid'分割

library(dplyr)
df1 %>% 
arrange(rowid) %>% 
group_split(grp = as.integer(gl(n(), ceiling(n()/2), n())),
.keep = FALSE)

与产出

[[1]]
# A tibble: 8 x 7
rowid Total_Labour_hrs Cases_Shipped Labour_Hrs_Cost Holiday fitted residuals
<dbl>            <dbl>         <dbl>           <dbl>   <dbl>  <dbl>     <dbl>
1     7             4110        269334            7.23       0  4267.    -157. 
2    11             4401        269189            7.05       0  4269.     132. 
3    18             4195        293880            8.38       0  4271.     -75.9
4    23             4114        227996            7.22       0  4234.    -120. 
5    24             4314        248328            8.5        0  4233.      80.6
6    25             4289        249894            8.08       0  4240.      48.8
7    27             4347        273848            7.39       0  4268.      78.9
8    28             4178        245743            8.12       0  4236.     -58.4
[[2]]
# A tibble: 7 x 7
rowid Total_Labour_hrs Cases_Shipped Labour_Hrs_Cost Holiday fitted residuals
<dbl>            <dbl>         <dbl>           <dbl>   <dbl>  <dbl>     <dbl>
1    30             4226        256506            7.79       0  4249.     -23.2
2    31             4121        271854            7.89       0  4260.    -139. 
3    32             3998        293225            9.01       0  4262.    -264. 
4    33             4475        269121            8.01       0  4256.     219. 
5    35             4016        252225            7.85       0  4245.    -229. 
6    39             4146        270051            8.19       0  4255.    -109. 
7    40             4555        265239            7.55       0  4259.     296. 

最新更新