在R中为循环而挣扎



根据";time_ passed";在公司中,分为几类(0到5岁的人,6到10岁的人和11到15岁的人等:每次4岁(。我想在没有for循环的情况下可以做到这一点,但我希望能够同时使用for循环和split(或子集,或任何其他R函数(函数。

以下是我的数据集结构:

structure(list(sex = c("F", "H", "F", "F", "H", "F"), age = c("24", 
"33", "53", "32", "38", "21"), time_passed = c("0", "3", "4", 
"0", "2", "0"), level = c("N7  ", "N7  ", "N9  ", "N7  ", "N8  ", 
"    "), wage = c("2605", "4931", "11123", "3750", "6180", "858.31"
)), row.names = c(NA, 6L), class = "data.frame")

还有我尝试过的for循环,但没有成功:

list_tranches <- c()
for (i in seq(from = 5, to = 40, by=5)) {
for (j in 1:nrow(data_2021)){
if(data_2021[j,4] %in% seq(i-5+1:i))
tranche_i <- data_2021[j,]
list_tranches <- c(list_tranches, tranche_i)
}
}

最终,我想要一个变量";部分";添加到我的数据集df中,指示每个人在公司中度过的时间类别(0到5年、6到10年等(。我该如何继续?

显然,在没有循环的情况下这样做会更快。以下一行代码与您试图实现的内容相同:

split(data_2021, data_2021$time_passed %/% 5)

但是,如果您想使用for循环来完成此操作,那么您的代码会出现一些问题。首先,如果你试图比较数字,你需要确保你的列是数字。您的dput显示time_passed列是一个字符列,因此您需要从开始

data_2021$time_passed <- as.numeric(data_2021$time_passed)

其次,应该将list_tranches定义为list,而不是向量。

list_tranches <- list()

你的循环中有几个问题。首先,您根本不需要嵌套循环,因为索引在R中是矢量化的。其次,time_passed是数据帧中的第三列,但您要在第四列中查找值。第三,您的seq语法错误。它将始终生成一个从1开始的序列。

把这些放在一起,我们有:

for (i in seq(from = 5, to = 40, by = 5)) {
j <- which(data_2021$time_passed %in% (i - 5:1))
if(length(j) > 0) list_tranches[[i/5]] <- data_2021[j,]
}

list_tranches
#> [[1]]
#>   sex age time_passed level   wage
#> 1   F  24           0  N7     2605
#> 2   H  33           3  N7     4931
#> 3   F  53           4  N9    11123
#> 4   F  32           0  N7     3750
#> 5   H  38           2  N8     6180
#> 6   F  21           0       858.31

当然,这里的例子并不好,因为所有的值都在同一部分。

创建于2022-08-04由reprex包(v2.0.1(

您要查找findInterval还是cut后面跟着split

data_2021 <-
structure(list(
sex = c("F", "H", "F", "F", "H", "F"), 
age = c("24", "33", "53", "32", "38", "21"), 
time_passed = c("0", "3", "4", "0", "2", "0"), 
level = c("N7  ", "N7  ", "N9  ", "N7  ", "N8  ", "    "), 
wage = c("2605", "4931", "11123", "3750", "6180", "858.31")), 
row.names = c(NA, 6L), 
class = "data.frame")
data_2021$time_passed <- as.integer(data_2021$time_passed)
breaks <- seq(0, 49, by = 5)
ff <- findInterval(data_2021$time_passed, breaks)
split(data_2021, ff)
#> $`1`
#>   sex age time_passed level   wage
#> 1   F  24           0  N7     2605
#> 2   H  33           3  N7     4931
#> 3   F  53           4  N9    11123
#> 4   F  32           0  N7     3750
#> 5   H  38           2  N8     6180
#> 6   F  21           0       858.31
cc <- cut(data_2021$time_passed, breaks = breaks, include.lowest = TRUE)
cc <- droplevels(cc)
split(data_2021, cc)
#> $`[0,5]`
#>   sex age time_passed level   wage
#> 1   F  24           0  N7     2605
#> 2   H  33           3  N7     4931
#> 3   F  53           4  N9    11123
#> 4   F  32           0  N7     3750
#> 5   H  38           2  N8     6180
#> 6   F  21           0       858.31

创建于2022-08-04由reprex包(v2.0.1(


若要添加新列tranche,请使用cut/split和结果的names属性。

cc <- cut(data_2021$time_passed, breaks = breaks, include.lowest = TRUE)
cc <- droplevels(cc)
sp <- split(data_2021, cc)
res <- lapply(seq_along(sp), (i){
sp[[i]]$tranche <- names(sp)[i]
sp[[i]]
})
rm(sp)
res <- do.call(rbind, res)
res
#>   sex age time_passed level   wage tranche
#> 1   F  24           0  N7     2605   [0,5]
#> 2   H  33           3  N7     4931   [0,5]
#> 3   F  53           4  N9    11123   [0,5]
#> 4   F  32           0  N7     3750   [0,5]
#> 5   H  38           2  N8     6180   [0,5]
#> 6   F  21           0       858.31   [0,5]

创建于2022-08-04由reprex包(v2.0.1(

最新更新