r语言 - 如果满足条件,如何将数据框中的一行连接到上一行?



我是 R 的中间用户,有一个 ~850,000 行的数据集,该数据集通过 Stata 编辑,保存为 csv,但大约 .01% 的行在第 11 列之后被拆分到下一行。我正在尝试将文件恢复到原始形式,没有拆分行。我使用第 4 列"类型"作为必需条件,但下面有人指出这行不通。我对此进行了测试,数据框中的所有对象类型确实都是"整数"。如果我更改此问题的"类型"列,也许这会起作用,但这是我尝试过的:

wages <- for (i in wages) {
if(typeof(wages[i,4]) == "integer") {
cat(i-1, i)
}
}

我得到的只是NA。

尝试时:

for (i in wages) {
if(typeof(i[ ,4]) == "integer") {
append(i-1, i, after = length(i-1))
}
}

它说:[.default错误 (i, , 4(:维度数不正确

我花了几个小时寻找解决方案并尝试不同的方法,但没有成功。提前感谢任何帮助。

数据片段:

WD County_Name State_Name   Cons_Code constructiondescription    wagegroup Rate_Effective_Date  hourly
113352                      CO20190006       Adams   Colorado     Highway                         SUCO2011-001           9/15/2011   22.67
113353                      CO20190004       Adams   Colorado Residential                         PLUM0058-011            7/1/2018   32.75
113354 (pipefitters exclude hvac pipe)   SOUTHWEST         CO        8001                   METRO         1352              100335 plumber
113355                      CO20190004       Adams   Colorado Residential                         PLUM0145-005            8/1/2016   24.58
fringe Rate_Type                                                            Craft_Title    region st_abbr stcnty_fips    mr supergrp
113352   8.73      Open                       power equipment operator: broom/sweeper arapahoe SOUTHWEST      CO        8001 METRO     1352
113353  14.85       CBA                        plumber/pipefitter (plumbers include hvac pipe)                            NA             NA
113354      1                                                                                                             NA             NA
113355  10.47       CBA plumber (plumbers include hvac pipe) & pipefitters (exclude hvac pipe) SOUTHWEST      CO        8001 METRO     1352
group key_craft key
113352 100335  operator   1
113353     NA            NA
113354     NA            NA
113355 100335   plumber   1

可重现的数据:

data <- data.frame(c("CO20190006","CO20190004","(pipefitters exclude hvac pipe)","CO20190004"), #1
c("Adams","Adams","SOUTHWEST","Adams"), #2
c("Colorado","Colorado","CO","Colorado"), #3
c("Highway","Residential","8001","Residential"), #4
c("","","METRO",""), #5
c("SUCO2011-001","PLUM0058-011","1352","PLUM0145-005"), #6
c("9/15/2011","7/1/2018","100335","8/1/2016"), #7
c("22.67","32.75","plumber","24.58"), #8
c("8.73","14.85","1","10.47"), #9
c("Open","CBA","","CBA"), #10
c("power equipment operator: broom/sweeper arapahoe","plumber/pipefitter (plumbers include hvac pipe)","",
"plumber (plumbers include hvac pipe) & pipefitters (exclude hvac pipe)"), #11
c("SOUTHWEST","","","SOUTHWEST"), #12
c("CO","","","CO"), #13
c("8001",NA,NA,"8001"), #14
c("METRO","","","METRO"), #15
c("1352",NA,NA,"1352"), #16
c("100335",NA,NA,"100335"), #17
c("operator","","","plumber"), #18
c("1",NA,NA,"1")) #19
colnames(data) <- c("WD","County_Name","State_Name","Cons_Code","constructiondescription","wagegroup","Rate_Effective_Date",
"hourly","fringe","Rate_Type","Craft_Title","region","st_abbr","stcnty_fips","mr","supergrp","group",
"key_craft","key")

以下解决方案应该可以完成这项工作:

new_data <- NULL
i <- 1
while (i <= nrow(data)) {
new_data <- rbind(new_data, data[i, ])
if (all(is.na(data[i, c(14, 16, 17, 19)]))) {
if (!paste(data[i,11], data[i+1,1]) %in% levels(new_data[,11])) {
levels(new_data[,11]) <- c(
levels(new_data[,11]), paste(data[i,11], data[i+1,1])
)
}
new_data[nrow(new_data),11] <- paste(data[i,11], data[i+1,1])
new_data[nrow(new_data),] <- cbind(new_data[nrow(new_data),1:11], data[i+1,2:9])
i <- i + 2
} else {i <- i + 1}
}

请注意,数据框 (DF( 将字符串存储为因子,因为在使用data.frame()函数创建 DF 时,默认情况下会stringsAsFactors = TRUE其中一个设置。您可以在此处阅读有关数据框中因子及其水平的详细信息。

因此,在上面的代码中,我们首先在干净的new_data中添加一个新行:

new_data <- rbind(new_data, data[i, ])

然后,我们通过检查第 14、16、17 和 19 列中是否有NA来测试该行是否被拆分:

if (all(is.na(data[i, c(14, 16, 17, 19)])))

如果是这样,为了使我们能够将拆分行第 11 列中的单元格与下一行的第一个单元格合并,我们首先需要检查该级别是否已存在于该列中,如果不存在:

if (!paste(data[i,11], data[i+1,1]) %in% levels(new_data[,11]))

在合并之前,需要将其添加到级别列表中:

levels(new_data[,11]) <- c(levels(new_data[,11]), paste(data[i,11], data[i+1,1]))

然后,最后,可以完成合并(以完成拆分行第 11 列中的单元格(:

new_data[nrow(new_data),11] <- paste(data[i,11], data[i+1,1])

之后,剩余的缺失列将添加到有问题的拆分行中:

new_data[nrow(new_data),] <- cbind(new_data[nrow(new_data),1:11], data[i+1,2:9])

版本 LITE

现在,我怀疑所有这些检查因素并添加新因素需要一些额外的时间,所以我建议您可以使用此代码的新版本,它将所涉及的第 11 列转换为字符,而不是因素。我认为在这个特定的数据集中这是有道理的,因为具体来说,该列似乎无论如何都不打算作为因素。这样,可以跳过所有因素检查/添加:

data[,11] <- as.character(data[,11])
new_data <- NULL
i <- 1
while (i <= nrow(data)) {
new_data <- rbind(new_data, data[i, ])
if (all(is.na(data[i, c(14, 16, 17, 19)]))) {
new_data[nrow(new_data),11] <- paste(data[i,11], data[i+1,1])
new_data[nrow(new_data),] <- cbind(new_data[nrow(new_data),1:11], data[i+1,2:9])
i <- i + 2
} else {i <- i + 1}
}

让我知道这是否提高了速度!

最新更新