我是 R 的中间用户,有一个 ~850,000 行的数据集,该数据集通过 Stata 编辑,保存为 csv,但大约 .01% 的行在第 11 列之后被拆分到下一行。我正在尝试将文件恢复到原始形式,没有拆分行。我使用第 4 列"类型"作为必需条件,但下面有人指出这行不通。我对此进行了测试,数据框中的所有对象类型确实都是"整数"。如果我更改此问题的"类型"列,也许这会起作用,但这是我尝试过的:
wages <- for (i in wages) {
if(typeof(wages[i,4]) == "integer") {
cat(i-1, i)
}
}
我得到的只是NA。
尝试时:
for (i in wages) {
if(typeof(i[ ,4]) == "integer") {
append(i-1, i, after = length(i-1))
}
}
它说:[.default
错误 (i, , 4(:维度数不正确
我花了几个小时寻找解决方案并尝试不同的方法,但没有成功。提前感谢任何帮助。
数据片段:
WD County_Name State_Name Cons_Code constructiondescription wagegroup Rate_Effective_Date hourly
113352 CO20190006 Adams Colorado Highway SUCO2011-001 9/15/2011 22.67
113353 CO20190004 Adams Colorado Residential PLUM0058-011 7/1/2018 32.75
113354 (pipefitters exclude hvac pipe) SOUTHWEST CO 8001 METRO 1352 100335 plumber
113355 CO20190004 Adams Colorado Residential PLUM0145-005 8/1/2016 24.58
fringe Rate_Type Craft_Title region st_abbr stcnty_fips mr supergrp
113352 8.73 Open power equipment operator: broom/sweeper arapahoe SOUTHWEST CO 8001 METRO 1352
113353 14.85 CBA plumber/pipefitter (plumbers include hvac pipe) NA NA
113354 1 NA NA
113355 10.47 CBA plumber (plumbers include hvac pipe) & pipefitters (exclude hvac pipe) SOUTHWEST CO 8001 METRO 1352
group key_craft key
113352 100335 operator 1
113353 NA NA
113354 NA NA
113355 100335 plumber 1
可重现的数据:
data <- data.frame(c("CO20190006","CO20190004","(pipefitters exclude hvac pipe)","CO20190004"), #1
c("Adams","Adams","SOUTHWEST","Adams"), #2
c("Colorado","Colorado","CO","Colorado"), #3
c("Highway","Residential","8001","Residential"), #4
c("","","METRO",""), #5
c("SUCO2011-001","PLUM0058-011","1352","PLUM0145-005"), #6
c("9/15/2011","7/1/2018","100335","8/1/2016"), #7
c("22.67","32.75","plumber","24.58"), #8
c("8.73","14.85","1","10.47"), #9
c("Open","CBA","","CBA"), #10
c("power equipment operator: broom/sweeper arapahoe","plumber/pipefitter (plumbers include hvac pipe)","",
"plumber (plumbers include hvac pipe) & pipefitters (exclude hvac pipe)"), #11
c("SOUTHWEST","","","SOUTHWEST"), #12
c("CO","","","CO"), #13
c("8001",NA,NA,"8001"), #14
c("METRO","","","METRO"), #15
c("1352",NA,NA,"1352"), #16
c("100335",NA,NA,"100335"), #17
c("operator","","","plumber"), #18
c("1",NA,NA,"1")) #19
colnames(data) <- c("WD","County_Name","State_Name","Cons_Code","constructiondescription","wagegroup","Rate_Effective_Date",
"hourly","fringe","Rate_Type","Craft_Title","region","st_abbr","stcnty_fips","mr","supergrp","group",
"key_craft","key")
以下解决方案应该可以完成这项工作:
new_data <- NULL
i <- 1
while (i <= nrow(data)) {
new_data <- rbind(new_data, data[i, ])
if (all(is.na(data[i, c(14, 16, 17, 19)]))) {
if (!paste(data[i,11], data[i+1,1]) %in% levels(new_data[,11])) {
levels(new_data[,11]) <- c(
levels(new_data[,11]), paste(data[i,11], data[i+1,1])
)
}
new_data[nrow(new_data),11] <- paste(data[i,11], data[i+1,1])
new_data[nrow(new_data),] <- cbind(new_data[nrow(new_data),1:11], data[i+1,2:9])
i <- i + 2
} else {i <- i + 1}
}
请注意,数据框 (DF( 将字符串存储为因子,因为在使用data.frame()
函数创建 DF 时,默认情况下会stringsAsFactors = TRUE
其中一个设置。您可以在此处阅读有关数据框中因子及其水平的详细信息。
因此,在上面的代码中,我们首先在干净的new_data
中添加一个新行:
new_data <- rbind(new_data, data[i, ])
然后,我们通过检查第 14、16、17 和 19 列中是否有NA
来测试该行是否被拆分:
if (all(is.na(data[i, c(14, 16, 17, 19)])))
如果是这样,为了使我们能够将拆分行第 11 列中的单元格与下一行的第一个单元格合并,我们首先需要检查该级别是否已存在于该列中,如果不存在:
if (!paste(data[i,11], data[i+1,1]) %in% levels(new_data[,11]))
在合并之前,需要将其添加到级别列表中:
levels(new_data[,11]) <- c(levels(new_data[,11]), paste(data[i,11], data[i+1,1]))
然后,最后,可以完成合并(以完成拆分行第 11 列中的单元格(:
new_data[nrow(new_data),11] <- paste(data[i,11], data[i+1,1])
之后,剩余的缺失列将添加到有问题的拆分行中:
new_data[nrow(new_data),] <- cbind(new_data[nrow(new_data),1:11], data[i+1,2:9])
版本 LITE
现在,我怀疑所有这些检查因素并添加新因素需要一些额外的时间,所以我建议您可以使用此代码的新版本,它将所涉及的第 11 列转换为字符,而不是因素。我认为在这个特定的数据集中这是有道理的,因为具体来说,该列似乎无论如何都不打算作为因素。这样,可以跳过所有因素检查/添加:
data[,11] <- as.character(data[,11])
new_data <- NULL
i <- 1
while (i <= nrow(data)) {
new_data <- rbind(new_data, data[i, ])
if (all(is.na(data[i, c(14, 16, 17, 19)]))) {
new_data[nrow(new_data),11] <- paste(data[i,11], data[i+1,1])
new_data[nrow(new_data),] <- cbind(new_data[nrow(new_data),1:11], data[i+1,2:9])
i <- i + 2
} else {i <- i + 1}
}
让我知道这是否提高了速度!