我有一个非常混乱的数据帧,看起来像
df <- data.frame(Job = c("casual", "part time", "full time", "Level A total" , "casual","full time","Level B total"), institute1 = c(1,2,2,5,0,1,1))
在上面的行";"B级总计";参考级别B,直到向上一行到达";A级总计";其中它现在指的是等级A。数据是>500行长,所以手动清洁它是一种选择,但不愉快,但我想不出如何编码它,这样我就可以添加信息,这样R就知道每个单元格所指的级别。
我们可以创建一个新列Level
,并将所有"Level"
值放在其中。fill
是下面有非NA值的NA
值。通过添加Job
中的文本来清理Level
列。
library(dplyr)
df %>%
mutate(Level = replace(Job, !grepl('Level', Job), NA)) %>%
tidyr::fill(Level, .direction = 'up') %>%
mutate(Level = ifelse(grepl('total', Job),
Job, paste0(sub('total', '', Level), Job)))
# Job institute1 Level
#1 casual 1 Level A casual
#2 part time 2 Level A part time
#3 full time 2 Level A full time
#4 Level A total 5 Level A total
#5 casual 0 Level B casual
#6 full time 1 Level B full time
#7 Level B total 1 Level B total
基本R解决方案:
transform(within(df[rev(seq_len(nrow(df))),],
{
Level <- ifelse(grepl("Level\s*[A-Z]", Job),
gsub("\s*total", "", Job), NA_character_)
}
), Level = na.omit(Level)[cumsum(!(is.na(Level)))])[rev(seq_len(nrow(df))),]