r-将字符串分隔为未知编号的新列



我有一个看起来像这样的数据集:

data = tibble(emp = c(1:4), 
idstring = c("PER20384|PER49576|PER10837|PER92641",
"PER20384|PER49576|PER03875|PER72534", 
"PER20384|PER98642|PER17134", 
"PER20384|PER98623|PER17134|PER01836|PER1234"))

我想用"|"分成单独的列。然而,我需要最右边的字符(例如"PER92641"(始终位于标记为"的列中;级别1";以及最左边的字符根据行中的字符数而变化。

我尝试了一些基本步骤,如:

data_split = str_split(data$idstring, "\|", simplify = T)
colnames(data_split) = paste0("Level_", ncol(data_split):1)

但我得到了错误的输出,如下所示:

Level_5    Level_4    Level_3    Level_2    Level_1  
[1,] "PER20384" "PER49576" "PER10837" "PER92641" ""       
[2,] "PER20384" "PER49576" "PER03875" "PER72534" ""       
[3,] "PER20384" "PER98642" "PER17134" ""         ""       
[4,] "PER20384" "PER98623" "PER17134" "PER01836" "PER1234" 

它应该是这样的:

Level_5      Level_4    Level_3    Level_2    Level_1  
[1,]   NA       "PER20384" "PER49576" "PER10837" "PER92641"        
[2,]   NA       "PER20384" "PER49576" "PER03875" "PER72534"        
[3,]   NA         NA       "PER20384" "PER98642" "PER17134"       
[4,] "PER20384" "PER98623" "PER17134" "PER01836" "PER1234"

请注意,理想情况下,我也希望在适用的情况下用NA代替空白。

我觉得我可以以某种方式颠倒每行的顺序,然后在添加colname之前用NA替换空格,但我希望这里能找到一个更优雅的解决方案。

这可以通过对NA值执行order来完成。我们将|处的"idstring"拆分为list,得到list元素的maxlengths('mx'(。用它来填充NAlength<-(默认情况下,它填充在末尾而不是开头(,然后我们order是基于NA元素的矢量,rbindlist元素

lst1 <- strsplit(data$idstring, "|", fixed = TRUE)
mx <- max(lengths(lst1))
out <- do.call(rbind,  lapply(lst1, function(x) {
length(x) <- mx
x[order(!is.na(x))]
}))
colnames(out) <- paste0("Level_", ncol(out):1)

-输出

#    Level_5    Level_4    Level_3    Level_2    Level_1   
#[1,] NA         "PER20384" "PER49576" "PER10837" "PER92641"
#[2,] NA         "PER20384" "PER49576" "PER03875" "PER72534"
#[3,] NA         NA         "PER20384" "PER98642" "PER17134"
#[4,] "PER20384" "PER98623" "PER17134" "PER01836" "PER1234" 

或者另一个选项是使用read.table读取列,然后通过重新排列前面的NA元素来修改行值

d1 <- read.table(text = data$idstring, sep="|", header = FALSE, 
fill = TRUE, na.strings = c(""), col.names = paste0('Level_',  5:1))
d1[1] <- t(apply(d1, 1, function(x) c(x[is.na(x)], x[!is.na(x)])))

最新更新