r-split函数不返回任何具有大数据集的观测值



我有一个这样的数据帧:

seqnames       pos     strand    nucleotide     count
id1         12        +          A            13
id1         13        +          C            25
id2         24        +          G            10
id2         25        +          T            25
id2         26        +          A            10
id3         10        +          C            5

但它总共有超过100000行,seqnames有3138个级别。我想根据seqname将其拆分为数据帧列表,所以我使用了拆分函数:

data_list <- split(data,data$seqnames)

但它只返回这样的东西:

List of 3138
$ id1:'data.frame':    0 obs. of  6 variables:
..$ seqnames  : Factor w/ 3138 levels "id1","id2",..: 
..$ pos       : int(0) 
..$ strand    : Factor w/ 3 levels "+","-","*": 
..$ nucleotide: Factor w/ 8 levels "A","C","G","T",..: 
..$ count     : int(0) 
..$ sample_id : chr(0) 
$ id2:'data.frame':    0 obs. of  6 variables:
..$ seqnames  : Factor w/ 3138 levels "id1","id2",..: 
..$ pos       : int(0) 
..$ strand    : Factor w/ 3 levels "+","-","*": 
..$ nucleotide: Factor w/ 8 levels "A","C","G","T",..: 
..$ count     : int(0) 
..$ sample_id : chr(0) 

我不明白为什么会是这样,因为我在一个所有数字的虚构数据帧上使用了它(当然,没有这一行那么多(,而且它很有效。我该如何解决这个问题?

只是因为列"seqnames"是factor,所以有许多未使用的级别。对于split,可以选择drop(drop = TRUE-默认情况下为FALSE(来删除这些列表元素。否则,它们将以0行的data.frame返回。如果我们想用NULL替换这些元素,那么找到行数(nrow(为0的元素,并将其分配给NULL

data_list <- split(data,data$seqnames)
> str(data_list)
List of 5
$ id1:'data.frame':    2 obs. of  5 variables:
..$ seqnames  : Factor w/ 5 levels "id1","id2","id3",..: 1 1
..$ pos       : int [1:2] 12 13
..$ strand    : chr [1:2] "+" "+"
..$ nucleotide: chr [1:2] "A" "C"
..$ count     : int [1:2] 13 25
$ id2:'data.frame':    3 obs. of  5 variables:
..$ seqnames  : Factor w/ 5 levels "id1","id2","id3",..: 2 2 2
..$ pos       : int [1:3] 24 25 26
..$ strand    : chr [1:3] "+" "+" "+"
..$ nucleotide: chr [1:3] "G" "T" "A"
..$ count     : int [1:3] 10 25 10
$ id3:'data.frame':    1 obs. of  5 variables:
..$ seqnames  : Factor w/ 5 levels "id1","id2","id3",..: 3
..$ pos       : int 10
..$ strand    : chr "+"
..$ nucleotide: chr "C"
..$ count     : int 5
$ id4:'data.frame':    0 obs. of  5 variables:
..$ seqnames  : Factor w/ 5 levels "id1","id2","id3",..: 
..$ pos       : int(0) 
..$ strand    : chr(0) 
..$ nucleotide: chr(0) 
..$ count     : int(0) 
$ id5:'data.frame':    0 obs. of  5 variables:
..$ seqnames  : Factor w/ 5 levels "id1","id2","id3",..: 
..$ pos       : int(0) 
..$ strand    : chr(0) 
..$ nucleotide: chr(0) 
..$ count     : int(0) 

NULL进行分配

data_list[sapply(data_list, nrow) == 0] <- list(NULL)

-再次检查

> str(data_list)
List of 5
$ id1:'data.frame':    2 obs. of  5 variables:
..$ seqnames  : Factor w/ 5 levels "id1","id2","id3",..: 1 1
..$ pos       : int [1:2] 12 13
..$ strand    : chr [1:2] "+" "+"
..$ nucleotide: chr [1:2] "A" "C"
..$ count     : int [1:2] 13 25
$ id2:'data.frame':    3 obs. of  5 variables:
..$ seqnames  : Factor w/ 5 levels "id1","id2","id3",..: 2 2 2
..$ pos       : int [1:3] 24 25 26
..$ strand    : chr [1:3] "+" "+" "+"
..$ nucleotide: chr [1:3] "G" "T" "A"
..$ count     : int [1:3] 10 25 10
$ id3:'data.frame':    1 obs. of  5 variables:
..$ seqnames  : Factor w/ 5 levels "id1","id2","id3",..: 3
..$ pos       : int 10
..$ strand    : chr "+"
..$ nucleotide: chr "C"
..$ count     : int 5
$ id4: NULL
$ id5: NULL

数据

data <- structure(list(seqnames = structure(c(1L, 1L, 2L, 2L, 2L, 
3L), .Label = c("id1", 
"id2", "id3", "id4", "id5"), class = "factor"), pos = c(12L, 
13L, 24L, 25L, 26L, 10L), strand = c("+", "+", "+", "+", "+", 
"+"), nucleotide = c("A", "C", "G", "T", "A", "C"), count = c(13L, 
25L, 10L, 25L, 10L, 5L)), row.names = c(NA, -6L), class = "data.frame")

最新更新