r-读取分隔符/固定宽度不一致的文件列表



我正试图找到一种更有效的方法来导入一个结构笨拙的数据文件列表。这些文件是由一个软件程序生成的,看起来像是要打印和查看,而不是导出和使用。该文件包含一个";化合物";然后是一些相关联的数据。在读作";化合物X:XXXX";,有几行制表符分隔的数据。在每个文件中,每个化合物的行数保持不变,但行数可能会随不同文件而变化。

以下是一些示例数据:

#Generate two data files to be imported
cat("Quantify Compound Summary Reportn", 
"nPrinted Mon March 28 14:54:39 2022n", 
"nCompound 1: Onen", 
"tNametIDtResult", 
"n1tA1234tQCt25.2", 
"n2tA4567tQCt26.8n", 
"nCompound 2: Twon", 
"tNametIDtResult", 
"n1tA1234tQCt51.1", 
"n2tA4567tQCt48.6n",
file = "test1.txt")
cat("Quantify Compound Summary Reportn", 
"nPrinted Mon March 28 14:54:39 2022n", 
"nCompound 1: Onen", 
"tNametIDtResult", 
"n1tC1234tQCt25.2", 
"n2tC4567tQCt26.8", 
"n3tC8910tQCt25.4n", 
"nCompound 2: Twon", 
"tNametIDtResult", 
"n1tC1234tQCt51.1", 
"n2tC4567tQCt48.6",
"n3tC8910tQCt45.6n",
file = "test2.txt")

我最终想要的是数据帧的列表,每个"帧"一个;化合物";,包含与每个化合物相关联的所有数据行。为了达到这一点,我有一种相当复杂的方法,将函数分解在一起,这些函数给了我想要的东西,但却以一种非常不守规矩的方式。

library(tidyverse)
## Step 1: ID list of data files
data.files <- list.files(path = ".",
pattern = ".txt",
full.names = TRUE)
## Step 2: Read in the data files
data.list.raw <- lapply(data.files, read_lines, skip = 4) 
## Step 3: Identify the "compounds" in the data file output  
Hdr.dat <- lapply(data.list.raw, function(x) grepl("Compound", x)) # Scan the file and find the different compounds within it (this can be applied to any Waters output)
grp.dat <- Map(function(x, y) {x[y][cumsum(y)]}, data.list.raw, Hdr.dat)
## Step 4: Unpack the tab delimited parts of the export file, then generate a list of dataframes within a list of imported files
Read <- function(x) read.table(text = x, sep = "t", fill = TRUE, stringsAsFactors = FALSE)
raw.dat <- Map(function(x,y) {Map(Read, split(x, y))}, data.list.raw, grp.dat)
## Step 5: Curate the list of compounds - remove "Compound X: " 
cmpd.list <- lapply(raw.dat, function(x) trimws(substring(names(x), 13))) 
## Step 6: Rename the headers for the dataframes, remove the blank rows and recentre 
NameCols <- function(z) lapply(names(z), function(i){
x <- z[[ i ]]
colnames(x) <- x[2,]
x[c(-1,-2),]
})
data.list <- Map(function(x,y){setNames(NameCols(x), y)}, raw.dat, cmpd.list) 
## Step 7: rbind the data based on the compound 
cmpd_names <- unique(unlist(sapply(data.list, names)))
result <- list()
j <- for (n in cmpd_names) {
result[[n]] <- map(data.list, n)
}
list.merged <- map(result, dplyr::bind_rows)
list.merged <- lapply(list.merged, function(x) x %>% filter(Name != ""))

这里的挑战是脚本效率和时间(我可以导入数百或数千个具有数百行数据的数据文件,这可能需要相当长的时间(;清洁度";,这就是为什么我在这里包括CCD_ 1作为标签。我还希望这是高度可概括的;化合物";可能会随时间变化。如果有人能想出一个干净高效的方法来做这一切,我将永远欠你的债。

请参阅下面的一种方法。乍一看,整个管道可能很吓人。您可以在每个步骤(%>%(之后插入一个head(或tail(调用,以显示数据转换的当前阶段。gsubs中正在进行正则表达式的一些清理:根据需要进行修改。

intermediate_result <-
data.frame(file_name = c('test1.txt','test2.txt')) %>%
rowwise %>%
## read file content into a raw string:
mutate(raw = read_file(file_name)) %>%
## separate raw file contents into rows 
## using newline and carriage return as row delimiters:
separate_rows(raw, sep = '[\n\r]') %>%
## provide a compound column for later grouping
## by extracting the 'Compound' string from column raw
## or setting the compound column to NA otherwise:
mutate(compound = ifelse(grepl('^Compound',raw),
gsub('.*(Compound .*):.*','\1', raw),
NA)
) %>%
## remove rows with empty raw text:
filter(raw != '') %>%
## filling missing compound values (NAs) with last non-NA compound string:
fill(compound, .direction = 'down') %>%
## keep only rows with tab-separated raw string
## indicating tabular data
filter(grepl('\t',raw)) %>%
## insert a column header 'Index' because
## original format has four data columns but only three header cols:
mutate(raw = gsub(' *\tName','IndextName',raw))

上面的步骤产生了一个数据帧,该数据帧具有一个列"raw",其中包含作为字符串的已清理数据,适合转换为表格数据(制表符分隔,换行(。从那时起,我们可以继续将父表中未来的单个表保留为所谓的列表列(变体a(,也可以继续拆分列"raw"并将其映射(变体B,credits到@Dorton(。

变量A在数据帧内生成一列数据帧:

intermediate_result %>%
group_by(compound) %>%
## the nifty piece: you can store dataframes inside a dataframe:
mutate(
tables = list(read.table(text = raw, header = TRUE, sep = 't' ))
)

变体B生成一个用相应化合物命名的数据帧列表:

intermediate_result %>%
split(f = as.factor(.$compound)) %>% 
lapply(function(x) x %>%
separate(raw,
into = unlist(
str_split(x$raw[1], pattern = "t"))
)
)

最新更新