r语言 - 读取值跨越多行的键值对的最有效方法



将文本文件(如下面的示例)解析为两列data.frame然后转换为宽格式的最快方法是什么?

FN Thomson Reuters Web of Science™
VR 1.0
PT J
AU Panseri, Sara
   Chiesa, Luca Maria
   Brizzolari, Andrea
   Santaniello, Enzo
   Passero, Elena
   Biondi, Pier Antonio
TI Improved determination of malonaldehyde by high-performance liquid
   chromatography with UV detection as 2,3-diaminonaphthalene derivative
SO JOURNAL OF CHROMATOGRAPHY B-ANALYTICAL TECHNOLOGIES IN THE BIOMEDICAL
   AND LIFE SCIENCES
VL 976
BP 91
EP 95
DI 10.1016/j.jchromb.2014.11.017
PD JAN 22 2015
PY 2015

使用 readLines 是有问题的,因为多行字段没有键。作为固定宽度表读取也不起作用。建议?如果不是因为多行问题,这将很容易通过对每行/记录进行操作的函数来实现,如下所示:

x <- "FN Thomson Reuters Web of Science"
re <- "^([^\s]+)\s*(.*)$"
key <- sub(re, "\1", x, perl=TRUE)
value <- sub(re, "\2", x, perl=TRUE)
data.frame(key, value)
key                          value
1  FN Thomson Reuters Web of Science

注意:字段将始终为大写和两个字符。整个标题和作者列表可以连接到一个单元格中。

这应该有效:

library(zoo)
x <- read.fwf(file="tempSO.txt",widths=c(2,500),as.is=TRUE)
x$V1[x$V1=="  "] <- NA
x$V1 <- na.locf(x$V1)
res <- aggregate(V2 ~ V1, data = x, FUN = paste, collapse = "")

这是另一个想法,如果你想留在基础R中,它可能很有用:

parseEntry <- function(entry) {
    ## Split at beginning of each line that starts with a non-space character    
    ll <- strsplit(entry, "\n(?=\S)", perl=TRUE)[[1]]
    ## Clean up empty characters at beginning of continuation lines
    ll <- gsub("\n(\s){3}", "", ll)
    ## Split each field into its two components
    read.fwf(textConnection(ll), c(2, max(nchar(ll))))
}
## Read in and collapse entry into one long character string.
## (If file contained more than one entry, you could preprocess it accordingly.)
ee <- paste(readLines("egFile.txt"), collapse="n")
## Parse the entry
parseEntry(ee)

使用 readLines 将文件的行读入字符向量,并为每个键附加一个冒号。 然后结果为 DCF 格式,因此我们可以使用 read.dcf 读取它 - 这是用于读取 R 包描述文件的函数。 read.dcf的结果是wide,一个矩阵,每个键一列。 最后我们创建long,一个长的数据帧,每个键一行:

L <- readLines("myfile.dat")
L <- sub("^(\S\S)", "\1:", L)
wide <- read.dcf(textConnection(L))
long <- data.frame(key = colnames(wide), value = wide[1,], stringsAsFactors = FALSE)

最新更新