嗨,我是新来的,也是R、的初学者
我的问题:如果我在R中有多个文件(test1.dat、test2.dat…)要处理,我会使用此代码在中读取它们
filelist <- list.files(pattern = "*.dat")
df_list <- lapply(filelist, function(x) read.table(x, header = FALSE, sep = ","
,colClasses = "factor", comment.char = "",
col.names = "raw"))
现在我遇到了数据很大的问题,我找到了一个使用sqldf包来加快速度的解决方案:
sql <- file("test2.dat")
df <- sqldf("select * from sql", dbname = tempfile(),
file.format = list(header = FALSE, row.names = FALSE, colClasses = "factor",
comment.char = "", col.names ="raw"))
它在一个文件中运行良好,但我无法像第一个代码片段那样将代码更改为在多个文件中读取。有人能帮我吗?非常感谢。Momo
这似乎有效(但我认为有一种更快的sql
方法)
sql.l <- lapply(filelist , file)
df_list2 <- lapply(sql.l, function(i) sqldf("select * from i" ,
dbname = tempfile(), file.format = list(header = TRUE, row.names = FALSE)))
看看速度-部分取自mnel的帖子在R 中快速读取非常大的表作为数据帧
library(data.table)
library(sqldf)
# test data
n=1e6
DT = data.table( a=sample(1:1000,n,replace=TRUE),
b=sample(1:1000,n,replace=TRUE),
c=rnorm(n),
d=sample(c("foo","bar","baz","qux","quux"),n,replace=TRUE),
e=rnorm(n),
f=sample(1:1000,n,replace=TRUE) )
# write 5 files out
lapply(1:5, function(i) write.table(DT,paste0("test", i, ".dat"),
sep=",",row.names=FALSE,quote=FALSE))
读取:数据。表
filelist <- list.files(pattern = "*.dat")
system.time(df_list <- lapply(filelist, fread))
# user system elapsed
# 5.244 0.200 5.457
读取:sqldf
sql.l <- lapply(filelist , file)
system.time(df_list2 <- lapply(sql.l, function(i) sqldf("select * from i" ,
dbname = tempfile(), file.format = list(header = TRUE, row.names = FALSE))))
# user system elapsed
# 35.594 1.432 37.357
检查-似乎可以,除了属性
all.equal(df_list , df_list2)
不知怎么的,lappy()对我不起作用。
map_df()适用于组合7000+.dat文件。还跳过每个文件的第一行并过滤列"0";V1〃;
rawDATfile.list <- list.files(pattern="*.DAT")
data <- rawDATfile.list%>%
map_dfr(~read.delim(.x, header = FALSE, sep=";", skip=1, quote = ""'")%>%
mutate_all(as.character))%>%
filter(V1=="B")