如何在R中合并来自不同目录的文件?



我有一个名为simulations的文件夹,其中包含100个子文件夹,每个子文件夹都包含模拟的结果。每个子文件夹中的每个模拟结果分别保存在四个单独的文件中,分别命名为seq[1].nexseq[2].nexseq[3].nexseq[4].nex。这些文件都有相同的格式,如下所示:

#NEXUS
Begin data;
Dimensions ntax=5 nchar=55;
Format datatype=Standard symbols="01" missing=? gap=-;
Matrix
L1   1100110010010100010110000110000010000100001011010010110
L2   1101110110011010010000010111000010010000001001010110110
L3   0111111100010100010011000001100011010100010010110011110
L4   1101110110011010010000010111000010010000001001010110110
L5   1101110100110100010110010110001010010100001011010110100
;
End;

命名为seq的文件具有相同的行数(即L1-L5),但它们每行的长度不同。例如,seq[2].nex如下所示:

#NEXUS
Begin data;
Dimensions ntax=5 nchar=20;
Format datatype=Standard symbols="012" missing=? gap=-;
Matrix
L1   10000012202011210001
L2   10002112212010210012
L3   10002112212210220022
L4   10002112212010220012
L5   10001112212010222012 
;
End;

对于每一个100子文件夹,我想合并seq[1].nex,seq[2].nex,seq[3].nexseq[4].nex到一个文件seq.nex。从seq[1].nex开始,我想将后面的文件(即2-4)中的信息附加到第一个文件中的相应行。使用上面的两个示例,我想要的输出看起来像这样:

#NEXUS
Begin data;
Dimensions ntax=5 nchar=55;
Format datatype=Standard symbols="01" missing=? gap=-;
Matrix
L1   110011001001010001011000011000001000010000101101001011010000012202011210001
L2   110111011001101001000001011100001001000000100101011011010002112212010210012
L3   011111110001010001001100000110001101010001001011001111010002112212210220022
L4   110111011001101001000001011100001001000000100101011011010002112212010220012
L5   110111010011010001011001011000101001010000101101011010010001112212010222012
;
End;

然后,我想要为每100个子文件夹重复合并文件的这个过程。有没有办法在R中做到这一点?

有一种方法:

library(data.table)
# get path to simulations folder
pth_to_simulations = "simulations"
# get a list of all subfolders, with full names
fldrs = dir(pth_to_simulations, full.names=T)
# Create a function that ingests a subfolder, reads files, and concatenates
read_sims <- function(fldr) {
sims = dir(fldr,full.names = T)
sims = lapply(sims, fread, skip=6, nrows=5, header=F)
sims = do.call(merge, c(by="V1", sims))
sims[, .(V2 = paste0(c(.SD), collapse="")), V1]
}
# Apply the function to each of the fldrs in `simulations`
lapply(fldrs, read_sims)

如果示例文件位于simulations/sim1中,则结果如下:

[[1]]
V1                                                                          V2
1: L1 110011001001010001011000011000001000010000101101001011010000012202011210001
2: L2 110111011001101001000001011100001001000000100101011011010002112212010210012
3: L3 011111110001010001001100000110001101010001001011001111010002112212210220022
4: L4 110111011001101001000001011100001001000000100101011011010002112212010220012
5: L5 110111010011010001011001011000101001010000101101011010010001112212010222012

这个输出是一个长度为1的列表,因为只有一个文件夹(' sim1 ')。您的输出将是一个长度为100的列表,每个元素包含连接的信息

相关内容

  • 没有找到相关文章

最新更新