我正在尝试创建一个程序来打开大量文件(.mol),并从这些文件复制特定信息并将其保存到电子表格中(TAB 分隔文件"\t")。
我的计算机上有 10000 摩尔文件,看起来像SN00000001 SN00000002 SN00000003......SN00010000。
(下载链接 => http://bioinf-applied.charite.de/supernatural_new/src/download_mol.php?sn_id=SN00000001)
我有两个问题:
-
我已经尝试使用函数load.molecules(rcdk)和ChemmineR(loadsdf),但我没有成功地在R中打开.mol文件。
-
是否可以打开每个.mol文件并使用R将其保存为唯一的电子表格,例如" ID","名称","分子式"之类的特定信息?
好的,我会把代码发给你
# get the full path of your mol files
mol_files <- list.files(path = file.path(getwd(), "/Users/189919604/Desktop/Download
SuperNatural II/SN00000001"), # specify your folder here
pattern = "*mol",
full.names = TRUE)
# create tibble, with filenames (incl. the full path)
df <- tibble(filenames = mol_files)
# create function to extract all the information
extract_info <- function(sdfset) {
# function to extract information from a sdfset (ChemmineR)
# this only works if there is one molecule in the sdfset
ID <- sdfset@SDF[[1]]@datablock["SNID"]
Name <- sdfset@SDF[[1]]@header["Molecule_Name"]
Molecular_Formula <- sdfset@SDF[[1]]@datablock["Molecular Formula"]
sdf_info <- tibble(SNID = ID,
Name = Name,
MolFormula = Molecular_Formula)
return(sdf_info)
}
# read all files and extract info
df <- df %>%
mutate(sdf_data = map(.x = filenames,
.f = ~ read.SDFset(sdfstr = .x)),
info = map(.x = sdf_data,
.f = ~ extract_info(sdfset = .x)))
# make a nice tibble with only the info you want
all_info <- df %>%
select(molecule) %>%
unnest(info)
# write to file
write_delim(x = all_info,
path = file.path(getwd(), "test.tsv"),
delim = "t")
这有效,我只用 2 mol 文件对其进行了测试。我使用ChemmineR
包中的read.SDFset
来读取所有 mol 文件。我使用的软件包tidyverse
是处理tibbles。Tibbles实际上是具有一些额外属性/功能的数据帧。
library(tidyverse)
library(ChemmineR)
# get the full path of your mol files
mol_files <- list.files(# specify your folder here in case of windows also add your drive letter e.g.: "c:/users/path/to/my/mol_files"
path = "/home/rico/r-stuff/temp",
pattern = "*mol",
full.names = TRUE)
# create tibble, with filenames (incl. the full path)
df <- tibble(filenames = mol_files)
# create function to extract all the information
extract_info <- function(sdfset) {
# function to extract information from a sdfset (ChemmineR)
# this only works if there is one molecule in the sdfset
ID <- sdfset@SDF[[1]]@datablock["SNID"]
Name <- sdfset@SDF[[1]]@header["Molecule_Name"]
Molecular_Formula <- sdfset@SDF[[1]]@datablock["Molecular Formula"]
sdf_info <- tibble(SNID = ID,
Name = Name,
MolFormula = Molecular_Formula)
return(sdf_info)
}
# read all files and extract info
df <- df %>%
mutate(sdf_data = map(.x = filenames,
.f = ~ read.SDFset(sdfstr = .x)),
info = map(.x = sdf_data,
.f = ~ extract_info(sdfset = .x)))
# make a nice tibble with only the info you want
all_info <- df %>%
select(info) %>%
unnest(info)
# write to file
write_delim(x = all_info,
path = file.path(getwd(), "temp", "test.tsv"),
delim = "t")