我看到越来越多的csv文件包含多个部分,每个包含自己的表。例如,来自10XGenomics的这个文件:
[gene-expression]
reference,/path/to/transcriptome
[libraries]
fastq_id,fastqs,feature_types
gex1,/path/to/fastqs,Gene Expression
mux1,/path/to/fastqs,Multiplexing Capture
[samples]
sample_id,cmo_ids
sample1,CMO301
sample2,CMO303
有时节头甚至嵌入到它们自己的行中,例如
[gene-expression],,
reference,/path/to/transcriptome,
[libraries],,
fastq_id,fastqs,feature_types
gex1,/path/to/fastqs,Gene Expression
mux1,/path/to/fastqs,Multiplexing Capture
[samples],,
sample_id,cmo_ids,
sample1,CMO301,
sample2,CMO303,
是否有一个Python模块来直接处理这种分段?我找不到如何用Pandas或csv
模块做到这一点。例如,从上面的两个例子中,我希望得到一个字典,每个部分有一个条目,然后每个部分有一个列表的列表。
一些章节有标题,如果这也可以处理,那就太好了,例如类似于csv.DictReader
。
虽然它不是特别难写一个解决方案,可以解析这个特定的例子,生产的东西,在一般情况下是很难的,例如解析一个简单的csv文件很容易完成与split
,但csv
模块是400+行的Python,和更多的C行,所以我在这里真正寻找的是一个模块来处理这个问题一般。
PS:这个问题是相关的,但不幸的是,答案没有解决关于csv解析器的问题
您可以使用configparser
模块读取您的文件:
from configparser import ConfigParser
import io
import pandas as pd
cfg = ConfigParser(allow_no_value=True)
cfg.optionxform = str
cfg.read('data.csv')
dfs = {}
for section in cfg.sections():
buf = io.StringIO()
buf.writelines('n'.join(row.rstrip(',') for row in cfg[section]))
buf.seek(0)
dfs[section] = pd.read_csv(buf)
输出:
>>> dfs['gene-expression']
Empty DataFrame
Columns: [reference, /path/to/transcriptome]
Index: []
>>> dfs['libraries']
fastq_id fastqs feature_types
0 gex1 /path/to/fastqs Gene Expression
1 mux1 /path/to/fastqs Multiplexing Capture
>>> dfs['samples']
sample_id cmo_ids
0 sample1 CMO301
1 sample2 CMO303
现在你也可以只提取一个部分:
cfg = ConfigParser(allow_no_value=True)
cfg.optionxform = str
cfg.read('data.csv')
def read_data(section):
buf = io.StringIO()
buf.writelines('n'.join(row.rstrip(',') for row in cfg[section]))
buf.seek(0)
return pd.read_csv(buf)
df = read_data('samples')
输出:
>>> df
sample_id cmo_ids
0 sample1 CMO301
1 sample2 CMO303
下面是一个使用pandas处理两种格式的命题:
import pandas as pd
df = (pd.read_fwf("input.txt", header=None, names=["data"])
.assign(section=lambda x: x["data"].str.extract("[(.*)]").ffill())
)
d_dfs = { # type hint: Dict[str, pd.DataFrame]
k: (g.iloc[1:,0].str.split(",", expand=True)
.pipe(lambda df_:
df_.rename(columns=df_.iloc[0])
.drop(df_.index[0])))
for k, g in df.groupby('section')
}
输出:
>>> print(d_dfs["libraries"])
fastq_id fastqs feature_types
4 gex1 /path/to/fastqs Gene Expression
5 mux1 /path/to/fastqs Multiplexing Capture
>>> print(d_dfs["samples"])
sample_id cmo_ids
8 sample1 CMO301
9 sample2 CMO303
对于csv
标准库模块,使用itertools.groupby()
来处理具有以下部分的解析文件是相当容易的:
import csv
import io
import itertools
s = """
[gene-expression]
reference,/path/to/transcriptome
[libraries]
fastq_id,fastqs,feature_types
gex1,/path/to/fastqs,Gene Expression
mux1,/path/to/fastqs,Multiplexing Capture
[samples]
sample_id,cmo_ids
sample1,CMO301
sample2,CMO303
"""
def is_header(l):
return l.strip().startswith("[") and l.strip().endswith("]")
f = io.StringIO(s)
grouped = itertools.groupby(f, is_header)
try:
while True:
_, header = next(grouped)
header = list(csv.reader(header))[-1][0]
_, section = next(grouped)
section = list(csv.reader(section))
print(header)
print(section)
except StopIteration:
pass
如果你有Python 3.10或更高版本,你可以将itertools.groupby()
和itertools.pairwise()
结合使用,这将使这更简单:
s = """
[gene-expression]
reference,/path/to/transcriptome
[libraries]
fastq_id,fastqs,feature_types
gex1,/path/to/fastqs,Gene Expression
mux1,/path/to/fastqs,Multiplexing Capture
[samples]
sample_id,cmo_ids
sample1,CMO301
sample2,CMO303
"""
import csv
import io
import itertools
f = io.StringIO(s)
def is_header(l):
return l.strip().startswith("[") and l.strip().endswith("]")
grouped = itertools.groupby(f, is_header)
paired = itertools.pairwise(list(g) for k, g in grouped)
data = {header[-1].strip("[]n"): list(csv.reader(section)) for header, section in paired}
print(data)