得到CParserError.pandas是否发布对单元格中值的最大大小的限制



我一直在尝试用熊猫来分析一些基因组数据。在读取csv时,我得到了CParserError: Error tokenizing data. C error: out of memory错误,并且我已经缩小到导致它的特定行,即43452。如下所示,直到解析器超出第43452行之后才发生错误。

我还粘贴了less输出的相关行,截断了长序列,第二列(seq_len)显示了该序列的长度。正如你所看到的,一些序列相当长,有几百万个字符(即基因组学中的碱基)。我想知道错误是否是csv中值太大的结果。pandas对单元格中的值的长度有限制吗?如果有,它有多大?

BTW, data.csv.gz的大小约为9G,如果解压少于200万行。我的系统有超过100G的内存,所以我认为物理内存不太可能是原因。

第43451行读取成功

In [1]: import pandas as pd
In [2]: df = pd.read_csv('data.csv.gz',
                         compression='gzip', header=None,
                         names=['accession', 'seq_len', 'tax_id', 'seq'],
                         nrows=43451)

第43452行读取失败

In [1]: import pandas as pd
In [2]: df = pd.read_csv('data.csv.gz',
                         compression='gzip', header=None,
                         names=['accession', 'seq_len', 'tax_id', 'seq'],
                         nrows=43452)
---------------------------------------------------------------------------
CParserError                              Traceback (most recent call last)
<ipython-input-1-036af96287f7> in <module>()
----> 1 import pandas as pd; df = pd.read_csv('filtered_gb_concatenated.csv.gz', compression='gzip', header=None, names=['accession', 'seq_len', 'tax_id', 'seq'], nrows=43452)
/path/to/venv/lib/python2.7/site-packages/pandas/io/parsers.pyc in parser_f(filepath_or_buffer, sep, dialect, compression, doublequote, escapechar, quotechar, quoting, skipinitialspace, lineterminator, header, index_col, names, prefix, skiprows, skipfooter, skip_footer, na_values, na_fvalues, true_values, false_values, delimiter, converters, dtype, usecols, engine, delim_whitespace, as_recarray, na_filter, compact_ints, use_unsigned, low_memory, buffer_lines, warn_bad_lines, error_bad_lines, keep_default_na, thousands, comment, decimal, parse_dates, keep_date_col, dayfirst, date_parser, memory_map, float_precision, nrows, iterator, chunksize, verbose, encoding, squeeze, mangle_dupe_cols, tupleize_cols, infer_datetime_format, skip_blank_lines)
472                     skip_blank_lines=skip_blank_lines)
    473
    --> 474         return _read(filepath_or_buffer, kwds)
    475
        476     parser_f.__name__ = name
/path/to/venv/lib/python2.7/site-packages/pandas/io/parsers.pyc in _read(filepath_or_buffer, kwds)
254                                   " together yet.")
    255     elif nrows is not None:
    --> 256         return parser.read(nrows)
    257     elif chunksize or iterator:
        Successful258         return parser
/path/to/venv/lib/python2.7/site-packages/pandas/io/parsers.pyc in read(self, nrows)
719                 raise ValueError('skip_footer not supported for iteration')
    720
    --> 721         ret = self._engine.read(nrows)
    722
        723         if self.options.get('as_recarray'):
/path/to/venv/lib/python2.7/site-packages/pandas/io/parsers.pyc in read(self, nrows)
   1168
  1169         try:
  -> 1170             data = self._reader.read(nrows)
     1171         except StopIteration:
    1172             if nrows is None:
pandas/parser.pyx in pandas.parser.TextReader.read (pandas/parser.c:7544)()
pandas/parser.pyx in pandas.parser.TextReader._read_low_memory (pandas/parser.c:7952)()
pandas/parser.pyx in pandas.parser.TextReader._read_rows (pandas/parser.c:8401)()
pandas/parser.pyx in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:8275)()
pandas/parser.pyx in pandas.parser.raise_parser_error (pandas/parser.c:20691)()
CParserError: Error tokenizing data. C error: out of memory

行43450-43455less -N -S输出长序列截断。第一列是行号,行号后面是用逗号分隔的csv内容。列名为['accession', 'seq_len', 'tax_id', 'seq']

43450 FP929055.1,3341681,657313,AAAGAACCTTGATAACTGAACAATAGACAACAACAACCCTTGAAAATTTCTTTAAGAGAA....
43451 FP929058.1,3096657,657310,TTCGCGTGGCGACGTCCTACTCTCACAAAGGGAAACCCTTCACTACAATCGGCGCTAAGA....
43452 FP929059.1,2836123,717961,GTTCCTCATCGTTTTTTAAGCTCTTCTCCGTACCCTCGACTGCCTTCTTTCTCACTGTTC....
43453 FP929060.1,3108859,245012,GGGGTATTCATACATACCCTCAAAACCACACATTGAAACTTCCGTTCTTCCTTCTTCCTC....
43454 FP929061.1,3114788,649756,TAACAACAACAGCAACGGTGTAGCTGATGAAGGAGACATATTTGGATGATGAATACTTAA....
43455 FP929063.1,34221,29290,CCTGTCTATGGGATTTGGCAGCGCAATGCAGGAAAACTACGTCCTAAGTGTGGAGATCGATGC....

最后一行说明了一切,它没有足够的内存来分割数据块。我不确定归档块读取是如何工作的,也不知道它会将多少数据加载到内存中,但很明显,您必须以某种方式控制块的大小。我在这里找到了一个解决方案:

pandas-read-csv-out-of-memory

:

out-of-memory-error-when-reading-csv-file-in-chunk

请尝试逐行读取正常文件,看看它是否有效。

最新更新