如何使用节点和流解析大型 TSV 文件



我已经检索了IMDB数据的转储(感谢 http://www.omdbapi.com/和少量捐赠(作为TSV文件(包含1,111,073行(。每条线代表一部电影,它们看起来像这样:

ID  imdbID  Title   Year    Rating  Runtime Genre   Released    Director    Writer  Cast    Metacritic  imdbRating  imdbVotes   Poster  Plot    FullPlot    Language    Country Awards  lastUpdated
1   tt0000001   Carmencita  1894    NOT RATED   1 min   Documentary, Short      William K.L. Dickson        Carmencita      5.8 1100    http://ia.media-imdb.com/images/M/MV5BMjAzNDEwMzk3OV5BMl5BanBnXkFtZTcwOTk4OTM5Ng@@._V1_SX300.jpg    Performing on what looks like a small wooden stage, wearing a dress with a hoop skirt and white high-heeled pumps, Carmencita does a dance with kicks and twirls, a smile always on her face.   Performing on what looks like a small wooden stage, wearing a dress with a hoop skirt and white high-heeled pumps, Carmencita does a dance with kicks and twirls, a smile always on her face.       USA     2015-12-10 01:09:33.043000000

我的目标是可视化电影长度随时间的变化。因此,我需要创建两个数组,一个用于最小/最大值,一个用于每年的平均值(因为Highcharts图表类型"面积图和折线图"需要这种格式(。所以我编写了一个脚本,该脚本适用于一小部分,但在尝试读取整个文件时会意外地抛出错误。

很清楚流应该能够帮助解决这个问题,但我的专业知识有限,这个小项目实际上是为了帮助我更好地挖掘流......

以下是目前的脚本:

https://gist.github.com/jfix/f79f011ce99d2049613c

如果最好在我的问题中内联显示整个脚本,我显然可以添加它。

这是抛出的错误:

$ node each.js
buffer.js:382
    throw new Error('toString failed');
    ^
Error: toString failed
    at Buffer.toString (buffer.js:382:11)
    at StringDecoder.write (string_decoder.js:129:21)
    at Parser._transform (/Users/jakob/Projects/imdb-film-length/node_modules/csv-parse/lib/index.js:154:26)
    at Transform._read (_stream_transform.js:167:10)
    at Transform._write (_stream_transform.js:155:12)
    at doWrite (_stream_writable.js:292:12)
    at writeOrBuffer (_stream_writable.js:278:5)
    at Writable.write (_stream_writable.js:207:11)
    at /Users/jakob/Projects/imdb-film-length/node_modules/csv-parse/lib/index.js:46:14
    at doNTCallback0 (node.js:419:9)

感谢您在正确方向上的任何指示...

我尝试重现您的情况,但仅通过运行就收到相同的错误:

csv(file, {delimiter: tab, relax: true, columns: true}, (err, out) => { });

因此,csv-parse 模块似乎使进程耗尽内存,因为回调分配了大量数组。您可能需要改用 csv-parse 模块的流 api。下面描述了一个示例:http://csv.adaltas.com/parse/examples/

相关内容

  • 没有找到相关文章

最新更新