我有 100 个文件,每个文件是 10 GB 的文件。我需要重新格式化文件并组合为更可用的表格格式,以便我可以对数据进行分组、求和、平均等。使用Python重新格式化数据将花费一周多的时间。即使我将其重新格式化为表,我也不知道它对于数据帧来说是否太大,而是一次一个问题。
任何人都可以建议一种更快的方法来重新格式化文本文件吗?我会考虑任何C++、perl 等。
示例数据:
Scenario: Modeling_5305 (0.0001)
Position: NORTHERN UTILITIES SR NT,
" ","THEO/Effective Duration","THEO/Yield","THEO/Implied Spread","THEO/Value","THEO/Price","THEO/Outstanding Balance","THEO/Effective Convexity","ID","WAL","Type","Maturity Date","Coupon Rate","POS/Position Units","POS/Portfolio","POS/User Defined 1","POS/SE Cash 1","User Defined 2","CMO WAL","Spread Over Yield",
"2017/12/31",16.0137 T,4.4194 % SEMI 30/360,0.4980 % SEMI 30/360,"6,934,452.0000 USD","6,884,052.0000 USD","7,000,000.0000 USD",371.6160 T,CachedFilterPartitions-PL_SPLITTER.2:665876C#3,29.8548 T,Fixed Rate Bond,2047/11/01,4.3200 % SEMI 30/360,"70,000.0000",All Portfolios,030421000,0.0000 USD,FRB,N/A,0.4980 % SEMI 30/360,
"2018/01/12",15.5666 T,4.8499 % SEMI 30/360,0.4980 % SEMI 30/360,"6,477,803.7492 USD","6,418,163.7492 USD","7,000,000.0000 USD",356.9428 T,CachedFilterPartitions-PL_SPLITTER.2:665876C#3,29.8219 T,Fixed Rate Bond,2047/11/01,4.3200 % SEMI 30/360,"70,000.0000",All Portfolios,030421000,0.0000 USD,FRB,N/A,0.4980 % SEMI 30/360,
Scenario: Modeling_5305 (0.0001)
Position: OLIVIA ISSUER TR SER A (A,
" ","THEO/Effective Duration","THEO/Yield","THEO/Implied Spread","THEO/Value","THEO/Price","THEO/Outstanding Balance","THEO/Effective Convexity","ID","WAL","Type","Maturity Date","Coupon Rate","POS/Position Units","POS/Portfolio","POS/User Defined 1","POS/SE Cash 1","User Defined 2","CMO WAL","Spread Over Yield",
"2017/12/31",1.3160 T,19.0762 % SEMI 30/360,0.2990 % SEMI 30/360,"3,862,500.0000 USD","3,862,500.0000 USD","5,000,000.0000 USD",2.3811 T,CachedFilterPartitions-PL_SPLITTER.2:681071AA4,1.3288 T,Interest Rate Index Linked Note,2019/05/30,0.0000 % MON 30/360,"50,000.0000",All Portfolios,010421002,0.0000 USD,IRLIN,N/A,0.2990 % SEMI 30/360,
"2018/01/12",1.2766 T,21.9196 % SEMI 30/360,0.2990 % SEMI 30/360,"3,815,391.3467 USD","3,815,391.3467 USD","5,000,000.0000 USD",2.2565 T,CachedFilterPartitions-PL_SPLITTER.2:681071AA4,1.2959 T,Interest Rate Index Linked Note,2019/05/30,0.0000 % MON 30/360,"50,000.0000",All Portfolios,010421002,0.0000 USD,IRLIN,N/A,0.2990 % SEMI 30/360,
我想重新格式化到这个csv表,以便我可以导入到数据帧:
Position, Scenario, TimeSteps, THEO/Value
NORTHERN UTILITIES SR NT, Modeling_5305, 2018/01/12, 6477803.7492
OLIVIA ISSUER TR SER A (A, Modeling_5305, 2018/01/12, 3815391.3467
当您必须操作大文件或大量文件时,有两个大瓶颈。一个是您的文件系统,它受到HDD或SSD(存储介质(,与存储介质的连接和操作系统的限制。通常你无法改变这一点。但你必须问自己,我的最高速度是多少?系统读写速度有多快?你永远不可能比这更快。 对最大速度的粗略估计是读取所有数据所需的时间加上写入所有数据所需的时间。
另一个瓶颈是您用于进行更改的库。并非所有的 Python 包都是平等的,存在巨大的速度差异。我建议在一个小的测试样本上尝试几种方法,直到找到适合你的方法。
请记住,大多数文件系统都喜欢读取或写入大量数据。因此,您应该尽量避免在阅读一行然后编写一行之间交替的情况。换句话说,不仅库很重要,而且你如何使用它也很重要。
不同的编程语言,虽然它们可能为这项任务提供了一个很好的库,并且可以成为一个好主意,但不会以任何有意义的方式加快这个过程(所以你不会得到 10 倍的速度或任何东西(。
我会将C/C++与内存映射一起使用。
通过内存映射,您可以像一个大的字节数组一样遍历数据(这也将防止将数据从内核空间复制到用户空间(在Windows上,不确定Linux((。
对于非常大的文件,您可以一次映射一个块(例如 10GB(。
对于写入,请使用缓冲区(例如 1MB(来存储结果,然后每次(使用fwrite()
(将该缓冲区写入文件。
无论您做什么,都不要使用流式 I/O 或readline()
。
该过程不应比在磁盘上复制文件(或通过网络,因为您使用网络文件存储(所需的时间更长(或至少不会长多少(。
如果可以选择,请将数据写入与要从中读取的磁盘不同的(物理(磁盘。