替换巨大的 txt 制表符分隔文件中第一行中的文本

我有一个巨大的文本文件（大小为19GB）;它是一个带有变量和观测值的遗传数据文件.
第一行包含变量名称，其结构如下：

id1.var1 id1.var2 id1.var3 id2.var1 id2.var2 id2.var3

我需要将 id1、id2 等与另一个文本文件中的相应值交换（此文件大约有 7k 行） id 不按任何特定顺序排列，其结构如下：

oldId newIds
id1 rs004
id2 rs135

我已经做了一些谷歌搜索，但无法真正找到一种可以执行以下操作的语言：

阅读第一行
将 ID 替换为新 ID
从原始文件中删除第一行并将其替换为新文件

这是一个好方法还是有更好的方法？
哪种语言是实现这一目标的最佳语言？
我们有在python，vbscipt和Perl方面有经验的人。

整个"替换"的事情在几乎任何语言中都是可能的（我确定 Python 和 Perl），只要替换行的长度与原始行相同，或者如果可以通过填充空格来使其相同（否则，您将不得不重写整个文件）。

打开文件进行读写（w+模式），读取第一行，准备新行，seek到文件中的位置0，写入新行，关闭文件。

我建议你使用 Tie::File 模块，它将文本文件中的行映射到 Perl 数组，并使重写标题后的行成为一项简单的工作。

该程序演示。它首先将所有旧/新 ID 读取到哈希中，然后使用 Tie::File 映射数据文件。文件的第一行（以$file[0]为单位）使用替换进行修改，然后解开数组以重写和关闭文件。

您需要更改我使用的文件名。还要注意，我假设 ID 始终是"单词"字符（字母数字加下划线），后跟一个点，并且没有空格。当然，您需要在修改文件之前备份文件，并且在更新真实文件之前，您应该在较小的文件上测试程序。

use strict;
use warnings;
use Tie::File;
my %ids;
open my $fh, '<', 'newids.txt' or die $!;
while (<$fh>) {
  my ($old, $new) = split;
  $ids{$old} = $new;
}
tie my @file, 'Tie::File', 'datafile.txt' or die $!;
$file[0] =~ s<(w+)(?=.)><$ids{$1} // $1>eg;
untie @file;

这应该很容易。我会使用Python，因为我是Python的粉丝。大纲：

读取映射文件，并保存映射（在 Python 中，使用字典）。
读取一行数据文件，重新映射变量名称，然后输出编辑的行。

您真的无法就地编辑文件...嗯，我想如果每个新变量名称的长度始终与旧名称完全相同，您可以。但为了便于编程和运行时的安全性，最好始终编写一个新的输出文件，然后删除原始文件。这意味着在运行此程序之前，您将至少需要 20 GB 的可用磁盘空间，但这应该不是问题。

这是一个Python程序，展示了如何做到这一点。我使用您的示例数据来制作测试文件，这似乎有效。

#!/usr/bin/python
import re
import sys
try:
    fname_idmap, fname_in, fname_out = sys.argv[1:]
except ValueError:
    print("Usage: remap_ids <id_map_file> <input_file> <output_file>")
    sys.exit(1)
# pattern to match an ID, only as a complete word (do not match inside another id)
# match start of line or whitespace, then match non-period until a period is seen
pat_id = re.compile("(^|s)([^.]+).")
idmap = {}
def remap_id(m):
    before_word = m.group(1)
    word = m.group(2)
    if word in idmap:
        return before_word + idmap[word] + "."
    else:
        return m.group(0)  # return full matched string unchanged
def replace_ids(line, idmap):
    return re.sub(pat_id, remap_id, line)
with open(fname_idmap, "r") as f:
    next(f)  # discard first line with column header: "oldId newIds"
    for line in f:
        key, value = line.split()
        idmap[key] = value
with open(fname_in, "r") as f_in, open(fname_out, "w") as f_out:
    for line in f_in:
        line = replace_ids(line, idmap)
        f_out.write(line)

相关内容

最新更新

热门标签：