在 Python 中读取和导出制表符分隔文件中的单个列

我有许多以制表符分隔的大型文件保存为 .txt ，每个文件都有七列，标题如下：

#column_titles = ["col1", "col2", "col3", "col4", "col5", "col6", "text"]

我想简单地提取名为 text 的最后一列并将其保存到一个新文件中，每行都是原始文件中的一行，而都是字符串。

编辑：这不是类似问题的重复，因为在我的情况下splitlines()没有必要。只有事情的顺序需要改进

基于-几个 - 其他 - 帖子，这是我目前的尝试：

import csv
# File names: to read in from and read out to
input_file = "tester_2014-10-30_til_2014-08-01.txt"
output_file = input_file + "-SA_input.txt"
## ==================== ##
##  Using module 'csv'  ##
## ==================== ##
with open(input_file) as to_read:
    reader = csv.reader(to_read, delimiter = "t")
    desired_column = [6]        # text column
    for row in reader:
    myColumn = list(row[i] for i in desired_column)
with open(output_file, "wb") as tmp_file:
    writer = csv.writer(tmp_file)
for row in myColumn:
    writer.writerow(row)

我得到的只是构成我的输入文件的第 2624 行的文本字段，该字符串中的每个字母都被分离出来：

H,o,w, ,t,h,e, ,t.e.a.m, ,d,i,d, ,T,h,u,r,s,d,a,y, ,-, ,s,e,e , ,h,e,r,e

我在编程的世界里知道很少是随机的，但这绝对是奇怪的！

这篇文章与我的需求非常相似，但错过了写作和保存部分，我也不确定。

我已经研究了使用pandas工具箱(根据上面的链接之一(，但由于我的 Python 安装，我无法，所以请只使用csv或其他内置模块的解决方案！

您必须一次处理一行文件：读取、解析和写入。

import csv
# File names: to read in from and read out to
input_file = "tester_2014-10-30_til_2014-08-01.txt"
output_file = input_file + "-SA_input.txt"
## ==================== ##
##  Using module 'csv'  ##
## ==================== ##
with open(input_file) as to_read:
    with open(output_file, "wb") as tmp_file:
        reader = csv.reader(to_read, delimiter = "t")
        writer = csv.writer(tmp_file)
        desired_column = [6]        # text column
        for row in reader:     # read one row at a time
            myColumn = list(row[i] for i in desired_column)   # build the output row (process)
            writer.writerow(myColumn) # write it

我会选择这个简单的解决方案：

    text_strings = [] # empty array to store the last column text
    with open('my_file') as ff:
        ss = ff.readlines() # read all strings in a string array 
    for s in ss:
        text_strings.append(s.split('t')[-1]) # last column to the text array

    with open('out_file') as outf:
        outf.write('n'.join(text_strings)) # write everything to output file

使用列表推导，您可以将ss字符串的最后几列翻译成更快的text_strings，并且一行：

    text_strings = [k.split("t")[-1] for k in ss]

还有其他可能的简化，你明白了(

代码中的问题出现在以下两行：

        for row in reader:
        myColumn = list(row[i] for i in desired_column)

首先，没有缩进，因此不会发生任何事情。实际上，在我的计算机上，它抛出了一个错误，因此有可能是拼写错误。但是在这种情况下，在 for 循环的每一步，您都会用来自新行的值覆盖myColumn值，因此最终您会从文件的最后一行获得一个字符串。其次，list应用于字符串(如在代码中一样(，将字符串转换为字符列表：

    In [5]: s = 'AAAA'
    In [6]: list(s)
    Out[6]: ['A', 'A', 'A', 'A']

这正是您在输出中看到的。

相关内容

最新更新

热门标签：