Python/IPython 奇怪的不可重现列表索引超出范围错误



我最近一直在学习一些Python以及如何将其应用于我的工作中。我已经成功地编写了几个脚本,但我遇到了一个我无法弄清楚的问题。

我正在打开一个包含 ~4000 行的文件,每行有两个制表符分隔的列。读取输入文件时,我收到一个索引错误,指出列表索引超出范围。但是,虽然我每次都收到错误,但它不会每次都在同一行上发生(就像,它每次都会在不同的行上抛出错误!因此,出于某种原因,它通常有效,但随后(似乎)随机失败。

由于我上周才开始学习Python,所以我被难住了。我四处寻找同样的问题,但没有找到类似的东西。此外,我不知道这是一个特定于语言还是特定于 IPython 的问题。任何帮助将不胜感激!

input = open("count.txt", "r")
changelist = []
listtosort = []
second = str()
output = open("output.txt", "w")
for each in input:
    splits = each.split("t")
    changelist = list(splits[0])
    second = int(splits[1])
print second
if changelist[7] == ";":   
    changelist.insert(6, "000")
    va = "".join(changelist) 
    var = va + ("t") + str(second)
    listtosort.append(var)
    output.write(var)
elif changelist[8] == ";":   
    changelist.insert(6, "00")
    va = "".join(changelist) 
    var = va + ("t") + str(second)
    listtosort.append(var)
    output.write(var)
elif changelist[9] == ";":   
    changelist.insert(6, "0")
    va = "".join(changelist) 
    var = va + ("t") + str(second)
    listtosort.append(var)
    output.write(var)
else:
    #output.write(str("".join(changelist)))
    va = "".join(changelist)
    var = va + ("t") + str(second)
    listtosort.append(var)
    output.write(var)
output.close()

错误

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
/home/a/Desktop/sharedfolder/ipytest/individ.ins.count.test/<ipython-input-87-32f9b0a1951b> in <module>()
     57     splits = each.split("t")
     58     changelist = list(splits[0])
---> 59     second = int(splits[1])
     60 
     61     print second
IndexError: list index out of range

输入:

ID=cds0;Name=NP_414542.1;Parent=gene0;Dbxref=ASAP:ABE-0000006,UniProtKB%2FSwiss-Prot:P0AD86,Genbank:NP_414542.1,EcoGene:EG11277,GeneID:944742;gbkey=CDS;product=thr 12
ID=cds1000;Name=NP_415538.1;Parent=gene1035;Dbxref=ASAP:ABE-0003451,UniProtKB%2FSwiss-Prot:P31545,Genbank:NP_415538.1,EcoGene:EG11735,GeneID:946500;gbkey=CDS;product=deferrrochelatase%2C  50
ID=cds1001;Name=NP_415539.1;Parent=gene1036;Note=PhoB-dependent%2C  36

期望输出:

ID=cds0000;Name=NP_414542.1;Parent=gene0;Dbxref=ASAP:ABE-0000006,UniProtKB%2FSwiss-Prot:P0AD86,Genbank:NP_414542.1,EcoGene:EG11277,GeneID:944742;gbkey=CDS;product=thr  12
ID=cds1000;Name=NP_415538.1;Parent=gene1035;Dbxref=ASAP:ABE-0003451,UniProtKB%2FSwiss-Prot:P31545,Genbank:NP_415538.1,EcoGene:EG11735,GeneID:946500;gbkey=CDS;product=deferrrochelatase%2C  50
ID=cds1001;Name=NP_415539.1;Parent=gene1036;Note=PhoB-dependent%2C  36

您获得IndexError的原因是您的输入文件显然不是完全制表符分隔的。这就是为什么当您尝试访问它时,splits[1]什么都没有。

您的代码可以使用一些重构。首先,您正在重复自己if检查,这是不必要的。这只是将cds0填充到 7 个字符,这可能不是您想要的。我将以下内容放在一起,以演示如何重构代码以使其更加python化和干燥。我不能保证它会与你的数据集一起工作,但我希望它可以帮助你理解如何以不同的方式做事。

    to_sort = []
    # We can open two files using the with statement. This will also handle 
    # closing the files for us, when we exit the block.
    with open("count.txt", "r") as inp, open("output.txt", "w") as out:
        for each in inp:
           # Split at ';'... So you won't have to worry about whether or not
           # the file is tab delimited
           changed = each.split(";")
           # Get the value you want. This is called unpacking.
           # The value before '=' will always be 'ID', so we don't really care about it.
           # _ is generally used as a variable name when the value is discarded.
           _, value = changed[0].split("=")
           # 0-pad the desired value to 7 characters. Python string formatting
           # makes this very easy. This will replace the current value in the list.
           changed[0] = "ID={:0<7}".format(value)
           # Join the changed-list with the original separator and
           # and append it to the sort list.
           to_sort.append(";".join(changed))
       # Write the results to the file all at once. Your test data already
       # provided the newlines, you can just write it out as it is.
       output.writelines(to_sort)
       # Do what else you need to do. Maybe to_list.sort()?

您会注意到此代码将代码减少到 8 行,但实现了完全相同的事情,不会重复并且非常容易理解。

请阅读PEP8,python的禅宗,并浏览官方教程。

当计数中有一行不包含制表符时.txt会发生这种情况。因此,当您按制表符拆分时,不会有任何splits[1]。因此错误"索引超出范围"。

要知道哪一行导致错误,只需在第 57 行的 splits 后添加一个print(each)。在错误消息之前打印的行是您的罪魁祸首。如果您的输入文件不断变化,那么您将获得不同的位置。更改脚本以处理此类格式错误的行。

最新更新