Python：在线访问"field"

i有以下.txt-file（修改后的bash饰品报告，原始报告具有seqtable格式）：

Start     End  Strand Pattern                                                     Sequence
  43392   43420       + regex:[T][G][A][TC][C][CTG]D{15,17}[CA][G][T][AT][AT][CTA] TGATCGCACGCCGAATGGAAACACGTTTT
  52037   52064       + regex:[T][G][A][TC][C][CTG]D{15,17}[CA][G][T][AT][AT][CTA] TGACCCTGCTTGGCGATCCCGGCGTTTC
 188334  188360       + regex:[T][G][A][TC][C][CTG]D{15,17}[CA][G][T][AT][AT][CTA] TGATCGCGCAACTGCAGCGGGAGTTAC

我只想访问"序列"下的元素，以将它们与某些变量进行比较并删除整个线，如果比较不给出所需的结果（使用Levenshtein距离进行比较）。

，但我什至无法开始。...：（

我正在寻找类似Linux -F选项的内容，直接到达行中的"字段"以进行我的比较。

我遇到了re.split：

with open(textFile) as f:
    for line in f:
        cleaned=re.split(r't',line)
        print(cleaned)

导致：

['  Start     End  Strand Pattern                                                     Sequencen']
['n']
['  43392   43420       + regex:[T][G][A][TC][C][CTG]\D{15,17}[CA][G][T][AT][AT][CTA] TGATCGCACGCCGAATGGAAACACGTTTTn']
['n']
['  52037   52064       + regex:[T][G][A][TC][C][CTG]\D{15,17}[CA][G][T][AT][AT][CTA] TGACCCTGCTTGGCGATCCCGGCGTTTCn']
['n']
[' 188334  188360       + regex:[T][G][A][TC][C][CTG]\D{15,17}[CA][G][T][AT][AT][CTA] TGATCGCGCAACTGCAGCGGGAGTTACn']
['n']

那是我最接近"将线分为元素"的最接近的。我觉得完全走错了方向，但是搜索堆栈溢出，Google并没有导致任何事情：（

我以前从未与Seqtable-Format合作过，所以我试图处理它。

python是我正在学习的主要语言，我在狂欢中并不那么坚定，但是处理这个问题也可以。

也可以。

我感谢任何提示/链接/帮助：）

格式本身似乎在使用多行作为定界符，而您的r't'不做任何事情（您正在指示Python在字面的t上分裂）。另外，根据您粘贴的数据，无论如何都没有使用选项卡定界符，而是随机数量的空格来填充表。

要解决这两者，您可以读取文件，将第一行视为标头（如果需要的话），然后按行读取其休息，剥离the the the the the the the thawing strail presition whitespace，检查是否有任何数据如果有 - 进一步将其分配在空格上以获取您的行元素：

with open("your_data", "r") as f:
    header = f.readline().split()  # read the first line as a header
    for line in f:  # read the rest of the file line-by-line
        line = line.strip()  # first clear out the whitespace
        if line:  # check if there is any content left or is it an empty line
            elements = line.split()  # split the data on whitespace to get your elements
            print(elements[-1])  # print the last element

TGATCGCACGCCGAATGGAAACGTTTTTGACCCTGCTTGGCGATCCCGGCGTTTCTGATCGCGCACTGCGGGGAGTTAC

作为奖励，由于您有标题，因此您可以将其转换为地图，然后使用"代理"访问来获取所需的元素，因此您不必担心元素位置：

with open("your_data", "r") as f:
    # read the header and turn it into a value:index map
    header = {v: i for i, v in enumerate(f.readline().split())}
    for line in f:  # read the rest of the file line-by-line
        line = line.strip()  # first clear out the whitespace
        if line:  # check if there is any content left or is it an empty line
            elements = line.split()
            print(elements[header["Sequence"]])  # print the Sequence element

您还可以使用标头图将行转换为dict结构，以便于访问。

update ：这是创建标头映射的方法，然后使用它来构建dict：

with open("your_data", "r") as f:
    # read the header and turn it into an index:value map
    header = {i: v for i, v in enumerate(f.readline().split())}
    for line in f:  # read the rest of the file line-by-line
        line = line.strip()  # first clear out the whitespace
        if line:  # check if there is any content left or is it an empty line
            # split the line, iterate over it and use the header map to create a dict
            row = {header[i]: v for i, v in enumerate(line.split())}
            print(row["Sequence"])  # ... or you can append it to a list for later use

至于如何出于某种原因删除您不需要的行，您必须创建一个临时文件，循环遍历原始文件，比较您的价值临时文件，删除原始文件，最后将临时文件重命名以匹配您的原始文件，例如：

import shutil
from tempfile import NamedTemporaryFile
SOURCE_FILE = "your_data"  # path to the original file to process
def compare_func(seq):  # a simple comparison function for our sequence
    return not seq.endswith("TC")  # use Levenshtein distance or whatever you want instead
# open a temporary file for writing and our source file for reading
with NamedTemporaryFile(mode="w", delete=False) as t, open(SOURCE_FILE, "r") as f:
    header_line = f.readline()  # read the header
    t.write(header_line)  # write the header immediately to the temporary file
    header = {v: i for i, v in enumerate(header_line.split())}  # create a header map
    last_line = ""  # a var to store the whitespace to keep the same format
    for line in f:  # read the rest of the file line-by-line
        row = line.strip()  # first clear out the whitespace
        if row:  # check if there is any content left or is it an empty line
            elements = row.split()  # split the row into elements
            # now lets call our comparison function
            if compare_func(elements[header["Sequence"]]):  # keep the line if True
                t.write(last_line)  # write down the last whitespace to the temporary file
                t.write(line)  # write down the current line to the temporary file
        else:
            last_line = line  # store the whitespace for later use
shutil.move(t.name, SOURCE_FILE)  # finally, overwrite the source with the temporary file

这将产生相同的文件，因为其序列以TC结束，而我们的comp_function()在这种情况下返回False。

。

对于较小的复杂性，而不是使用临时文件，您可以将整个源文件加载到工作内存中，然后覆盖它，但这仅适用于可以适合您的工作内存的文件，而上述方法可以工作文件与您的免费存储空间一样大。

相关内容

最新更新

热门标签：