i有以下.txt-file(修改后的bash饰品报告,原始报告具有seqtable格式):
Start End Strand Pattern Sequence
43392 43420 + regex:[T][G][A][TC][C][CTG]D{15,17}[CA][G][T][AT][AT][CTA] TGATCGCACGCCGAATGGAAACACGTTTT
52037 52064 + regex:[T][G][A][TC][C][CTG]D{15,17}[CA][G][T][AT][AT][CTA] TGACCCTGCTTGGCGATCCCGGCGTTTC
188334 188360 + regex:[T][G][A][TC][C][CTG]D{15,17}[CA][G][T][AT][AT][CTA] TGATCGCGCAACTGCAGCGGGAGTTAC
我只想访问"序列"下的元素,以将它们与某些变量进行比较并删除整个线,如果比较不给出所需的结果(使用Levenshtein距离进行比较)。
,但我什至无法开始。...:(
我正在寻找类似Linux -F选项的内容,直接到达行中的"字段"以进行我的比较。
我遇到了re.split:
with open(textFile) as f:
for line in f:
cleaned=re.split(r't',line)
print(cleaned)
导致:
[' Start End Strand Pattern Sequencen']
['n']
[' 43392 43420 + regex:[T][G][A][TC][C][CTG]\D{15,17}[CA][G][T][AT][AT][CTA] TGATCGCACGCCGAATGGAAACACGTTTTn']
['n']
[' 52037 52064 + regex:[T][G][A][TC][C][CTG]\D{15,17}[CA][G][T][AT][AT][CTA] TGACCCTGCTTGGCGATCCCGGCGTTTCn']
['n']
[' 188334 188360 + regex:[T][G][A][TC][C][CTG]\D{15,17}[CA][G][T][AT][AT][CTA] TGATCGCGCAACTGCAGCGGGAGTTACn']
['n']
那是我最接近"将线分为元素"的最接近的。我觉得完全走错了方向,但是搜索堆栈溢出,Google并没有导致任何事情:(
我以前从未与Seqtable-Format合作过,所以我试图处理它。
python是我正在学习的主要语言,我在狂欢中并不那么坚定,但是处理这个问题也可以。
也可以。我感谢任何提示/链接/帮助:)
格式本身似乎在使用多行作为定界符,而您的r't'
不做任何事情(您正在指示Python在字面的t
上分裂)。另外,根据您粘贴的数据,无论如何都没有使用选项卡定界符,而是随机数量的空格来填充表。
要解决这两者,您可以读取文件,将第一行视为标头(如果需要的话),然后按行读取其休息,剥离the the the the the the the thawing strail presition whitespace,检查是否有任何数据如果有 - 进一步将其分配在空格上以获取您的行元素:
with open("your_data", "r") as f:
header = f.readline().split() # read the first line as a header
for line in f: # read the rest of the file line-by-line
line = line.strip() # first clear out the whitespace
if line: # check if there is any content left or is it an empty line
elements = line.split() # split the data on whitespace to get your elements
print(elements[-1]) # print the last element
TGATCGCACGCCGAATGGAAACGTTTTTGACCCTGCTTGGCGATCCCGGCGTTTCTGATCGCGCACTGCGGGGAGTTAC
作为奖励,由于您有标题,因此您可以将其转换为地图,然后使用"代理"访问来获取所需的元素,因此您不必担心元素位置:
with open("your_data", "r") as f:
# read the header and turn it into a value:index map
header = {v: i for i, v in enumerate(f.readline().split())}
for line in f: # read the rest of the file line-by-line
line = line.strip() # first clear out the whitespace
if line: # check if there is any content left or is it an empty line
elements = line.split()
print(elements[header["Sequence"]]) # print the Sequence element
您还可以使用标头图将行转换为dict
结构,以便于访问。
update :这是创建标头映射的方法,然后使用它来构建dict
:
with open("your_data", "r") as f:
# read the header and turn it into an index:value map
header = {i: v for i, v in enumerate(f.readline().split())}
for line in f: # read the rest of the file line-by-line
line = line.strip() # first clear out the whitespace
if line: # check if there is any content left or is it an empty line
# split the line, iterate over it and use the header map to create a dict
row = {header[i]: v for i, v in enumerate(line.split())}
print(row["Sequence"]) # ... or you can append it to a list for later use
至于如何出于某种原因删除您不需要的行,您必须创建一个临时文件,循环遍历原始文件,比较您的价值临时文件,删除原始文件,最后将临时文件重命名以匹配您的原始文件,例如:
import shutil
from tempfile import NamedTemporaryFile
SOURCE_FILE = "your_data" # path to the original file to process
def compare_func(seq): # a simple comparison function for our sequence
return not seq.endswith("TC") # use Levenshtein distance or whatever you want instead
# open a temporary file for writing and our source file for reading
with NamedTemporaryFile(mode="w", delete=False) as t, open(SOURCE_FILE, "r") as f:
header_line = f.readline() # read the header
t.write(header_line) # write the header immediately to the temporary file
header = {v: i for i, v in enumerate(header_line.split())} # create a header map
last_line = "" # a var to store the whitespace to keep the same format
for line in f: # read the rest of the file line-by-line
row = line.strip() # first clear out the whitespace
if row: # check if there is any content left or is it an empty line
elements = row.split() # split the row into elements
# now lets call our comparison function
if compare_func(elements[header["Sequence"]]): # keep the line if True
t.write(last_line) # write down the last whitespace to the temporary file
t.write(line) # write down the current line to the temporary file
else:
last_line = line # store the whitespace for later use
shutil.move(t.name, SOURCE_FILE) # finally, overwrite the source with the temporary file
这将产生相同的文件,因为其序列以TC
结束,而我们的comp_function()
在这种情况下返回False
。
对于较小的复杂性,而不是使用临时文件,您可以将整个源文件加载到工作内存中,然后覆盖它,但这仅适用于可以适合您的工作内存的文件,而上述方法可以工作文件与您的免费存储空间一样大。