在文本文件中搜索连续重复并返回重复次数最多的文本的最佳方法是什么?



我对编程非常陌生,只学习了一周Python。

对于一节课,我必须分析一个文本DNA序列,类似这样:CTAGATAGATAGATAGATAGATGACTA

对于这些特定密钥:AGAT、AATG、TATC

我必须记录每一次的最大连续重复次数,忽略除最高重复次数外的所有重复次数。

我一直在复习以前的stackerflow答案,我看到groupby((被建议作为一种方法。不过,我并不完全确定如何使用groupby来满足我的特定实现需求。

我似乎必须将文件中的文本序列读取到列表中。我可以将本质上是文本字符串的内容导入到列表中吗?我必须用逗号分隔所有字符吗?分组对字符串有效吗?

看起来groupby会给我最高的连续重复次数,但形式是列表。我该如何从该列表中获得最高结果,并将其存储在其他地方,而程序员不必查看结果?groupby会返回列表中第一个连续重复次数最多的组吗?或者它会按照它出现在列表中的时间顺序排列?

有没有一个函数可以用来隔离和返回重复发生率最高的序列,这样我就可以将其与提供给我的字典文件进行比较?

坦率地说,我真的需要一些帮助来分解groupby函数。

我的任务建议可能使用切片来实现这一点,这似乎在某种程度上更令人生畏,但如果是这样的话,请告诉我,我不会拒绝如何做到这一点。

提前感谢你在这方面的智慧。

这里有一个与上一篇文章类似的解决方案,但可能具有更好的可读性。

# The DNA Sequence
DNA = "CTAGATAGATAGATAGATAGATGACTAGCTAGATAGATAGATAGATAGATGACTAGAGATAGATAGATCTAG"
# All Sequences of Interest
elements = {"AGAT", "AATG", "TATC"}
# Add Elements to A Dictionary
maxSeq = {}
for element in elements:
maxSeq[element] = 0
# Find Max Sequence for Each Element
for element in elements:
i = 0
curCount = 0
# Ensure DNA Length Not Reached
while i+4 <= len(DNA):
# Sequence Not Being Tracked
if curCount == 0:
# Sequence Found
if DNA[i: i + 4] == element:
curCount = 1
i += 4
# Sequence Not Found
else: i += 1

# Sequence Is Being Tracked
else:
# Sequence Found
if DNA[i: i + 4] == element:
curCount += 1
i += 4
# Sequence Not Found
else:
# Check If Previous Max Was Beat
if curCount > maxSeq[element]:
maxSeq[element] = curCount

# Reset Count
curCount = 0
i += 1

#Check If Sequence Was Being Tracked At End
if curCount > maxSeq[element]: maxSeq[element] = curCount
#Display
print(maxSeq)

输出:

{'AGAT': 5, 'TATC': 0, 'AATG': 0}

这看起来不像是一个逐组的问题,因为您需要同一个键的多个组。只需扫描列表中的键数会更容易。

# all keys (keys are four chars each)
seq = "CTAGATAGATAGATAGATAGATGACTAGCTAGATAGATAGATAGATAGATGACTAGAGATAGATAGATCTAG"
# split key string into list of keys: ["CTAG","ATAG","ATAG","ATAG", ....]
lst = [seq[i:i+4] for i in (range(0,len(seq),4))]  
lst.append('X')  # the while loop only tallies when next key found, so add fake end key
# these are the keys we care about and want to store the max consecutive counts
dicMax = { 'AGAT':0, 'AATG':0, 'TATC':0, 'ATAG':0 }  #dictionary of keys and max consecutive key count
# the while loop starts at the 2nd entry, so set variables based on first entry
cnt = 1 
key = lst[0] #first key in list
if (key in dicMax): dicMax[key] = 1  #store first key in case it's the max for this key
ctr = 1 # start at second entry in key list (we always compare to previous entry so can't start at 0)
while ctr < len(lst):   #all keys in list
if (lst[ctr] != lst[ctr-1]):  #if this key is different from previous key in list
if (key in dicMax and cnt > dicMax[key]):  #if we care about this key and current count is larger than stored count
dicMax[key] = cnt  #store current count as max count for this key
#set variables for next key in list
cnt = 0
key = lst[ctr]
ctr += 1  #list counter
cnt += 1  #counter for current key

print(dicMax)  # max consecutive count for each key
Raiyan Chowdhury认为序列可能重叠,因此将基本序列划分为四个字符串可能不起作用。在这种情况下,我们需要单独搜索每个字符串。请注意,这个算法是不是有效的,但对于新程序员来说是可读的。
seq = "CTAGATAGATAGATAGATAGATGACTAGCTAGATAGATAGATAGATAGATGACTAGAGATAGATAGATCTAG"
dicMax = { 'AGAT':0, 'AATG':0, 'TATC':0, 'ATAG':0 }  #dictionary of keys and max consecutive key count
for key in dicMax: #each key, could divide and conquer here so all keys run at same time
for ctr in range(1,9999):  #keep adding key to itself ABC > ABCABC > ABCABCABC
s = key * ctr  #create string by repeating key  "ABC" * 2 = "ABCABC"
if (s in seq):   # if repeated key found in full sequence
dicMax[key]=ctr   # set max (repeat) count for this key
else:
break; # exit inner for #done with this key

print(dicMax)  #max consecutive key counts

最新更新