我已经研究这个问题很长时间了,但我似乎没有把它弄对。我已经尝试了我所知道的一切,我真的需要帮助找出问题所在。
我有两个文件:文件1:序列名称和编号列表,即:
name,AGAT,AATG,TATC
Jake,28,42,14
Chris,17,22,19
Anne,36,18,25
文件二:一串DNA"gctaaatttttcagccagatgtaggctacaatcaagctgtccgctcggcacggcctaccacgt">
这个想法是实现一个基于DNA识别一个人的程序。运行文件二,并计算文件1中提供的序列的出现次数。如果两个文件中出现的次数匹配,则返回名称。不幸的是,我似乎无法获得第二个文件的正确"总数"。
这就是我目前所拥有的:
Python:
with open(argv[1], 'r') as csvfile:
csvfile_data = csv.reader(csvfile)
next(csvfile_data) #skip first line
for row in csvfile_data:
list_temp = row
# copy elements into a new list
temp = []
temp.extend(list_temp)
# remove the first element, because its the name
name = temp.pop(0)
# the values attached to the name
csvlist = temp
#change strings in list to integers
csvlist = [int(i) for i in csvlist]
# open dna sequence also
with open(argv[2], 'r') as dnafile:
dnafile_data = dnafile.read()
#use regular expressions to find each sequence's occurence in the file
patterns = re.compile(r'AGATC|TTTTTTCT|AATG|TCTAG|GATA|TATC|GAAA|TCTG')
result = re.findall(patterns, dnafile_data)
#count each sequence's occurence
dictionary = Counter(result)
#split the key sand values into a new list
dnalist = dictionary.values()
print(dnalist)
if collections.Counter(csvlist) == collections.Counter(dnalist):
print(name)
else:
print("No match")
```
您可以使用简单的产品推荐引擎背后的逻辑,例如:
def sequence(string):
count_AGAT = string.count('AGAT')
count_AATG = string.count('AATG')
count_TATC = string.count('TATC')
print(count_AGAT)
print(count_AATG)
print(count_TATC)
data_dna = {'Name': ['01'],
'AGAT': [count_AGAT],
'AATG': [count_AATG],
'TATC': [count_TATC]}
df_dna = pd.DataFrame(data_dna)
print(df_dna)
sequence('TCATCTAGGAGGCGCGCGTAGGATAAATAATTCAATTAAGATGTCGTTTTGC...')
您将得到一个数据帧输出,例如:
40
31
42
Name AGAT AATG TATC
0 01 40 31 42
然后将新行附加到已经可用的数据帧:
df = df.append(df_dna, ignore_index = True)
print(df)
df = df.drop('Name',1)
print(df)
输出为:
Name AGAT AATG TATC
0 Jake 28 42 14
1 Chris 17 22 19
2 Anne 36 18 25
3 01 40 31 42
AGAT AATG TATC
0 28 42 14
1 17 22 19
2 36 18 25
3 40 31 42
将行保存到单独的变量中:
df_jake = df.iloc[0]
df_chris = df.iloc[1]
df_anne = df.iloc[2]
df_sequence = df.iloc[3]
print(df_jake)
获取输出:
AGAT 28
AATG 42
TATC 14
Name: 0, dtype: int64
并构建一个具有协作过滤功能的推荐引擎(此处提供帮助:https://realpython.com/build-recommendation-engine-collaborative-filtering/)使用空间.距离.欧氏值:
from scipy import spatial
diff_jake = spatial.distance.euclidean(df_sequence, df_jake)
diff_chris = spatial.distance.euclidean(df_sequence, df_chris)
diff_anne = spatial.distance.euclidean(df_sequence, df_anne)
print('Jake: ', diff_jake)
print('Chris: ', diff_chris)
print('Anne: ', diff_anne)
在这个例子中,你得到的输出是:
Jake: 32.38826948140329
Chris: 33.74907406137241
Anne: 21.77154105707724
因此,提供的dna序列可能与安妮的更相似。
您可以使用scipy.spatial.distance.euclean来计算距离两点之间。使用它来计算从A、B和D到C的评分表明距离,C的评级与B的最接近
>>> spatial.distance.euclidean(c, a)
2.5
>>> spatial.distance.euclidean(c, b)
0.5
>>> spatial.distance.euclidean(c, d)
2.23606797749979
你也可以使用余弦距离矢量
要使用角度计算相似性,需要一个返回对于较低角度和较低角度,相似性较高或距离较小相似性或对于更高角度的更大距离。余弦角度是一个随着角度的增加从1减小到-1的函数从0到180。