计算每个基因的平均覆盖率



我有两个文件:文件1(如下(具有bp开始和停止坐标

基因终止(bp(
Chrom 基因起始(bp(
1 50902700 50902978
1 103817769 103828355

这应该可以解决您的问题。它非常依赖于具有一致的数据,这意味着文件2中的每个条目(chrom, pos)在文件1中都有相应的(chrom, s, e)。如果不是这样,则必须在内部while循环中执行额外的检查。

# importing gene start/end files
df_gene = pd.read_csv('gene_list.csv')
# importing exome data file
df_data = pd.read_csv('exomes.coverage.summary.tsv', sep = 't')
# Creating a Dictionary to store mean values
chroms_f2 = df_data.['Chrom'].to_list()
positions = df_data.['pos'].to_list()
means = df_data.['mean'].to_list()
f2_as_list = sorted(zip(chroms_f2, zip(positions, means))
starts = df_gene['Gene start (bp)'].to_list()
ends = df_gene['Gene end (bp)'].to_list()
chroms_f1 = df_gene['Chrom'].to_list()
f1_as_list = sorted(zip(chroms_f1, zip(starts, ends)))
df_mean = pd.DataFrame(columns=['chrom', 'start','end','mean coverage'])
### looping:
i1 = 0
c1, (s, e) = f1_as_list[i1]
list_mean = []
for c2, (p, m) in f2_as_list:
if not (c1 == c2 and s <= p <= e):
my_series = pd.Series(
data=[c, s, e, np.mean(list_mean)], 
index=['chrom', 'start', 'end', 'mean coverage']
)
df_mean=df_mean.append(my_series,ignore_index=True)
list_mean = []
while not (c1 == c2 and s <= p <= e):
i1 += 1
c1, (s, e) = f1_as_list[i1]
list_mean.append(m)

### Add mean coverage to gene dataframe
df_gene['mean coverage'] = df_mean['mean coverage']
df_gene.to_csv('gene_out.csv', index=False)

相关内容

  • 没有找到相关文章

最新更新