使用R或python计算多个变量与p值的成对Pearson相关性



我正在使用巨大的微阵列表达式数据集。我有27000个探针的表达值,代表了14个不同数据点的5500个基因(变量:D1到D14)。在这5500个基因中,很少有基因被多个探针(即同一基因的不同探针)所代表。5500个基因的探针代表分布从1到5不等(意味着很少有基因有1个或2个或3个或4个或5个探针)。现在,我想计算跨14个不同数据点(14个变量)的同一基因的多个探针的所有可能组合的成对Pearson相关系数和相关p值,并以一维格式导出结果。CSV格式的输入数据表的一小部分如下所示

<表类> ProbeName 基因 D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 这里 D12 D13 D14 tbody><<tr>A19.16.68.29.39.08.89.97.510.89.08.311.69.310.9A23.93.75.82.22.92.82.93.83.31.73.23.55.93.7A34.64.86.82.84.33.54.25.34.53.34.04.36.94.7A43.83.95.83.24.02.83.74.63.62.23.84.35.63.9A56.36.67.75.95.95.66.26.45.84.95.46.17.76.9B1B7.55.57.110.27.28.68.37.16.17.09.26.46.49.4B2B4.64.85.64.34.74.34.05.54.03.33.85.05.74.7B3 td>5.13.95.16.55.05.44.95.34.54.55.95.04.65.6B4 td>7.66.17.510.98.09.28.57.16.37.410.06.96.910.2C1C3.16.13.42.53.73.32.75.02.33.12.03.82.63.3C2C3.87.14.84.14.94.53.85.94.04.74.45.12.94.8C3C3.86.15.55.46.33.93.47.85.35.74.84.03.54.3D1D12.211.711.410.511.511.410.712.011.310.59.911.710.510.2D2D12.011.511.310.411.411.410.711.911.210.69.911.710.310.2E1E2.43.37.53.45.83.61.23.50.92.23.14.77.54.0

假设您的输入数据在一个以制表符分隔的CSV文件中,并且基因样本都是连续的。考虑到这一点,这应该足够了。

from scipy.stats import pearsonr as P
import pandas as pd

def combos(n, s):
r = []
for i in range(n):
for j in range(i + 1, n):
r.append((i + s, j + s))
return r

def process(df, s, e, ad):
if (e - s) > 0:
_, c = df.shape
for r1, r2 in combos(e - s + 1, s):
r, p = P(df.iloc[r1, 2: c], df.iloc[r2, 2: c])
ad.append([df.iloc[r1, 0], df.iloc[r2, 0],
df.iloc[r1, 1], r, p])

def main(csvFile):
df = pd.read_csv(csvFile, sep='t')
r, _ = df.shape
gene = df.iloc[0, 1]
startRow = 0
endRow = 0
allData = []
for _r in range(1, r):
_g = df.iloc[_r, 1]
if _g == gene:
endRow = _r
else:
process(df, startRow, endRow, allData)
gene = _g
startRow = _r
process(df, startRow, r - 1, allData)
newDF = pd.DataFrame(data=allData, columns=[
'ProbeName_1', 'ProbeName_2', 'Gene', 'Pearson', 'Pvalue'])
with pd.option_context('display.float_format', '{:0.4f}'.format):
print(newDF)

if __name__ == '__main__':
main('genes.csv')

相关内容

  • 没有找到相关文章

最新更新