生成数据集的所有排列

我有一个数据帧，看起来像这样:

df1 = pd.DataFrame({'Gene':['TP53', 'COX5', 'P16'], 'test':[1,3,0], 'Healthy':[0,0,2]})
Gene    test    Healthy
0   TP53    1       0
1   COX5    3       0
2   P16     0       2

我一直在尝试创建所有可能值的排列。这个想法是绘制第一个基因"tp53"。它的值在column "test"并记录"健康"的值。列。

例如，最初TP53将映射到自身:TP53: TP53:1:0然后将TP53映射到健康栏中的COX5: TP53:COX5:1:0接着是下一个基因:TP53:P16:1:2接下来，基因COX5将使用"测试"中的值进行定位。列，以便与"健康状态"进行比较。专栏:COX5: TP53:3:0然后:COX5: COX5:3:0

所以最终会产生如下表格:

All_combinations
TP53:TP53:1:0
TP53:COX5:1:0
TP53:P16:1:2
COX5:TP53:3:0
COX5:COX5:3:0
COX5:P16:3:2
P16:TP53:0:0
P16:COX5:0:0
P16:P16:0:2

我已经尝试了下面的代码，但有困难。

import pandas as pd
df1 = pd.DataFrame({'Gene':['TP53', 'COX5', 'P16'], 'test':[1,3,0], 'Healthy':[0,0,2]})
df2 = df1.transpose()
df2.columns = df2.iloc[0]
df2 = df2.iloc[1:]
from itertools import product
uniques = [df1[i].unique().tolist() for i in df1.iloc[:,[1,2]]]
pd.DataFrame(product(*uniques), columns = df2.iloc[:,])

真实的数据集有超过32,000行，所以快速工作的东西将是伟大的。谢谢你的帮助

这段代码能解决你的问题吗?

import pandas as pd
df1 = pd.DataFrame({'Gene':['TP53', 'COX5', 'P16'], 'test':[1,3,0], 'Healthy':[0,0,2]})
# Create all the combinations as tuples. 
# Note that test is taken from gene1 but Healthy from gene2
# The enumerate is used to get the row number related to that gene
row_list = []
for i, gene1 in enumerate(df1.Gene):
for j, gene2 in enumerate(df1.Gene):
row_list.append((gene1, gene2, df1.iloc[i].test, df1.iloc[j].Healthy))
# Now create a new dataframe with the results
df2 = pd.DataFrame(row_list, columns=['Gene1', 'Gene2', 'test', 'Healthy'])

这产生:

Gene1 Gene2  test  Healthy
0  TP53  TP53     1        0
1  TP53  COX5     1        0
2  TP53   P16     1        2
3  COX5  TP53     3        0
4  COX5  COX5     3        0
5  COX5   P16     3        2
6   P16  TP53     0        0
7   P16  COX5     0        0
8   P16   P16     0        2

因为已经给出了一个pandas解。只是展示product是如何工作的

a=[1,3,0]
b=[0,0,2]
from itertools import product
list(product(*[a]+[b]))
[(1, 0), (1, 0), (1, 2), (3, 0), (3, 0), (3, 2), (0, 0), (0, 0), (0, 2)]

相关内容

最新更新

热门标签：