我想用统计模型计算相同的结果,就像我用R的lm.fit
一样,但是我有一些麻烦。我得到错误ValueError: The indices for endog and exog are not aligned
这些是我的数据,应该是复制和粘贴的:
import pandas as pd
try:
from StringIO import StringIO
except ImportError:
from io import StringIO
design_str = """(Intercept) celltypeB
1 1 0
2 1 0
3 1 1
4 1 1"""
logcounts_str = """A_1.bam A_2.bam B_1.bam B_2.bam
653635 10.620219493055 10.6004806223872 10.2558655809612 10.2517277004859
729737 7.67498885090687 7.79906883969567 6.03916081991983 5.75239067063238
100507658 5.08044030135652 4.92805528041574 4.06340836616651 4.59011924173351
100132287 7.74946706686615 7.44831209174908 5.71110662223563 5.87022716092624
100131754 14.9504967198295 15.046928067964 16.615773268211 16.6324137588478
100133331 8.14394324366268 7.97474851586378 6.75643061274512 6.85601330170654
100288069 4.96014606763881 4.38056748511325 5.20091188991645 4.84165800872947
100287934 3.60650911302411 3.00205586185952 3.61594938919529 3.36772682039706
79854 5.72198633044404 5.63432407735903 5.51406977517608 5.4072551845837
643837 7.0659407316614 7.21928657808019 7.23333336760882 7.31674289167825
100506327 5.29456510670937 4.92805528041574 3.3264427720003 3.00515674101235
284600 6.31922716094364 5.88958113260111 2.47844586544535 2.51972991384211
100130417 5.48097823094025 5.80941078391712 0.156517770557991 3.36772682039706
148398 7.78531226581287 7.79906883969567 4.06340836616651 4.84165800872947
26155 10.6251729572891 10.6034552523913 9.36353209073552 9.29846415795994
339451 8.95622036442501 8.90412944117026 7.29606912295679 7.16358610361683
84069 6.54510856835996 6.90894645746804 0.156517770557991 1.7827643196759
84808 5.48097823094025 5.80941078391712 0.156517770557991 0.197801818954747
57801 8.52934125250165 8.40122695567934 7.39492250988307 7.33735317135354
9636 7.40236839624388 7.21928657808019 6.49636777344262 6.85601330170654
375790 12.319609180022 12.2617991255503 10.3720507703036 10.2076304363229
401934 4.34347470719031 3.85005276841447 0.156517770557991 0.197801818954747
54991 8.19813534402627 7.93751560966481 7.05133553386594 7.11666505622934
8784 7.25676307399698 7.4215947533733 4.24398061180833 3.00515674101235
51150 11.167054248766 11.132369007403 10.1802721238574 10.1822202777559
126792 8.95622036442501 8.64012969904024 8.18442376712788 7.80513213270436
388581 4.13702382972289 4.58701836258068 3.85695748869908 2.51972991384211
118424 9.37616332589649 9.50985050205822 8.78222661362246 8.66132619222593
6339 7.2309999779319 6.74621695742993 8.17332605824454 8.16934537290552
116983 10.4038103704178 10.3718712861434 11.6438553857652 11.6941563075194
126789 8.14394324366268 7.81967911937095 6.56590870669569 6.73696063006278
54973 10.1318124030775 10.0702967231723 9.92965697727768 9.95769000217658
80772 8.69867895695197 8.60494027057794 8.28063908238718 8.40237296320395
83756 4.52404695283213 2.26509026769331 3.61594938919529 1.7827643196759
1855 10.4266880754393 10.3365526302499 10.653371547946 10.5739272082309
54587 8.13007106908112 7.83999910375055 7.6878992310743 7.69764770603795
54998 10.0568039370893 9.88714208714969 9.48967312086861 9.47159741816901
81669 10.162802270914 10.2095583211183 9.81294263383577 9.64695046433018
148413 8.28833315299785 7.97474851586378 7.07538100783259 7.31674289167825
55052 8.96406111764219 8.95159079487653 8.54453505590313 8.65312903925931
441869 4.13702382972289 4.58701836258068 8.23866681191186 8.34246006178663
643965 0.436584111581795 0.680127766972157 7.89122739078383 7.49242256784637
64856 8.61150979408247 8.78341557538418 10.2212605353082 10.1091938067982
219293 5.92843720791147 6.23471661864979 4.80037396033272 5.32708483589971
83858 9.25995135162803 9.31312296411511 7.05133553386594 7.31674289167825
55210 9.6533299697771 9.27631752311657 7.31638910733638 7.681617596219
339453 3.60650911302411 3.85005276841447 7.74897480782607 7.87728191846019
29101 10.2615428521103 10.1938553629246 10.0558746934811 9.87552146059575
643988 8.32532736048005 8.61676570597473 8.20636632000855 7.84885351013367
142678 9.62393618478229 9.39093420067151 10.0828137653391 9.97752117409815
984 8.89191133188636 8.53187680838822 8.02070391521227 8.23672080824705
728661 9.41672368922095 9.49711139022754 9.71685060477043 9.87199408710043
728642 8.53987191999382 8.13133887880449 7.98306625784891 7.74469627884238
100506482 7.44781136700505 7.4215947533733 8.37083689135876 8.2584977506423
9906 6.21794382510645 5.88958113260111 6.30626489006267 6.30632627573292
65220 9.79633367190412 9.96321611999616 8.72257180872908 8.36270874563044
2782 12.228967726517 12.2071159724042 12.8462973999353 12.8611375424331
339456 5.64603747721074 4.92805528041574 3.85695748869908 4.28526466020509
85452 3.60650911302411 3.85005276841447 6.03916081991983 5.55535382357283
2563 0.436584111581795 3.48748268902976 11.4562981734165 11.3993131634549
5590 7.71270851685603 7.86003685698709 10.7758207259737 10.7243010580913
100506504 5.56586712852676 5.63432407735903 7.64033354782225 7.33735317135354
199990 8.43093754844065 8.27258480424024 7.48743464867261 7.5106847742391
100128003 7.15082962924792 6.82987488647684 5.64837086688767 5.92572227351795
6497 10.1422164989432 10.1168393091094 11.6548683784876 11.7700283317484
79906 5.29456510670937 5.53810876209973 7.25454985351852 7.43620655827983
100129534 5.08044030135652 4.58701836258068 6.49636777344262 6.26389100941252
11079 10.1931404341059 10.1251426128406 9.61389864963053 9.51747393990174
5192 7.40236839624388 8.02885592120323 7.21180020605918 7.56412403320056
9651 4.96014606763881 4.58701836258068 10.2532329250465 10.2104263578198
55229 8.65090323238256 8.33117945815109 8.81115379908596 8.62825437062028
388585 0.436584111581795 0.680127766972157 4.40444528400158 4.10469241456327
115110 5.92843720791147 5.43501526913563 0.156517770557991 2.51972991384211
100133445 5.39078042196867 5.07244518975092 2.9638726926156 2.51972991384211
8764 8.98733089696504 9.0420715407074 6.72637337888894 6.46458835964965
127281 8.85864887775461 8.70803376354204 8.98624050564405 9.11964275602924
440556 2.02154661230295 0.680127766972157 7.09903227589723 6.96598614373167
63976 5.08044030135652 4.58701836258068 7.0268824901414 7.01798078136993
27237 7.92039988884605 7.86003685698709 5.28580078750296 5.2421959383132
1953 9.405250904777 9.4249616044717 6.81472925330979 6.57284125030167
127262 8.52934125250165 8.84503469364785 10.6947067026104 10.7837032686455
49856 8.44220866077567 8.58099457495291 7.52283998480381 7.209029074378
7161 6.10900945355329 5.72452188633061 6.08725510812088 6.46458835964965
57212 8.641155255831 8.71904675626446 8.76014411554418 8.62825437062028
388588 6.10900945355329 5.63432407735903 3.3264427720003 2.51972991384211
57470 9.7010267118084 9.67730724790978 9.65037321979881 9.76575789437021
9731 8.78973093707988 8.78341557538418 8.59530962313625 8.89476934518903
1677 6.66540280207768 6.562770816334 6.56590870669569 6.53765182183937
339448 8.33745091956254 8.28745808072177 7.0268824901414 7.25308425445594
100133612 4.13702382972289 4.38056748511325 1.74148027127915 1.7827643196759
55966 7.95228394986584 8.11475599460888 7.29606912295679 6.96598614373167
261734 8.4755031008741 8.31675238751581 7.90471062014745 7.69764770603795
8514 9.47824326321901 9.3701257383916 11.7096670461565 11.5852806847971
26038 5.08044030135652 4.38056748511325 11.3604774737975 11.2792852570484
6146 11.5006531919673 11.5604765751282 10.6046340759675 10.7780603864545
388591 7.65575263204396 8.01104464508678 7.6563636576412 7.18648650572691
23463 10.4713830741591 10.3113048226761 9.90638719795484 9.80698055709673
387509 7.69397195427445 7.86003685698709 6.87076328822411 6.99221768530485
11332 9.91028986120121 9.79126343720687 10.3913352016753 10.5365382015287
"""
design = pd.read_table(StringIO(design_str), sep=" ")
logcounts = pd.read_table(StringIO(logcounts_str), sep=" ")
使用R的lm.fit(design, logcounts.T)
对这些数据给我的结果(只有头部和尾部显示在这里):
(Intercept) celltypeB
653635 10.6 -0.36
729737 7.7 -1.84
100507658 5.0 -0.68
100132287 7.6 -1.81
100131754 15.0 1.63
100133331 8.1 -1.25
(Intercept) celltypeB
26038 4.7 6.59
6146 11.5 -0.84
388591 7.8 -0.41
23463 10.4 -0.53
387509 7.8 -0.85
11332 9.9 0.61
如何使用Python统计模型实现相同的结果?
p。这是微不足道的sklearn
,但sklearn
不给我残余和大量的其他东西,我需要:
from sklearn import linear_model
clf = linear_model.LinearRegression()
clf.fit(design, logcounts)
print(clf.coef_[0:3,1])
print(clf.coef_[-3:,1])
print(clf.intercept_)
psp上。我是一个非统计学家,所以这里可能有完全微不足道的错误/误解;我只是将一些R代码移植到Python。
相关的错误是因为两个dataframe/Series没有相同的行索引。
我们可以将相同的索引分配给design
(尽管我不知道这是否是熊猫的官方方式)
>>> design.index = logcounts.T.index
>>> design
(Intercept) celltypeB
A_1.bam 1 0
A_2.bam 1 0
B_1.bam 1 1
B_2.bam 1 1
主要问题是统计模型OLS仅为单变量因变量(endog
)设计。参数估计的协方差矩阵的计算失败,因为完整的矩阵将是三维的(或适当重塑的块对角线)。
然而,由于默认的numpy广播,有些东西可以工作(但没有经过单元测试,也没有正式支持)
>>> res = sm.OLS(logcounts.T, design).fit()
>>> #print(res.summary()) #raises exception
>>> res.params.T.head()
(Intercept) celltypeB
0 10.610350 -0.356553
1 7.737029 -1.841253
2 5.004248 -0.677484
3 7.598890 -1.808223
4 14.998712 1.625381
>>> res.resid.T.head()
A_1.bam A_2.bam B_1.bam B_2.bam
653635 0.009869 -0.009869 0.002069 -0.002069
729737 -0.062040 0.062040 0.143385 -0.143385
100507658 0.076193 -0.076193 -0.263355 0.263355
100132287 0.150577 -0.150577 -0.079560 0.079560
100131754 -0.048216 0.048216 -0.008320 0.008320
>>> res.fittedvalues.T.head()
A_1.bam A_2.bam B_1.bam B_2.bam
653635 10.610350 10.610350 10.253797 10.253797
729737 7.737029 7.737029 5.895776 5.895776
100507658 5.004248 5.004248 4.326764 4.326764
100132287 7.598890 7.598890 5.790667 5.790667
100131754 14.998712 14.998712 16.624094 16.624094
>>> logcounts.head()
A_1.bam A_2.bam B_1.bam B_2.bam
653635 10.620219 10.600481 10.255866 10.251728
729737 7.674989 7.799069 6.039161 5.752391
100507658 5.080440 4.928055 4.063408 4.590119
100132287 7.749467 7.448312 5.711107 5.870227
100131754 14.950497 15.046928 16.615773 16.632414
如果因变量是多变量的,那么大多数其他结果将不起作用,并且要获得完整的结果,目前需要一次循环所有这些结果。