我的基因表达伪数据帧是这个
**FRESHUPDATE**
我想过滤的较大基因表达中的小数据框
我的子集是这个mat2
Symbol TCGA.AB.2856 TCGA.AB.2849 TCGA.AB.2971 TCGA.AB.2930 TCGA.AB.2891 TCGA.AB.2872 TCGA.AB.2851 TCGA.AB.3011 TCGA.AB.2949
1 A2ML1 4.627857365 5.369632 6.700112904 5.6232636 4.75680637 5.8050996 6.2077827 5.2683007 5.232384
2 A4GALT 5.550918500 5.572321 4.569849528 6.2627817 5.25197103 6.4728585 3.8088796 5.5766959 6.458113
3 AACSP1 -0.004394347 1.195122 -0.004562859 0.1343311 -0.01469569 0.2808245 0.2881929 0.3270398 0.708931
4 ABCA9 5.652068819 5.579944 7.787378888 4.9460252 4.77917651 5.5384349 5.6242293 5.8726373 8.846332
5 ABCA9-AS1 0.557163318 1.701202 3.343076301 0.4203761 1.04232725 0.5324808 1.3794852 1.9304208 3.594210
6 ABCC13 4.077316070 8.840604 2.340835263 3.0782108 2.32162741 4.0645558 3.3683787 4.0456838 3.129047
7 ABLIM1 9.696391499 11.988791 9.873324476 10.5111442 10.81262360 9.0651002 10.6804131 9.4307673 11.879929
8 ABLIM3 5.292492658 5.979259 3.623770183 3.5016803 6.74841153 4.9092703 3.7786797 3.9352033 4.406261
9 ABO 10.631505004 6.859666 5.505456740 10.1379316 6.39110235 10.2743712 9.9307084 6.3601978 11.161422
10 ACOT12 1.648498344 3.762861 2.098422076 1.1439361 2.39612635 2.0490598 0.8765957 2.6902788 2.896370
>
为了得到std-dev,我做了这个
mat1 <- mat2
mat1[,-1] <- lapply(mat1[,-1],
function(x) replace(x,abs(scale(x))>2,NA))
为了查找具有任何NA
的行,我做了这个
mat_rown <- mat1 %>% remove_rownames %>% column_to_rownames(var="Symbol")
which(is.na(mat_rown),arr.ind = TRUE)
这给了我这个数据帧
Symbol TCGA.AB.2856 TCGA.AB.2849 TCGA.AB.2971 TCGA.AB.2930 TCGA.AB.2891 TCGA.AB.2872 TCGA.AB.2851 TCGA.AB.3011 TCGA.AB.2949
1 A2ML1 4.627857 5.369632 6.700113 5.623264 4.756806 5.805100 6.207783 5.268301 5.232384
2 A4GALT 5.550918 5.572321 4.569850 6.262782 5.251971 6.472859 3.808880 5.576696 6.458113
3 AACSP1 NA NA NA NA NA NA NA NA NA
4 ABCA9 5.652069 5.579944 NA 4.946025 4.779177 5.538435 5.624229 5.872637 8.846332
5 ABCA9-AS1 NA NA 3.343076 NA NA NA 1.379485 NA 3.594210
6 ABCC13 4.077316 8.840604 2.340835 3.078211 2.321627 4.064556 3.368379 4.045684 3.129047
7 ABLIM1 NA NA NA NA NA NA NA NA NA
8 ABLIM3 5.292493 5.979259 3.623770 3.501680 NA 4.909270 3.778680 3.935203 4.406261
9 ABO NA 6.859666 5.505457 NA NA NA NA 6.360198 NA
10 ACOT12 1.648498 3.762861 2.098422 1.143936 2.396126 2.049060 NA 2.690279 2.896370
在这里我们可以看到这些基因,它们在不同的列中有一些或另一个NA
,所以我的目标是去掉这些行。
所以当我尝试用NA
对那些行进行索引时AACSP1,ABCA9-AS1,ABLIM1,ABO,ACOT12
我得到这些
row col
AACSP1 3 1
ABCA9-AS1 5 1
ABLIM1 7 1
ABO 9 1
AACSP1 3 2
ABCA9-AS1 5 2
ABLIM1 7 2
AACSP1 3 3
ABCA9 4 3
ABLIM1 7 3
AACSP1 3 4
ABCA9-AS1 5 4
ABLIM1 7 4
ABO 9 4
AACSP1 3 5
ABCA9-AS1 5 5
ABLIM1 7 5
ABLIM3 8 5
ABO 9 5
AACSP1 3 6
ABCA9-AS1 5 6
ABLIM1 7 6
ABO 9 6
AACSP1 3 7
ABLIM1 7 7
ABO 9 7
ACOT12 10 7
AACSP1 3 8
ABCA9-AS1 5 8
ABLIM1 7 8
AACSP1 3 9
ABLIM1 7 9
ABO 9 9
因此,我的简单想法是在另一个数据帧或对象中保留这些包含NA
的行或基因,我可以在下游进一步使用这些行或基因进行不同的分析,以检查
如果您只是想将原始帧拆分为具有低行标准差和高行标准差的帧,您可以这样做:
rld2 <- as.data.frame((mat)) %>% rownames_to_column('gene')
# set your threshold that defines "high" deviation (i've picked a relatively low one here; you might choose something like 3)
sd_threshold = .6
# get the row-specific standard deviation, using `apply()`
row_sds = apply(rld2[,-1],1, (r) sd(r))
# split into a list of two frames,
low_high_split <- split(rld2, f = row_sds>sd_threshold)