将包含行的NA过滤到R中的新数据帧中



我的基因表达伪数据帧是这个

**FRESHUPDATE** 

我想过滤的较大基因表达中的小数据框

我的子集是这个mat2

Symbol TCGA.AB.2856 TCGA.AB.2849 TCGA.AB.2971 TCGA.AB.2930 TCGA.AB.2891 TCGA.AB.2872 TCGA.AB.2851 TCGA.AB.3011 TCGA.AB.2949
1      A2ML1  4.627857365     5.369632  6.700112904    5.6232636   4.75680637    5.8050996    6.2077827    5.2683007     5.232384
2     A4GALT  5.550918500     5.572321  4.569849528    6.2627817   5.25197103    6.4728585    3.8088796    5.5766959     6.458113
3     AACSP1 -0.004394347     1.195122 -0.004562859    0.1343311  -0.01469569    0.2808245    0.2881929    0.3270398     0.708931
4      ABCA9  5.652068819     5.579944  7.787378888    4.9460252   4.77917651    5.5384349    5.6242293    5.8726373     8.846332
5  ABCA9-AS1  0.557163318     1.701202  3.343076301    0.4203761   1.04232725    0.5324808    1.3794852    1.9304208     3.594210
6     ABCC13  4.077316070     8.840604  2.340835263    3.0782108   2.32162741    4.0645558    3.3683787    4.0456838     3.129047
7     ABLIM1  9.696391499    11.988791  9.873324476   10.5111442  10.81262360    9.0651002   10.6804131    9.4307673    11.879929
8     ABLIM3  5.292492658     5.979259  3.623770183    3.5016803   6.74841153    4.9092703    3.7786797    3.9352033     4.406261
9        ABO 10.631505004     6.859666  5.505456740   10.1379316   6.39110235   10.2743712    9.9307084    6.3601978    11.161422
10    ACOT12  1.648498344     3.762861  2.098422076    1.1439361   2.39612635    2.0490598    0.8765957    2.6902788     2.896370
> 

为了得到std-dev,我做了这个

mat1 <- mat2
mat1[,-1] <- lapply(mat1[,-1],
function(x) replace(x,abs(scale(x))>2,NA))

为了查找具有任何NA的行,我做了这个

mat_rown <- mat1 %>% remove_rownames %>% column_to_rownames(var="Symbol")
which(is.na(mat_rown),arr.ind = TRUE)

这给了我这个数据帧

Symbol TCGA.AB.2856 TCGA.AB.2849 TCGA.AB.2971 TCGA.AB.2930 TCGA.AB.2891 TCGA.AB.2872 TCGA.AB.2851 TCGA.AB.3011 TCGA.AB.2949
1      A2ML1     4.627857     5.369632     6.700113     5.623264     4.756806     5.805100     6.207783     5.268301     5.232384
2     A4GALT     5.550918     5.572321     4.569850     6.262782     5.251971     6.472859     3.808880     5.576696     6.458113
3     AACSP1           NA           NA           NA           NA           NA           NA           NA           NA           NA
4      ABCA9     5.652069     5.579944           NA     4.946025     4.779177     5.538435     5.624229     5.872637     8.846332
5  ABCA9-AS1           NA           NA     3.343076           NA           NA           NA     1.379485           NA     3.594210
6     ABCC13     4.077316     8.840604     2.340835     3.078211     2.321627     4.064556     3.368379     4.045684     3.129047
7     ABLIM1           NA           NA           NA           NA           NA           NA           NA           NA           NA
8     ABLIM3     5.292493     5.979259     3.623770     3.501680           NA     4.909270     3.778680     3.935203     4.406261
9        ABO           NA     6.859666     5.505457           NA           NA           NA           NA     6.360198           NA
10    ACOT12     1.648498     3.762861     2.098422     1.143936     2.396126     2.049060           NA     2.690279     2.896370

在这里我们可以看到这些基因,它们在不同的列中有一些或另一个NA,所以我的目标是去掉这些行。

所以当我尝试用NA对那些行进行索引时AACSP1,ABCA9-AS1,ABLIM1,ABO,ACOT12

我得到这些

row col
AACSP1      3   1
ABCA9-AS1   5   1
ABLIM1      7   1
ABO         9   1
AACSP1      3   2
ABCA9-AS1   5   2
ABLIM1      7   2
AACSP1      3   3
ABCA9       4   3
ABLIM1      7   3
AACSP1      3   4
ABCA9-AS1   5   4
ABLIM1      7   4
ABO         9   4
AACSP1      3   5
ABCA9-AS1   5   5
ABLIM1      7   5
ABLIM3      8   5
ABO         9   5
AACSP1      3   6
ABCA9-AS1   5   6
ABLIM1      7   6
ABO         9   6
AACSP1      3   7
ABLIM1      7   7
ABO         9   7
ACOT12     10   7
AACSP1      3   8
ABCA9-AS1   5   8
ABLIM1      7   8
AACSP1      3   9
ABLIM1      7   9
ABO         9   9

因此,我的简单想法是在另一个数据帧或对象中保留这些包含NA的行或基因,我可以在下游进一步使用这些行或基因进行不同的分析,以检查

如果您只是想将原始帧拆分为具有低行标准差和高行标准差的帧,您可以这样做:

rld2 <- as.data.frame((mat)) %>% rownames_to_column('gene')
# set your threshold that defines "high" deviation (i've picked a relatively low one here; you might choose something like 3)
sd_threshold = .6
# get the row-specific standard deviation, using `apply()`
row_sds = apply(rld2[,-1],1, (r) sd(r))
# split into a list of two frames,
low_high_split <- split(rld2, f = row_sds>sd_threshold)

最新更新