r-试图使用单列中所有行值的计数来对大型表进行子集设置

  • 本文关键字:大型 设置 子集 单列中 r phyloseq
  • 更新时间 :
  • 英文 :


在R中处理基因组数据。

我试图对一个非常大的融化的门表进行子集划分,其中包括一列门ID,以便删除表中出现次数少于100000次的包含门的行。我可能错过了一个";"容易";但我最终还是尝试了自己的功能。

功能:

phylum_subset <- function(x = melt.ALKSS_few, #melted physeq object 
Count = melt.ALKSS$Phylum, #counting phyla
Value = 1000   #minimum number of OTUs
){
phyla.table <- table(x$Count) 

for(Count in x){if(phyla.table[Count]<=100000)
subset(x,Phylum != Count)
}
}

我承认这是我第一次写函数,我真的不知道自己在做什么。

我的函数输入和由此产生的错误输出结果如下:

melt.ALKSS_few.count <- phylum_subset(x = melt.ALKSS_few,Count = melt.ALKSS_few$Phylum,Value = 100000)
Error in if (phyla.table[Count] <= 1e+05) subset(x, Count != Phylum_col) : 
the condition has length > 1

因为我试图用一列中所有事件的总和来子集,所以我不能只使用filter((或其他东西一次(除非我想这样做500次(。肯定有人以前做过这样的事吗?

编辑:好的,试着提供一个可复制的数据集块。请注意,它有超过808k个obs的47个变量,因为在生态数据集上做基因组学是一团糟。我已经删除了一些变量,这些变量是之前步骤(引物序列等(元数据的残余,我不会在分析中使用这些变量,只是为了保留代码块。。。质量较小。

> dput(droplevels(head(melt.ALKSS_few)))
structure(list(OTU = c("44c21e29adae97a53247abbd73978395", "0f18144d308ada95632ab5193d92073f", 
"d829bee4984f82ffc2453212157caf96", "0f18144d308ada95632ab5193d92073f", 
"0ddcd311e02f742e2e0e61ce02cf9c29", "120eba657e42a11a5c29f97b90f02035"
), Sample = c("S438", "S680", "S437", "S345", "S454", "S513"), 
Abundance = c(10755, 9568, 8186, 7621, 7506, 7501), BarcodeSequence = c("CATTTTAGGACT", 
"CGGAATAGAGTA", "CATTTTAGAGTA", "TATAATGGACCA", "CGGAATTGGCAT", 
"GACGACGGACCA"), PrimerDesc = c("16S", 
"16S", "16S", "16S", "16S", "16S"), SampleName = c("06222021KC-2-R", 
"09292021KC-2-R", "06222021KC-1-R", "06032021KC-1-R", "06292921KC-3-R", 
"06302021KC-3-R"), Project = c("16SLBSKR1-", "16SLBSKR2-", 
"16SLBSKR1-", "16SLBSKR1-", "16SLBSKR1-", "16SLBSKR2-"), 
Number = c("456", "694", "455", "363", "471", "491"), Date = c("6_22_2021", "9_29_2021", "6_22_2021", "6_3_2021", 
"6_29_2021", "6_30_2021"), Year = c(2021L, 2021L, 2021L, 
2021L, 2021L, 2021L), Season = c("Summer", "Fall", "Summer", 
"Summer", "Summer", "Summer"), sample_Species = c("Little_Bluestem", 
"Little_Bluestem", "Little_Bluestem", "Little_Bluestem", 
"Little_Bluestem", "Little_Bluestem"), SoloOrMixed = c("Solo", 
"Mixed", "Solo", "Mixed", "Mixed", "Solo"), Location = c("Tyler_SP", "Hy_180", "Tyler_SP", "Roadside_Hy67", 
"Copper_Breaks_SP", "Caprock_Canyons_SP"), Ecoregion = c("South_Central_Plains", 
"South_Central_Plains", "South_Central_Plains", "Edwards_Plateau", 
"Southwestern_Tablelands", "Southwestern_Tablelands"), Habitat = c("Forest", 
"Roadside", "Forest", "Roadside", "Roadside", "AridRock"), 
Source = c("Root", "Root", "Root", "Root", "Root", "Root"
),  PrecipMonth = c(96.65, 
37.45, 96.65, 125.94, 125.01, 153.94), PrecipDaysSince = c(1L, 
1L, 1L, 1L, 1L, 0L), pH = c(6.8, 6.7, 6.8, 8, 8, 7.8), EC = c(139L, 
182L, 139L, 161L, 125L, 2370L), NO3 = c(0, 4.4, 0, 0.2, 2.2, 
3.4), P = c(16L, 17L, 16L, 14L, 5L, 6L), K = c(145L, 84L, 
145L, 114L, 160L, 65L), Ca = c(3918L, 2159L, 3918L, 27256L, 
6609L, 16508L), Mg = c(166L, 130L, 166L, 188L, 148L, 95L), 
S = c(10L, 16L, 10L, 24L, 24L, 14299L), Na = c(4L, 3L, 4L, 
4L, 4L, 4L), Fe = c(19.76, 17, 19.76, 2.31, 1, 0), Zn = c(2.28, 
15.1, 2.28, 7.01, 0.8, 0.1), Mn = c(64.16, 19, 64.16, 27.01, 
15, 6), Cu = c(0.16, 0.2, 0.16, 0.16, 0.2, 0.2), Kingdom = c("d__Bacteria", 
"d__Bacteria", "d__Bacteria", "d__Bacteria", "d__Bacteria", 
"d__Bacteria"), Phylum = c("Proteobacteria", "Proteobacteria", 
"Proteobacteria", "Proteobacteria", "Proteobacteria", "Actinobacteriota"
), Class = c("Gammaproteobacteria", "Gammaproteobacteria", 
"Alphaproteobacteria", "Gammaproteobacteria", "Gammaproteobacteria", 
"Actinobacteria"), Order = c("Xanthomonadales", "Pseudomonadales", 
"Rhizobiales", "Pseudomonadales", "Pseudomonadales", "Streptomycetales"
), Family = c("Rhodanobacteraceae", "Pseudomonadaceae", "Xanthobacteraceae", 
"Pseudomonadaceae", "Pseudomonadaceae", "Streptomycetaceae"
), Genus = c("Rhodanobacter", "Pseudomonas", "Bradyrhizobium", 
"Pseudomonas", "Pseudomonas", "Streptomyces"), Species = c(NA_character_, 
NA_character_, NA_character_, NA_character_, NA_character_, 
NA_character_)), row.names = c(2352002L, 511171L, 7348565L, 
510815L, 468295L, 621043L), class = "data.frame")

请参阅下面的tidyverse解决方案。请注意,我使用了phyloseq中的GlobalPatterns来创建一个可复制的示例。

require("phyloseq")
require("tidyverse")
# Load the data and melt it
data(GlobalPatterns)
psdf <- psmelt(GlobalPatterns)
# Function to subset a dataframe based on the size of each group
# in a grouping variable
subset_by_freq <- function(df, grouping_var, threshold){
df %>%
group_by(!!sym(grouping_var)) %>%
filter(n() >= threshold) %>%
ungroup()
}
# Filter out taxa with less than 1e5 counts
psdf_sub <- subset_by_freq(psdf, "Phylum", 1e5)
# Sanity check: count the number of rows per taxon
psdf_sub %>%
group_by(Phylum) %>%
tally()
#> # A tibble: 2 x 2
#>   Phylum              n
#>   <chr>           <int>
#> 1 Firmicutes     113256
#> 2 Proteobacteria 166816

创建于2022-08-12由reprex包(v2.0.1(

最新更新