在R中处理基因组数据。
我试图对一个非常大的融化的门表进行子集划分,其中包括一列门ID,以便删除表中出现次数少于100000次的包含门的行。我可能错过了一个";"容易";但我最终还是尝试了自己的功能。
功能:
phylum_subset <- function(x = melt.ALKSS_few, #melted physeq object
Count = melt.ALKSS$Phylum, #counting phyla
Value = 1000 #minimum number of OTUs
){
phyla.table <- table(x$Count)
for(Count in x){if(phyla.table[Count]<=100000)
subset(x,Phylum != Count)
}
}
我承认这是我第一次写函数,我真的不知道自己在做什么。
我的函数输入和由此产生的错误输出结果如下:
melt.ALKSS_few.count <- phylum_subset(x = melt.ALKSS_few,Count = melt.ALKSS_few$Phylum,Value = 100000)
Error in if (phyla.table[Count] <= 1e+05) subset(x, Count != Phylum_col) :
the condition has length > 1
因为我试图用一列中所有事件的总和来子集,所以我不能只使用filter((或其他东西一次(除非我想这样做500次(。肯定有人以前做过这样的事吗?
编辑:好的,试着提供一个可复制的数据集块。请注意,它有超过808k个obs的47个变量,因为在生态数据集上做基因组学是一团糟。我已经删除了一些变量,这些变量是之前步骤(引物序列等(元数据的残余,我不会在分析中使用这些变量,只是为了保留代码块。。。质量较小。
> dput(droplevels(head(melt.ALKSS_few)))
structure(list(OTU = c("44c21e29adae97a53247abbd73978395", "0f18144d308ada95632ab5193d92073f",
"d829bee4984f82ffc2453212157caf96", "0f18144d308ada95632ab5193d92073f",
"0ddcd311e02f742e2e0e61ce02cf9c29", "120eba657e42a11a5c29f97b90f02035"
), Sample = c("S438", "S680", "S437", "S345", "S454", "S513"),
Abundance = c(10755, 9568, 8186, 7621, 7506, 7501), BarcodeSequence = c("CATTTTAGGACT",
"CGGAATAGAGTA", "CATTTTAGAGTA", "TATAATGGACCA", "CGGAATTGGCAT",
"GACGACGGACCA"), PrimerDesc = c("16S",
"16S", "16S", "16S", "16S", "16S"), SampleName = c("06222021KC-2-R",
"09292021KC-2-R", "06222021KC-1-R", "06032021KC-1-R", "06292921KC-3-R",
"06302021KC-3-R"), Project = c("16SLBSKR1-", "16SLBSKR2-",
"16SLBSKR1-", "16SLBSKR1-", "16SLBSKR1-", "16SLBSKR2-"),
Number = c("456", "694", "455", "363", "471", "491"), Date = c("6_22_2021", "9_29_2021", "6_22_2021", "6_3_2021",
"6_29_2021", "6_30_2021"), Year = c(2021L, 2021L, 2021L,
2021L, 2021L, 2021L), Season = c("Summer", "Fall", "Summer",
"Summer", "Summer", "Summer"), sample_Species = c("Little_Bluestem",
"Little_Bluestem", "Little_Bluestem", "Little_Bluestem",
"Little_Bluestem", "Little_Bluestem"), SoloOrMixed = c("Solo",
"Mixed", "Solo", "Mixed", "Mixed", "Solo"), Location = c("Tyler_SP", "Hy_180", "Tyler_SP", "Roadside_Hy67",
"Copper_Breaks_SP", "Caprock_Canyons_SP"), Ecoregion = c("South_Central_Plains",
"South_Central_Plains", "South_Central_Plains", "Edwards_Plateau",
"Southwestern_Tablelands", "Southwestern_Tablelands"), Habitat = c("Forest",
"Roadside", "Forest", "Roadside", "Roadside", "AridRock"),
Source = c("Root", "Root", "Root", "Root", "Root", "Root"
), PrecipMonth = c(96.65,
37.45, 96.65, 125.94, 125.01, 153.94), PrecipDaysSince = c(1L,
1L, 1L, 1L, 1L, 0L), pH = c(6.8, 6.7, 6.8, 8, 8, 7.8), EC = c(139L,
182L, 139L, 161L, 125L, 2370L), NO3 = c(0, 4.4, 0, 0.2, 2.2,
3.4), P = c(16L, 17L, 16L, 14L, 5L, 6L), K = c(145L, 84L,
145L, 114L, 160L, 65L), Ca = c(3918L, 2159L, 3918L, 27256L,
6609L, 16508L), Mg = c(166L, 130L, 166L, 188L, 148L, 95L),
S = c(10L, 16L, 10L, 24L, 24L, 14299L), Na = c(4L, 3L, 4L,
4L, 4L, 4L), Fe = c(19.76, 17, 19.76, 2.31, 1, 0), Zn = c(2.28,
15.1, 2.28, 7.01, 0.8, 0.1), Mn = c(64.16, 19, 64.16, 27.01,
15, 6), Cu = c(0.16, 0.2, 0.16, 0.16, 0.2, 0.2), Kingdom = c("d__Bacteria",
"d__Bacteria", "d__Bacteria", "d__Bacteria", "d__Bacteria",
"d__Bacteria"), Phylum = c("Proteobacteria", "Proteobacteria",
"Proteobacteria", "Proteobacteria", "Proteobacteria", "Actinobacteriota"
), Class = c("Gammaproteobacteria", "Gammaproteobacteria",
"Alphaproteobacteria", "Gammaproteobacteria", "Gammaproteobacteria",
"Actinobacteria"), Order = c("Xanthomonadales", "Pseudomonadales",
"Rhizobiales", "Pseudomonadales", "Pseudomonadales", "Streptomycetales"
), Family = c("Rhodanobacteraceae", "Pseudomonadaceae", "Xanthobacteraceae",
"Pseudomonadaceae", "Pseudomonadaceae", "Streptomycetaceae"
), Genus = c("Rhodanobacter", "Pseudomonas", "Bradyrhizobium",
"Pseudomonas", "Pseudomonas", "Streptomyces"), Species = c(NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_)), row.names = c(2352002L, 511171L, 7348565L,
510815L, 468295L, 621043L), class = "data.frame")
请参阅下面的tidyverse
解决方案。请注意,我使用了phyloseq
中的GlobalPatterns
来创建一个可复制的示例。
require("phyloseq")
require("tidyverse")
# Load the data and melt it
data(GlobalPatterns)
psdf <- psmelt(GlobalPatterns)
# Function to subset a dataframe based on the size of each group
# in a grouping variable
subset_by_freq <- function(df, grouping_var, threshold){
df %>%
group_by(!!sym(grouping_var)) %>%
filter(n() >= threshold) %>%
ungroup()
}
# Filter out taxa with less than 1e5 counts
psdf_sub <- subset_by_freq(psdf, "Phylum", 1e5)
# Sanity check: count the number of rows per taxon
psdf_sub %>%
group_by(Phylum) %>%
tally()
#> # A tibble: 2 x 2
#> Phylum n
#> <chr> <int>
#> 1 Firmicutes 113256
#> 2 Proteobacteria 166816
创建于2022-08-12由reprex包(v2.0.1(