我想在大数据帧中生成一列,其中包含其他列的信息。我举了一个非常小的可复制的例子:
tax <- data.frame(
Family = c("Brassicacae", "Pinaceae", "Rosaceae", "Liliaceae"),
Genus = c("NA" ,"Pinus", "NA", "Lilia"),
Species = c("NA" ,"Pinus_sylvestris", "NA", "Calochortus nuttallii"))
我想创建一个名为tax_rank的列,在该列中,您的分类达到的物种将具有值species
,但如果您达到的等级比属更高,则值将为genus
或family
,如以下输出所示:
tax <- data.frame(
Family = c("Brassicacae", "Pinaceae", "Rosaceae", "Liliaceae"),
Genus = c("NA" ,"Pinus", "NA", "Lilia"),
Species = c("NA" ,"Pinus_sylvestris", "NA", "Calochortus nuttallii"),
tax_rank = c("family" ,"species", "family", "species"))
但我想用一个大数据集自动完成,用dplyr
可能吗?谢谢
在base R
中,可以对非NA值使用max.col
,并选择ties.method = "last"
以保留最新的非NA值。
names(tax)[max.col(!is.na(tax), ties.method = "last")]
这可以将其转换为dplyr
:
library(dplyr)
tax %>%
mutate(tax_rank = names(tax)[max.col(!is.na(tax), ties.method = "last")])
# Family Genus Species tax_rank
# 1 Brassicacae <NA> <NA> Family
# 2 Pinaceae Pinus Pinus_sylvestris Species
# 3 Rosaceae <NA> <NA> Family
# 4 Liliaceae Lilia Calochortus nuttallii Species
数据(注意,我将"NA"
转换为NA
(
tax <- data.frame(
Family = c("Brassicacae", "Pinaceae", "Rosaceae", "Liliaceae"),
Genus = c(NA ,"Pinus", NA, "Lilia"),
Species = c(NA ,"Pinus_sylvestris", NA, "Calochortus nuttallii"))
首先,数据帧应该包含NA
对象,而不是字符:
tax <- data.frame(
Family = c("Brassicacae", "Pinaceae", "Rosaceae", "Liliaceae"),
Genus = c(NA ,"Pinus",NA, "Lilia"),
Species = c(NA,"Pinus_sylvestris", NA, "Calochortus nuttallii"))
那么你想要的列就是下一个
tax %>% mutate(tax_rank = ifelse(!is.na(Species), "species", ifelse(!is.na(Genus), "genus", "family")))
这是输出
Family Genus Species tax_rank
1 Brassicacae <NA> <NA> family
2 Pinaceae Pinus Pinus_sylvestris species
3 Rosaceae <NA> <NA> family
4 Liliaceae Lilia Calochortus nuttallii species
使用base R
tax$tax_rank <- apply(tax, 1, (x) tail(names(x)[!is.na(x)], 1))
-输出
> tax
Family Genus Species tax_rank
1 Brassicacae <NA> <NA> Family
2 Pinaceae Pinus Pinus_sylvestris Species
3 Rosaceae <NA> <NA> Family
4 Liliaceae Lilia Calochortus nuttallii Species