我正在写一个非常简单的if-else循环来创建一个新变量,将另一个变量分为四分位数。这似乎是一个非常简单的过程,但循环将我的所有数据分组为中位数和第三个四分位数(这违反了四分位数的定义(。
以下是我的数据结构:
> str(tmp)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 435 obs. of 12 variables:
$ CD112FP : chr "01" "02" "03" "04" ...
$ State : chr "ALABAMA" "ALABAMA" "ALABAMA" "ALABAMA" ...
$ Year : num 2011 2011 2011 2011 2011 ...
$ Alignment : num 0 0 0 0 0 0 1 0 0 0 ...
$ State_Aligned : num 0 0 0 0 0 0 0 1 0 0 ...
$ PercentFunding : num 0.0658 0.29 0.6764 0.0174 0.047 ...
$ fips : chr "01" "01" "01" "01" ...
$ ssa : int 1 1 1 1 1 1 1 NA 3 3 ...
$ region : int 3 3 3 3 3 3 3 NA 4 4 ...
$ division : int 6 6 6 6 6 6 6 NA 8 8 ...
$ abb : chr "AL" "AL" "AL" "AL" ...
$ PercentFundingBinned: chr "0.0625-0.1799" "0.0625-0.1799" "0.0625-0.1799" "0.0625-0.1799" ...
这是我的数据头:
head(tmp)
# A tibble: 6 x 12
CD112FP State Year Alignment State_Aligned PercentFunding fips ssa region division abb PercentFundingBinned
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <int> <int> <int> <chr> <chr>
1 01 ALABAMA 2011 0 0 0.0658 01 1 3 6 AL 0.0625-0.1799
2 02 ALABAMA 2011 0 0 0.290 01 1 3 6 AL 0.0625-0.1799
3 03 ALABAMA 2011 0 0 0.676 01 1 3 6 AL 0.0625-0.1799
4 04 ALABAMA 2011 0 0 0.0174 01 1 3 6 AL 0.0625-0.1799
5 05 ALABAMA 2011 0 0 0.0470 01 1 3 6 AL 0.0625-0.1799
6 06 ALABAMA 2011 0 0 0.0440 01 1 3 6 AL 0.0625-0.1799
我正在使用以下if-else循环:
tmp$PercentFundingBinned <- NULL
if (tmp$PercentFunding >= quantile(tmp$PercentFunding, 0.75)) {
tmp$PercentFundingBinned <- paste0(round(quantile(tmp$PercentFunding, 0.75), 4), "-",
round(max(tmp$PercentFundingBinned), 4))
} else if (tmp$PercentFunding >= median(tmp$PercentFunding)){
tmp$PercentFundingBinned <- paste0(round(median(tmp$PercentFunding),4), "-",
round(quantile(tmp$PercentFunding, 0.75),4))
} else if (tmp$PercentFunding >= quantile(tmp$PercentFunding, 0.25)){
tmp$PercentFundingBinned <- paste0(round(quantile(tmp$PercentFunding, 0.25),4), "-",
round(median(tmp$PercentFunding),4))
} else {
tmp$PercentFundingBinned <- paste0(round(min(tmp$PercentFunding),4), "-",
round(quantile(tmp$PercentFunding, 0.25),4))
}
并返回以下类别:
unique(tmp$PercentFundingBinned)
[1] "0.0625-0.1799"
不知道该做什么或如何安装。这似乎应该是一个非常简单的过程。任何建议都有帮助,谢谢!
我建议您根本不需要ifelse
。
tmp <- read.table(header=TRUE, stringsAsFactors=FALSE, text="
CD112FP State Year Alignment State_Aligned PercentFunding fips ssa region division abb PercentFundingBinned
1 01 ALABAMA 2011 0 0 0.0658 01 1 3 6 AL 0.0625-0.1799
2 02 ALABAMA 2011 0 0 0.290 01 1 3 6 AL 0.0625-0.1799
3 03 ALABAMA 2011 0 0 0.676 01 1 3 6 AL 0.0625-0.1799
4 04 ALABAMA 2011 0 0 0.0174 01 1 3 6 AL 0.0625-0.1799
5 05 ALABAMA 2011 0 0 0.0470 01 1 3 6 AL 0.0625-0.1799
6 06 ALABAMA 2011 0 0 0.0440 01 1 3 6 AL 0.0625-0.1799 ")
quants <- quantile(tmp$PercentFunding, c(0, 0.25, 0.5, 0.75, 1))
quants
# 0% 25% 50% 75% 100%
# 0.01740 0.04475 0.05640 0.23395 0.67600
cuts <- cut(tmp$PercentFunding,
quants, include.lowest = TRUE, dig.lab = 4,
labels = sprintf("%0.04f-%0.04f", head(quants, n = -1), quants[-1]))
cuts
# [1] 0.0564-0.2339 0.2339-0.6760 0.2339-0.6760 0.0174-0.0447 0.0447-0.0564 0.0174-0.0447
# Levels: 0.0174-0.0447 0.0447-0.0564 0.0564-0.2339 0.2339-0.6760
诚然,这是一个factor
,但如果需要,可以很容易地使用as.character
进行转换。
tmp$PercentFundingBinned <- as.character(cuts)
我强烈建议您始终注意警告。
处理矢量时不应使用if
,因为如警告中所示,只会使用第一个元素:
> if(c(TRUE, FALSE)) 1 else 2
[1] 1
Warning message:
In if (c(TRUE, FALSE)) 1 else 2 :
the condition has length > 1 and only the first element will be used
> if(c(FALSE, TRUE)) 1 else 2
[1] 2
Warning message:
In if (c(FALSE, TRUE)) 1 else 2 :
the condition has length > 1 and only the first element will be used
在您的情况下发生的情况是:第一个值是0.0658,所以if确定它在bin中0.0625-0.1799。因为你给一个向量指定了一个值,所以这个值被指定给向量的每个元素。
相反,您可以使用ifelse
:
tmp$PercentFundingBinned <- ifelse (
tmp$PercentFunding >= quantile(tmp$PercentFunding, 0.75) ,
paste0(round(quantile(tmp$PercentFunding, 0.75), 4), "-",
round(max(tmp$PercentFundingBinned), 4)),
ifelse(tmp$PercentFunding >= median(tmp$PercentFunding),
paste0(round(median(tmp$PercentFunding),4), "-",
round(quantile(tmp$PercentFunding, 0.75),4)),
ifelse(tmp$PercentFunding >= quantile(tmp$PercentFunding, 0.25),
paste0(round(quantile(tmp$PercentFunding, 0.25),4), "-",
round(median(tmp$PercentFunding),4)),
paste0(round(min(tmp$PercentFunding),4), "-",
round(quantile(tmp$PercentFunding, 0.25),4))
)
)
)