r语言 - 如何将高度测量转换为统一格式?



我有一组数据,包括许多高度测量作为字符变量。有的写为"5英尺7英寸",有的写为"170厘米",有的写为"1.7厘米"。还有一些只是"170"。

我想更改它们,以便它们都显示为没有度量单位的数字变量(例如,只有170)。

数据争用是一件非常有趣的事情,涉及到相当多的跌跌撞撞和边缘情况:

heights <- c("5ft 7", "170cm", "1.7m", "6' 7", "150", "5' 2"", "5ft8")
heights
[1] "5ft 7"  "170cm"  "1.7m"   "6' 7"   "150"    "5' 2"" "5ft8"

但是提供了探索许多工具的机会。用统一的度量,比如厘米,索引我们得到的符号:

b4meas <-gsub('[0-9\. ]', '', heights)
b4meas  
[1] "ft"  "cm"  "m"   "'"   ""    "'"" "ft"

在gsub '[0-9。的意思是,给我所有不是数字、点或空格的东西。我们可能会希望索引这些不同情况下转换:

which(b4meas== 'ft')
[1] 1 7
which(b4meas== '')
[1] 5

和探索数字:

char_num <- gsub('[a-z']','', heights, perl=TRUE)
char_num
[1] "5 7"   "170"   "1.7"   "6 7"   "150"   "5 2"" "58"
> which(nchar(char_num) == 2 & b4meas=='ft')
[1] 7
> which(nchar(char_num) == 3 & b4meas=='ft')
[1] 1
> which(nchar(char_num) == 3 & b4meas=="'")
[1] 4
> which(b4meas=="'"")
[1] 6

所以我们的异质脚符号,也可以是索引。我们基于厘米的测量不需要转换:

which(nchar(char_num) == 3 & b4meas=="'" | b4meas == 'cm')
[1] 2 4

那么,让我们看看我们在这里做了什么:

split_char <- strsplit(char_num, ' ')
> split_char
[[1]]
[1] "5" "7"
[[2]]
[1] "170"
[[3]]
[1] "1.7"
[[4]]
[1] "6" "7"
[[5]]
[1] "150"
[[6]]
[1] "5"   "2""
[[7]]
[1] "58"

So, [[2]] &[[5]]可以不加转换,也可以直接写到另一列。[[3]] * 100, [[1]] &[[4]]可以计算,[[6]]需要进一步清洗,[[7]]需要额外劈裂。

sum(as.numeric(split_char[[1]][1])*12 * 2.54, as.numeric(split_char[[1]][2]) * 2.54)
[1] 170.18
# for [[6]]
sum(as.numeric(split_char[[6]][1]) * 12 * 2.54, eval(as.numeric(gsub('\"', '', split_char[[6]][2])) * 2.54))
[1] 157.48
# either `eval` or `force` can be used to avoid
# Error in gsub( non-numeric argument to binary operator
# for [[7]]
sum(as.numeric(strsplit(split_char[[7]], '')[[1]][1])*12 *2.54, as.numeric(strsplit(split_char[[7]],'')[[1]][2]) * 2.54)
[1] 172.72

好的,我们可以转换,但是等等,我们有一个data.frame!因此,我们将使用我们的索引和转换来完成它。一个希望…

> physio_df <- data.frame(heights)
> physio_df[['heights_cm']] <- NA_real_ # add column to convert to
> physio_df
heights heights_cm
1   5ft 7         NA
2   170cm         NA
3    1.7m         NA
4    6' 7         NA
5     150         NA
6   5' 2"         NA
7    5ft8         NA

真是个奇迹,我们的一些案例仅仅通过采用数据框架就简化了。但这也意味着重新计算b4meas来反映这一点是有用的(因为你已经在一个data.frame中,你不需要这样做)。

# [[5]] just take to numeric
physio_df$heights_cm[which(nchar(physio_df$heights) ==3)] <- physio_df$heights[as.numeric(which(nchar(physio_df$heights) ==3))] 
#[[7]] 
physio_df$heights_cm[b4meas== 'm'] <- as.numeric(char_num[b4meas == 'm'])* 100
b4meas2 <- gsub('[0-9\. ]', '', physio_df$heights)
> b4meas2
[1] "ft"  "cm"  "m"   "'"   ""    "'"" "ft"
physio_df$heights[[6]]
[1] "5' 2""

哦,所以这实际上不是一个奇迹,b4meas仍然是一个有效的索引。索引的好处是,如果你有多个符合条件的情况,所有这些情况都可以解决。

#let's make an index for [[1]] & [[4]] but not [[6]]
one_four_type <- setdiff(which(sapply(split_char, function(x) length(x) == 2)), which(b4meas == "'""))
# and use in a `for` loop, should `sapply`, data has killed brain
for(i in 1:length(one_four_type)){
+ physio_df$heights_cm[one_four_type[i]] <-
+ sum(as.numeric(split_char[[one_four_type[i]]][1])*12 * 2.54,
+ as.numeric(split_char[[one_four_type[i]]][2]) * 2.54)
+ }
physio_df
heights heights_cm
1   5ft 7     170.18
2   170cm       <NA>
3    1.7m        170
4    6' 7     200.66
5     150        150
6   5' 2"       <NA>
7    5ft8       <NA>
# physio_df$heights_cm[2]
physio_df$heights_cm[which(b4meas=='cm')] <- as.numeric(char_num[b4meas=='cm'])
# physio_df$heights_cm[6]
> physio_df$heights_cm[which(b4meas == "'"")] <-
+ sum(as.numeric(split_char[[6]][1]) * 12 * 2.54, eval(as.numeric(gsub('\"', '', split_char[[6]][2])) * 2.54))
# physio_df$heights_cm[7]
physio_df$heights_cm[7] <- sum(as.numeric(strsplit(split_char[[7]], '')[[1]][1])*12 *2.54, as.numeric(strsplit(split_char[[7]],'')[[1]][2]) * 2.54)
> physio_df
heights heights_cm
1   5ft 7     170.18
2   170cm        170
3    1.7m        170
4    6' 7     200.66
5     150        150
6   5' 2"     157.48
7    5ft8     172.72

最新更新