我有一个大数据框架在R (>250 k行)。有一个列land_use
有不同的值(%)和土地利用类别:
df <- data.frame(id = c(1:10), land_use = c("Parks/Open Space 100%",
"Single-Family 100%",
"Educational 26% Vacant 74%",
"Educational 50 % Retail 50%",
"Undevelopable 50% Mobile Homes 50%",
"Retail 100%" ,
"Single-Family 12% Vacant 88%",
"Single-Family 71% Agriculture 29%",
"Single-Family 10% Agriculture 20% Vacant 30% Parks/Open Space 40%",
"Parks/Open Space 40% Single-Family 60%"))
最后,我需要将每个唯一的类别附加为一个单独的列,并在该特定列中插入相邻的值:
new_df = structure(list(id = 1:10, land_use = c("Parks/Open Space 100%",
"Single-Family 100%", "Educational 26% Vacant 74%", "Educational 50 % Retail 50%",
"Undevelopable 50% Mobile Homes 50%", "Retail 100%", "Single-Family 12% Vacant 88%",
"Single-Family 71% Agriculture 29%", "Single-Family 10% Agriculture 20% Vacant 30% Parks/Open Space 40%",
"Parks/Open Space 40% Single-Family 60%"), `Parks/Open Space` = c(100,
NA, NA, NA, NA, NA, NA, NA, 40, 40), `Single-Family` = c(NA,
100, NA, NA, NA, NA, 12, 71, 10, 60), Educational = c(NA, NA,
26, 50, NA, NA, NA, NA, NA, NA), Vacant = c(NA, NA, 74, NA, NA,
NA, 88, NA, 30, NA), Undevelopable = c(NA, NA, NA, NA, 50, NA,
NA, NA, NA, NA), Retail = c(NA, NA, NA, 50, NA, 100, NA, NA,
NA, NA), Agriculture = c(NA, NA, NA, NA, NA, NA, NA, 29, 20,
NA), `Mobile Homes` = c(NA, NA, NA, NA, 50, NA, NA, NA, NA, NA
)), row.names = c(NA, -10L), class = "data.frame")
作为一种方法,我正在考虑创建唯一的类别,根据这些类别插入新列,并尝试循环填充每个列。开头:
categories <- gsub('[[:digit:]]+', '', df$land_use) # remove the numeric values first
categories_split <- strsplit(x = categories, split = '%') # split using the '%' sign
categories_split <- unique(unlist(categories_split)) # there will be some extra-spaces in the column names after that
然而,通过260 k行循环将比想要的慢,所以我想知道是否有人更有经验会找到一个更好/更简单的解决方案(如矢量化函数或dplyr)。
谢谢!
首先使用tidyr::separate_rows()
分隔每个id的多个条目,然后使用tidyr::separate()
将类别与百分比分开,然后使用readr::parse_number()
强制百分比为数字:
library(tidyverse)
df_new1 <- df %>%
# first fix values with extra space before "%"
mutate(land_use = str_remove(land_use, "\s+(?=%)")) %>%
separate_rows(land_use, sep = "(?<=%)\s+") %>%
separate(land_use, into = c("land_use", "pct"), sep = "\s+(?=\d+%$)") %>%
mutate(pct = parse_number(pct))
这给了你这个漂亮整洁的数据框架,正如@Limey评论的那样,它可能更容易使用:
#> df_new1
# A tibble: 19 × 3
id land_use pct
<int> <chr> <dbl>
1 1 Parks/Open Space 100
2 2 Single-Family 100
3 3 Educational 26
4 3 Vacant 74
5 4 Educational 50
6 4 Retail 50
7 5 Undevelopable 50
8 5 Mobile Homes 50
9 6 Retail 100
10 7 Single-Family 12
11 7 Vacant 88
12 8 Single-Family 71
13 8 Agriculture 29
14 9 Single-Family 10
15 9 Agriculture 20
16 9 Vacant 30
17 9 Parks/Open Space 40
18 10 Parks/Open Space 40
19 10 Single-Family 60
然而,如果你想要你的土地使用栏,只需要再走一步到tidyr::pivot_wider()
:
df_new2 <- df_new1 %>%
pivot_wider(names_from = land_use, values_from = pct)
#> df_new2
# A tibble: 10 × 9
id `Parks/Open Space` `Single-Family` Educational Vacant Retail
<int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 100 NA NA NA NA
2 2 NA 100 NA NA NA
3 3 NA NA 26 74 NA
4 4 NA NA 50 NA 50
5 5 NA NA NA NA NA
6 6 NA NA NA NA 100
7 7 NA 12 NA 88 NA
8 8 NA 71 NA NA NA
9 9 40 10 NA 30 NA
10 10 40 60 NA NA NA
# ℹ 3 more variables: Undevelopable <dbl>, `Mobile Homes` <dbl>,
# Agriculture <dbl>
通过设置values_fill = 0
,可以将未观察到的类别设置为0
而不是NA
。
创建一个临时列name
,等于land_use
,并将其分成几行,过滤掉空垃圾。然后分离name
的每个元素并转换为宽形式。注意pivot_wider
使用name
和value
作为names_from
和values_from
参数的默认值。
library(dplyr)
library(tidyr)
df %>%
mutate(name = land_use) %>%
separate_rows(name, sep = " *% *") %>%
filter(name != "") %>%
separate(name, c("name", "value"), sep = " (?=\d)") %>%
pivot_wider
给:
# A tibble: 10 × 10
id land_use Parks…¹ Singl…² Educa…³ Vacant Retail Undev…⁴ Mobil…⁵ Agric…⁶
<int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 Parks/Op… 100 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
2 2 Single-F… <NA> 100 <NA> <NA> <NA> <NA> <NA> <NA>
3 3 Educatio… <NA> <NA> 26 74 <NA> <NA> <NA> <NA>
4 4 Educatio… <NA> <NA> 50 <NA> 50 <NA> <NA> <NA>
5 5 Undevelo… <NA> <NA> <NA> <NA> <NA> 50 50 <NA>
6 6 Retail 1… <NA> <NA> <NA> <NA> 100 <NA> <NA> <NA>
7 7 Single-F… <NA> 12 <NA> 88 <NA> <NA> <NA> <NA>
8 8 Single-F… <NA> 71 <NA> <NA> <NA> <NA> <NA> 29
9 9 Single-F… 40 10 <NA> 30 <NA> <NA> <NA> 20
10 10 Parks/Op… 40 60 <NA> <NA> <NA> <NA> <NA> <NA>
# … with abbreviated variable names ¹`Parks/Open Space`, ²`Single-Family`,
# ³Educational, ⁴Undevelopable, ⁵`Mobile Homes`, ⁶Agriculture
正如@Limey所建议的那样,保持数据整洁在大多数情况下会对您有所帮助。你可以这样做:
library(tidyr)
library(dplyr)
df |>
## replace the last blank with arbitrary separator (here: ;)
mutate(land_use = land_use |> gsub(pattern = '(^.*) ', replacement = '\1;')) |>
separate(land_use,
into = c('cat', 'val'),
sep = ';'
)
截断输出:
id cat val
1 1 Parks/Open Space 100%
2 2 Single-Family 100%
3 3 Educational 26% Vacant 74%
4 4 Educational 50 % Retail 50%
如果你有固定宽度的列,检查最新的{tidyr}版本的separate_wider_delim
。
但是,如果您确实需要一个热编码(每个特性类别一列),请参见以下关于统计的概述。