r语言 - 将行中的值解析为新列



我有一个大数据框架在R (>250 k行)。有一个列land_use有不同的值(%)和土地利用类别:

df <- data.frame(id = c(1:10), land_use = c("Parks/Open Space 100%", 
"Single-Family 100%", 
"Educational 26%  Vacant 74%", 
"Educational 50 % Retail 50%", 
"Undevelopable 50% Mobile Homes 50%",
"Retail 100%" , 
"Single-Family 12%  Vacant 88%", 
"Single-Family 71% Agriculture 29%", 
"Single-Family 10% Agriculture 20% Vacant 30% Parks/Open Space 40%",
"Parks/Open Space 40% Single-Family 60%"))

最后,我需要将每个唯一的类别附加为一个单独的列,并在该特定列中插入相邻的值:

new_df = structure(list(id = 1:10, land_use = c("Parks/Open Space 100%", 
"Single-Family 100%", "Educational 26%  Vacant 74%", "Educational 50 % Retail 50%", 
"Undevelopable 50% Mobile Homes 50%", "Retail 100%", "Single-Family 12%  Vacant 88%", 
"Single-Family 71% Agriculture 29%", "Single-Family 10% Agriculture 20% Vacant 30% Parks/Open Space 40%", 
"Parks/Open Space 40% Single-Family 60%"), `Parks/Open Space` = c(100, 
NA, NA, NA, NA, NA, NA, NA, 40, 40), `Single-Family` = c(NA, 
100, NA, NA, NA, NA, 12, 71, 10, 60), Educational = c(NA, NA, 
26, 50, NA, NA, NA, NA, NA, NA), Vacant = c(NA, NA, 74, NA, NA, 
NA, 88, NA, 30, NA), Undevelopable = c(NA, NA, NA, NA, 50, NA, 
NA, NA, NA, NA), Retail = c(NA, NA, NA, 50, NA, 100, NA, NA, 
NA, NA), Agriculture = c(NA, NA, NA, NA, NA, NA, NA, 29, 20, 
NA), `Mobile Homes` = c(NA, NA, NA, NA, 50, NA, NA, NA, NA, NA
)), row.names = c(NA, -10L), class = "data.frame")

作为一种方法,我正在考虑创建唯一的类别,根据这些类别插入新列,并尝试循环填充每个列。开头:

categories <- gsub('[[:digit:]]+', '', df$land_use) # remove the numeric values first
categories_split <- strsplit(x = categories, split = '%') # split using the '%' sign
categories_split <- unique(unlist(categories_split)) # there will be some extra-spaces in the column names after that

然而,通过260 k行循环将比想要的慢,所以我想知道是否有人更有经验会找到一个更好/更简单的解决方案(如矢量化函数或dplyr)。

谢谢!

首先使用tidyr::separate_rows()分隔每个id的多个条目,然后使用tidyr::separate()将类别与百分比分开,然后使用readr::parse_number()强制百分比为数字:

library(tidyverse)
df_new1 <- df %>%
# first fix values with extra space before "%"
mutate(land_use = str_remove(land_use, "\s+(?=%)")) %>%
separate_rows(land_use, sep = "(?<=%)\s+") %>%
separate(land_use, into = c("land_use", "pct"), sep = "\s+(?=\d+%$)") %>%
mutate(pct = parse_number(pct))

这给了你这个漂亮整洁的数据框架,正如@Limey评论的那样,它可能更容易使用:

#> df_new1
# A tibble: 19 × 3
id land_use           pct
<int> <chr>            <dbl>
1     1 Parks/Open Space   100
2     2 Single-Family      100
3     3 Educational         26
4     3 Vacant              74
5     4 Educational         50
6     4 Retail              50
7     5 Undevelopable       50
8     5 Mobile Homes        50
9     6 Retail             100
10     7 Single-Family       12
11     7 Vacant              88
12     8 Single-Family       71
13     8 Agriculture         29
14     9 Single-Family       10
15     9 Agriculture         20
16     9 Vacant              30
17     9 Parks/Open Space    40
18    10 Parks/Open Space    40
19    10 Single-Family       60

然而,如果你想要你的土地使用栏,只需要再走一步到tidyr::pivot_wider():

df_new2 <- df_new1 %>%
pivot_wider(names_from = land_use, values_from = pct)
#> df_new2
# A tibble: 10 × 9
id `Parks/Open Space` `Single-Family` Educational Vacant Retail
<int>              <dbl>           <dbl>       <dbl>  <dbl>  <dbl>
1     1                100              NA          NA     NA     NA
2     2                 NA             100          NA     NA     NA
3     3                 NA              NA          26     74     NA
4     4                 NA              NA          50     NA     50
5     5                 NA              NA          NA     NA     NA
6     6                 NA              NA          NA     NA    100
7     7                 NA              12          NA     88     NA
8     8                 NA              71          NA     NA     NA
9     9                 40              10          NA     30     NA
10    10                 40              60          NA     NA     NA
# ℹ 3 more variables: Undevelopable <dbl>, `Mobile Homes` <dbl>,
#   Agriculture <dbl>

通过设置values_fill = 0,可以将未观察到的类别设置为0而不是NA

创建一个临时列name,等于land_use,并将其分成几行,过滤掉空垃圾。然后分离name的每个元素并转换为宽形式。注意pivot_wider使用namevalue作为names_fromvalues_from参数的默认值。

library(dplyr)
library(tidyr)
df %>%
mutate(name = land_use) %>%
separate_rows(name, sep = " *% *") %>%
filter(name != "") %>%
separate(name, c("name", "value"), sep = " (?=\d)") %>%
pivot_wider

给:

# A tibble: 10 × 10
id land_use  Parks…¹ Singl…² Educa…³ Vacant Retail Undev…⁴ Mobil…⁵ Agric…⁶
<int> <chr>     <chr>   <chr>   <chr>   <chr>  <chr>  <chr>   <chr>   <chr>  
1     1 Parks/Op… 100     <NA>    <NA>    <NA>   <NA>   <NA>    <NA>    <NA>   
2     2 Single-F… <NA>    100     <NA>    <NA>   <NA>   <NA>    <NA>    <NA>   
3     3 Educatio… <NA>    <NA>    26      74     <NA>   <NA>    <NA>    <NA>   
4     4 Educatio… <NA>    <NA>    50      <NA>   50     <NA>    <NA>    <NA>   
5     5 Undevelo… <NA>    <NA>    <NA>    <NA>   <NA>   50      50      <NA>   
6     6 Retail 1… <NA>    <NA>    <NA>    <NA>   100    <NA>    <NA>    <NA>   
7     7 Single-F… <NA>    12      <NA>    88     <NA>   <NA>    <NA>    <NA>   
8     8 Single-F… <NA>    71      <NA>    <NA>   <NA>   <NA>    <NA>    29     
9     9 Single-F… 40      10      <NA>    30     <NA>   <NA>    <NA>    20     
10    10 Parks/Op… 40      60      <NA>    <NA>   <NA>   <NA>    <NA>    <NA>   
# … with abbreviated variable names ¹​`Parks/Open Space`, ²​`Single-Family`,
#   ³​Educational, ⁴​Undevelopable, ⁵​`Mobile Homes`, ⁶​Agriculture

正如@Limey所建议的那样,保持数据整洁在大多数情况下会对您有所帮助。你可以这样做:

library(tidyr)
library(dplyr)
df |>
## replace the last blank with arbitrary separator (here: ;)
mutate(land_use = land_use |> gsub(pattern = '(^.*) ', replacement = '\1;')) |>
separate(land_use,
into = c('cat', 'val'),
sep = ';'
)

截断输出:

id                                                           cat  val
1   1                                              Parks/Open Space 100%
2   2                                                 Single-Family 100%
3   3                                       Educational 26%  Vacant  74%
4   4                                       Educational 50 % Retail  50%

如果你有固定宽度的列,检查最新的{tidyr}版本的separate_wider_delim

但是,如果您确实需要一个热编码(每个特性类别一列),请参见以下关于统计的概述。

相关内容

  • 没有找到相关文章

最新更新