我有一个数据集有63,000行R。其中一栏的格式是"壁炉"、"车库"、"一层带阳台"、"街边停车"等等。这些都是从销售网站上列出的房产特征。
我想提取单词从该列中创建一个新列,如果单词存在或不存在,该列具有'0'或'1'(为回归创建一个虚拟变量)。一旦完成,我希望能够合并将这些列合并在一起(例如,将"parking"、"parking"、"garage"、"garage"列合并为一个包括所有停车场和车库的列)。我假设R对大小写字符敏感,但即使不是,我也需要能够将'parking'和'garage'合并在一起,例如。
这是一个享乐定价方法,所以我需要尽可能多的属性特征变量。
我不知道如何创建新的虚拟变量或合并成一列一旦我有,所以我挣扎。如有任何帮助,不胜感激。
这是你要找的吗?
library(tidyverse)
data.frame(txt) %>%
# tidy up `txt`:
mutate(txt = gsub("(?![, ])\W", "", txt, perl = TRUE)) %>%
# split into rows
separate_rows(txt, sep = ",") %>%
# extract keywords matched:
mutate(keywords = str_extract(txt, "(?i)Parking|Garage|Garden|Freehold|Fireplace|Balcony"))
# A tibble: 19 × 2
txt keywords
<chr> <chr>
1 "Stunning seaside location" NA
2 " 24hour emergency call system and secure video entry" NA
3 " Mature landscaped gardens with large terraces and seating areas" garden
4 " Walk out balconies to selected apartments" NA
5 " Beautifully decorated homeowners8099 lounge" NA
6 " Parking spaces and car ports are available via an annual permit" Parking
7 " Wheelchair access" NA
8 " Lifts to all floors" NA
9 " Fire detection" NA
10 " Intruder alarm" NA
11 " Village Location" NA
12 " 4 Bedrooms" NA
13 " Gardens" Garden
14 " Balcony" Balcony
15 " On streetresidents parking" parking
16 " Central heating" NA
17 " Double glazing" NA
18 " Fireplace" Fireplace
19 " Ruralsecluded" NA
数据:
txt <- '"["Stunning seaside location", "24-hour emergency call system and secure video entry", "Mature landscaped gardens with large terraces and seating areas", "Walk out balconies to selected apartments", "Beautifully decorated homeownersâ200231 lounge", "Parking spaces and car ports are available via an annual permit", "Wheelchair access", "Lifts to all floors", "Fire detection", "Intruder alarm"]", "["Village Location, 4 Bedrooms, Garden(s)"]", "["Balcony", "On street/residents parking", "Central heating", "Double glazing", "Fireplace", "Rural/secluded"]"'
如果每个子字符串可能有超过1个关键字,那么这样使用str_extract_all
:
data.frame(txt) %>%
mutate(txt = gsub("(?![, ])\W", "", txt, perl = TRUE)) %>%
separate_rows(txt, sep = ",") %>%
mutate(keywords = str_extract_all(txt, "(?i)Parking|Garage|Garden|Freehold|Fireplace|Balcony")) %>%
unnest(where(is.list), keep_empty = TRUE)
编辑:
如果OP希望为每个关键字获取一个变量,那么这可以工作:
data.frame(txt) %>%
mutate(txt = gsub("(?![, /])\W", "", txt, perl = TRUE)) %>%
separate_rows(txt, sep = ", ") %>%
mutate(keywords = str_extract_all(txt, "(?i)Parking|Garage|Garden|Freehold|Fireplace|Balcony")) %>%
# unnest listed items:
unnest(where(is.list), keep_empty = TRUE) %>%
# capitalize initial letter:
mutate(keywords = sub("^(.)", "\U\1", keywords, perl = TRUE)) %>%
# cast each keaword into its own column:
pivot_wider(names_from = keywords, values_from = keywords,
values_fn = function(x) 1, values_fill = 0)
# A tibble: 19 × 6
txt `NA` Garden Parking Balcony Fireplace
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Stunning seaside location 1 0 0 0 0
2 24hour emergency call system and secure video entry 1 0 0 0 0
3 Mature landscaped gardens with large terraces and seating areas 0 1 0 0 0
4 Walk out balconies to selected apartments 1 0 0 0 0
5 Beautifully decorated homeowners8099 lounge 1 0 0 0 0
6 Parking spaces and car ports are available via an annual permit 0 0 1 0 0
7 Wheelchair access 1 0 0 0 0
8 Lifts to all floors 1 0 0 0 0
9 Fire detection 1 0 0 0 0
10 Intruder alarm 1 0 0 0 0
11 Village Location 1 0 0 0 0
12 4 Bedrooms 1 0 0 0 0
13 Gardens 0 1 0 0 0
14 Balcony 0 0 0 1 0
15 On street/residents parking 0 0 1 0 0
16 Central heating 1 0 0 0 0
17 Double glazing 1 0 0 0 0
18 Fireplace 0 0 0 0 1
19 Rural/secluded 1 0 0 0 0