r语言 - 在新列中搜索字符串和状态是否存在



我有一个数据集有63,000行R. 其中一列包含一个格式为

的单词列表。
("["Stunning seaside location", "24-hour emergency call system and secure video entry", "Mature landscaped gardens with large terraces and seating areas", "Walk out balconies to selected apartments", "Beautifully decorated homeownersâ200231 lounge", "Parking spaces and car ports are available via an annual permit", "Wheelchair access", "Lifts to all floors", "Fire detection", "Intruder alarm"]", "["Village Location, 4 Bedrooms, Garden(s)"]", "["Balcony", "On street/residents parking", "Central heating", "Double glazing", "Fireplace", "Rural/secluded"])

这些都是从销售网站上列出的房产特征。

我想从这个列中提取单词,并创建一个新列,如果单词存在或不存在,该列具有'0'或'1'。创建一个虚拟变量回归。理想情况下,我将能够将多个属性特征分组到一列中,并说明它们是否存在。我也意识到R可能对大写和复数很敏感,所以我想在一列中有多个版本的单词。即我希望能够将'parking' 'parking' 'parking' 'parking' 'parking' ' 'parking' ' 'parking' '放在同一列中,因为它们都代表相同的特征,但在文本中可能会写得不同。

这是一个享乐定价方法,所以我需要尽可能多的属性特征变量。

在(持续)缺乏确切数据的情况下,我冒昧地根据OP的示例字符串创建了自己的玩具数据:

df <- data.frame(
Location = 1:3,
Description = c('["Interesting seaside location", "24-hour emergency call system and secure video entry", "Mature landscaped gardens with large terraces and seating areas", "Walk out balconies to selected apartments", "Beautifully decorated homeownersâ200231 lounge", "Parking spaces and car ports are available via an annual permit", "Wheelchair access", "Lifts to all floors", "Fire detection", "Intruder alarm"]", "["Village Location, 4 Bedrooms, Garden(s)"]", "["On street/residents parking", "Central heating", "Double glazing", "Rural/secluded"]',
'["Stunning city location", "24-hour emergency call system and secure video entry", "Mature landscaped gardens", "Walk out balconies to selected apartments", "Beautifully decorated homeownersâ200231 lounge", "Parking spaces and car ports are available via an annual permit", "Lifts to all floors", "Fire detection", "Intruder alarm"]", "["Village Location, 4 Bedrooms, Garden(s)"]", "["Balcony", "On street/residents parking", "Central heating", "Double glazing", "Fireplaces", "Rural/secluded"]',
'["Nice off-shore location", "12-hour emergency call system", "Marine gardens with large terraces and seating areas", "Swimming pool", "Beautifully decorated homeownersâ200231 lounge", "Parkings and car ports are available via an annual permit", "Wheelchair access", "No Lifts", "Fire detection", "Intruder alarm"]", "["Village Location, 4 Bedrooms, Garden(s)"]", "["Balcony", "On street/residents parking", "Central heating", "Double glazing", "Fireplace", "Rural/secluded"]')
)

OP只是说他/她想提取"单词";从位置描述;给出了一个例子,即"停车"。以及它的各种拼写。我认为会有更多这样的keywords,所以我整理了这个列表。请根据需要随意删除或扩展:

keywords <- "(?i)Parking|Garage|Garden|Freehold|Fireplace|Balcon(y|ies)"

注意,(?i)用于使匹配不区分大小写。

现在,如果目标是计算keywordsDesciptions中的出现次数,那么这个过程可以工作:

library(dplyr)
df %>%
# extract all occurrences of `keywords` into new column:
mutate(keywords = str_extract_all(txt, keywords)) %>% 
# unnest the listed items in the new column:
unnest(where(is.list), keep_empty = FALSE) %>% 
# prepare `keywords` for `pivot_wider` below:
mutate(
# capitalize initial letter:
headings = sub("^(.)", "\U\1", keywords, perl = TRUE),
# standardize "Balconies" to "Balcony":
headings = sub("Balconies", "Balcony", headings)) %>%
# cast each `heading` into its own column and count its occurrences:
pivot_wider(names_from = headings, values_from = headings, 
values_fn = function(x) 1, values_fill = 0)
# A tibble: 21 × 7
Location Description                                                                            keywords Garden Balcony Parking Fireplace
<int> <chr>                                                                                  <chr>     <dbl>   <dbl>   <dbl>     <dbl>
1        1 "["Interesting seaside location", "24-hour emergency call system and secure video … garden        1       0       0         0
2        1 "["Interesting seaside location", "24-hour emergency call system and secure video … balconi…      0       1       0         0
3        1 "["Interesting seaside location", "24-hour emergency call system and secure video … Parking       0       0       1         0
4        1 "["Interesting seaside location", "24-hour emergency call system and secure video … Garden        1       0       0         0
5        1 "["Interesting seaside location", "24-hour emergency call system and secure video … Balcony       0       1       0         0
6        1 "["Interesting seaside location", "24-hour emergency call system and secure video … parking       0       0       1         0
7        1 "["Interesting seaside location", "24-hour emergency call system and secure video … Firepla…      0       0       0         1
8        2 "["Stunning city location", "24-hour emergency call system and secure video entry… garden        1       0       0         0
9        2 "["Stunning city location", "24-hour emergency call system and secure video entry… balconi…      0       1       0         0
10        2 "["Stunning city location", "24-hour emergency call system and secure video entry… Parking       0       0       1         0
# … with 11 more rows

相关内容

  • 没有找到相关文章

最新更新