将值和单位从字符列提取到多个新列(使用R:dplyr?中的提取)



我已经看了很多不同的问题,但没有找到一个能解决我的问题

我在一个数据帧中有一个非常混乱和不规则的字符列,看起来像这样:成本$k1:

"33p/kWh”
"40p/kWh on bp pulse50 units. 50p/kWh on bp pulse150 units."                                                                                                                    
"42p/kWh”
"Free"                                                                                                                                                                          
"30p/kWh ( min £1.50 )”
"42p/kWh”
"Polar members: 12p/kWh”
"Polar Subscription - 27p/kWh Polar Free Membership - 42p/kWh Contactless - 42p/kWh"                                                                                            
"47p/kWh”
"47p/kWh”
"25p/kWh”
"25p/kWh”
"50kW: 43p/kWh”

我需要尝试接受多达4个价格(包括单位(,并将它们放入新的列中,例如:

cost$k2[2] would be "40p/kWh"
cost$k3[2] would be "50p/kWh" 
cost$k2[8] would be "27p/kWh"
cost$k3[8] would be "42p/kWh" 
cost$k4[8] would be "42p/kWh"

环顾四周,有人认为dplyr::extract((应该是理想的,但我在让它发挥作用方面遇到了问题。

我花了很长时间重新排序和移动的三个例子是…。

cost %>% extract(k1,c("k2","k3","k4"),"([0-9]+p/kWh)-([0-9]+p/kWh)-([0-9]+p/kWh)",remove=FALSE)

没有结果

cost %>%  extract(k1,"([[0-9]+]p/kWh)",remove=FALSE)

只是给我第一组数字,而不是单位,甚至是正确的单位

cost %>%  extract(k1,into = c("k2","k3","k4"),regex="([0-9]+p/kWh)*([0-9]+p/kWh)*([[0-9]+p/kWh)",remove=FALSE)

获取有一个成本的数字和单位,但只将其写入第k4列(在第4行中,它获取的是40p/kWh,而不是50p/kWh的

有什么想法吗?

此处数据

structure(list(ID = c(194597L, 194510L, 193430L, 191632L, 190347L, 
190056L, 189724L, 189630L, 189350L, 189349L, 188842L, 188841L, 
188046L, 176130L, 175867L, 175683L, 175682L, 175526L, 175354L, 
175323L, 175034L, 174985L, 173800L, 173795L, 173794L, 173713L, 
173668L, 173518L, 173027L, 173026L, 173025L, 173018L, 172194L, 
172008L, 171158L, 171137L, 171136L, 170768L, 170767L, 170764L, 
170763L, 170701L, 170372L, 170368L, 170366L, 170365L, 170364L, 
170362L, 170359L, 170356L), k1 = c("20p/kWh", "25p/kWhContactless Card ", 
"33p/kWh; other tariffs available", "40p/kWh on bp pulse50 units. 50p/kWh on bp pulse150 units.", 
"42p/kWh; other tariffs available", "33p/kWh; other tariffs available", 
"35p/kWh; other tariffs available", "16p/kWh; other tariffs available", 
"Free", "42p/kWh; other tariffs available", "26p/kWh; other tariffs available", 
"26p/kWh; other tariffs available", "35p/kWh. Overstay £10.00/hour after 90 mins; other tariffs available", 
"30p/kWh ( min £1.50 ); other tariffs available. Parking fees apply", 
"33p/kWh; other tariffs available", "33p/kWh; other tariffs available", 
"33p/kWh; other tariffs available", "42p/kWh; other tariffs available", 
"42p/kWh; other tariffs available", "47p/kWh; other tariffs available", 
"42p/kWh; other tariffs available", "Polar members: 12p/kWh; instant: 18p/kWh (£1.20 min payment)", 
"Polar Subscription - 27p/kWh Polar Free Membership - 42p/kWh Contactless - 42p/kWh", 
"47p/kWh; other tariffs available", "47p/kWh; other tariffs available", 
"25p/kWh; other tariffs available", "25p/kWh; other tariffs available", 
"50kW: 43p/kWh; 150kW: 42p/kWh; other tariffs available", "Rapids Polar Card 15p/kWh, Contactless 20p/kWh", 
"Polar members: 12p/kWh; instant: 18p/kWh (£1.20 min payment)", 
"Rapids Polar Card 15p/kWh, Contactless 20p/kWh", "25p/kWh; other tariffs available", 
"Polar members: 12p/kWh; instant: 18p/kWh (£1.20 min payment)", 
"Polar members: 12p/kWh; instant: 18p/kWh (£1.20 min payment)", 
"30p/kWh Contactless; Overstay £10/hour after 90 minutes; Other tariffs available", 
"Rapids Polar Card 15p/kWh, Contactless 20p/kWh", "Rapids Polar Card 15p/kWh, Contactless 20p/kWh", 
"Rapids Polar Card 15p/kWh, Contactless 20p/kWh", "Rapids Polar Card 15p/kWh, Contactless 20p/kWh", 
"Rapids Polar Card 15p/kWh, Contactless 20p/kWh", "Rapids Polar Card 15p/kWh, Contactless 20p/kWh", 
"40p - 20p/kWh", "Rapids Polar Card 15p/kWh, Contactless 20p/kWh", 
"Rapids Polar Card 15p/kWh, Contactless 20p/kWh", "Rapids Polar Card 15p/kWh, Contactless 20p/kWh", 
"Rapids Polar Card 15p/kWh, Contactless 20p/kWh", "40p - 20p/kWh", 
"40p - 20p/kWh", "40p - 20p/kWh", "Polar Plus 20p/kWh Polar Instant 35p/kWh Contactless 40p/kWh £1.50 connection charge"
)), row.names = c(NA, -50L), class = c("data.table", "data.frame"
), .internal.selfref = <pointer: 0x000001e772333ba0>)

使用str_extract_all:

str_extract_all(df$k1, '\b\d+p/kWh')
# [[1]]
# [1] "33p/kWh"
# 
# [[2]]
# [1] "40p/kWh" "50p/kWh"

或者在数据帧中:

df %>% 
mutate(col = str_extract_all(k1, '\b\d+p/kWh')) %>% 
unnest_wider(col, names_sep = "")
# A tibble: 2 × 3
k1                                                         col1    col2   
<chr>                                                      <chr>   <chr>  
1 33p/kWh                                                    33p/kWh NA     
2 40p/kWh on bp pulse50 units. 50p/kWh on bp pulse150 units. 40p/kWh 50p/kWh

数据

df = data.frame(k1 = c("33p/kWh",
"40p/kWh on bp pulse50 units. 50p/kWh on bp pulse150 units."))

最新更新