我在R中有示例数据帧df,在sparklyr中有rd_3。我想在火花数据帧中创建visit_category列。我知道我们可以在 R 中使用 Cut 函数来创建相同的列,但是如何在 sparklyr 中创建相同的列?
用于可重复目的
df<-data.frame(visit_duration=c(12,20,70,100),city=c("X","X","X","X"),visit_category=c("0-15","15-25","25-80","80-120"))
rd_3<-copy_to(sc,df)
我不能使用 ifelse 语句,因为箱数超过 50 个。我在 sparklyr 中使用了ft_bucketlizer,但它显示了一个错误,如下所示
rd_3 %>%
ft_bucketizer("visit_duration", "Visit_Category", splits = c(0, 15, 25, 80 , 120)) %>%
mutate(Visit_Category = factor(Visit_Category, labels = c("0-15","15-25","25-80","80-120")))
这是我得到的错误
Error: org.apache.spark.sql.catalyst.parser.ParseException:
extraneous input 'AS' expecting {')', ','}(line 1, pos 98)
== SQL ==
SELECT `new_col`, `visit_duration`, FACTOR(`Visit_Category`, ("0-15",
"15-25", "25-80", "80-120") AS "labels") AS `Visit_Category`
In addition: Warning message:
Named arguments ignored for SQL FACTOR
Spark SQL中没有factors
或等效类型。相反,如果需要,Spark ML 转换器会添加特殊的列元数据。
结果factor
调用被解释为远程函数并通过 SQL 翻译引擎传递,呈现完全乱码。
现在,假设你真的想使用桶化器,你将不得不桶化
splits <- c(0, 15, 25, 80, 120)
bucketized <- rd_3 %>%
ft_bucketizer("visit_duration", "Visit_Category", splits = splits)
创建引用表:
ref <- copy_to(sc, tibble(
Visit_Category = seq_along(splits[-1]) - 1,
label = paste0(
splits[-length(splits)],
"-",
splits[-1]
)
))
并加入:
bucketized %>% left_join(ref, by = "Visit_Category")
# Source: spark<?> [?? x 4]
visit_duration city Visit_Category label
<dbl> <chr> <dbl> <chr>
1 12 X 0 0-15
2 20 X 1 15-25
3 70 X 2 25-80
4 100 X 3 80-120
尽管像这样构造CASE WHEN
表达式可能更容易:
library(rlang)
expr <- purrr::map2(
splits[-length(splits)], splits[-1],
function(lo, hi)
glue::glue("visit_duration %BETWEEN% {lo} %AND% {hi} ~ '{lo}-{hi}'")
) %>%
glue::glue_collapse(sep=",n") %>%
paste("case_when(n", ., ")")
rd_3 %>% mutate(result = !!parse_quo(expr, env = caller_frame()))
# Source: spark<?> [?? x 4]
visit_duration city visit_category result
<dbl> <chr> <chr> <chr>
1 12 X 0-15 0-15
2 20 X 15-25 15-25
3 70 X 25-80 25-80
4 100 X 80-120 80-120
或者简单地参考笛卡尔积并过滤结果:
ref2 <- copy_to(sc, tibble(
lo = splits[-length(splits)],
hi = splits[-1]
))
cross_join(rd_3, ref2, explicit=TRUE) %>%
filter(visit_duration >= lo & visit_duration < hi) %>%
mutate(label = paste0(lo, "-", hi)) %>%
select(-lo, -hi)
# Source: spark<?> [?? x 6]
visit_duration city visit_category lo hi label
<dbl> <chr> <chr> <dbl> <dbl> <chr>
1 12 X 0-15 0 15 0.0-15.0
2 20 X 15-25 15 25 15.0-25.0
3 70 X 25-80 25 80 25.0-80.0
4 100 X 80-120 80 120 80.0-120.0