r语言 - 如何删除远程(postgresql)表上的排序?



在远程源中的表上调用dplyr::arrange()会添加"排序者:..."旗。是否有后续函数可以删除远程表上的此"排序者:"标志?

考虑示例数据:

tmp_cars_sdf <-
copy_to(con_psql, cars, name = "tmp_cars_sdf", overwrite = T)

为此:

glimpse(tmp_cars_sdf)
# Observations: ??
#     Variables: 2
# Database: postgres 9.5.3
# $ speed <dbl> 4, 4, 7, 7, 8, 9, 10, 10, 10, 11, 11, 12, 12, 12, 12, 13, 13...
# $ dist  <dbl> 2, 10, 4, 22, 16, 10, 18, 26, 34, 17, 28, 14, 20, 24, 28, 26...

考虑:

tmp_cars <-
cars
tmp_cars <-
tmp_cars %>%
arrange(speed, dist)
glimpse(tmp_cars)
# Observations: 50
# Variables: 2
# $ speed <dbl> 4, 4, 7, 7, 8, 9, 10, 10, 10, 11, 11, 12, 12, 12, 12, 13, 13, 13, 13,...
# $ dist  <dbl> 2, 10, 4, 22, 16, 10, 18, 26, 34, 17, 28, 14, 20, 24, 28, 26, 34, 34,...

然而:

tmp_cars <-
tmp_cars_sdf %>%
arrange(speed, dist)
glimpse(tmp_cars)
# Observations: ??
#     Variables: 2
# Database: postgres 9.5.3 
# Ordered by: speed, dist
# $ speed <dbl> 4, 4, 7, 7, 8, 9, 10, 10, 10, 11, 11, 12, 12, 12, 12, 13, 13, 13, 13,...
# $ dist  <dbl> 2, 10, 4, 22, 16, 10, 18, 26, 34, 17, 28, 14, 20, 24, 28, 26, 34, 34,...

dbplyr倾向于通过添加命令来嵌套子查询。因此,当您添加更多命令时,早期arrange可能会出现在子查询中。这似乎是根本问题。

删除这些查询的一个选项是直接呈现和编辑基础 SQL 查询。也许像下面这样:

unarrange = function(table, cols_prev_ordered_by){
db_connection = table$src$con
order_text = paste0("ORDER BY "",
paste0(cols_prev_ordered_by, collapse = ", ""),
""")
query_text = table %>% sql_render() %>% as.character()
new_query_text = gsub(order_text, "", query_text)
sql_query = build_sql(con = db_connection, new_query_text)
return(tbl(db_connection, sql(sql_query)))
}
# example:
tmp_cars <-
tmp_cars_sdf %>%
arrange(speed, dist)
unarrange(c("speed", "dist"))

当然,有比gsub更可靠的方法来识别和删除查询的排序依据部分。如果这很重要,您可能需要查看?select_query因为它具有明确的order_by参数。

受 Simon 对 OP 的回答和评论的启发,以下函数是一种解决方法,它删除了所有排序(但保留了作为排序结果计算的任何新列(。这可能不是最有效或最低级别/直接的方法,我将在本答案结束时回到这个问题,但如果他们认为合适,我会让dbplyr团队解决我的问题。

unarrange <-
function(remote_df) {
existing_groups <- groups(remote_df)

remote_df <-
remote_df %>%
compute()
remote_df <-
tbl(remote_df$src$con, 
sql_render(remote_df))

remote_df <-
group_by(remote_df, !!!existing_groups)

return(remote_df)
}

为什么有效

使用输入数据:

tmp_cars_sdf <-
copy_to(con_psql, cars, name = "tmp_cars_sdf", overwrite = T)

考虑

str(tmp_cars_sdf)
# ..$ con <truncated>
# ..$ disco <truncated>
# $ ops:List of 2
# ..$ x   : 'ident' chr "tmp_cars_sdf"
# ..$ vars: chr [1:2] "speed" "dist"
# ..- attr(*, "class")= chr [1:3] "op_base_remote" "op_base" "op"
# - attr(*, "class")= chr [1:5] "tbl_PostgreSQLConnection" "tbl_dbi" "tbl_sql" "tbl_lazy" ...

tmp_cars_sdf <-
tmp_cars_sdf %>%
arrange(speed, dist)
str(tmp_cars_sdf)
# $ ops:List of 4
# ..$ name: chr "arrange"
# ..$ x   :List of 2
# .. ..$ x   : 'ident' chr "tmp_cars_sdf"
# .. ..$ vars: chr [1:2] "speed" "dist"
# .. ..- attr(*, "class")= chr [1:3] "op_base_remote" "op_base" "op"
# ..$ dots:List of 2
# .. ..$ : language ~speed
# .. .. ..- attr(*, ".Environment")=<environment: 0x000000002556b260> 
#     .. ..$ : language ~dist
# .. .. ..- attr(*, ".Environment")=<environment: 0x000000002556b260> 
#     ..$ args:List of 1
# .. ..$ .by_group: logi FALSE
# ..- attr(*, "class")= chr [1:3] "op_arrange" "op_single" "op"
# - attr(*, "class")= chr [1:5] "tbl_PostgreSQLConnection" "tbl_dbi" "tbl_sql" "tbl_lazy" ...

很明显,通过arrange添加排序实际上修改了 R 对象的结构,因为远程表无法进行固有排序(或分组(,顺序和分组信息必须存储在本地,并且仅在构建最终查询时传输。

因此,解决方法使用三个技巧:首先,使用compute()生成一个临时表。请注意,这样做不会重置组和本地排序。其次,使用 Simon 的技巧拉取与此新表对应的简单选择查询,并覆盖现有的表结构,以便丢失所有分组和排序信息。为了保留组,该函数将原始组重新添加到此表中。

为什么这很有用?

虽然 OP 中提供的示例用于显示问题,但它出现的原因是依赖于表上的某些(分组(排序的突变。一旦构建了新列,旧的排序就不再需要了,事实上,由于 github 上的链接问题,有时是一个障碍。下面给出了这样一个例子:

tmp_cars_sdf <-
copy_to(con_psql, cars, name = "tmp_cars_sdf", overwrite = T)

cars_df <-
cars %>%
arrange(speed, dist) %>%
group_by(speed) %>%
mutate(diff_dist_up = dist - lag(dist)) %>%
arrange(speed, desc(dist)) %>%
mutate(diff_dist_down = dist - lag(dist)) %>%
ungroup() %>%
arrange(speed, dist) %>%
data.frame()

因此:

head(cars_df)
# speed dist diff_dist_up diff_dist_down
# 1     4    2           NA             -8
# 2     4   10            8             NA
# 3     7    4           NA            -18
# 4     7   22           18             NA
# 5     8   16           NA             NA
# 6     9   10           NA             NA

使用新功能,我们可以远程复制它:

cars_df_2 <-
tmp_cars_sdf %>%
arrange(speed, dist) %>%
group_by(speed) %>%
mutate(diff_dist_up = dist - lag(dist)) %>%
# unfortunately the next line is needed
# because of https://github.com/tidyverse/dbplyr/issues/345
unarrange() %>%
arrange(speed, desc(dist)) %>%
mutate(diff_dist_down = dist - lag(dist)) %>%
ungroup() %>%
unarrange() %>%
collect() %>%
arrange(speed, dist) %>%
data.frame()

检查一下,我们看到:

identical(cars_df, cars_df_2)
# [1] TRUE

此修复程序可能存在的问题

第一个问题是调用使用资源的compute()的必要性。第二个问题是必须能够修改对远程表进行编码的 R 对象的结构,但我不知道如何从此结构构建查询,因此无法执行此操作。

最新更新