在 sparkR 中,我需要计算数据框中 col STORE
的出现次数df3
>df3 <- select(df_1,"DC_NAME","STORE", "ITEM_DESC","ITEM")
>head(df3)
DC_NAME STORE ITEM_DESC ITEM
1 Kerala 1216 Nambisan Ghee 200 Ml Pet Jar 100050222
2 Kerala 1216 Nambisan Ghee 100 ml Pet Jar 100149022
3 Kerala 1216 Nambisan Ghee 50 ml Pet Jar 100149024
4 Kerala 1219 Nambisan Ghee 500 Ml Pet Jar 100050210
5 Kerala 1219 Nambisan Ghee 200 Ml Pet Jar 100050222
6 Kerala 1219 Nambisan Ghee 50 ml Pet Jar 100149024
For counting number times column STORE occurs, i used the code,
df_3 <- groupBy(df_3,"STORE") %>% count()
STORE count
1 1216 3
2 1219 3
3 3154 1
4 3049 3
5 1990 3
6 3107 4
但是我需要这种形式的结果,包括列'DC_NAME, ITEM_DESC, ITEM'
.有没有代码。
DC_NAME STORE ITEM_DESC ITEM count
1 Kerala 1216 Nambisan Ghee 200 Ml Pet Jar 100050222 3
2 Kerala 1216 Nambisan Ghee 100 ml Pet Jar 100149022 3
3 Kerala 1216 Nambisan Ghee 50 ml Pet Jar 100149024 3
4 Kerala 1219 Nambisan Ghee 500 Ml Pet Jar 100050210 3
5 Kerala 1219 Nambisan Ghee 200 Ml Pet Jar 100050222 3
6 Kerala 1219 Nambisan Ghee 50 ml Pet Jar 100149024 3
如果你想避免join
你可以使用具有无限范围的窗口函数。假设您的数据具有以下结构:
df <- structure(list(DC_NAME = structure(c(1L, 1L, 1L, 1L, 1L, 1L),
.Label = " Kerala ", class = "factor"),
STORE = c(1216L, 1216L, 1216L, 1219L, 1219L, 1219L),
ITEM_DESC = structure(c(2L,
1L, 4L, 3L, 2L, 4L), .Label = c(" Nambisan Ghee 100 ml Pet Jar",
" Nambisan Ghee 200 Ml Pet Jar", " Nambisan Ghee 500 Ml Pet Jar",
" Nambisan Ghee 50 ml Pet Jar"), class = "factor"), ITEM = c(100050222L,
100149022L, 100149024L, 100050210L, 100050222L, 100149024L
)), .Names = c("DC_NAME", "STORE", "ITEM_DESC", "ITEM"),
class = "data.frame", row.names = c("1 ", "2 ", "3 ", "4 ", "5 ", "6 "))
使用 Hive 上下文创建 Spark
DataFrame
:hiveContext <- sparkRHive.init(sc) sdf <- createDataFrame(hiveContext, df)
将其注册为临时表:
registerTempTable(sdf, "sdf")
准备查询:
query <- "SELECT *, SUM(1) OVER ( PARTITION BY STORE ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) AS count FROM sdf"
使用
sql
函数执行它:sql(hiveContext, query)