过滤R中hadoop reducer函数中的键值对



我想知道如何将条件过滤出hadoop reducer函数中的键,值对。例如,在下面的单词计数示例中,我如何获取那些计数大于某个阈值(例如3)的单词。

library(rmr2)
library(rhdfs)
# initiate rhdfs package
hdfs.init()
map <- function(k,lines) {
  words.list <- strsplit(lines, '\s')
  words <- unlist(words.list)
  return( keyval(words, 1) )
}
reduce <- function(word, counts) {
  keyval(word, sum(counts))
}
wordcount <- function (input, output=NULL) {
  mapreduce(input=input, output=output, input.format="text", map=map, reduce=reduce)
}
## read text files from folder example/wordcount/data
hdfs.root <- 'example/wordcount'
hdfs.data <- file.path(hdfs.root, 'data')
## save result in folder example/wordcount/out
hdfs.out <- file.path(hdfs.root, 'out')
## Submit job
out <- wordcount(hdfs.data, hdfs.out) 
## Fetch results from HDFS
results <- from.dfs(out)
results.df <- as.data.frame(results, stringsAsFactors=F)
colnames(results.df) <- c('word', 'count')
head(results.df)
reduce <- function(word, counts) {
  if(sum(counts) > 3)
    keyval(word, sum(counts))
}

相关内容

  • 没有找到相关文章

最新更新