mrjob:设置EMR登录



我正试图使用mrjob在EMR上运行hadoop,但不知道如何设置日志记录(用户在map/reduce步骤中生成的日志),因此我可以在集群终止后访问它们。

我曾尝试使用logging模块、printsys.stderr.write()设置日志记录,但到目前为止没有成功。对我来说,唯一有效的选择是将日志写入文件,然后通过SSH访问机器并读取它,但这很麻烦。我希望我的日志进入stderr/stdout/syslog并自动收集到S3,这样我就可以在集群终止后查看它们。

下面是word_freq的日志示例:

"""The classic MapReduce job: count the frequency of words.
"""
from mrjob.job import MRJob
import re
import logging
import logging.handlers
import sys
WORD_RE = re.compile(r"[w']+")

class MRWordFreqCount(MRJob):
    def mapper_init(self):
        self.logger = logging.getLogger()
        self.logger.setLevel(logging.INFO)
        self.logger.addHandler(logging.FileHandler("/tmp/mr.log"))
        self.logger.addHandler(logging.StreamHandler())
        self.logger.addHandler(logging.StreamHandler(sys.stdout))
        self.logger.addHandler(logging.handlers.SysLogHandler())
    def mapper(self, _, line):
        self.logger.info("Test logging: %s", line)
        sys.stderr.write("Test stderr: %sn" % line)
        print "Test print: %s" % line
        for word in WORD_RE.findall(line):
            yield (word.lower(), 1)
    def combiner(self, word, counts):
        yield (word, sum(counts))
    def reducer(self, word, counts):
        yield (word, sum(counts))

if __name__ == '__main__':
    MRWordFreqCount.run()

在所有选项中,唯一真正有效的是使用带有直接写入(sys.stderr.write)的stderr,或者使用带有到stderr的StreamHandler的记录器。

作业完成后(成功或出现错误),可以稍后从检索日志

[s3_log_uri]/[jobflow id]/task attempts/[job id]/[atry id]/stderr

请确保将日志保存在runners.emr.cleanup配置中。

这里有一个在stdout(python3)上登录的示例

from mrjob.job import MRJob
from mrjob.job import MRStep
from mrjob.util import log_to_stream, log_to_null
import re
import sys
import logging
log = logging.getLogger(__name__)
WORD_RE = re.compile(r'[w]+')
class MostUsedWords(MRJob):
    def set_up_logging(cls, quiet=False, verbose=False, stream=None):  
        log_to_stream(name='mrjob', debug=verbose, stream=stream)
        log_to_stream(name='__main__', debug=verbose, stream=stream)
    def steps(self):
        return [
            MRStep (mapper = self.mapper_get_words,
                    combiner = self.combiner_get_words,
                    reducer = self.reduce_get_words),
            MRStep (reducer = self.reducer_find_max)
        ]
        pass
    def mapper_get_words(self,  _, line):
        for word in WORD_RE.findall(line):
            yield (word.lower(), 1)
    def combiner_get_words(self, word, counts):
        yield (word, sum(counts))
    def reduce_get_words(self, word, counts):
        log.info(word + "t" +str(list(counts)) )
        yield None, (sum(counts), word)
    def reducer_find_max(self, key, value):
        # value is pairs i.e., tuples
        yield max(value)

if __name__ == '__main__':
    MostUsedWords.run()

相关内容

  • 没有找到相关文章

最新更新