如何在JythonUDFforPig中使用picklescikit学习模型

我已经从scikit-learn中训练了一个MultinomialNB模型，现在我想在S3集群上的许多json文本文件上释放它。我腌制了模型（称之为"nb.spickle"）。如何将其加载到Pig脚本中并使用它？假设我有一个文件，里面有几行文本，每一行都需要被归类为垃圾邮件或火腿：

    "im bored tonight, come chat with me",
    "hi good looking msg me sometime",
    "I'm walking the dog",
    "check me out",
    "I went to the store earlier",
    "here much at all but im always on there at i get on there alot more, my id is orangewolf77",
    "I like to play baseball",
    "what are you doing?",
    "i had a picture on my profile did u not see it?",
    "look at my b00bs",
    "go to my website http://we.scam.u
    "you are so pretty"

Jython不能使用numpy、scipy和scikit-learn，因为它们都有Jython不支持的本地编译扩展。因此，既不可能在Jython中使用scikit学习模型，也不可能从pickle文件中加载它们。

您可以做的是内省MNB类的代码，以了解要导出哪些参数（例如在json文件中），并重写一个新的预测方法，该方法可以根据Jython中的这些固定参数计算预测。

或者，您可以在hadoop节点上安装CPython、numpy、scipy和scikit-learn（例如使用Anaconda分发版），并通过hadoop流接口调用scikit-learn。

相关内容

最新更新

热门标签：