使用Hadoop和Pig对Twitter数据的情感分析



Twitter的推文存储在hadoop中的HDF中。这些推文需要进行处理以进行情感分析。HDF中的推文以AVRO格式为AVRO格式,因此需要使用JSON加载程序来处理它们,但是在Pig脚本脚本脚本的HDF中,来自HDF的推文不会被读取。更改JAR文件后,PIG脚本显示失败的消息

通过使用这些以下JAR文件,由Pig脚本失败。

注册'/home/cloudera/desktop/lephant-bird-hadoop-compat-4.17.jar';

注册'/home/cloudera/desktop/lephant-bird-pig-4.17.jar';

注册'/home/cloudera/desktop/json-simple-3.1.0.jar';

这些是另一组JAR文件,它没有失败,但数据也没有被读取。

注册'/home/cloudera/desktop/lephant-bird-hadoop-compat-4.17.jar';

注册'/home/cloudera/desktop/lephant-bird-pig-4.17.jar';

注册'/home/cloudera/desktop/json-simple-1.1.jar';

这是我使用的所有猪脚本命令:

tweets = LOAD '/user/cloudera/OutputData/tweets' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') AS myMap;
B = FOREACH tweets GENERATE myMap#'id' as id ,myMap#'tweets' as tweets;
tokens = foreach B generate id, tweets, FLATTEN(TOKENIZE(tweets)) As word;
dictionary = load ' /user/cloudera/OutputData/AFINN.txt' using PigStorage('t') AS(word:chararray,rating:int);
word_rating = join tokens by word left outer, dictionary by word using 'replicated';
describe word_rating;
rating = foreach word_rating generate tokens::id as id,tokens::tweets as tweets, dictionary::rating as rate;
word_group = group rating by (id,tweets);
avg_rate = foreach word_group generate group, AVG(rating.rate) as tweet_rating;
positive_tweets = filter avg_rate by tweet_rating>=0;
DUMP positive_tweets;
negative_tweets = filter avg_rate by tweet_rating<=0;
DUMP negative_tweets;

在第一组JAR文件上倾倒上述推文命令上的错误:

输入: 无法从"/user/cloudera/outputdata/tweets"读取数据

输出: 无法产生" hdfs://quickstart.cloudera:8020/tmp/temp/temp-1614543351/tmp37889715"

Counters:
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
Job DAG:
job_1556902124324_0001

2019-05-03 09:59:09,409 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Failed!
2019-05-03 09:59:09,427 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias tweets. Backend error : org.json.simple.parser.ParseException
Details at logfile: /home/cloudera/pig_1556902594207.log

在第二组JAR文件上倾倒上述推文命令上的错误:

输入: 成功地读取0个记录(5178477字节(:"/user/cloudera/outputdata/tweets"

输出: 成功存储了0个记录:" hdfs://quickstart.cloudera:8020/tmp/temp/temp-1614543351/tmp479037703"

Counters:
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
Job DAG:
job_1556902124324_0002

2019-05-03 10:01:05,417 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!
2019-05-03 10:01:05,418 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2019-05-03 10:01:05,418 [main] INFO  org.apache.pig.data.SchemaTupleBackend - Key [pig.schematuple] was not set... will not generate code.
2019-05-03 10:01:05,428 [main] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2019-05-03 10:01:05,428 [main] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1

预期的输出是正面和整洁的推文排序的,但会出现错误。请帮忙。谢谢。

ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias tweets. Backend error : org.json.simple.parser.ParseException这通常表示猪脚本中的语法错误。

加载语句中的AS关键字通常需要架构。 myMap在您的加载语句中不是有效的架构。

请参阅https://stackoverflow.com/a/12829494/8886552有关JSONLOADER的示例。

最新更新