我正在尝试使用单词列表进行情绪分析,以获得pyspark数据帧列中的阳性和阴性单词的计数。我可以用同样的方法成功地获得阳性单词的计数,这个列表中大约有2k个阳性单词。负面清单的单词数量大约是原来的两倍(约4k个单词(。是什么原因导致了这个问题,我该如何解决?
我不认为这是因为代码,因为它适用于阳性单词,但我很困惑,是我在另一个列表中搜索的单词数量太长,还是我遗漏了什么。下面是一个例子(不是确切的列表(:
stories.show()
+--------------------+
| words|
+--------------------+
|tom and jerry went t|
|she was angry when g|
|arnold became sad at|
+--------------------+
neg = ['angry','sad','sorrowful','angry']
#doing some counting manipulation here
df3.show()
错误:
spark-3.2.0-bin-hadoop3.2/python/lib/py4j-0.10.9.2-src.zip/py4j/java_gateway.py in __call__(self, *args)
1308 answer = self.gateway_client.send_command(command)
1309 return_value = get_return_value(
-> 1310 answer, self.gateway_client, self.target_id, self.name)
1311
1312 for temp_arg in temp_args:
/content/spark-3.2.0-bin-hadoop3.2/python/pyspark/sql/utils.py in deco(*a, **kw)
115 # Hide where the exception came from that shows a non-Pythonic
116 # JVM exception message.
--> 117 raise converted from None
118 else:
119 raise
PythonException:
An exception was thrown from the Python worker. Please see the stack trace below.
Traceback (most recent call last):
File "<ipython-input-6-97710da0cedd>", line 17, in countNegatives
File "/usr/lib/python3.7/re.py", line 225, in findall
return _compile(pattern, flags).findall(string)
File "/usr/lib/python3.7/re.py", line 288, in _compile
p = sre_compile.compile(pattern, flags)
File "/usr/lib/python3.7/sre_compile.py", line 764, in compile
p = sre_parse.parse(p, flags)
File "/usr/lib/python3.7/sre_parse.py", line 932, in parse
p = _parse_sub(source, pattern, True, 0)
File "/usr/lib/python3.7/sre_parse.py", line 420, in _parse_sub
not nested and not items))
File "/usr/lib/python3.7/sre_parse.py", line 648, in _parse
source.tell() - here + len(this))
re.error: multiple repeat at position 5
预期输出:
+--------------------+--------+
| words|Negative|
+--------------------+--------+
|tom and jerry went t| 45|
|she was angry when g| 12|
|arnold became sad at| 54|
您的neg
列表包含对正则表达式模式具有特殊意义的字符,因此,您的模式将成为无法解析的正则表达式模式。
您可以使用re.eescape((函数来转义模式中的特殊字符。