Pyspark错误返回_compile(模式,标志).findall(字符串)-如何进行故障排除



我正在尝试使用单词列表进行情绪分析,以获得pyspark数据帧列中的阳性和阴性单词的计数。我可以用同样的方法成功地获得阳性单词的计数,这个列表中大约有2k个阳性单词。负面清单的单词数量大约是原来的两倍(约4k个单词(。是什么原因导致了这个问题,我该如何解决?

我不认为这是因为代码,因为它适用于阳性单词,但我很困惑,是我在另一个列表中搜索的单词数量太长,还是我遗漏了什么。下面是一个例子(不是确切的列表(:

stories.show()
+--------------------+
|               words|
+--------------------+
|tom and jerry went t|
|she was angry when g|
|arnold became sad at|
+--------------------+

neg = ['angry','sad','sorrowful','angry']

#doing some counting manipulation here
df3.show()

错误:

spark-3.2.0-bin-hadoop3.2/python/lib/py4j-0.10.9.2-src.zip/py4j/java_gateway.py in __call__(self, *args)
1308         answer = self.gateway_client.send_command(command)
1309         return_value = get_return_value(
-> 1310             answer, self.gateway_client, self.target_id, self.name)
1311 
1312         for temp_arg in temp_args:
/content/spark-3.2.0-bin-hadoop3.2/python/pyspark/sql/utils.py in deco(*a, **kw)
115                 # Hide where the exception came from that shows a non-Pythonic
116                 # JVM exception message.
--> 117                 raise converted from None
118             else:
119                 raise
PythonException: 
An exception was thrown from the Python worker. Please see the stack trace below.
Traceback (most recent call last):
File "<ipython-input-6-97710da0cedd>", line 17, in countNegatives
File "/usr/lib/python3.7/re.py", line 225, in findall
return _compile(pattern, flags).findall(string)
File "/usr/lib/python3.7/re.py", line 288, in _compile
p = sre_compile.compile(pattern, flags)
File "/usr/lib/python3.7/sre_compile.py", line 764, in compile
p = sre_parse.parse(p, flags)
File "/usr/lib/python3.7/sre_parse.py", line 932, in parse
p = _parse_sub(source, pattern, True, 0)
File "/usr/lib/python3.7/sre_parse.py", line 420, in _parse_sub
not nested and not items))
File "/usr/lib/python3.7/sre_parse.py", line 648, in _parse
source.tell() - here + len(this))
re.error: multiple repeat at position 5

预期输出:

+--------------------+--------+
|               words|Negative|
+--------------------+--------+
|tom and jerry went t|      45|
|she was angry when g|      12|
|arnold became sad at|      54|

您的neg列表包含对正则表达式模式具有特殊意义的字符,因此,您的模式将成为无法解析的正则表达式模式。

您可以使用re.eescape((函数来转义模式中的特殊字符。

相关内容

  • 没有找到相关文章

最新更新