使用Python功能检查文本中的特殊字符

因此，为了将文本文件转换为特征的数据帧，我正在编写一个能够做到这一点的自定义函数。现在，我希望该函数能够在文本输入中找到问号/感叹号，然后将其转换为 df.column 中的值。我的函数部分如下所示：

discount = ['[%]','[€]','[$]','[£]','korting','deal','discount','reduct','remise','voucher', 
'descuento', 'rebaja', 'скидка', 'sconto','rabat','alennus','kedvezmény',
'할인','折扣','ディスカウント','diskon']
data = [text_input.split()]
for word in data:
if any(char in discount for char in word):
df['discount'] = 1
else:
df['discount'] = 0
for word in data:
if any(char == '!' for char in word):
df['exclamation'] = 1
else:
df['exclamation'] = 0
for word in data:
if any(char == '?' for char in word):
df['question'] = 1
else:
df['question'] = 0

问题是，例如，如果文本输入包含："折扣！"，则无法识别"！"或单词"折扣"，从而导致两个指定列中的 0。现在，如果我从"折扣"中删除"！"，它会识别它们。

因此，我想知道如何拆分我的text_input以确保它从单词中删除"！"。还是有更有效的方法来找到这些字符？

提前感谢！

例如，您可以使用正则表达式在空格或"！"处拆分text_input。在正则表达式中添加其他特殊字符也很容易。

import re
data = re.split('[! ]', text_input)

设法解决了它。这是我更新的代码：

data_str = [re.split('[*?*! ]', text_input)]
data_chr = [re.findall('[^A-Za-z0-9]', text_input)]
for word in data_str:
if any(phrase in word for phrase in discount):
df['discount'] = 1
else:
df['discount'] = 0
for word in data_chr:
if '!' in word:
df['exclamation'] = 1
else:
df['exclamation'] = 0
for word in data_chr:
if '?' in word:
df['question'] = 1
else:
df['question'] = 0

相关内容

最新更新

热门标签：