将许多Regex操作组合在一起

我正在使用python进行一个文本处理的NLP项目，在该项目中，我需要在提取特征之前进行数据清理。我正在使用regex操作清理特殊字符和带有字符的数字分隔，但我在许多操作中都是单独进行的，这使得它很慢。我想用尽可能少的操作或更快的方式完成它。

我的代码如下

def remove_special_char(x):
if type(x) is str:
x = x.replace('-', ' ').replace('(', ',').replace(')', ',')
x = re.compile(r"s+").sub(" ", x).strip()
x = re.sub(r'[^A-Z a-z 0-9-,.x]+', '', x).lower()
x = re.sub(r"([0-9]+(.[0-9]+)?)",r" 1 ", x).strip()
x = x.replace(",,",",")
return x
else:
return x

有人能帮我吗？

除了在函数外准备编译的模式外，您还可以通过对所有一对一或一对无转换使用translate来获得一些性能：

import string
mappings     = {'-':' ', '(':',', ')':','}            # add more mappings as needed
mappings.update({ c:' ' for c in string.whitespace }) # white spaces become spaces
mappings.update({c:c.lower() for c in string.ascii_uppercase}) # set to lowercase
specialChars = str.maketrans(mappings)
def remove_special_char(x):
x = x.translate(specialChars)
...
return x

各种操作有不同的替换字符串，因此无法真正合并它们。

不过，您可以预先预编译所有regexp，但我怀疑它不会有太大区别：

paren_re = re.compile(r"[()]")
whitespace_re = re.compile(r"s+")
ident_re = re.compile(r"[^A-Za-z0-9-,.x]+")
number_re = re.compile(r"([0-9]+(.[0-9]+)?)")

def remove_special_char(x):
if isinstance(x, str):
x = x.replace("-", " ")
x = paren_re.sub(",", x)
x = whitespace_re.sub(" ", x)
x = ident_re.sub("", x).lower()
x = number_re.sub(r" 1 ", x).strip()
x = x.replace(",,", ",")
return x

你有没有分析过你的程序，看看这是瓶颈？

相关内容

最新更新

热门标签：