在python中使用列表或字典替换字符串中的子字符串

我目前正在研究一个NLP模型，目前正在优化预处理步骤。因为我使用的是自定义函数，所以polar不能并行化操作。

我试过用极地做一些事情"还有一些"when。then。else"。但我还没有找到解决办法。

在这个例子中，我是在做"展开收缩"吗?例:我是->我)。

我现在使用这个:

# This is only a few example contractions that I use.
cList = {
"i'm": "i am",
"i've": "i have",
"isn't": "is not"
}
c_re = re.compile("(%s)" % "|".join(cList.keys()))
def expandContractions(text, c_re=c_re):
def replace(match):
return cList[match.group(0)]
return c_re.sub(replace, text)

df = pl.DataFrame({"Text": ["i'm i've, isn't"]})
df["Text"].apply(expandContractions)

输出

shape: (1, 1)
┌─────────────────────┐
│ Text                │
│ ---                 │
│ str                 │
╞═════════════════════╡
│ i am i have, is not │
└─────────────────────┘

但是我想使用极性的全部性能优势，因为我处理的数据集相当大。

性能测试:

#This dict have 100+ key/value pairs in my test case
cList = {
"i'm": "i am",
"i've": "i have",
"isn't": "is not"
}
def base_case(sr: pl.Series) -> pl.Series:
c_re = re.compile("(%s)" % "|".join(cList.keys()))
def expandContractions(text, c_re=c_re):
def replace(match):
return cList[match.group(0)]
return c_re.sub(replace, text)
sr = sr.apply(expandContractions)
return sr

def loop_case(sr: pl.Series) -> pl.Series:
for old, new in cList.items():
sr = sr.str.replace_all(old, new, literal=True)
return sr

def iter_case(sr: pl.Series) -> pl.Series:
sr = functools.reduce(
lambda res, x: getattr(getattr(res, "str"), "replace_all")(
x[0], x[1], literal=True
),
cList.items(),
sr,
)
return sr

它们都返回相同的结果，这里是样本长度为~500个字符的~10,000个样本的15个循环的平均时间。

Base case: 16.112362766265868
Loop case: 7.028670716285705
Iter case: 7.112465214729309

所以使用这两种方法中的任何一种都是两倍以上的速度，这主要归功于polar API-call "replace_all"。我最终使用循环的情况下，因为我少了一个模块导入。看这个问题的答案

(
df['Text']
.str.replace_all("i'm", "i am", literal=True)
.str.replace_all("i've", "i have", literal=True)
.str.replace_all("isn't", "is not", literal=True)
)

或:

functools.reduce(
lambda res, x: getattr(
getattr(res, "str"), "replace_all"
)(x[0], x[1], literal=True),
cList.items(),
df["Text"],
)

相关内容

最新更新

热门标签：