Pyparsing对于维基百科自定义预处理来说花费的时间太长

我试图通过将tokenizer_func参数设置为自定义tokenize函数来定制Gensim的WikiCorpus的语料库处理:

# Set tokenizer to our custom tokenizer
wiki = WikiCorpus(input, tokenizer_func=tokenize)

但是PyParsing处理文本的时间太长了(例如，一篇文章甚至在运行了一天之后都没有处理)。在我的例子中，我想像往常一样清理维基百科语料库，除了保持与我拥有的单词列表(可能包含数字、下划线或&符号)匹配的任何单词不变。

假设一个可变长度的单词列表phrase_list，它包括:81、麦当劳、21、快乐10、山姆的车、火腿和;鸡蛋

下面是一些需要清理的输入示例:

这里有一些示例文本转换为清理后的文本!,8 *81是一个数字，麦当劳是一家快餐连锁店。21岁怎么样?这也是一个数字。这里有一些时态:run, run运行。
我不知道Happy 10是什么感觉，但是Sam的车是很好。In-N-Out当然是经典的，有时候人们会写In N Out。7 - 11的人吗?还是7 - 11 ?火腿,鸡蛋是一种这本书也很有趣-我猜- -

和期望的输出:

这里有一些示例文本转换为清洁文本牙仙81是麦当劳是快餐连锁店，那21家呢这里还有一些时态run run跑步不知道怎么感觉关于happy_10，但是sams_car很好，因为_n_out是经典的当然有时人们会写in_n_out，尽管7_11_any or is it7_11 ham_&_eggs是一本很有趣的书，猜nonalpha

请注意，在这个示例文本上的处理非常快，但在实际的维基百科语料库上却不是这样(我遵循本教程，但是自定义它:https://www.kdnuggets.com/2017/11/building-wikipedia-text-corpus-nlp.html)

下面是我使用PyParsing编写的自定义标记器(和一些辅助函数):

from typing import List
from pyparsing import *
from gensim.utils import to_unicode
from gensim.corpora import WikiCorpus
import string
import re
TOKEN_MIN_LEN = 2
TOKEN_MAX_LEN = 30
def pre_phrase_tokenize_processing(sentence):
"""
Helper: For cleaning sentence before phrases have been combined into single underscored tokens. Apply to entire sentence.
Removes hyphens, and punctuation except ampersands.
"""
# replace all hyphens with spaces since some phrases use them; just consider as multiword so they can be combined with underscore later
sentence = sentence.translate(str.maketrans('-', ' '))

# remove all punctuation except ampersands, since some phrases use them
remove_punct = string.punctuation.replace("&", "")
sentence = sentence.translate(str.maketrans('', '', remove_punct))

return sentence
def turn_phrases_into_tokens(phrases, sentence):
"""
Helper: Turns all individual phrases in a sentence into single underscored tokens according to a provided phrase dictionary.
"""
regex = re.compile("|".join([r"b{}b".format(phrase) for phrase in phrases]))
sentence = regex.sub(lambda m: phrases[m.group(0)], sentence)
return sentence
#@traceParseAction
def post_phrase_tokenize_processing(toks):
"""
For cleaning sentence after phrases have been combined into single underscored tokens. Apply to each non-phrase word.
Removes numeric characters and punctuation. Note toks is a list passed by pyparsing.
"""
# remove numeric characters, since only non-brand words are passed in
word = re.sub(r'd+', '', toks[0])

# remove all punctuation (including &), since only non-brand words are passed in
word = word.translate(str.maketrans('', '', string.punctuation))
return word
# our phrase dictionary - actual list may continue many more phrases
phrases = {"81": "81", "mcdonalds": "mcdonalds", "twenty one": "twenty_one", "happy 10": "happy_10", "sams car": "sams_car", "ham & eggs": "ham_&_eggs"}
def tokenize(content: str, token_min_len=TOKEN_MIN_LEN, token_max_len=TOKEN_MAX_LEN, lower=True) -> List[str]:
"""Overrides original tokenize method in wikicorpus.py
Tokenize a piece of text from Wikipedia.
Parameters
----------
content : str
String without markup (see :func:`~gensim.corpora.wikicorpus.filter_wiki`).
token_min_len : int
Minimal token length.
token_max_len : int
Maximal token length.
lower : bool
Convert `content` to lower case?
Returns
-------
list of str
List of tokens from `content`.
"""
content = to_unicode(content, encoding='utf8', errors='ignore')
if lower:
content = content.lower()

content = pre_phrase_tokenize_processing(content)

# Combine any phrases into single tokens
content = turn_phrases_into_tokens(phrases, content)


# Match either one of our phrases, or any other nonwhitespace word (in which case we process)
phrase_list = list(phrases.values())
parser = Combine(
OneOrMore(
oneOf(phrase_list, asKeyword=True)
| Word(alphas)
| Word(printables).setParseAction(post_phrase_tokenize_processing)
),
joinString=' ',
adjacent=False
)
content = parser.transformString(content)

return [
to_unicode(token) for token in content.split()
if token_min_len <= len(token) <= token_max_len and not token.startswith('_')
]

也只是作为参考，这是我在清理维基百科语料库(而不是示例文本)时用来调用标记器的实际代码-更多可以在上面的同一教程中找到:

def make_corpus(in_f, out_f):
"""Convert Wikipedia xml dump file to text corpus"""
output = open(out_f, 'w')

# Set tokenizer to our custom tokenizer
wiki = WikiCorpus(in_f, tokenizer_func=tokenize)
i = 0
for text in wiki.get_texts():
output.write(bytes(' '.join(text), 'utf-8').decode('utf-8') + 'n')
i = i + 1
if (i % 100 == 0):
print('Processed ' + str(i) + ' articles')
output.close()
print('Processing complete!')

到目前为止，我很确定问题是PyParsing(我调用parser = Combine(...)的部分)-而不是匹配每个非空格词，我应该只匹配需要清理的单词-但是我有点卡在如何做到这一点，因为我对这个库没有太多的经验。我也有一个问题，单词之间的空格被删除，当他们放回一起，这就是为什么我不得不调用Combine与joinString=' '，所以如果有任何建议，这将是非常感激!

在注释掉对pre_phrase_tokenize_processing和turn_phrases_into_tokens的内容清洗调用(因为没有足够的代码来运行它们)，并注释掉对post_phrase_tokenize_processing的解析操作后，这在我的系统上运行了大约1秒。

post_phrase_tokenize_processing到底发生了什么?

试着用一个非常小的输入文本和解析动作的配置文件运行，以获得下一步的一些想法。

此外，您可以通过将解析操作包装在pyparsingtraceParseAction提供的诊断装饰器中，对该解析操作进行一些粗略的检测。您可以将其作为装饰器添加到post_phrase_tokenize_processing，或者只是将其内联到setParseAction的调用中:

...
| Word(printables).setParseAction(traceParseAction(post_phrase_tokenize_processing))
...

编辑:您可以使用Regex获得与更新后的oneOf相同的行为。

下面是使用普通正则表达式的解决方案:

import re
phrases = "ab abc def a".split()
phrases_re = re.compile(r"b(" + '|'.join(re.escape(w) for w in phrases) + r")b" )
print(phrases_re.findall("abcd bc abc a bc def"))
['abc', 'a', 'def']

您可以使用以下命令将其变为pyparsingRegex:

import pyparsing as pp
phrases_expr = pp.Regex(phrases_re)
print(phrases_expr.searchString("abcd bc abc a bc def"))
[['abc'], ['a'], ['def']]

相关内容

最新更新

热门标签：