我试图通过将tokenizer_func
参数设置为自定义tokenize
函数来定制Gensim的WikiCorpus的语料库处理:
# Set tokenizer to our custom tokenizer
wiki = WikiCorpus(input, tokenizer_func=tokenize)
但是PyParsing处理文本的时间太长了(例如,一篇文章甚至在运行了一天之后都没有处理)。在我的例子中,我想像往常一样清理维基百科语料库,除了保持与我拥有的单词列表(可能包含数字、下划线或&符号)匹配的任何单词不变。
假设一个可变长度的单词列表phrase_list
,它包括:81、麦当劳、21、快乐10、山姆的车、火腿和;鸡蛋
下面是一些需要清理的输入示例:
这里有一些示例文本转换为清理后的文本!,8 *81是一个数字,麦当劳是一家快餐连锁店。21岁怎么样?这也是一个数字。这里有一些时态:run, run运行。
我不知道Happy 10是什么感觉,但是Sam的车是很好。In-N-Out当然是经典的,有时候人们会写In N Out。7 - 11的人吗?还是7 - 11 ?火腿,鸡蛋是一种这本书也很有趣-我猜- -
和期望的输出:
这里有一些示例文本转换为清洁文本牙仙81是麦当劳是快餐连锁店,那21家呢这里还有一些时态run run跑步不知道怎么感觉关于happy_10,但是sams_car很好,因为_n_out是经典的当然有时人们会写in_n_out,尽管7_11_any or is it7_11 ham_&_eggs是一本很有趣的书,猜nonalpha
请注意,在这个示例文本上的处理非常快,但在实际的维基百科语料库上却不是这样(我遵循本教程,但是自定义它:https://www.kdnuggets.com/2017/11/building-wikipedia-text-corpus-nlp.html)
下面是我使用PyParsing编写的自定义标记器(和一些辅助函数):
from typing import List
from pyparsing import *
from gensim.utils import to_unicode
from gensim.corpora import WikiCorpus
import string
import re
TOKEN_MIN_LEN = 2
TOKEN_MAX_LEN = 30
def pre_phrase_tokenize_processing(sentence):
"""
Helper: For cleaning sentence before phrases have been combined into single underscored tokens. Apply to entire sentence.
Removes hyphens, and punctuation except ampersands.
"""
# replace all hyphens with spaces since some phrases use them; just consider as multiword so they can be combined with underscore later
sentence = sentence.translate(str.maketrans('-', ' '))
# remove all punctuation except ampersands, since some phrases use them
remove_punct = string.punctuation.replace("&", "")
sentence = sentence.translate(str.maketrans('', '', remove_punct))
return sentence
def turn_phrases_into_tokens(phrases, sentence):
"""
Helper: Turns all individual phrases in a sentence into single underscored tokens according to a provided phrase dictionary.
"""
regex = re.compile("|".join([r"b{}b".format(phrase) for phrase in phrases]))
sentence = regex.sub(lambda m: phrases[m.group(0)], sentence)
return sentence
#@traceParseAction
def post_phrase_tokenize_processing(toks):
"""
For cleaning sentence after phrases have been combined into single underscored tokens. Apply to each non-phrase word.
Removes numeric characters and punctuation. Note toks is a list passed by pyparsing.
"""
# remove numeric characters, since only non-brand words are passed in
word = re.sub(r'd+', '', toks[0])
# remove all punctuation (including &), since only non-brand words are passed in
word = word.translate(str.maketrans('', '', string.punctuation))
return word
# our phrase dictionary - actual list may continue many more phrases
phrases = {"81": "81", "mcdonalds": "mcdonalds", "twenty one": "twenty_one", "happy 10": "happy_10", "sams car": "sams_car", "ham & eggs": "ham_&_eggs"}
def tokenize(content: str, token_min_len=TOKEN_MIN_LEN, token_max_len=TOKEN_MAX_LEN, lower=True) -> List[str]:
"""Overrides original tokenize method in wikicorpus.py
Tokenize a piece of text from Wikipedia.
Parameters
----------
content : str
String without markup (see :func:`~gensim.corpora.wikicorpus.filter_wiki`).
token_min_len : int
Minimal token length.
token_max_len : int
Maximal token length.
lower : bool
Convert `content` to lower case?
Returns
-------
list of str
List of tokens from `content`.
"""
content = to_unicode(content, encoding='utf8', errors='ignore')
if lower:
content = content.lower()
content = pre_phrase_tokenize_processing(content)
# Combine any phrases into single tokens
content = turn_phrases_into_tokens(phrases, content)
# Match either one of our phrases, or any other nonwhitespace word (in which case we process)
phrase_list = list(phrases.values())
parser = Combine(
OneOrMore(
oneOf(phrase_list, asKeyword=True)
| Word(alphas)
| Word(printables).setParseAction(post_phrase_tokenize_processing)
),
joinString=' ',
adjacent=False
)
content = parser.transformString(content)
return [
to_unicode(token) for token in content.split()
if token_min_len <= len(token) <= token_max_len and not token.startswith('_')
]
也只是作为参考,这是我在清理维基百科语料库(而不是示例文本)时用来调用标记器的实际代码-更多可以在上面的同一教程中找到:
def make_corpus(in_f, out_f):
"""Convert Wikipedia xml dump file to text corpus"""
output = open(out_f, 'w')
# Set tokenizer to our custom tokenizer
wiki = WikiCorpus(in_f, tokenizer_func=tokenize)
i = 0
for text in wiki.get_texts():
output.write(bytes(' '.join(text), 'utf-8').decode('utf-8') + 'n')
i = i + 1
if (i % 100 == 0):
print('Processed ' + str(i) + ' articles')
output.close()
print('Processing complete!')
到目前为止,我很确定问题是PyParsing(我调用parser = Combine(...)
的部分)-而不是匹配每个非空格词,我应该只匹配需要清理的单词-但是我有点卡在如何做到这一点,因为我对这个库没有太多的经验。我也有一个问题,单词之间的空格被删除,当他们放回一起,这就是为什么我不得不调用Combine
与joinString=' '
,所以如果有任何建议,这将是非常感激!
在注释掉对pre_phrase_tokenize_processing
和turn_phrases_into_tokens
的内容清洗调用(因为没有足够的代码来运行它们),并注释掉对post_phrase_tokenize_processing
的解析操作后,这在我的系统上运行了大约1秒。
post_phrase_tokenize_processing
到底发生了什么?
试着用一个非常小的输入文本和解析动作的配置文件运行,以获得下一步的一些想法。
此外,您可以通过将解析操作包装在pyparsingtraceParseAction
提供的诊断装饰器中,对该解析操作进行一些粗略的检测。您可以将其作为装饰器添加到post_phrase_tokenize_processing
,或者只是将其内联到setParseAction的调用中:
...
| Word(printables).setParseAction(traceParseAction(post_phrase_tokenize_processing))
...
编辑:您可以使用Regex
获得与更新后的oneOf
相同的行为。
下面是使用普通正则表达式的解决方案:
import re
phrases = "ab abc def a".split()
phrases_re = re.compile(r"b(" + '|'.join(re.escape(w) for w in phrases) + r")b" )
print(phrases_re.findall("abcd bc abc a bc def"))
['abc', 'a', 'def']
您可以使用以下命令将其变为pyparsingRegex
:
import pyparsing as pp
phrases_expr = pp.Regex(phrases_re)
print(phrases_expr.searchString("abcd bc abc a bc def"))
[['abc'], ['a'], ['def']]