标记文本-操作时速度非常慢

问题

我有一个数据框架，有+90000行，还有一列['text']，其中包含一些新闻的文本。

文本的长度平均为3.000个单词，当我通过word_tokesize时，它会变得非常慢，哪种方法更有效？

from nltk.tokenize import word_tokenize
df['tokenized_text'] = df.iloc[0:10]['texto'].apply(word_tokenize) 
df.head()

此外，word_tokesize没有一些标点符号和其他我不想要的字符，所以我创建了一个函数来过滤我使用spacy的地方。

from spacy.lang.es.stop_words import STOP_WORDS
from nltk.corpus import stopwords
spanish_stopwords = set(stopwords.words('spanish'))
otherCharacters = ['`','�',' ','xa0']
def tokenize(phrase):
sentence_tokens = []
tokenized_phrase = nlp(phrase)
for token in tokenized_phrase:
if ~token.is_punct or ~token.is_stop or ~(token.text.lower() in spanish_stopwords) or ~(token.text.lower() in otherCharacters) or ~(token.text.lower() in STOP_WORDS):
sentence_tokens.append(token.text.lower())
return sentence_tokens

还有其他更好的方法吗？

谢谢你阅读我的《也许的角落》‍问题，祝你今天愉快。

赞赏

nlp在

import spacy
import es_core_news_sm
nlp = es_core_news_sm.load()

我使用spacy来标记，但也使用西班牙语的nltk stop_words

如果您只进行标记化，请使用空白模型(只包含标记化器)而不是es_core_news_sm:

nlp = spacy.blank("es")

当您只想标记化时，为了使spacy更快
您可以更改：

nlp = es_core_news_sm.load()

收件人：

nlp = spacy.load("es_core_news_sm", disable=["tagger", "ner", "parser"])

一个小的解释：
Spacy提供了一个完整的语言模型，它不仅可以标记你的句子，还可以进行语法分析和词性标注。实际上，大多数计算时间都是为其他任务(解析树、pos、ner)完成的，而不是标记化，这实际上是一项"更轻"的计算任务
但是，正如你所看到的，spacy只允许你使用你真正需要的东西，这样可以节省你一些时间。

另一件事是，您可以通过只降低一次token并将停止字添加到spacy来提高函数的效率(即使您不想这样做，otherCharacters是一个列表而不是集合的事实也不是很有效)。

我还想补充一下：

for w in stopwords.words('spanish'):
nlp.vocab[w].is_stop = True
for w in otherCharacters:
nlp.vocab[w].is_stop = True
for w in STOP_WORDS:
nlp.vocab[w].is_stop = True

以及：

for token in tokenized_phrase:
if not token.is_punct and  not token.is_stop:
sentence_tokens.append(token.text.lower())

相关内容

最新更新

热门标签：