在NLP中标记句子,Python



我有一个包含2851个句子的列'Text'的数据集,我想对它们进行标记,以便在NLP中实现单词袋。我试着使用loc函数,但没有工作,有人能告诉我怎么做吗?

您可以使用nltk来标记句子并将其应用于您的数据集。我不确定你的数据集是什么样子的,但这里有一个pandas的例子。

>>> import pandas as pd
>>> from nltk.tokenize import word_tokenize

>>> s = "Hey! Look! It's a sentence."
>>> word_tokenize(s)
# Output
['Hey', '!', 'Look', '!', 'It', "'s", 'a', 'sentence', '.']
>>> df = pd.DataFrame({"Name": ["first", "second", "third"],
"Text": ["This is the first sentence.",
"This is the second sentence.",
"Hey! Look! It's a third one."],})
>>> df
# Output
Name    Text
0   first   This is the first sentence.
1   second  This is the second sentence.
2   third   Hey! Look! It's a third one.
>>> df['Text'].apply(word_tokenize)
# Output
0    [This, is, the, first, sentence, .]
1    [This, is, the, second, sentence, .]
2    [Hey, !, Look, !, It, 's, a, third, one, .]
Name: Text, dtype: object

最新更新