在NLP中标记句子，Python

我有一个包含2851个句子的列'Text'的数据集，我想对它们进行标记，以便在NLP中实现单词袋。我试着使用loc函数，但没有工作，有人能告诉我怎么做吗?

您可以使用nltk来标记句子并将其应用于您的数据集。我不确定你的数据集是什么样子的，但这里有一个pandas的例子。

>>> import pandas as pd
>>> from nltk.tokenize import word_tokenize

>>> s = "Hey! Look! It's a sentence."
>>> word_tokenize(s)

# Output
['Hey', '!', 'Look', '!', 'It', "'s", 'a', 'sentence', '.']

>>> df = pd.DataFrame({"Name": ["first", "second", "third"],
"Text": ["This is the first sentence.",
"This is the second sentence.",
"Hey! Look! It's a third one."],})
>>> df

# Output
Name    Text
0   first   This is the first sentence.
1   second  This is the second sentence.
2   third   Hey! Look! It's a third one.

>>> df['Text'].apply(word_tokenize)

# Output
0    [This, is, the, first, sentence, .]
1    [This, is, the, second, sentence, .]
2    [Hey, !, Look, !, It, 's, a, third, one, .]
Name: Text, dtype: object

相关内容

最新更新

热门标签：