Text包含标记化的wrod,tweet文本只需要完全匹配。因此“#eat”
或“@eat”
或“eating”
或“eat23”
或“eat-Python”
等不是精确匹配,并且可以忽略。如果一个单词以以下标点符号形式结尾:!,?。'"那么它可以被视为完全匹配,例如"Python是一道你不应该吃的菜!"会匹配"eat"。这些单词应按大小写处理
我正在使用以下功能标记文本:
def sentenceBreak(text):
'''This function is used to tokenize the text/sentence and store it in the form of bag of words.
Parameters:
text: The text/sentence to be tokenised and stored as bag of words
Return:
bag_of_words: Dictionary with key as tokenized word and value as frequency of the word in text/sentence
#input example: text = "hello yes HEllo yES yes"
#output example: {"hello": 2, "yes": 3}
'''
bag_of_words = {}
t = text.lower().split() # Storing words with case insensitivity
#Question mark: if a word looks like 'live!!!' should 'live' be counted?
for index, word in enumerate(t):
if word[-1] in string.punctuation:
bag_of_words[word[:-1]] = bag_of_words.get(word[:-1], 0) + 1
else:
bag_of_words[word] = bag_of_words.get(word, 0) + 1
return bag_of_words
使用collections.Counter
:可以编写可读性更强的代码
import string
from collections import Counter
def sentenceBreak(text):
words = (x.strip(string.punctuation) for x in text.lower().split())
return Counter(words)
counts = sentenceBreak("hello, yes HEllo!! yES yes?!")
print(counts)
print(counts["yes"])
print(counts["hello"])
print([(word, count) for word, count in counts.items()])
意志输出:
Counter({'yes': 3, 'hello': 2})
3
2
[('hello', 2), ('yes', 3)]