我需要处理有推特的短信.要么使用正则表达式,要么甚至使用普通的python代码



Text包含标记化的wrod,tweet文本只需要完全匹配。因此“#eat”“@eat”“eating”“eat23”“eat-Python”等不是精确匹配,并且可以忽略。如果一个单词以以下标点符号形式结尾:!,?。'"那么它可以被视为完全匹配,例如"Python是一道你不应该吃的菜!"会匹配"eat"。这些单词应按大小写处理

我正在使用以下功能标记文本:

def sentenceBreak(text):
'''This function is used to tokenize the text/sentence and store it in the form of bag of words.
Parameters:
text: The text/sentence to be tokenised and stored as bag of words
Return:
bag_of_words: Dictionary with key as tokenized word and value as frequency of the word in text/sentence
#input example: text = "hello yes HEllo yES yes"
#output example: {"hello": 2, "yes": 3}
'''
bag_of_words = {}
t = text.lower().split() # Storing words with case insensitivity     
#Question mark: if a word looks like 'live!!!' should 'live' be counted?
for index, word in enumerate(t):
if word[-1] in string.punctuation:
bag_of_words[word[:-1]] = bag_of_words.get(word[:-1], 0) + 1
else:
bag_of_words[word] = bag_of_words.get(word, 0) + 1

return bag_of_words

使用collections.Counter:可以编写可读性更强的代码

import string
from collections import Counter

def sentenceBreak(text):
words = (x.strip(string.punctuation) for x in text.lower().split())
return Counter(words)

counts = sentenceBreak("hello, yes HEllo!! yES yes?!")
print(counts)
print(counts["yes"])
print(counts["hello"])
print([(word, count) for word, count in counts.items()])

意志输出:

Counter({'yes': 3, 'hello': 2})
3
2
[('hello', 2), ('yes', 3)]

最新更新