我的删除@user和标点符号的代码不起作用



>我为推文数据集编写了下面的代码,我想进行预处理,我已经删除了#,网站但我删除@user和标点符号的代码不起作用,我是python的新手,任何人都可以帮助我吗?

from nltk.corpus import stopwords
import spacy, re
nlp = spacy.load('en')
stop_words = [w.lower() for w in stopwords.words()]
def sanitize(input_string):
""" Sanitize one string """
# normalize to lowercase 
string = input_string.lower()
# spacy tokenizer 
string_split = [token.text for token in nlp(string)]
# in case the string is empty 
if not string_split:
return '' 
names = re.compile('@[A-Za-z0-9_][A-Za-z0-9_]+')
string = [re.sub(names, '@USER', tweet) for tweet in input_string()]
#remove # and @
for punc in '":!#':
string = string.replace(punc, '')
# remove 't.co/' links
string = re.sub(r'http//t.co/[^s]+', '', string, flags=re.MULTILINE)
# removing stop words 
string = ' '.join([w for w in string.split() if w not in stop_words])
#punctuation
# string = [''.join(w for w in string.split() if w not in #string.punctuation) for w in string]

return string 

list = ['@Jeff_Atwood Thank you for #stackoverflow', 'All hail @Joel_Spolsky t.co/Gsb7V1oVLU #stackoverflow' ]
list_sanitized = [sanitize(string) for string in tweets[:300]]
list_sanitized[:50]

正则表达式需要修复。尝试类似操作:

names = re.compile('@[A-Za-z0-9_]+')
string = re.sub(names, '@USER', input_string)

input_string是一个变量而不是函数,它也是一个单数字符串,所以你不想遍历它。这将正常工作,如下所示:https://regexr.com/55u44

您的标点符号删除工作正常,请参阅:https://ideone.com/zScVPJ

试试这个: 字符串 = [names.sub('@USER', tweet( 表示 input_string((] 中的推文

相关内容

最新更新