很抱歉这是一个非常基本的问题,但我对Python完全陌生(我以前只使用过R,因为这是我在大学里教的,无可否认,没有达到很高的水平(,所以我不知道该怎么做。
我正在对推文进行情绪分析,发现了一个在Python上运行的预训练情绪分析包(RoBERTa(——我已经在R中聚合并清理了所有数据,现在有了一个CSV,其中有一列包含清理过的推文。
这是我正在使用的代码:
! pip install transformers
! pip install scipy
import pandas as pd
import io
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from scipy.special import softmax
from google.colab import files
uploaded = files.upload()
df = pd.read_csv(io.BytesIO(uploaded['example_cleaned_tweets.csv']))
print(df)
tweet = "This oatmeal is not good. Its mushy, soft, I don't like it. Quaker Oats is the way to go."
print(tweet)
# load model and tokenizer
roberta = "cardiffnlp/twitter-roberta-base-sentiment"
model = AutoModelForSequenceClassification.from_pretrained(roberta)
tokenizer = AutoTokenizer.from_pretrained(roberta)
labels = ['Negative', 'Neutral', 'Positive']
encoded_tweet = tokenizer(tweet, return_tensors='pt')
print(encoded_tweet)
# sentiment analysis
output = model(**encoded_tweet)
scores = output[0][0].detach().numpy()
scores = softmax(scores)
for i in range(len(scores)):
l = labels[i]
s = scores[i]
print(l,s)
我从一本关于如何使用我正在使用的包的指南中学习了很多,但去掉了数据处理阶段。
我已经将csv作为数据帧导入-有人能帮助我如何使用数据帧中的"cleaned_tweets"列,而不是";推特"-在那里我必须手动输入文本。如何为cleaned_tweets变量的数据帧中的每一行生成情绪得分,然后将负/中性/正得分附加到每一行的数据帧?
很抱歉出现基本问题,非常感谢您的帮助!
使用df.cleaned_tweets
或df["cleaned_tweets"]
,这将给您一个pandas Series对象
df[["cleaned_tweets"]]
将向您返回一个数据帧
如果使用模型,可以传递整个panda数据帧进行预测。
df_results = model.predict(df["cleaned_tweets"])
如果您使用令牌,文档会声明您可以使用str:的列表
text(str,List[str],List[List[str]](--要编码的序列。每个序列可以是字符串或列表字符串(预标记字符串(。如果序列以列表形式提供在字符串(预标记(中,必须将is_split_into_words=True(设置为用一批序列消除模糊性(。
您只需要将Panda列转换为列表:
list_of_cleaned_tweets = df['cleaned_tweets'].tolist()
这是我用来为未来的任何人运行脚本的代码:
! pip install transformers
! pip install scipy
import pandas as pd
import io
import numpy as np
from google.colab import files
uploaded = files.upload()
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from scipy.special import softmax
# load model and tokenizer
roberta = "cardiffnlp/twitter-roberta-base-sentiment-latest"
model = AutoModelForSequenceClassification.from_pretrained(roberta)
tokenizer = AutoTokenizer.from_pretrained(roberta)
labels = ['Negative', 'Neutral', 'Positive']
df = pd.read_csv('nameofcsv.csv')
# probably unnecessary but the len call could be expensive to do multiple times on large datasets
total_tweets = len(df['cleaned_tweets'])
# adds the columns for negative, neutral, positive
for label in labels:
df[label] = [np.nan]*total_tweets
for i, tweet in enumerate(df['cleaned_tweets']):
if tweet is not np.nan:
encoded_tweet = tokenizer(tweet, return_tensors='pt')
# sentiment analysis
output = model(**encoded_tweet)
scores = output[0][0].detach().numpy()
scores = softmax(scores)
for label, score in zip(labels, scores):
df[label][i] = score
print(df)