我有一个旅行顾问评论的csv文件。共有四列:
人员、职称、评级、评审、评审日期。
我想让这个代码做以下事情:
- 在csv中,创建一个名为"塔拉特">
- 填充";塔拉特;带有"pos"、"neg"或"neut"。它应该读取"0"中的数值;评级"塔拉特pos"if";评级">40"tarate’==‘neut’if"评级30"塔拉特否定"如果";评级"<30
- 接下来,运行";审查";专栏通过情感内涵分析
- 将输出记录在名为"csv"的新csv列中;分数">
- 为";化合物";值,使用"pos"one_answers"neg"分类
- 运行sklearn.metrics工具将旅行顾问评级("tarate"(与;化合物";。这可以直接打印
部分代码基于[http://akashsenta.com/blog/sentiment-analysis-with-vader-with-python/]
这是我的csv文件:[https://github.com/nsusmann/vadersentiment]
我犯了一些错误。我是一个初学者,我想我会被一些事情绊倒,比如指向特定的列和lambda函数。
这是代码:
# open command prompt
# import nltk
# nltk.download()
# pip3 install pandas
# pip3 installs sci-kitlearn
# pip3 install matplotlib
# pip3 install seaborn
# pip3 install vaderSentiment
#pip3 install openpyxl
import pandas
import nltk
nltk.download([
"vader_lexicon",
"stopwords"])
from nltk.sentiment import SentimentAnalyzer
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.sentiment.util import *
from nltk import tokenize
from nltk.corpus import stopwords
import pandas as pd
import numpy as np
from collections import Counter
import re
import math
import html
import sklearn
import sklearn.metrics as metrics
from sklearn.metrics import mutual_info_score
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, ENGLISH_STOP_WORDS
import matplotlib.pyplot as plt
import seaborn as sns
from pprint import pprint
import openpyxl
# open the file to save the review
import csv
outputfile = open('D:DocumentsArchaeologyProjectsPatmosTextAnalysisSentimentscraped_cln_sent.csv', 'w', newline='')
df = csv.writer(outputfile)
#open Vader Sentiment Analyzer
from nltk.sentiment.vader import SentimentIntensityAnalyzer
#make SIA into an object
analyzer = SentimentIntensityAnalyzer()
#create a new column called "tarate"
df['tarate'],
#populate column "tarate". write pos, neut, or neg per row based on column "rating" value
df.loc[df['rating'] >= 40, ['tarate'] == 'Pos',
df.loc[df['rating'] == 30, ['tarate'] == 'Neut',
df.loc[df['rating'] <= 20, ['tarate'] == 'Neg',
#use polarity_scores() to get sentiment metrics and write them to new column "scores"
df.head['scores'] == df['review'].apply(lambda review: sid.polarity_scores['review'])
#extract the compound value of the polarity scores and write to new column "compound"
df['compound'] = df['scores'].apply(lambda d:d['compound'])
#using column "compound", determine whether the score is <0> and write new column "score" recording positive or negative
df['score'] = df['compound'].apply(lambda score: 'pos' if score >=0 else 'neg')
ta.file()
#get accuracy metrics. this will compare the trip advisor rating (text version recorded in column "tarate") to the sentiment analysis results in column "score"
from sklearn.metrics import accuracy_score,classification_report,confusion_matrix
accuracy_score(df['tarate'],df['score'])
print(classification_report(df['tarate'],df['score'])) ```
在填充之前不需要创建新列。此外,行的末尾有虚假的逗号。不要那样做;在Python中,一个逗号和一个表达式的末尾将它变成一个元组。还要记住,=
是赋值运算符,==
是比较运算符。
熊猫;loc";函数接受一个行索引器和一个列索引器:
#populate column "tarate". write pos, neut, or neg per row based on column "rating" value
df.loc[df['rating'] >= 40, 'tarate'] = 'Pos'
df.loc[df['rating'] == 30, 'tarate'] = 'Neut'
df.loc[df['rating'] <= 20, 'tarate'] = 'Net'
请注意,对于20到30之间的值以及30到40之间的值,这将在列中保留NaN
(不是数字(。
我不知道你在这里想做什么,但这是不对的:
#extract the compound value of the polarity scores and write to new column "compound"
df['compound'] = df['scores'].apply(lambda d:d['compound'])
CCD_ 4将不包含名为"0"的列;化合物";,这就是你在lambda中所要求的。
我建议查找列表综合,谷歌"熊猫应用方法";,以及";pandas lambda示例";以便更加熟悉它们。
示例代码:
import pandas as pd
#create a demo dataframe called 'df'
df = pd.DataFrame({'rating': [12, 42, 40, 30, 31, 56, 8, 88, 39, 79]})
这会给你一个看起来像这样的数据帧(只有一列名为"评级",里面有整数(:
rating
0 12
1 42
2 40
3 30
4 31
5 56
6 8
7 88
8 39
9 79
使用该列根据其中的值创建另一个列可以这样做。。。
#create a new column called 'tarate' and using a list comprehension
#assign a string value of either 'pos', 'neut', or 'neg' based on the
#numeric value in the 'rating' column (it does this going row by row)
df['tarate'] = ['pos' if x >= 40 else 'neut' if x == 30 else 'neg' for x in df['rating']]
#output the df
print(df)
输出:
rating tarate
0 12 neg
1 42 pos
2 40 pos
3 30 neut
4 31 neg
5 56 pos
6 8 neg
7 88 pos
8 39 neg
9 79 pos