如何使用机器学习为数据分配标签/分数



我有一个由许多行组成的数据帧,其中包括tweet。我想使用机器学习技术(监督或非监督(对它们进行分类。由于数据集是未标记的,我想选择几行(50%(手动标记(+1 pos,-1 neg,0 neutral(,然后使用机器学习将标签分配给其他行。为了做到这一点,我做了如下操作:

原始数据集

Date                   ID        Tweet                         
01/20/2020           4141    The cat is on the table               
01/20/2020           4142    The sky is blue                       
01/20/2020           53      What a wonderful day                  
...
05/12/2020           532     In this extraordinary circumstance we are together   
05/13/2020           12      It was a very bad decision            
05/22/2020           565     I know you are the best              
  1. 将数据集拆分为50%训练和50%测试。我手动标记50%的数据如下:

    Date                   ID        Tweet                          PosNegNeu
    01/20/2020           4141    The cat is on the table               0
    01/20/2020           4142    The weather is bad today              -1
    01/20/2020           53      What a wonderful day                  1
    ...
    05/12/2020           532     In this extraordinary circumstance we are together   1
    05/13/2020           12      It was a very bad decision            -1
    05/22/2020           565     I know you are the best               1
    

然后我提取单词的频率(删除停止词后(:

Frequency
bad               2
circumstance      1
best              1
day               1
today             1
wonderful         1

我想尝试根据以下内容为其他数据分配标签:

频率表中的
  • 单词,例如说";如果一条推文包含例如bad than assign-1;如果一条推文包含精彩的赋值1(即,我应该创建一个字符串列表和一个规则(
  • 基于句子相似性(例如使用Levenstein距离(

我知道有几种方法可以做到这一点,甚至更好,但我在为数据分类/分配标签时遇到了一些问题,无法手动完成。

我的预期输出,例如以下测试数据集

Date                   ID        Tweet                                   
06/12/2020           43       My cat 'Sylvester' is on the table            
07/02/2020           75       Laura's pen is black                                                
07/02/2020           763      It is such a wonderful day                                    
...
11/06/2020           1415    No matter what you need to do                  
05/15/2020           64      I disagree with you: I think it is a very bad decision           
12/27/2020           565     I know you can improve                         

应该是类似的东西

Date                   ID        Tweet                                   PosNegNeu
06/12/2020           43       My cat 'Sylvester' is on the table            0
07/02/2020           75       Laura's pen is black                          0                       
07/02/2020           763      It is such a wonderful day                    1                
...
11/06/2020           1415    No matter what you need to do                  0  
05/15/2020           64      I disagree with you: I think it is a very bad decision  -1          
12/27/2020           565     I know you can improve                         0   

也许一个更好的方法应该是考虑n-gram,而不是单个单词,或者建立一个语料库/词汇表来分配分数,然后是情感。任何建议都将不胜感激,因为这是我第一次进行机器学习。我认为k均值聚类也可以应用,试图得到更多相似的句子。如果你能给我一个完整的例子(我的数据会很好,但其他数据也会很好(,我真的很感激

我将建议在此上下文中分析句子或推特的极性。这可以使用textblob库来完成。它可以作为pip install -U textblob安装。一旦找到文本数据极性,就可以将其指定为数据帧中的一个单独列。随后,句子的极性可以用于进一步的分析。

初始代码

from textblob import TextBlob
df['sentiment'] = df['Tweet'].apply(lambda Tweet: TextBlob(Tweet).sentiment)
print(df)

中间结果

Date     ...                                  sentiment
0  1/1/2020  ...                                 (0.0, 0.0)
1  2/1/2020  ...                                 (0.0, 0.0)
2  3/2/2020  ...                                 (0.0, 0.1)
3  4/2/2020  ...  (-0.6999999999999998, 0.6666666666666666)
4  5/2/2020  ...                                 (0.5, 0.6)
[5 rows x 4 columns]

从情感栏(在上面的输出中(,我们可以看到情感栏分为两类——极性和主体性。

极性是介于[-1.0到1.0]范围内的浮点值,其中0表示中性,+1表示非常积极的情绪,-1代表一种非常消极的情绪。

主观是在[0.0到1.0]范围内的浮点值,其中0.0非常客观,1.0非常主观。主观判决表达一些个人感受、观点、信仰、观点,指控、欲望、信念、怀疑和推测客观句子是事实。

注意,情感列是一个元组。所以我们可以把它分成两列,比如df1=pd.DataFrame(df['sentiment'].tolist(), index= df.index)。现在,我们可以创建一个新的数据帧,我将如图所示将拆分列附加到该数据帧;

df_new = df
df_new['polarity'] = df1['polarity']
df_new.polarity = df1.polarity.astype(float)
df_new['subjectivity'] = df1['subjectivity']
df_new.subjectivity = df1.polarity.astype(float)

最后,根据之前发现的句子极性,我们现在可以在数据帧中添加一个标签,它将指示推文是正面的、负面的还是中性的。

import numpy as np
conditionList = [
df_new['polarity'] == 0,
df_new['polarity'] > 0,
df_new['polarity'] < 0]
choiceList = ['neutral', 'positive', 'negative']
df_new['label'] = np.select(conditionList, choiceList, default='no_label')
print(df_new)

最后,结果会是这样的;

最终结果

[5 rows x 6 columns]
Date  ID                 Tweet  ... polarity  subjectivity     label
0  1/1/2020   1  the weather is sunny  ...      0.0           0.0   neutral
1  2/1/2020   2       tom likes harry  ...      0.0           0.0   neutral
2  3/2/2020   3       the sky is blue  ...      0.0           0.0   neutral
3  4/2/2020   4    the weather is bad  ...     -0.7          -0.7  negative
4  5/2/2020   5         i love apples  ...      0.5           0.5  positive
[5 rows x 7 columns]

数据

import pandas as pd
# create a dictionary
data = {"Date":["1/1/2020","2/1/2020","3/2/2020","4/2/2020","5/2/2020"],
"ID":[1,2,3,4,5],
"Tweet":["the weather is sunny",
"tom likes harry", "the sky is blue",
"the weather is bad","i love apples"]}
# convert data to dataframe
df = pd.DataFrame(data)

完整代码

# create some dummy data
import pandas as pd
import numpy as np
# create a dictionary
data = {"Date":["1/1/2020","2/1/2020","3/2/2020","4/2/2020","5/2/2020"],
"ID":[1,2,3,4,5],
"Tweet":["the weather is sunny",
"tom likes harry", "the sky is blue",
"the weather is bad","i love apples"]}
# convert data to dataframe
df = pd.DataFrame(data)
from textblob import TextBlob
df['sentiment'] = df['Tweet'].apply(lambda Tweet: TextBlob(Tweet).sentiment)
print(df)
# split the sentiment column into two
df1=pd.DataFrame(df['sentiment'].tolist(), index= df.index)
# append cols to original dataframe
df_new = df
df_new['polarity'] = df1['polarity']
df_new.polarity = df1.polarity.astype(float)
df_new['subjectivity'] = df1['subjectivity']
df_new.subjectivity = df1.polarity.astype(float)
print(df_new)
# add label to dataframe based on condition
conditionList = [
df_new['polarity'] == 0,
df_new['polarity'] > 0,
df_new['polarity'] < 0]
choiceList = ['neutral', 'positive', 'negative']
df_new['label'] = np.select(conditionList, choiceList, default='no_label')
print(df_new)

IIUC,您已经标记了一定百分比的数据,并且需要标记其余数据。我建议阅读关于半监督机器学习的文章。

半监督学习是一种机器学习方法,在训练期间将少量标记数据与大量未标记数据相结合。半监督学习介于无监督学习(没有标记的训练数据(和监督学习(只有标记的训练资料(之间

Sklearn提供了各种各样的算法来帮助实现这一点。一定要看看这个。

如果你需要对这个主题有更多的了解,我强烈建议你也看看这篇文章。

以下是虹膜数据集的一个例子

import numpy as np
from sklearn import datasets
from sklearn.semi_supervised import LabelPropagation
#Init
label_prop_model = LabelPropagation()
iris = datasets.load_iris()
#Randomly create unlabelled samples
rng = np.random.RandomState(42)
random_unlabeled_points = rng.rand(len(iris.target)) < 0.3
labels = np.copy(iris.target)
labels[random_unlabeled_points] = -1
#propogate labels over remaining unlabelled data
label_prop_model.fit(iris.data, labels)

最新更新