如何在拆分我的测试集后使用pandas数据帧



我最近学会了如何对pandas数据帧进行验证拆分,但拆分后我注意到我无法对列进行切片。

print(my_data['column name']) 

它抛出了一个错误,请帮忙。

我的代码是这样的:

import pandas as pd  
from sklearn.cross_validation import train_test_split
data = pd.read_csv("labeledTrainData.tsv" , header = 0 ,  
           delimiter = 't' , quoting  = 3)
train  , test = train_test_split(data , train_size = 0.8 , random_state = 38)
print(len(train['sentiment']))

请告诉我numpy是否也面临这个问题?

train_test_split返回一个拆分列表,您应该使用这些列表来索引df:

X_train, X_test, y_train, y_test =train_test_split(data , train_size = 0.8 , random_state = 38)

然后你这样索引:

data.iloc[X_train]
data.iloc[X_test]
data.iloc[y_train]
data.iloc[y_test]

如果我们输入简单的numpy数组,输出也是numpy数组。请参阅此处的示例:

>>> import numpy as np
>>> from sklearn.cross_validation import train_test_split
>>> X, y = np.arange(10).reshape((5, 2)), range(5)
>>> X
array([[0, 1],
       [2, 3],
       [4, 5],
       [6, 7],
       [8, 9]])
>>> list(y)
[0, 1, 2, 3, 4]
>>>
>>> X_train, X_test, y_train, y_test = train_test_split(
...     X, y, test_size=0.33, random_state=42)
...
>>> X_train
array([[4, 5],
       [0, 1],
       [6, 7]])
>>> y_train
[2, 0, 3]
>>> X_test
array([[2, 3],
       [8, 9]])
>>> y_test
[1, 4]

编辑

我尝试了同样的方法,但没有得到任何错误,我使用的是Python 2.7+。所以这是不同版本的Python或Scikitsearn 特有的吗

    import pandas as pd  
    from sklearn.cross_validation import train_test_split
    url = 'https://raw.github.com/pydata/pandas/master/pandas/tests/data/tips.csv'
    data = pd.read_csv(url)
    train  , test = train_test_split(data ,train_size = 0.8 , random_state = 38)
    print (train['total_bill'])
Output:
....
211    25.89
53      9.94
75     10.51
161    12.66
Name: total_bill, dtype: float64

相关内容

  • 没有找到相关文章

最新更新