我最近学会了如何对pandas数据帧进行验证拆分,但拆分后我注意到我无法对列进行切片。
print(my_data['column name'])
它抛出了一个错误,请帮忙。
我的代码是这样的:
import pandas as pd
from sklearn.cross_validation import train_test_split
data = pd.read_csv("labeledTrainData.tsv" , header = 0 ,
delimiter = 't' , quoting = 3)
train , test = train_test_split(data , train_size = 0.8 , random_state = 38)
print(len(train['sentiment']))
请告诉我numpy是否也面临这个问题?
train_test_split
返回一个拆分列表,您应该使用这些列表来索引df:
X_train, X_test, y_train, y_test =train_test_split(data , train_size = 0.8 , random_state = 38)
然后你这样索引:
data.iloc[X_train]
data.iloc[X_test]
data.iloc[y_train]
data.iloc[y_test]
如果我们输入简单的numpy数组,输出也是numpy数组。请参阅此处的示例:
>>> import numpy as np
>>> from sklearn.cross_validation import train_test_split
>>> X, y = np.arange(10).reshape((5, 2)), range(5)
>>> X
array([[0, 1],
[2, 3],
[4, 5],
[6, 7],
[8, 9]])
>>> list(y)
[0, 1, 2, 3, 4]
>>>
>>> X_train, X_test, y_train, y_test = train_test_split(
... X, y, test_size=0.33, random_state=42)
...
>>> X_train
array([[4, 5],
[0, 1],
[6, 7]])
>>> y_train
[2, 0, 3]
>>> X_test
array([[2, 3],
[8, 9]])
>>> y_test
[1, 4]
编辑
我尝试了同样的方法,但没有得到任何错误,我使用的是Python 2.7+。所以这是不同版本的Python或Scikitsearn 特有的吗
import pandas as pd
from sklearn.cross_validation import train_test_split
url = 'https://raw.github.com/pydata/pandas/master/pandas/tests/data/tips.csv'
data = pd.read_csv(url)
train , test = train_test_split(data ,train_size = 0.8 , random_state = 38)
print (train['total_bill'])
Output:
....
211 25.89
53 9.94
75 10.51
161 12.66
Name: total_bill, dtype: float64