Python :使用 np.genfromtxt 读取 CSV，导致不同的列数

我正在使用np.genfromtxt来读取csv。我不确定为什么它会在数据上引发 ValueError（errmsg）。当我在 excel 中读取文件时，它为文件中的所有 23 行显示总共 33 列

这是代码和错误：

csv = np.genfromtxt （fname， delimiter="，"，names=True）

以下是 csv 记录的片段：

,mean_fit_time,mean_score_time,mean_test_score,mean_train_score,param_NN__alpha,param_NN__hidden_layer_sizes,params,rank_test_score,split0_test_score,split0_train_score,split1_test_score,split1_train_score,split2_test_score,split2_train_score,split3_test_score,split3_train_score,split4_test_score,split4_train_score,std_fit_time,std_score_time,std_test_score,std_train_score
0,0.34166226387023924,0.0010362625122070312,0.842927342927343,0.8468980402379758,0.1,"(7,)","{'NN__alpha': 0.1, 'NN__hidden_layer_sizes': (7,)}",25,0.8420706295240185,0.8475292052871167,0.8398771660451854,0.8463774474853288,0.845360824742268,0.846158065046893,0.8385256691531373,0.8486892618185806,0.8488040377441299,0.8457362215519605,0.05093153997183547,0.00018195987247183776,0.0037378988316037944,0.0010747322296072162
1,0.5543142318725586,0.0018250465393066407,0.8465250965250966,0.8527554135893668,0.1,"(25, 7)","{'NN__alpha': 0.1, 'NN__hidden_layer_sizes': (25, 7)}",5,0.846018863785918,0.8530137662480118,0.846018863785918,0.8589919376953875,0.8479929809168677,0.8496681840618658,0.8400614304519526,0.851486234506965,0.8525345622119815,0.8506169454346038,0.10835399357094619,0.00018853748087819175,0.004013613789285713,0.003306836154659678
2,0.5266880512237548,0.0013680458068847656,0.8437609687609687,0.8478413817137904,0.1,"(11, 7)","{'NN__alpha': 0.1, 'NN__hidden_layer_sizes': (11, 7)}",17,0.842509322219785,0.8479679701639884,0.8354902390875192,0.8431964021280096,0.8455801710901514,0.8520265452750507,0.8433523475208424,0.851595919710431,0.8518762343647136,0.8444200712914725,0.1041624682160838,0.0003233587082439388,0.005278162504355272,0.0036030369022985215
3,0.49459095001220704,0.0011162281036376954,0.8406458406458407,0.845428443186931,0.1,"(7, 5)","{'NN__alpha': 0.1, 'NN__hidden_layer_sizes': (7, 5)}",32,0.8383417416100022,0.848461580650469,0.8429480149155516,0.8501617945483464,0.8468962491774512,0.8514780891789612,0.8312856516015796,0.8381046396841066,0.8437568575817423,0.8389361118727722,0.10397613499936685,0.00018889068500539376,0.005421511394261151,0.005726975087304059
4,0.6175418376922608,0.0024899959564208983,0.8449017199017199,0.8508140227747922,0.1,"(25, 11, 7)","{'NN__alpha': 0.1, 'NN__hidden_layer_sizes': (25, 11, 7)}",11,0.8414125904803685,0.8493939560138211,0.8427286685676684,0.8546591345362804,0.8501864443957008,0.8519716996654417,0.8459850811759544,0.8564769112646704,0.8441957428132544,0.8415684123937482,0.1940231074769015,0.00047604030307216253,0.003049662553913791,0.005209439647677219

收到的错误：

ValueError: Some errors were detected !
    Line #2 (got 26 columns instead of 22)
    Line #3 (got 26 columns instead of 22)
    Line #4 (got 26 columns instead of 22)
    Line #5 (got 26 columns instead of 22)
    Line #6 (got 28 columns instead of 22)
    Line #7 (got 26 columns instead of 22)
    Line #8 (got 28 columns instead of 22)
    Line #9 (got 26 columns instead of 22)
    Line #10 (got 26 columns instead of 22)
    Line #11 (got 26 columns instead of 22)
    Line #12 (got 26 columns instead of 22)
    Line #13 (got 26 columns instead of 22)
    Line #14 (got 28 columns instead of 22)
    Line #15 (got 26 columns instead of 22)
    Line #16 (got 28 columns instead of 22)
    Line #17 (got 26 columns instead of 22)
    Line #18 (got 26 columns instead of 22)
    Line #19 (got 26 columns instead of 22)
    Line #20 (got 26 columns instead of 22)
    Line #21 (got 26 columns instead of 22)
    Line #22 (got 28 columns instead of 22)
    Line #23 (got 26 columns instead of 22)
    Line #24 (got 28 columns instead of 22)
    Line #25 (got 26 columns instead of 22)
    Line #26 (got 26 columns instead of 22)
    Line #27 (got 26 columns instead of 22)
    Line #28 (got 26 columns instead of 22)
    Line #29 (got 26 columns instead of 22)
    Line #30 (got 28 columns instead of 22)
    Line #31 (got 26 columns instead of 22)
    Line #32 (got 28 columns instead of 22)
    Line #33 (got 26 columns instead of 22)

您传递,作为分隔符，而许多列值本身都包含元素。您需要指定一个显式引号才能使其正常工作。

幸运的是，pandas在没有太多帮助的情况下很好地处理了这个问题。您可以尝试使用 read_csv 加载数据，然后将加载的数据帧转换为数组。

import pandas as pd
array = pd.read_csv(name, index_col=[0]).values

加载的数据帧（在调用.values之前获得的数据帧）如下所示：

df = pd.read_csv(name, index_col=[0])
print(df)
   mean_fit_time  mean_score_time  mean_test_score  mean_train_score  
0       0.341662         0.001036         0.842927          0.846898   
1       0.554314         0.001825         0.846525          0.852755   
2       0.526688         0.001368         0.843761          0.847841   
3       0.494591         0.001116         0.840646          0.845428   
4       0.617542         0.002490         0.844902          0.850814   
   param_NN__alpha param_NN__hidden_layer_sizes  
0              0.1                         (7,)   
1              0.1                      (25, 7)   
2              0.1                      (11, 7)   
3              0.1                       (7, 5)   
4              0.1                  (25, 11, 7)   
                                              params  rank_test_score  
0  {'NN__alpha': 0.1, 'NN__hidden_layer_sizes': (...               25   
1  {'NN__alpha': 0.1, 'NN__hidden_layer_sizes': (...                5   
2  {'NN__alpha': 0.1, 'NN__hidden_layer_sizes': (...               17   
3  {'NN__alpha': 0.1, 'NN__hidden_layer_sizes': (...               32   
4  {'NN__alpha': 0.1, 'NN__hidden_layer_sizes': (...               11   
   split0_test_score  split0_train_score       ...         split2_test_score  
0           0.842071            0.847529       ...                  0.845361   
1           0.846019            0.853014       ...                  0.847993   
2           0.842509            0.847968       ...                  0.845580   
3           0.838342            0.848462       ...                  0.846896   
4           0.841413            0.849394       ...                  0.850186   
   split2_train_score  split3_test_score  split3_train_score  
0            0.846158           0.838526            0.848689   
1            0.849668           0.840061            0.851486   
2            0.852027           0.843352            0.851596   
3            0.851478           0.831286            0.838105   
4            0.851972           0.845985            0.856477   
   split4_test_score  split4_train_score  std_fit_time  std_score_time  
0           0.848804            0.845736      0.050932        0.000182   
1           0.852535            0.850617      0.108354        0.000189   
2           0.851876            0.844420      0.104162        0.000323   
3           0.843757            0.838936      0.103976        0.000189   
4           0.844196            0.841568      0.194023        0.000476   
   std_test_score  std_train_score  
0        0.003738         0.001075  
1        0.004014         0.003307  
2        0.005278         0.003603  
3        0.005422         0.005727  
4        0.003050         0.005209  
[5 rows x 22 columns

是的，列会自动转换为适当的数据类型。

print(df.dtypes)
mean_fit_time                   float64
mean_score_time                 float64
mean_test_score                 float64
mean_train_score                float64
param_NN__alpha                 float64
param_NN__hidden_layer_sizes     object
params                           object
rank_test_score                   int64
split0_test_score               float64
split0_train_score              float64
split1_test_score               float64
split1_train_score              float64
split2_test_score               float64
split2_train_score              float64
split3_test_score               float64
split3_train_score              float64
split4_test_score               float64
split4_train_score              float64
std_fit_time                    float64
std_score_time                  float64
std_test_score                  float64
std_train_score                 float64
dtype: object

法定警告：由于其性质，这些数据作为 python 列表可能对您来说比 numpy 数组（经过优化以与标量一起使用）更有用。

相关内容

最新更新

热门标签：