SCI-KIT-LEARN归一化删除了列标题



我有一个带有22列的熊猫数据框,其中索引为dateTime。

我试图使用以下代码将这些数据归一化:

from sklearn.preprocessing import MinMaxScaler
# Normalization
scaler = MinMaxScaler(copy = False)
normal_data = scaler.fit_transform(all_data2)

问题是我通过应用此功能丢失了很多数据,例如,以前是:

all_data2.head(n = 5)
Out[105]: 
                     btc_price  btc_change  btc_change_label  eth_price  
time                                                                      
2017-09-02 21:54:00  4537.8338   -0.066307                 0    330.727   
2017-09-02 22:29:00  4577.6050   -0.056294                 0    337.804   
2017-09-02 23:04:00  4566.3600   -0.059716                 0    336.938   
2017-09-02 23:39:00  4590.0313   -0.056242                 0    342.929   
2017-09-03 00:14:00  4676.1925   -0.035857                 0    354.171   
                      block_size    difficulty  estimated_btc_sent  
time                                                                 
2017-09-02 21:54:00  142521291.0  8.880000e+11        2.040000e+13   
2017-09-02 22:29:00  136524566.0  8.880000e+11        2.030000e+13   
2017-09-02 23:04:00  134845546.0  8.880000e+11        2.010000e+13   
2017-09-02 23:39:00  133910638.0  8.880000e+11        1.990000e+13   
2017-09-03 00:14:00  130678099.0  8.880000e+11        2.010000e+13   
                     estimated_transaction_volume_usd     hash_rate  
time                                                                  
2017-09-02 21:54:00                       923315359.5  7.417412e+09   
2017-09-02 22:29:00                       918188066.9  7.152505e+09   
2017-09-02 23:04:00                       910440915.6  7.240807e+09   
2017-09-02 23:39:00                       901565929.9  7.284958e+09   
2017-09-03 00:14:00                       922422228.4  7.152505e+09   
                     miners_revenue_btc        ...         n_blocks_mined  
time                                           ...                          
2017-09-02 21:54:00              2395.0        ...                  168.0   
2017-09-02 22:29:00              2317.0        ...                  162.0   
2017-09-02 23:04:00              2342.0        ...                  164.0   
2017-09-02 23:39:00              2352.0        ...                  165.0   
2017-09-03 00:14:00              2316.0        ...                  162.0   
                     n_blocks_total   n_btc_mined      n_tx  nextretarget  
time                                                                        
2017-09-02 21:54:00        483207.0  2.100000e+11  241558.0      483839.0   
2017-09-02 22:29:00        483208.0  2.030000e+11  236661.0      483839.0   
2017-09-02 23:04:00        483216.0  2.050000e+11  238682.0      483839.0   
2017-09-02 23:39:00        483220.0  2.060000e+11  237159.0      483839.0   
2017-09-03 00:14:00        483223.0  2.030000e+11  237464.0      483839.0   
                     total_btc_sent  total_fees_btc      totalbtc  
time                                                                
2017-09-02 21:54:00    1.620000e+14    2.959788e+10  1.650000e+15   
2017-09-02 22:29:00    1.600000e+14    2.920230e+10  1.650000e+15   
2017-09-02 23:04:00    1.600000e+14    2.923498e+10  1.650000e+15   
2017-09-02 23:39:00    1.580000e+14    2.899158e+10  1.650000e+15   
2017-09-03 00:14:00    1.580000e+14    2.917904e+10  1.650000e+15   
                     trade_volume_btc  trade_volume_usd  
time                                                     
2017-09-02 21:54:00         102451.92       463497284.7  
2017-09-02 22:29:00         102451.92       463497284.7  
2017-09-02 23:04:00         102451.92       463497284.7  
2017-09-02 23:39:00         102451.92       463497284.7  
2017-09-03 00:14:00          96216.78       440710136.1  
[5 rows x 22 columns]

之后,我得到了一个numpy数组,其中已将新索引已归一化(因为它是日期列,这不是很好),并且所有列标头都被删除。

我可以以某种方式将原始数据框架的选择列归一化?

如果没有,那么我如何仅选择归一化的numpy数组的所需列并将其插入原始DF中?

尝试sklearn.preprocessing.scale。这里不需要基于类的缩放器。

沿任何轴标准化数据集。均值和组件的中心 明智的比例到单位差异。

您可以这样使用它:

from sklearn.preprocessing import scale
df = pd.DataFrame({'col1' : np.random.randn(10),
                   'col2' : np.arange(10, 30, 2),
                   'col3' : np.arange(10)},
                  index=pd.date_range('2017', periods=10))
# Specify columns to scale to N~(0,1)
to_scale = ['col2', 'col3']
df.loc[:, to_scale] = scale(df[to_scale])
print(df)
               col1     col2     col3
2017-01-01 -0.28292 -1.56670 -1.56670
2017-01-02 -1.55172 -1.21854 -1.21854
2017-01-03  0.51800 -0.87039 -0.87039
2017-01-04 -1.75596 -0.52223 -0.52223
2017-01-05  1.34857 -0.17408 -0.17408
2017-01-06  0.12600  0.17408  0.17408
2017-01-07  0.21887  0.52223  0.52223
2017-01-08  0.84924  0.87039  0.87039
2017-01-09  0.32555  1.21854  1.21854
2017-01-10  0.54095  1.56670  1.56670

返回修改后的副本:

new_df = df.copy()
new_df.loc[:, to_scale] = scale(df[to_scale])

至于警告:很难说而不看到您的数据,但看起来确实有一些很大的值(7.417412e 09)。该警告是从这里开始的,我敢说可以忽略它是可以忽略的 - 因为有一些宽容测试,它会被抛出,测试您的新均值是否等于0,这是失败的。要查看它是否实际失败,只需使用new_df.mean()new_df.std()检查您的列是否已归一化为N〜(0,1)。

相关内容

  • 没有找到相关文章

最新更新