我有一个带有22列的熊猫数据框,其中索引为dateTime。
我试图使用以下代码将这些数据归一化:
from sklearn.preprocessing import MinMaxScaler
# Normalization
scaler = MinMaxScaler(copy = False)
normal_data = scaler.fit_transform(all_data2)
问题是我通过应用此功能丢失了很多数据,例如,以前是:
all_data2.head(n = 5)
Out[105]:
btc_price btc_change btc_change_label eth_price
time
2017-09-02 21:54:00 4537.8338 -0.066307 0 330.727
2017-09-02 22:29:00 4577.6050 -0.056294 0 337.804
2017-09-02 23:04:00 4566.3600 -0.059716 0 336.938
2017-09-02 23:39:00 4590.0313 -0.056242 0 342.929
2017-09-03 00:14:00 4676.1925 -0.035857 0 354.171
block_size difficulty estimated_btc_sent
time
2017-09-02 21:54:00 142521291.0 8.880000e+11 2.040000e+13
2017-09-02 22:29:00 136524566.0 8.880000e+11 2.030000e+13
2017-09-02 23:04:00 134845546.0 8.880000e+11 2.010000e+13
2017-09-02 23:39:00 133910638.0 8.880000e+11 1.990000e+13
2017-09-03 00:14:00 130678099.0 8.880000e+11 2.010000e+13
estimated_transaction_volume_usd hash_rate
time
2017-09-02 21:54:00 923315359.5 7.417412e+09
2017-09-02 22:29:00 918188066.9 7.152505e+09
2017-09-02 23:04:00 910440915.6 7.240807e+09
2017-09-02 23:39:00 901565929.9 7.284958e+09
2017-09-03 00:14:00 922422228.4 7.152505e+09
miners_revenue_btc ... n_blocks_mined
time ...
2017-09-02 21:54:00 2395.0 ... 168.0
2017-09-02 22:29:00 2317.0 ... 162.0
2017-09-02 23:04:00 2342.0 ... 164.0
2017-09-02 23:39:00 2352.0 ... 165.0
2017-09-03 00:14:00 2316.0 ... 162.0
n_blocks_total n_btc_mined n_tx nextretarget
time
2017-09-02 21:54:00 483207.0 2.100000e+11 241558.0 483839.0
2017-09-02 22:29:00 483208.0 2.030000e+11 236661.0 483839.0
2017-09-02 23:04:00 483216.0 2.050000e+11 238682.0 483839.0
2017-09-02 23:39:00 483220.0 2.060000e+11 237159.0 483839.0
2017-09-03 00:14:00 483223.0 2.030000e+11 237464.0 483839.0
total_btc_sent total_fees_btc totalbtc
time
2017-09-02 21:54:00 1.620000e+14 2.959788e+10 1.650000e+15
2017-09-02 22:29:00 1.600000e+14 2.920230e+10 1.650000e+15
2017-09-02 23:04:00 1.600000e+14 2.923498e+10 1.650000e+15
2017-09-02 23:39:00 1.580000e+14 2.899158e+10 1.650000e+15
2017-09-03 00:14:00 1.580000e+14 2.917904e+10 1.650000e+15
trade_volume_btc trade_volume_usd
time
2017-09-02 21:54:00 102451.92 463497284.7
2017-09-02 22:29:00 102451.92 463497284.7
2017-09-02 23:04:00 102451.92 463497284.7
2017-09-02 23:39:00 102451.92 463497284.7
2017-09-03 00:14:00 96216.78 440710136.1
[5 rows x 22 columns]
之后,我得到了一个numpy
数组,其中已将新索引已归一化(因为它是日期列,这不是很好),并且所有列标头都被删除。
我可以以某种方式将原始数据框架的选择列归一化?
如果没有,那么我如何仅选择归一化的numpy数组的所需列并将其插入原始DF中?
尝试sklearn.preprocessing.scale
。这里不需要基于类的缩放器。
沿任何轴标准化数据集。均值和组件的中心 明智的比例到单位差异。
您可以这样使用它:
from sklearn.preprocessing import scale
df = pd.DataFrame({'col1' : np.random.randn(10),
'col2' : np.arange(10, 30, 2),
'col3' : np.arange(10)},
index=pd.date_range('2017', periods=10))
# Specify columns to scale to N~(0,1)
to_scale = ['col2', 'col3']
df.loc[:, to_scale] = scale(df[to_scale])
print(df)
col1 col2 col3
2017-01-01 -0.28292 -1.56670 -1.56670
2017-01-02 -1.55172 -1.21854 -1.21854
2017-01-03 0.51800 -0.87039 -0.87039
2017-01-04 -1.75596 -0.52223 -0.52223
2017-01-05 1.34857 -0.17408 -0.17408
2017-01-06 0.12600 0.17408 0.17408
2017-01-07 0.21887 0.52223 0.52223
2017-01-08 0.84924 0.87039 0.87039
2017-01-09 0.32555 1.21854 1.21854
2017-01-10 0.54095 1.56670 1.56670
返回修改后的副本:
new_df = df.copy()
new_df.loc[:, to_scale] = scale(df[to_scale])
至于警告:很难说而不看到您的数据,但看起来确实有一些很大的值(7.417412e 09)。该警告是从这里开始的,我敢说可以忽略它是可以忽略的 - 因为有一些宽容测试,它会被抛出,测试您的新均值是否等于0,这是失败的。要查看它是否实际失败,只需使用new_df.mean()
和new_df.std()
检查您的列是否已归一化为N〜(0,1)。