我有一个CSV文件,其中一列包含一系列间隔为0.1步的实数。但是,缺少一些值。我已经将数据加载到pandas DataFrame中,并将此列设置为索引。由于浮点数的表示方式,我很难识别缺失的索引值。我想找到缺失的行,并用np.nan
填充它们。
例如,给定
A B
0.0 78.4 65.9
0.1 72.9 29.2
0.3 80.7 45.9
0.4 10.9 82.3
0.5 37.7 85.4
0.6 90.4 69.9
0.7 17.0 57.2
0.8 70.0 49.6
1.0 58.6 65.1
我想要
A B
0.0 78.4 65.9
0.1 72.9 29.2
0.2 NaN NaN
0.3 80.7 45.9
0.4 10.9 82.3
0.5 37.7 85.4
0.6 90.4 69.9
0.7 17.0 57.2
0.8 70.0 49.6
0.9 NaN NaN
1.0 58.6 65.1
我试过用熊猫。DataFrame.reindex,但由于浮点表示,它不起作用。以下是我迄今为止所尝试的(我的数据范围要大得多(:
import sys
import numpy as np
import pandas as pd
if sys.version_info[0] < 3:
from StringIO import StringIO
else:
from io import StringIO
data = """, A, B
0.0, 78.4, 65.9
0.1, 72.9, 29.2
0.3, 80.7, 45.9
0.4, 10.9, 82.3
0.5, 37.7, 85.4
0.6, 90.4, 69.9
0.7, 17.0, 57.2
0.8, 70.0, 49.6
1.0, 58.6, 65.1"""
fp = StringIO(data)
df = pd.read_csv(fp, index_col=0)
print(
df.reindex(
pd.Index(
np.arange(df.index[0], df.index[-1]+0.1, 0.1)
)
)
)
此输出
A B
0.0 78.4 65.9
0.1 72.9 29.2
0.2 NaN NaN
0.3 NaN NaN
0.4 10.9 82.3
0.5 37.7 85.4
0.6 NaN NaN
0.7 NaN NaN
0.8 70.0 49.6
0.9 NaN NaN
1.0 58.6 65.1
由于float不能总是获得100%匹配的链接,我们可以用reindex
传递tolerance
out = df.reindex(np.arange(df.index.min(),df.index.max()+0.1,0.1),method='nearest',tolerance=0.01)
A B
0.0 78.4 65.9
0.1 72.9 29.2
0.2 NaN NaN
0.3 80.7 45.9
0.4 10.9 82.3
0.5 37.7 85.4
0.6 90.4 69.9
0.7 17.0 57.2
0.8 70.0 49.6
0.9 NaN NaN
1.0 58.6 65.1