只在pandas数据框中插入(或外推)小间隙

我有一个pandas DataFrame，索引为time (1 min Freq)和几列数据。有时数据包含NaN。如果是这样，我想插入只有当差距不超过5分钟。在这种情况下，最多有5个连续的nan。数据可能看起来像这样(几个测试用例，其中显示了问题):

import numpy as np
import pandas as pd
from datetime import datetime
start = datetime(2014,2,21,14,50)
data = pd.DataFrame(index=[start + timedelta(minutes=1*x) for x in range(0, 8)],
                         data={'a': [123.5, np.NaN, 136.3, 164.3, 213.0, 164.3, 213.0, 221.1],
                               'b': [433.5, 523.2, 536.3, 464.3, 413.0, 164.3, 213.0, 221.1],
                               'c': [123.5, 132.3, 136.3, 164.3] + [np.NaN]*4,
                               'd': [np.NaN]*8,
                               'e': [np.NaN]*7 + [2330.3],
                               'f': [np.NaN]*4 + [2763.0, 2142.3, 2127.3, 2330.3],
                               'g': [2330.3] + [np.NaN]*7,
                               'h': [2330.3] + [np.NaN]*6 + [2777.7]})

它是这样的:

In [147]: data
Out[147]: 
                         a      b      c   d       e       f       g       h
2014-02-21 14:50:00  123.5  433.5  123.5 NaN     NaN     NaN  2330.3  2330.3
2014-02-21 14:51:00    NaN  523.2  132.3 NaN     NaN     NaN     NaN     NaN
2014-02-21 14:52:00  136.3  536.3  136.3 NaN     NaN     NaN     NaN     NaN
2014-02-21 14:53:00  164.3  464.3  164.3 NaN     NaN     NaN     NaN     NaN
2014-02-21 14:54:00  213.0  413.0    NaN NaN     NaN  2763.0     NaN     NaN
2014-02-21 14:55:00  164.3  164.3    NaN NaN     NaN  2142.3     NaN     NaN
2014-02-21 14:56:00  213.0  213.0    NaN NaN     NaN  2127.3     NaN     NaN
2014-02-21 14:57:00  221.1  221.1    NaN NaN  2330.3  2330.3     NaN  2777.7

我知道data.interpolate()，但它有几个缺陷，因为它产生了这个结果，这对列a-e是好的，但对于列f-h，由于不同的原因它失败了::

                         a      b      c   d       e       f       g  
2014-02-21 14:50:00  123.5  433.5  123.5 NaN     NaN     NaN  2330.3   
2014-02-21 14:51:00  129.9  523.2  132.3 NaN     NaN     NaN  2330.3   
2014-02-21 14:52:00  136.3  536.3  136.3 NaN     NaN     NaN  2330.3   
2014-02-21 14:53:00  164.3  464.3  164.3 NaN     NaN     NaN  2330.3   
2014-02-21 14:54:00  213.0  413.0  164.3 NaN     NaN  2763.0  2330.3   
2014-02-21 14:55:00  164.3  164.3  164.3 NaN     NaN  2142.3  2330.3   
2014-02-21 14:56:00  213.0  213.0  164.3 NaN     NaN  2127.3  2330.3   
2014-02-21 14:57:00  221.1  221.1  164.3 NaN  2330.3  2330.3  2330.3   
                               h  
2014-02-21 14:50:00  2330.300000  
2014-02-21 14:51:00  2394.214286  
2014-02-21 14:52:00  2458.128571  
2014-02-21 14:53:00  2522.042857  
2014-02-21 14:54:00  2585.957143  
2014-02-21 14:55:00  2649.871429  
2014-02-21 14:56:00  2713.785714  
2014-02-21 14:57:00  2777.700000

f)开始时的差距由4分钟的nan组成，它们应该被该值2763.0(即向后推断时间)所取代

g)差距大于5分钟，但仍被推断为

h)差距大于5分钟，但仍然插入差距。

我理解这些原因，当然我没有指定它不应该插入超过5分钟的间隔。我知道interpolate只能在时间上向前外推，但我希望它也能在时间上向后外推。有没有什么已知的方法可以用来解决我的问题，而不用重新发明轮子?

编辑:data.interpolate方法接受输入参数limit，该参数定义了要被插值替换的连续nan的最大数量。但这仍然会插值到极限，但我想在这种情况下继续使用所有nan

所以这里有一个遮罩应该可以解决这个问题。只需interpolate，然后应用掩码将适当的值重置为NaN。老实说，这比我意识到的要多一点工作，因为我必须遍历每个列，但是如果没有我提供一些虚拟列(如'ones')， groupby就不能完全工作。

无论如何，如果有什么不清楚的地方我可以解释，但实际上只有几行有点难以理解。请参阅此处了解df['new']行上的技巧的更多解释，或者只是打印出单独的行以更好地了解发生了什么。

mask = data.copy()
for i in list('abcdefgh'):
    df = pd.DataFrame( data[i] )
    df['new'] = ((df.notnull() != df.shift().notnull()).cumsum())
    df['ones'] = 1
    mask[i] = (df.groupby('new')['ones'].transform('count') < 5) | data[i].notnull()
In [7]: data
Out[7]: 
                         a      b      c   d       e       f       g       h
2014-02-21 14:50:00  123.5  433.5  123.5 NaN     NaN     NaN  2330.3  2330.3
2014-02-21 14:51:00    NaN  523.2  132.3 NaN     NaN     NaN     NaN     NaN
2014-02-21 14:52:00  136.3  536.3  136.3 NaN     NaN     NaN     NaN     NaN
2014-02-21 14:53:00  164.3  464.3  164.3 NaN     NaN     NaN     NaN     NaN
2014-02-21 14:54:00  213.0  413.0    NaN NaN     NaN  2763.0     NaN     NaN
2014-02-21 14:55:00  164.3  164.3    NaN NaN     NaN  2142.3     NaN     NaN
2014-02-21 14:56:00  213.0  213.0    NaN NaN     NaN  2127.3     NaN     NaN
2014-02-21 14:57:00  221.1  221.1    NaN NaN  2330.3  2330.3     NaN  2777.7
In [8]: mask
Out[8]: 
                        a     b     c      d      e     f      g      h
2014-02-21 14:50:00  True  True  True  False  False  True   True   True
2014-02-21 14:51:00  True  True  True  False  False  True  False  False
2014-02-21 14:52:00  True  True  True  False  False  True  False  False
2014-02-21 14:53:00  True  True  True  False  False  True  False  False
2014-02-21 14:54:00  True  True  True  False  False  True  False  False
2014-02-21 14:55:00  True  True  True  False  False  True  False  False
2014-02-21 14:56:00  True  True  True  False  False  True  False  False
2014-02-21 14:57:00  True  True  True  False   True  True  False   True

如果你在外推方面不做任何花哨的事情，从那里开始很容易:

In [9]: data.interpolate().bfill()[mask]
Out[9]: 
                         a      b      c   d       e       f       g       h
2014-02-21 14:50:00  123.5  433.5  123.5 NaN     NaN  2763.0  2330.3  2330.3
2014-02-21 14:51:00  129.9  523.2  132.3 NaN     NaN  2763.0     NaN     NaN
2014-02-21 14:52:00  136.3  536.3  136.3 NaN     NaN  2763.0     NaN     NaN
2014-02-21 14:53:00  164.3  464.3  164.3 NaN     NaN  2763.0     NaN     NaN
2014-02-21 14:54:00  213.0  413.0  164.3 NaN     NaN  2763.0     NaN     NaN
2014-02-21 14:55:00  164.3  164.3  164.3 NaN     NaN  2142.3     NaN     NaN
2014-02-21 14:56:00  213.0  213.0  164.3 NaN     NaN  2127.3     NaN     NaN
2014-02-21 14:57:00  221.1  221.1  164.3 NaN  2330.3  2330.3     NaN  2777.7

编辑添加:这里有一个更快(大约是这个示例数据的2倍)和稍微简单的方法，通过将一些东西移到循环之外:

mask = data.copy()
grp = ((mask.notnull() != mask.shift().notnull()).cumsum())
grp['ones'] = 1
for i in list('abcdefgh'):
    mask[i] = (grp.groupby(i)['ones'].transform('count') < 5) | data[i].notnull()

在我找到上面的答案之前，我不得不解决一个类似的问题，并提出了一个基于numpy的解决方案。因为我的代码是近似的。快十倍，我在这里提供它是为了将来对别人有用。它在本系列末尾处理nan的方式与上面john的解决方案不同。如果一个序列以nan结尾，它将最后一个间隔标记为无效。

代码如下:


def bfill_nan(arr):
    """ Backward-fill NaNs """
    mask = np.isnan(arr)
    idx = np.where(~mask, np.arange(mask.shape[0]), mask.shape[0]-1)
    idx = np.minimum.accumulate(idx[::-1], axis=0)[::-1]
    out = arr[idx]
    return out
def calc_mask(arr, maxgap):
    """ Mask NaN gaps longer than `maxgap` """
    isnan = np.isnan(arr)
    cumsum = np.cumsum(isnan).astype('float')
    diff = np.zeros_like(arr)
    diff[~isnan] = np.diff(cumsum[~isnan], prepend=0)
    diff[isnan] = np.nan
    diff = bfill_nan(diff)
    return (diff < maxgap) | ~isnan

mask = data.copy()
for column_name in data:
    x = data[column_name].values
    mask[column_name] = calc_mask(x, 5)
print('data:')
print(data)
print('nmask:')
print mask

输出:

data:
                         a      b      c   d       e       f       g       h
2014-02-21 14:50:00  123.5  433.5  123.5 NaN     NaN     NaN  2330.3  2330.3
2014-02-21 14:51:00    NaN  523.2  132.3 NaN     NaN     NaN     NaN     NaN
2014-02-21 14:52:00  136.3  536.3  136.3 NaN     NaN     NaN     NaN     NaN
2014-02-21 14:53:00  164.3  464.3  164.3 NaN     NaN     NaN     NaN     NaN
2014-02-21 14:54:00  213.0  413.0    NaN NaN     NaN  2763.0     NaN     NaN
2014-02-21 14:55:00  164.3  164.3    NaN NaN     NaN  2142.3     NaN     NaN
2014-02-21 14:56:00  213.0  213.0    NaN NaN     NaN  2127.3     NaN     NaN
2014-02-21 14:57:00  221.1  221.1    NaN NaN  2330.3  2330.3     NaN  2777.7
mask:
                        a     b      c      d      e     f      g      h
2014-02-21 14:50:00  True  True   True  False  False  True   True   True
2014-02-21 14:51:00  True  True   True  False  False  True  False  False
2014-02-21 14:52:00  True  True   True  False  False  True  False  False
2014-02-21 14:53:00  True  True   True  False  False  True  False  False
2014-02-21 14:54:00  True  True  False  False  False  True  False  False
2014-02-21 14:55:00  True  True  False  False  False  True  False  False
2014-02-21 14:56:00  True  True  False  False  False  True  False  False
2014-02-21 14:57:00  True  True  False  False   True  True  False   True

根据interpolate文档，下面使用的limit_area在0.23.0版本中是新的。我不确定这是否是列e和g所需的输出，因为您还没有详细指定所需的输出。

import numpy as np
import pandas as pd
from datetime import datetime
from datetime import timedelta
start = datetime(2014,2,21,14,50)
df = data = pd.DataFrame(index=[start + timedelta(minutes=1*x) for x in range(0, 8)],
                         data={'a': [123.5, np.NaN, 136.3, 164.3, 213.0, 164.3, 213.0, 221.1],
                               'b': [433.5, 523.2, 536.3, 464.3, 413.0, 164.3, 213.0, 221.1],
                               'c': [123.5, 132.3, 136.3, 164.3] + [np.NaN]*4,
                               'd': [np.NaN]*8,
                               'e': [np.NaN]*7 + [2330.3],
                               'f': [np.NaN]*4 + [2763.0, 2142.3, 2127.3, 2330.3],
                               'g': [2330.3] + [np.NaN]*7,
                               'h': [2330.3] + [np.NaN]*6 + [2777.7]})
df.interpolate(
    limit=5,
    inplace=True,
    limit_direction='both',
    limit_area='outside',
    )
print(df)

输出:

                         a      b      c   d       e       f       g       h
2014-02-21 14:50:00  123.5  433.5  123.5 NaN     NaN  2763.0  2330.3  2330.3
2014-02-21 14:51:00    NaN  523.2  132.3 NaN     NaN  2763.0  2330.3     NaN
2014-02-21 14:52:00  136.3  536.3  136.3 NaN  2330.3  2763.0  2330.3     NaN
2014-02-21 14:53:00  164.3  464.3  164.3 NaN  2330.3  2763.0  2330.3     NaN
2014-02-21 14:54:00  213.0  413.0  164.3 NaN  2330.3  2763.0  2330.3     NaN
2014-02-21 14:55:00  164.3  164.3  164.3 NaN  2330.3  2142.3  2330.3     NaN
2014-02-21 14:56:00  213.0  213.0  164.3 NaN  2330.3  2127.3     NaN     NaN
2014-02-21 14:57:00  221.1  221.1  164.3 NaN  2330.3  2330.3     NaN  2777.7

我将@JohnE的解决方案改编成一个函数(进行了一些调整/改进)。我正在使用Python 3.8，并且我相信类型提示在3.9中发生了变化，因此您可能必须适应。

from typing import Union
def fill_with_hard_limit(
        df_or_series: Union[pd.DataFrame, pd.Series], limit: int,
        fill_method='interpolate',
        **fill_method_kwargs) -> Union[pd.DataFrame, pd.Series]:
    """The fill methods from Pandas such as ``interpolate`` or ``bfill``
    will fill ``limit`` number of NaNs, even if the total number of
    consecutive NaNs is larger than ``limit``. This function instead
    does not fill any data when the number of consecutive NaNs
    is > ``limit``.
    Adapted from: https://stackoverflow.com/a/30538371/11052174
    :param df_or_series: DataFrame or Series to perform interpolation
        on.
    :param limit: Maximum number of consecutive NaNs to allow. Any
        occurrences of more consecutive NaNs than ``limit`` will have no
        filling performed.
    :param fill_method: Filling method to use, e.g. 'interpolate',
        'bfill', etc.
    :param fill_method_kwargs: Keyword arguments to pass to the
        fill_method, in addition to the given limit.
    :returns: A filled version of the given df_or_series according
        to the given inputs.
    """
    # Keep things simple, ensure we have a DataFrame.
    try:
        df = df_or_series.to_frame()
    except AttributeError:
        df = df_or_series
    # Initialize our mask.
    mask = pd.DataFrame(True, index=df.index, columns=df.columns)
    # Get cumulative sums of consecutive NaNs.
    grp = (df.notnull() != df.shift().notnull()).cumsum()
    # Add columns of ones.
    grp['ones'] = 1
    # Loop through columns and update the mask.
    for col in df.columns:
        mask.loc[:, col] = (
                (grp.groupby(col)['ones'].transform('count') <= limit)
                | df[col].notnull()
        )
    # Now, interpolate and use the mask to create NaNs for the larger
    # gaps.
    method = getattr(df, fill_method)
    out = method(limit=limit, **fill_method_kwargs)[mask]
    # Be nice to the caller and return a Series if that's what they
    # provided.
    if isinstance(df_or_series, pd.Series):
        # Return a Series.
        return out.loc[:, out.columns[0]]
    return out

用法:

>>> data_filled = fill_with_hard_limit(data, 5)
>>> data_filled
                         a      b      c   d       e       f       g       h
2014-02-21 14:50:00  123.5  433.5  123.5 NaN     NaN     NaN  2330.3  2330.3
2014-02-21 14:51:00  129.9  523.2  132.3 NaN     NaN     NaN     NaN     NaN
2014-02-21 14:52:00  136.3  536.3  136.3 NaN     NaN     NaN     NaN     NaN
2014-02-21 14:53:00  164.3  464.3  164.3 NaN     NaN     NaN     NaN     NaN
2014-02-21 14:54:00  213.0  413.0  164.3 NaN     NaN  2763.0     NaN     NaN
2014-02-21 14:55:00  164.3  164.3  164.3 NaN     NaN  2142.3     NaN     NaN
2014-02-21 14:56:00  213.0  213.0  164.3 NaN     NaN  2127.3     NaN     NaN
2014-02-21 14:57:00  221.1  221.1  164.3 NaN  2330.3  2330.3     NaN  2777.7

相关内容

最新更新

热门标签：