从 Python 特征工具的特征工程中排除当前行



我正在使用featuretools为当前行生成历史特征。例如,会话期间过去一小时内进行的事务数。

featuretools包括参数cutoff_time,用于排除时间cutoff_time之后的所有行。

cutoff_time设置为time_index value - 1 second,因此我希望这些功能基于历史数据减去当前行。这允许包含来自历史行的响应变量。

问题是,当这个参数不等于time_index变量时,我在原始生成的特征中得到了一堆NaN

例:

#!/usr/bin/env python3
import featuretools as ft
import pandas as pd
from featuretools import primitives, variable_types
data = ft.demo.load_mock_customer()
transactions_df = data['transactions']
transactions_df['cutoff_time'] = transactions_df['transaction_time'] - pd.Timedelta(seconds=1)
es = ft.EntitySet('transactions_set')
es.entity_from_dataframe(
entity_id='transactions',
dataframe=transactions_df,
variable_types={
'transaction_id': variable_types.Index,
'session_id': variable_types.Id,
'transaction_time': variable_types.DatetimeTimeIndex,
'product_id': variable_types.Id,
'amount': variable_types.Numeric,
'cutoff_time': variable_types.Datetime
},
index='transaction_id',
time_index='transaction_time'
)
es.normalize_entity(
base_entity_id='transactions',
new_entity_id='sessions',
index='session_id'
)
es.add_last_time_indexes()
fm, features = ft.dfs(
entityset=es,
target_entity='transactions',
agg_primitives=[primitives.Sum, primitives.Count],
trans_primitives=[primitives.Day],
cutoff_time=transactions_df[['transaction_id', 'cutoff_time']].
rename(index=str, columns={'transaction_id': 'transaction_id', 'cutoff_time': 'time'}),
training_window='1 hours',
verbose=True
)
print(fm)

输出(摘录(:

DAY(cutoff_time)  sessions.SUM(transactions.amount)  
transaction_id                                                        
352                          NaN                                NaN   
186                          NaN                                NaN   
319                          NaN                                NaN   
256                          NaN                                NaN   
449                          NaN                                NaN   
40                           NaN                                NaN   
13                           NaN                                NaN   
127                          NaN                                NaN   
21                           NaN                                NaN   
309                          NaN                                NaN   

sessions.SUM(transactions.amount)列应该是>= 0。原始功能session_id product_id amount也都NaN

如果transactions_df['cutoff_time'] = transactions_df['transaction_time'](无时间增量(,则此代码有效,但包含当前行。

计算将从计算中排除当前行的聚合和转换的正确方法是什么?

您看到的是截止时间和time_index的预期行为。实体的time_index表示第一次可以知道每个实例的任何信息。向 Featuretools 提供中断时间时,它将通过移除时间索引在中断时间之后的行来模拟数据集中数据集的状态。

在这种情况下,事务的transaction_idsession_id在事务时间之前是未知的,这是有道理的,因为事务尚未发生。这就是为什么当您要求 Featuretools 在事务时间前一秒计算要素时,它会返回所有要素的NaN

处理此问题的方法是将secondary_time_index分配给transactions中的amount等变量。此堆栈溢出答案的高级解决方案中对此进行了描述。这允许您告诉 Featuretools 特定变量在transaction_time无效,只能在辅助时间索引列中使用的时间。从本质上讲,您将阻止在事务时使用某些行值,同时允许其他值。您可以将辅助时间索引分配给该实体中任意数量的变量。

基于Max Kanter的回答:

#!/usr/bin/env python3
import featuretools as ft
import pandas as pd
from featuretools import primitives, variable_types
data = ft.demo.load_mock_customer()
transactions_df = data['transactions']
transactions_df['response_time'] = transactions_df['transaction_time'] + pd.Timedelta(seconds=1)
es = ft.EntitySet('transactions_set')
es.entity_from_dataframe(
entity_id='transactions',
dataframe=transactions_df,
variable_types={
'transaction_id': variable_types.Index,
'session_id': variable_types.Id,
'transaction_time': variable_types.DatetimeTimeIndex,
'product_id': variable_types.Id,
'amount': variable_types.Numeric,
'response_time': variable_types.Datetime
},
index='transaction_id',
time_index='transaction_time',
secondary_time_index={'response_time': ['amount', 'transaction_id']}
)
es.normalize_entity(
base_entity_id='transactions',
new_entity_id='sessions',
index='session_id'
)
es.add_last_time_indexes()
fm, features = ft.dfs(
entityset=es,
target_entity='transactions',
agg_primitives=[primitives.Sum, primitives.Count],
trans_primitives=[primitives.Day],
cutoff_time=transactions_df[['transaction_id', 'transaction_time']],
cutoff_time_in_index=True,
training_window='5 minutes',
verbose=True
)
print(fm)

此代码生成排除当前行并包含所有前行不到 5 分钟的前行的功能sessions.SUM(transactions.amount)sessions.COUNT(transactions)

最新更新