将特征生成限制为特征工具中的特定实体



我试图了解如何在 FeatureTools(版本 0.16(中指定primitive_options以仅包含某个实体。根据我应该使用的文档include_entities

为基元创建特征时要包括的实体列表。所有其他实体将被忽略(列表[str](。

简单案例

下面是一些示例代码:

import pprint
from featuretools.primitives import GreaterThanScalar
esd1 = ft.demo.load_mock_customer(return_entityset=True)
def run_dfs(esd, primitive_options={}):
feature_defs = ft.dfs(
entityset=esd,
target_entity="customers",
agg_primitives=["count"],
where_primitives=["count",GreaterThanScalar(value=0)],
trans_primitives=[GreaterThanScalar(value=0)],
primitive_options=primitive_options,
max_depth=4,
features_only=True
)
pprint.pprint(feature_defs)
run_dfs(esd1)

这会产生:

[<Feature: zip_code>,
<Feature: COUNT(sessions)>,
<Feature: COUNT(transactions)>,
<Feature: COUNT(sessions) > 0>,
<Feature: COUNT(transactions) > 0>]

假设我对会话和事务计数以及会话是否大于 0 感兴趣。根据我在这里include_entities的文档:

run_dfs(esd1, primitive_options={
"greater_than_scalar":{
"include_entities":['sessions']}
})

但是,由此产生的输出是:

[<Feature: zip_code>,
<Feature: COUNT(sessions)>,
<Feature: COUNT(transactions)>]

这两个GreaterThanScalar功能现在都消失了。如果我改用ignore_entities,我会得到:

run_dfs(esd1, primitive_options={
"greater_than_scalar":{
"ignore_entities":["transactions"],
}
})
[<Feature: zip_code>,
<Feature: COUNT(sessions)>,
<Feature: COUNT(transactions)>,
<Feature: COUNT(sessions) > 0>]

所以它有效,但我不确定为什么ignore_entities给出了我需要的结果,而include_entities没有。 我错过了什么吗?

更复杂的案例

虽然我有点让简单的案例起作用,但我真正想要的是更复杂的东西。我想获得一个布尔功能,告诉我特定设备上的会话是否超过零。

这样做:

esd2 = ft.demo.load_mock_customer(return_entityset=True)
esd2['sessions'].add_interesting_values()
run_dfs(esd2)

屈服:

[<Feature: zip_code>,
<Feature: COUNT(sessions)>,
<Feature: COUNT(transactions)>,
<Feature: COUNT(sessions WHERE device = desktop)>,
<Feature: COUNT(sessions WHERE device = tablet)>,
<Feature: COUNT(sessions WHERE device = mobile)>,
<Feature: COUNT(transactions) > 0>,
<Feature: COUNT(sessions) > 0>,
<Feature: COUNT(transactions WHERE sessions.device = mobile)>,
<Feature: COUNT(transactions WHERE sessions.device = desktop)>,
<Feature: COUNT(transactions WHERE sessions.device = tablet)>,
<Feature: COUNT(sessions WHERE device = desktop) > 0>,
<Feature: COUNT(sessions WHERE device = tablet) > 0>,
<Feature: COUNT(sessions WHERE device = mobile) > 0>,
<Feature: COUNT(transactions WHERE sessions.device = tablet) > 0>,
<Feature: COUNT(transactions WHERE sessions.device = mobile) > 0>,
<Feature: COUNT(transactions WHERE sessions.device = desktop) > 0>]

我需要的功能是从底部开始计算 4 到 6 个。如果我尝试将dfs限制为将自身限制为会话实体和设备变量:

run_dfs(esd2, primitive_options={
"greater_than_scalar":{
"ignore_entities":["transactions"],
"include_variables":{"sessions":["device"]}
}
})

结果是:

[<Feature: zip_code>,
<Feature: COUNT(sessions)>,
<Feature: COUNT(transactions)>,
<Feature: COUNT(sessions WHERE device = desktop)>,
<Feature: COUNT(sessions WHERE device = tablet)>,
<Feature: COUNT(sessions WHERE device = mobile)>,
<Feature: COUNT(transactions WHERE sessions.device = mobile)>,
<Feature: COUNT(transactions WHERE sessions.device = desktop)>,
<Feature: COUNT(transactions WHERE sessions.device = tablet)>]

没有大于标量的功能。

有没有办法让我dfs只给我这里想要的三个大于标量功能?

更新:第三种情况

有没有办法限制where下计数的内容?例如:

esd3 = ft.demo.load_mock_customer(return_entityset=True)
esd3['sessions'].add_interesting_values()
esd3['products'].add_interesting_values()
run_dfs(esd3, primitive_options={
"greater_than_scalar":{
"ignore_entities":["transactions","sessions"],
},
"count":{
"ignore_variables":{"transactions":['session_id']}
}
})

给:

[<Feature: zip_code>,
<Feature: COUNT(sessions)>,
<Feature: COUNT(transactions)>,
<Feature: COUNT(sessions WHERE device = desktop)>,
<Feature: COUNT(sessions WHERE device = tablet)>,
<Feature: COUNT(sessions WHERE device = mobile)>,
<Feature: COUNT(transactions WHERE sessions.device = mobile)>,
<Feature: COUNT(transactions WHERE products.brand = B)>,
<Feature: COUNT(transactions WHERE sessions.device = tablet)>,
<Feature: COUNT(transactions WHERE products.brand = A)>,
<Feature: COUNT(transactions WHERE sessions.device = desktop)>]

是否可以将COUNT(transactions WHERE ...)功能限制为仅products.我仍然想保留COUNT sessions ...功能。

将"会话"实体中的"session_id"添加到include_variables选项将生成您正在寻找的功能:

primitive_options={
"greater_than_scalar":{
"ignore_entities":["transactions"],
"include_variables":{"sessions":["session_id", "device"]}}}

Count基元使用实体索引作为其基础,以及任何where列。如果只包含GreaterThanScalar基元选项的where列,dfs最终会忽略GreaterThanScalar的所有Count功能,因为它们都使用隐式忽略的列(实体索引(。在这种情况下,所需的Count变量使用"会话"实体,因此将"会话"实体索引("session_id"(添加到included_variables选项可以生成所需的功能。

此外,在使用include_entities的第一个示例中,由于不包括"客户"实体(目标实体(,因此GreaterThanScalar功能丢失。Count功能都是"客户"实体中的聚合功能;它们表示每个客户的数量。为了使用Count功能,需要允许GreaterThanScalar基元同时使用Count功能所在的"客户"实体以及所需Count功能所基于的实体(在本例中为"会话"(。

最新更新