如何将深度特征合成应用于单个表



处理后,我的数据是一个表,其中几列是特征,一列是标签。我想使用featuretools.dfs来帮助我预测标签。是否可以直接执行此操作,或者是否需要将单个表拆分为多个表?

可以在单个表上运行 DFS。例如,如果您有一个索引为'index'的 pandas 数据帧df,则可以编写:

import featuretools as ft
es = ft.EntitySet('Transactions')
es.entity_from_dataframe(dataframe=df,
entity_id='log',
index='index')
fm, features = ft.dfs(entityset=es, 
target_entity='log',
trans_primitives=['day', 'weekday', 'month'])

生成的特征矩阵将如下所示

In [1]: fm
Out[1]: 
location  pies sold  WEEKDAY(date)  MONTH(date)  DAY(date)
index                                                                  
1         main street          3              4           12         29
2         main street          4              5           12         30
3         main street          5              6           12         31
4      arlington ave.         18              0            1          1
5      arlington ave.          1              1            1          2

这会将"转换"基元应用于数据。您通常希望添加更多实体来提供ft.dfs,以便使用聚合原语。您可以在我们的文档中阅读有关差异的信息。

标准工作流是通过有趣的分类规范化单个实体。如果您的df是单个表

| index | location       | pies sold |   date |
|-------+----------------+-------+------------|
|     1 | main street    |     3 | 2017-12-29 |
|     2 | main street    |     4 | 2017-12-30 |
|     3 | main street    |     5 | 2017-12-31 |
|     4 | arlington ave. |    18 | 2018-01-01 |
|     5 | arlington ave. |     1 | 2018-01-02 |

您可能有兴趣通过location进行规范化:

es.normalize_entity(base_entity_id='log',
new_entity_id='locations',
index='location')

您的新实体locations将具有该表

| location       | first_log_time |
|----------------+----------------|
| main street    |     2018-12-29 |
| arlington ave. |     2000-01-01 |

这将使locations.SUM(log.pies sold)locations.MEAN(log.pies sold)等功能按位置添加或平均所有值。您可以在下面的示例中看到创建的这些功能

In [1]: import pandas as pd
...: import featuretools as ft
...: df = pd.DataFrame({'index': [1, 2, 3, 4, 5],
...:                    'location': ['main street',
...:                                 'main street',
...:                                 'main street',
...:                                 'arlington ave.',
...:                                 'arlington ave.'],
...:                    'pies sold': [3, 4, 5, 18, 1]})
...: df['date'] = pd.date_range('12/29/2017', periods=5, freq='D')
...: df
...: 
Out[1]: 
index        location  pies sold       date
0      1     main street          3 2017-12-29
1      2     main street          4 2017-12-30
2      3     main street          5 2017-12-31
3      4  arlington ave.         18 2018-01-01
4      5  arlington ave.          1 2018-01-02
In [2]: es = ft.EntitySet('Transactions')
...: es.entity_from_dataframe(dataframe=df, entity_id='log', index='index', t
...: ime_index='date')
...: es.normalize_entity(base_entity_id='log', new_entity_id='locations', ind
...: ex='location')
...: 
Out[2]: 
Entityset: Transactions
Entities:
log [Rows: 5, Columns: 4]
locations [Rows: 2, Columns: 2]
Relationships:
log.location -> locations.location
In [3]: fm, features = ft.dfs(entityset=es,
...:                       target_entity='log',
...:                       agg_primitives=['sum', 'mean'],
...:                       trans_primitives=['day'])
...: fm
...: 
Out[3]: 
location  pies sold  DAY(date)  locations.DAY(first_log_time)  locations.MEAN(log.pies sold)  locations.SUM(log.pies sold)
index                                                                                                                                  
1         main street          3         29                             29                            4.0                            12
2         main street          4         30                             29                            4.0                            12
3         main street          5         31                             29                            4.0                            12
4      arlington ave.         18          1                              1                            9.5                            19
5      arlington ave.          1          2                              1                            9.5                            19

最新更新