问题陈述

创建一个有效的分数编码(类似于一个热编码(，用于组件和相应组成的不规则列表。

玩具示例

取具有以下class: ingredient组合的复合材料：

填料：胶体二氧化硅(filler_A(
填料：研磨玻璃纤维(filler_B(
树脂：聚氨酯(resin_A(
树脂：硅树脂(resin_B(
树脂：环氧树脂(resin_C(

伪数据

components = np.array(
[
["filler_A", "filler_B", "resin_C"],
["filler_A", "resin_B"],
["filler_A", "filler_B", "resin_B"],
["filler_A", "resin_B", "resin_C"],
["filler_B", "resin_A", "resin_B"],
["filler_A", "resin_A"],
["filler_B", "resin_A", "resin_B"],
],
dtype=object,
)
compositions = np.array(
[
[0.4, 0.4, 0.2],
[0.5, 0.5],
[0.5, 0.3, 0.2],
[0.5, 0.5, 0.0],
[0.6, 0.4, 0.0],
[0.6, 0.4],
[0.6, 0.2, 0.2],
],
dtype=object,
)

所需输出

X_train:

filler_A  filler_B  resin_A  resin_B  resin_C
0       0.4       0.4      0.0      0.0      0.2
1       0.5       0.0      0.0      0.5      0.0
2       0.5       0.3      0.0      0.2      0.0
3       0.5       0.0      0.0      0.5      0.0
4       0.0       0.6      0.4      0.0      0.0
5       0.6       0.0      0.4      0.0      0.0
6       0.0       0.6      0.2      0.2      0.0

我尝试过的

我有一个缓慢的fractional_encode实现，一个可供参考的fractional_decode，以及基本用法。

我的(缓慢(实施

在努力实现一个更快的实现之后，我采用了一个缓慢的、两级嵌套的for循环实现来创建一个像分数编码或流行编码一样的热门编码。

def fractional_encode(components, compositions, drop_last=False):
"""Fractionally encode components and compositions similar to one-hot encoding.
In one-hot encoding, components are assigned a "1" if it exists for a particular
compound, and a "0" if it does not. However, this ignores the case where the
composition (i.e. the fractional prevalence) of each component is known. For
example, NiAl is 50% Ni and 50% Al. This function computes the fractional components
(albeit manually using for loops) where instead of a "1" or a "0", the corresponding
fractional prevalence is assigned (e.g. 0.2, 0.5643, etc.).
Parameters
----------
components : list of lists of strings or numbers
The components that make up the compound for each compound. If strings, then
each string corresponds to a category. If numbers, then each number must
uniquely describe a particular category.
compositions : list of lists of floats
The compositions of each component that makes up the compound for each compound.
drop_last : bool, optional
Whether to drop the last component. This is useful since compositions are
constrained to sum to one, and therefore there is `n_components - 1` degrees of freedom, by default False
Returns
-------
X_train : 2D array
Fractionally encoded matrix.
Raises
------
ValueError
Components and compositions should have the same shape.
See also
--------
"Convert jagged array to Pandas dataframe" https://stackoverflow.com/a/63496196/13697228
"""
# lengths, unique components, and initialization
n_compounds = len(components)
unique_components = np.unique(list(flatten(components)))
n_unique = len(unique_components)
X_train = np.zeros((n_compounds, n_unique))
for i in range(n_compounds):
# unpack
component = components[i]
composition = compositions[i]
# lengths
n_component = len(component)
n_composition = len(composition)
if n_component != n_composition:
raise ValueError("Components and compositions should have the same shape")
for j in range(n_unique):
# unpack
unique_component = unique_components[j]
if unique_component in component:
# assign
idx = component.index(unique_component)
X_train[i, j] = composition[idx]
if drop_last:
# remove last column: https://stackoverflow.com/a/6710726/13697228
X_train = np.delete(X_train, -1, axis=1)
X_train = pd.DataFrame(data=X_train, columns=unique_components)
return X_train

反向实现(解码(

作为参考，我还制作了一个解码X_train的函数，它使用更高级别的操作：

def fractional_decode(X_train):
"""Fractionally decode components and compositions similar to one-hot encoding.
In one-hot encoding, components are assigned a "1" if it exists for a particular
compound, and a "0" if it does not. However, this ignores the case where the
composition (i.e. the fractional prevalence) of each component is known. For
example, NiAl is 50% Ni and 50% Al. This function decodes the fractional encoding
where instead of "1" or a "0", the corresponding fractional prevalence is used (e.g. 0.2, 0.5643, etc.).
Parameters
----------
X_train : DataFrame
Fractionally encoded matrix (similar to a one-hot encoded matrix).
last_dropped : bool, optional
Whether the last component is already dropped. This is useful since compositions
are constrained to sum to one, and therefore there is `n_components - 1` degrees
of freedom. If `drop_last` from `fractional_encode` is set to True, and you want
to decode, set to True. By default False
Returns
-------
components : list of lists of strings or numbers
The components that make up the compound for each compound. If strings, then
each string corresponds to a category. If numbers, then each number must
uniquely describe a particular category.
compositions : list of lists of floats
The compositions of each component that makes up the compound for each compound.
Raises
------
ValueError
Components and compositions should have the same shape.
"""
# lengths, unique components, and sparse matrix attributes
unique_components = X_train.columns
n_unique = len(unique_components)
sparse_mat = coo_matrix(X_train.values)
row_ids, col_ids = sparse_mat.row, sparse_mat.col
idx_pairs = list(zip(row_ids, col_ids))
comps = sparse_mat.data
# lookup dictionaries to replace col_ids with components
component_lookup = {
component_idx: unique_component
for (component_idx, unique_component) in zip(range(n_unique), unique_components)
}
# lookup dictionaries to replace idx_pairs with compositions
composition_lookup = {idx_pair: comp for (idx_pair, comp) in zip(idx_pairs, comps)}
# contains placeholder col_ids and idx_pairs which will get replaced by components
# and compositions, respectively
tmp_df = pd.DataFrame(
data=[(idx_pair[1], idx_pair) for idx_pair in idx_pairs],
columns=["component", "composition"],
)
# NOTE: component_lookup should be mapped before composition_lookup
tmp_df.component = tmp_df.component.map(component_lookup)
tmp_df.composition = tmp_df.composition.map(composition_lookup)
# add a row_id column to use for grouping into ragged entries
cat_df = pd.concat([pd.DataFrame(row_ids, columns=["row_id"]), tmp_df], axis=1)
# combine components and compositions compound-wise
df = (
cat_df.reset_index()
.groupby(by="row_id")
.agg({"component": lambda x: tuple(x), "composition": lambda x: tuple(x)})
)
# extract and convert to ragged lists
components, compositions = [df[key] for key in ["component", "composition"]]
components = list(components)
compositions = list(compositions)
return components, compositions

示例用法

X_train = fractional_encode(components, compositions)
components, compositions = fractional_decode(X_train)

问题

什么是fractional_encode的更快实现？

一个用零初始化数组然后更新字段的解决方案：

columns = sorted(list(set(sum(list(components), []))))
data = np.zeros((len(components), len(columns)))
for i in range(data.shape[0]):
for component, composition in zip(components[i], compositions[i]):
j = columns.index(component)
data[i, j] = composition
df = pd.DataFrame(columns=columns, data=data)

输出：

filler_A  filler_B  resin_A  resin_B  resin_C
0       0.4       0.4      0.0      0.0      0.2
1       0.5       0.0      0.0      0.5      0.0
2       0.5       0.3      0.0      0.2      0.0
3       0.5       0.0      0.0      0.5      0.0
4       0.0       0.6      0.4      0.0      0.0
5       0.6       0.0      0.4      0.0      0.0
6       0.0       0.6      0.2      0.2      0.0

分数编码的更快实现(类似于一个热编码)