有没有一种聪明的方法可以将序数编码器(基于不同的类别)应用于多个变量



我有多个带有文本值的变量,我想通过顺序编码器将其转换为数值。但这些变量遵循不同的顺序逻辑。例如:

import pandas as pd
import numpy as np
d = {"attr1":["Excellent", "Fair", "Fair", "Good", "Poor"],
"attr4":["Fair", "Good", "Good", "Excellent", "Excellent"],
"attr2":["Finished", "Unfinished", "Partially Finished", "Finished", "Unfinished"],
"attr3":["Satisfied", "Unsatisfied", "Unsatisfied", "Satisfied", "Satisfied"]}
data = pd.DataFrame(data = d)

您可以注意到"attr1"one_answers"attr4"共享相同的唯一值。将文本值转换为数字:

from sklearn.preprocessing import OrdinalEncoder
# Assign attributes to different lists based on the values
attr_list1 = ["attr1", "attr4"]
attr_list2 = ["attr2"]
attr_list3 = ["attr3"]
# Create categories to instruct how ordinal encoder should work
cat1 = ["Poor", "Fair", "Good", "Excellent"]
cat2 = ["Unfinished", "Partially Finished", "Finished"]
cat3 = ["Unsatisfied", "Satisfied"]
# Initialize the encoder
encoder1 = OrdinalEncoder(categories = [cat1])
encoder2 = OrdinalEncoder(categories = [cat2])
encoder3 = OrdinalEncoder(categories = [cat3])
def ord_encode(attr_list, encoder):
for attr in attr_list:
data[attr] = encoder.fit_transform(data[[attr]])
return data
data = ord_encode(attr_list1, encoder1)
data = ord_encode(attr_list2, encoder2)
data = ord_encode(attr_list3, encoder3)

我发现我的解决方案非常低效和笨拙。想象一下,我有20多个属性,有4到5种不同的类别。我想知道有什么聪明的方法可以解决我的问题吗?

谢谢。

sklearn-pandas可用于快速完成此操作。我会构建一个列到类别的映射,然后使用DataFrameMapper为我创建一个管道

column_to_cat = {
"attr1": cat1,
"attr4": cat1,
"attr2": cat2,
"attr3": cat3
}
mapper_df = DataFrameMapper(
[
([col], OrdinalEncoder(categories = [cat])) for col, cat in column_to_cat.items()
],
df_out=True
)
mapper_df.fit_transform(data.copy())

完整代码:

import pandas as pd
import numpy as np
from sklearn_pandas import DataFrameMapper
from sklearn.preprocessing import OrdinalEncoder
d = {"attr1":["Excellent", "Fair", "Fair", "Good", "Poor"],
"attr4":["Fair", "Good", "Good", "Excellent", "Excellent"],
"attr2":["Finished", "Unfinished", "Partially Finished", "Finished", "Unfinished"],
"attr3":["Satisfied", "Unsatisfied", "Unsatisfied", "Satisfied", "Satisfied"]}
data = pd.DataFrame(data = d)
# Create categories to instruct how ordinal encoder should work
cat1 = ["Poor", "Fair", "Good", "Excellent"]
cat2 = ["Unfinished", "Partially Finished", "Finished"]
cat3 = ["Unsatisfied", "Satisfied"]
# Assign attributes to different lists based on the values
column_to_cat = {
"attr1": cat1,
"attr4": cat1,
"attr2": cat2,
"attr3": cat3
}

mapper_df = DataFrameMapper(
[
([col], OrdinalEncoder(categories = [cat])) for col, cat in column_to_cat.items()
],
df_out=True
)
mapper_df.fit_transform(data.copy())

如果我正确理解了这个问题,就有一种更简洁的方法。


enc = OrdinalEncoder()
enc.fit(df[["Sex","Blood", "Study"]])
df[["Sex","Blood", "Study"]] = enc.transform(df[["Sex","Blood", "Study"]])

来源:使用OrdinalEncoder转换分类值

最新更新