python从字典中创建一个pandas DF.Columns_name = dict的值,如果值在dict的值中,则d



我想从这个数据开始创建一个df

item_features = {'A': {1, 2, 3}, 'B':{7, 2, 1}, 'C':{3, 2}, 'D':{9, 11} }
pos = {'B', 'C'}
neg = {'A'}

我想获得以下数据集:

1  2  3  7    positive     item_id
0  1  1  0  1           1           B
1  0  1  1  0           1           C
2  1  1  1  0           0           A

所以我想要df:

-have the df columns always ordered by their Number during the    
creating process ? Like in this case it is 1 -2 - 3- 4 and i want    
to be sure that i never have an order like 4-1-3-2
- contains only item_id that are in one of the 2 sets ( pos or neg). 
- if the item is positive the corresponding 'positive' column will be set to 1 else 0
- the other columns_names are the value in the item_features dictionary, but only for the items that are either in pos or in neg.
- the value in the column must be 1 if the   corresponding column name is in value of the item_features dict for that specific item.

什么是有效的方法?

使用说明:

item_features = {'A': {1, 2, 3}, 'B':{4, 2, 1}, 'C':{3, 2}, 'D':{9, 11} }
pos = {'B', 'C'}
neg = {'A'}
#join sets
both = pos.union(neg)
#create Series, filter by both and create indicator columns
df=pd.Series(item_features).loc[both].agg(lambda x: '|'.join(map(str, x))).str.get_dummies()

df['item_id'] = df.index
df['positive'] = df['item_id'].isin(pos).astype(int)
df = df.reset_index(drop=True)
print(df)
1  2  3  4 item_id  positive
0  0  1  1  0       C         1
1  1  1  0  1       B         1
2  1  1  1  0       A         0

如果可能的话,使用列表代替集合:

item_features = {'A': {1, 2, 3}, 'B':{4, 2, 1}, 'C':{3, 2}, 'D':{9, 11} }
pos = ['B', 'C']
neg = ['A']
both = pos + neg
#create Series, filter by both and create indicator columns
df=pd.Series(item_features).loc[both].agg(lambda x: '|'.join(map(str, x))).str.get_dummies()
df = df.sort_index(axis=1, level=0, key=lambda x: x.astype(int))
df['item_id'] = df.index
df['positive'] = df['item_id'].isin(pos).astype(int)
df = df.reset_index(drop=True)
print(df)
1  2  3  4 item_id  positive
0  1  1  0  1       B         1
1  0  1  1  0       C         1
2  1  1  1  0       A         0

编辑:改进性能的解决方案是:

item_features = {'A': {1, 2, 3}, 'B':{4, 2, 11}, 'C':{3, 2}, 'D':{9, 11} }
pos = ['B', 'C']
neg = ['A']
both = pos + neg
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
d = { k: item_features[k] for k in both }
df = pd.DataFrame(mlb.fit_transform(d.values()),columns=mlb.classes_)
print (df)

df['item_id'] = d.keys()
df['positive'] = df['item_id'].isin(pos).astype(int)
print(df)
1  2  3  4  11 item_id  positive
0  0  1  0  1   1       B         1
1  0  1  1  0   0       C         1
2  1  1  1  0   0       A         0

相关内容

最新更新