Python 熊猫确保每行基于列值都有一组数据存在,如果不添加行



我正在组织用于标记的 AWS 资源,并将数据捕获到 CSV 文件中。CSV 文件的示例输出如下所示。我试图确保对于每个resource_id,都有一个我需要确保存在的tag_key数据集。此数据集是

tag_key

Application
Client
Environment
Name
Owner
Project
Purpose

我是熊猫的新手,我只设法将CSV文件读取为数据帧

import pandas as pd
file_name = "z.csv"
df = pd.read_csv(file_name, names=['resource_id', 'resource_type', 'tag_key', 'tag_value'])
print (df)

CSV 文件

vol-00441b671ca48ba41,volume,Environment,Development
vol-00441b671ca48ba41,volume,Name,Database Files
vol-00441b671ca48ba41,volume,Project,Application Development
vol-00441b671ca48ba41,volume,Purpose,Web Server
i-1234567890abcdef0,instance,Environment,Production
i-1234567890abcdef0,instance,Owner,Fast Company

我预计输出如下

vol-00441b671ca48ba41,volume,Environment,Development
vol-00441b671ca48ba41,volume,Name,Database Files
vol-00441b671ca48ba41,volume,Project,Application Development
vol-00441b671ca48ba41,volume,Purpose,Web Server
vol-00441b671ca48ba41,volume,Client,
vol-00441b671ca48ba41,volume,Owner,
vol-00441b671ca48ba41,volume,Application,
i-1234567890abcdef0,instance,Environment,Production
i-1234567890abcdef0,instance,Owner,Fast Company
i-1234567890abcdef0,instance,Application,
i-1234567890abcdef0,instance,Client,
i-1234567890abcdef0,instance,Name,
i-1234567890abcdef0,instance,Project,
i-1234567890abcdef0,instance,Purpose,

一种方法是使用多ndexes、from_productrenindex

taglist = ['Application',
'Client',
'Environment',
'Name',
'Owner',
'Project',
'Purpose']
df_out = df.set_index(['resource_id','tag_key'])
.reindex(pd.MultiIndex.from_product([df['resource_id'].unique(), taglist],
names=['resource_id','tag_key']))
df_out.assign(resource_type = df_out.groupby('resource_id')['resource_type']
.ffill().bfill()).reset_index()

输出:

resource_id      tag_key resource_type                tag_value
0   vol-00441b671ca48ba41  Application        volume                      NaN
1   vol-00441b671ca48ba41       Client        volume                      NaN
2   vol-00441b671ca48ba41  Environment        volume              Development
3   vol-00441b671ca48ba41         Name        volume           Database Files
4   vol-00441b671ca48ba41        Owner        volume                      NaN
5   vol-00441b671ca48ba41      Project        volume  Application Development
6   vol-00441b671ca48ba41      Purpose        volume               Web Server
7     i-1234567890abcdef0  Application      instance                      NaN
8     i-1234567890abcdef0       Client      instance                      NaN
9     i-1234567890abcdef0  Environment      instance               Production
10    i-1234567890abcdef0         Name      instance                      NaN
11    i-1234567890abcdef0        Owner      instance             Fast Company
12    i-1234567890abcdef0      Project      instance                      NaN
13    i-1234567890abcdef0      Purpose      instance                      NaN

举一个稍微简单的例子。我有数据帧 df:

df = pd.DataFrame(data={'a': [1, 1, 2, 2], 'b': [[1, 2], [3, 5], [1, 2], [5]]})

返回

a       b
0  1  [1, 2]
1  1  [3, 5]
2  2  [1, 2]
3  2     [5]

使用所需的 b:1、2、3、4 和 5。

然后我们需要找出我们"已经拥有"的东西。我们这样做:

def flatten(lsts):
return [j for i in lsts for j in i]
df_new = df.groupby(by=['a'])['b'].apply(flatten)

返回:

a
1    [1, 2, 3, 5]
2       [1, 2, 5]

现在我们需要列出我们缺少的列并添加它们:

df_new = df_new.reset_index()
lst_wanted = [1, 2, 3, 4, 5]
for row in df_new.itertuples():
for j in lst_wanted:
if j not in row.b:
df = df.append({'a': row.a, 'b': j}, ignore_index=True)
print(df)

返回:

a       b
0  1  [1, 2]
1  1  [3, 5]
2  2  [1, 2]
3  2     [5]
4  1       4
5  2       3
6  2       4

最新更新