使用python更新/更新数据表



我想要一些关于如何使用Python/Databricks:更新/插入新数据到现有数据表的建议


# Inserting and updating already existing data
# Original data
import pandas as pd
source_data = {'Customer Number':  ['1', '2', '3'],
'Colour': ['Red', 'Blue', 'Green'],
'Flow': ['Good', 'Bad', "Good"]
}
df1 = pd.DataFrame (source_data, columns = ['Customer Number','Colour', 'Flow'])
print(df1)
# New data
new_data = {'Customer Number':  ['1', '4',],
'Colour': ['Blue', 'Blue'],
'Flow': ['Bad', 'Bad']
}
df2 = pd.DataFrame (new_data, columns = ['Customer Number','Colour', 'Flow'])
print(df2)
# What the updated table will look like
updated_data = {'Customer Number':  ['1', '2', '3', '4',],
'Colour': ['Blue', 'Blue', 'Green', 'Blue',],
'Flow': ['Bad', 'Bad', "Good", 'Bad']
}
df3 = pd.DataFrame (updated_data, columns = ['Customer Number','Colour', 'Flow'])
print(df3)

您可以在这里看到,原始数据有三个客户。然后,我得到了"new_data",其中包含客户1的数据的更新和"客户4"的新数据,后者不在原始数据中。然后,如果您查看"updated_data",您可以看到最终的数据应该是什么样子。此处"客户1的数据已更新,客户4的数据已被插入

有人知道我该从哪里开始吗?我可以使用哪个模块?

我不希望有人在发展方面解决这个问题,只需要朝着正确的方向推动。

编辑:数据源是.txt或CSV,输出是JSON,但当我将数据加载到Cosmos DB时,它会自动转换,所以不用太担心。

感谢

当前数据帧结构和'pd.update'

经过一些准备,您可以使用熊猫的"更新"功能。首先,必须对数据帧进行索引(这通常很有用(。其次,源数据帧必须通过具有伪/NaN数据的新索引进行扩展,以便进行更新。

# set indices of original data frames
col = 'Customer Number'
df1.set_index(col, inplace=True)
df2.set_index(col, inplace=True)
df3.set_index(col, inplace=True)
# extend source data frame by new customer indices
df4 = df1.copy().reindex(index=df1.index.union(df2.index))
# update data
df4.update(df2)
# verify that new approach yields correct results
assert all(df3 == df4)

当前数据帧结构和"pd.concat">

一种稍微简单一点的方法连接数据帧并消除重复行(如果需要,可以按索引排序(。但是,临时串联需要更多的存储器,这可能限制数据帧的大小。

df5 = pd.concat([df1, df2])
df5 = df5.loc[~df5.index.duplicated(keep='last')].sort_index()
assert all(df3 == df5)

备选数据结构

考虑到"客户编号"是数据的关键属性,你也可以考虑重组你原来的字典,比如:

{'1': ['Red', 'Good'], '2': ['Blue', 'Bad'], '3': ['Green', 'Good']}

然后更新您的数据就相当于用新数据重新设置源数据的密钥。通常,直接使用字典比使用数据帧更快。

# define function to restructure data, for demonstration purposes only
def restructure(data):
# transpose original data
# https://stackoverflow.com/a/6473724/5350621
vals = data.values()
rows = list(map(list, zip(*vals)))
# create new restructured dictionary with customers as keys
restructured = dict()
for row in rows:
restructured[row[0]] = row[1:]
return restructured
# restructure data
source_restructured = restructure(source_data)
new_restructured = restructure(new_data)
# simply (re)set new keys
final_restructured = source_restructured.copy()
for key, val in new_restructured.items():
final_restructured[key] = val
# convert to data frame and check results
df6 = pd.DataFrame(final_restructured, index=['Colour', 'Flow']).T
assert all(df3 == df6)

PS:当设置"df1=pd.DataFrame(source_data,columns=[…]("时,您不需要"columns"参数,因为字典的名称很好,键会自动作为列名。

您可以使用set intersection查找要更新的客户编号,并使用set difference查找要添加的新客户编号。

然后,您可以首先通过Costumer Number的交叉点更新初始数据帧行,然后仅将初始数据帧与具有新值的数据帧的新行合并。

# same name column for clarity  
cn = 'Customer Number'
# convert Consumer Number values into integer to use set
CusNum_df1 = [int(x) for x in df1[cn].values]
CusNum_df2 = [int(x) for x in df2[cn].values]
# find Customer Numbers to update and to add
CusNum_to_update = list(set(CusNum_df1).intersection(set(CusNum_df2)))
CusNum_to_add = list(set(CusNum_df2) - set(CusNum_df1))
# update rows in initial data frame 
for num in CusNum_to_update:
index_initial = df1.loc[df1[cn]==str(num)].index[0]
index_new = df2.loc[df2[cn]==str(num)].index[0]
for col in df1.columns:
df1.at[index_initial,col]= df2.loc[index_new,col]
# concatenate new rows to initial data frame 
for num in CusNum_to_add:
df1 = pd.concat([df1, df2.loc[df2[cn]==str(num)]]).reset_index(drop=True)

out:
Customer Number Colour  Flow
0               1   Blue   Bad
1               2   Blue   Bad
2               3  Green  Good
3               4   Blue   Bad

有很多方法,但就可读性而言,我更喜欢这样做。

import pandas as pd
dict_source = {'Customer Number': ['1', '2', '3'],
'Colour': ['Red', 'Blue', 'Green'],
'Flow': ['Good', 'Bad', "Good"]
}
df_origin = pd.DataFrame.from_dict(dict_source)
dict_new = {'Customer Number': ['1', '4', ],
'Colour': ['Blue', 'Blue'],
'Flow': ['Bad', 'Bad']
}
df_new = pd.DataFrame.from_dict(dict_new)
df_result = df_origin.copy()
df_result.set_index(['Customer Number', ], inplace=True)
df_new.set_index(['Customer Number', ], inplace=True)
df_result.update(df_new)  # update number 1
# handle number 4
df_result.reset_index(['Customer Number', ], inplace=True)
df_new.reset_index(['Customer Number', ], inplace=True)
df_result = df_result.merge(df_new, on=list(df_result), how='outer')
print(df_result)
Customer Number Colour  Flow
0               1   Blue   Bad
1               2   Blue   Bad
2               3  Green  Good
3               4   Blue   Bad

您可以将'Customer Number'用作index,并使用update方法:

import pandas as pd
source_data = {'Customer Number':  ['1', '2', '3'],
'Colour': ['Red', 'Blue', 'Green'],
'Flow': ['Good', 'Bad', "Good"]
}
df1 = pd.DataFrame (source_data, index=source_data['Customer Number'], columns=['Colour', 'Flow'])
print(df1)
# New data
new_data = {'Customer Number':  ['1', '4',],
'Colour': ['Blue', 'Blue'],
'Flow': ['Bad', 'Bad']
}
df2 = pd.DataFrame (new_data, index=new_data['Customer Number'], columns=['Colour', 'Flow'])
print(df2)
df3 = df1.reindex(index=df1.index.union(df2.index))
df3.update(df2)
print(df3)
Colour  Flow
1   Blue   Bad
2   Blue   Bad
3  Green  Good
4   Blue   Bad

最新更新