实现最佳性能的任务解决方案

以下任务的最佳/最快解决方案是什么：

使用技术：MySQL数据库+Python

我正在下载一个data.sql文件。它的格式：

INSERT INTO `temp_table` VALUES (group_id,city_id,zip_code,post_code,earnings,'group_name',votes,'city_name',person_id,'person_name',networth);
INSERT INTO `temp_table` VALUES (group_id,city_id,zip_code,post_code,earnings,'group_name',votes,'city_name',person_id,'person_name',networth);
.
.

每行中的值不同。

表格结构：http://sqlfiddle.com/#！9/8f10d6

一个人可以拥有多个城市

一个人可以只在一个组中，也可以不在组中。

一个团体可以有多个人

我知道这些.sql数据来自哪个国家。

我需要把这些数据分成3个表。我将更新表中已经存在的数据，如果没有，我将创建新行。所以我想出了两个解决方案：

通过python从文件中分割值，然后在事务中对每行执行3x选择+3x更新/插入。
不知何故，将数据大容量插入到一个临时表中，然后在数据库中处理数据——这意味着对于临时表中的每一行，我将执行3个选择查询(每个实际表一个(，如果找到行，我会发送3x(更新查询，如果没有，则运行插入查询(。

我每天将多次运行此函数，.sql文件中的行数超过10K，它将在数据库中更新/创建超过30K的行。

//编辑

我现在插入/更新代码：

autocomit = "SET autocommit=0"
with connection.cursor() as cursor:
cursor.execute(autocomit)
data = data.sql
lines = data.splitlines
for line in lines:
with connection.cursor() as cursor: 
cursor.execute(line)
temp_data = "SELECT * FROM temp_table"
with connection.cursor() as cursor: 
cursor.execute(temp_data)
temp_data = cursor.fetchall()
for temp_row in temp_data:
group_id = temp_row[0]
city_id = temp_row[1]
zip_code = temp_row[2]
post_code = temp_row[3]
earnings = temp_row[4]
group_name = temp_row[5]
votes = temp_row[6]
city_name = temp_row[7]
person_id = temp_row[8]
person_name = temp_row[9]
networth = temp_row[10]
group_select = "SELECT * FROM perm_group WHERE group_id = %s AND countryid_fk = %s"
group_values = (group_id, countryid)
with connection.cursor() as cursor:
row = cursor.execute(group_select, group_values)
if row == 0 and group_id != 0: #If person doesn't have group do not create
group_insert = "INSERT INTO perm_group (group_id, group_name, countryid_fk) VALUES (%s, %s, %s)"
group_insert_values = (group_id, group_name, countryid)
with connection.cursor() as cursor:
cursor.execute(group_insert, group_insert_values) 
groupid = cursor.lastrowid
elif row == 1 and group_id != 0: 
group_update = "UPDATE perm_group SET group_name = group_name WHERE group_id = %s and countryid_fk = %s"
group_update_values = (group_id, countryid)
with connection.cursor() as cursor:
cursor.execute(group_update, group_update_values)
#Select group id for current row to assign correct group to the person    
group_certain_select = "SELECT id FROM perm_group WHERE group_id = %s and countryid_fk = %s"
group_certain_select_values = (group_id, countryid)
with connection.cursor() as cursor:
cursor.execute(group_certain_select, group_certain_select_values)
groupid = cursor.fetchone()
#.
#.
#.
#Repeating the same piece of code for person and city

测量时间：206秒-这是不可接受的。

group_insert = "INSERT INTO perm_group (group_id, group_name, countryid_fk) VALUES (%s, %s, %s) ON DUPLICATE KEY UPDATE group_id = %s, group_name = %s"
group_insert_values = (group_id, group_name, countryid, group_id, group_name)
with connection.cursor() as cursor:
cursor.execute(group_insert, group_insert_values) 
#Select group id for current row to assign correct group to the person    
group_certain_select = "SELECT id FROM perm_group WHERE group_id = %s and countryid_fk = %s"
group_certain_select_values = (group_id, countryid)
with connection.cursor() as cursor:
cursor.execute(group_certain_select, group_certain_select_values)
groupid = cursor.fetchone()

测量时间：30到50秒。(仍然很长，但情况正在好转(

关于如何做到这一点，还有其他更好(更快(的选择吗？

谢谢你的建议，爆米花

我建议您将数据加载到临时表中，并在SQL中进行处理。

基本上，您的最终结果是一组SQL表，因此SQL必然是解决方案的一部分。您还可以将尽可能多的逻辑放入数据库中，以简化所需工具的数量。

加载10000行应该不会花费太多时间。但是，如果您可以选择数据格式，我建议使用CSV文件而不是insert。insert会产生额外的开销，如果只是因为它们更大的话。

一旦数据进入数据库，我就不会太担心将数据存储在三个表中的处理时间了。

相关内容

最新更新

热门标签：