以下任务的最佳/最快解决方案是什么:
使用技术:MySQL数据库+Python
我正在下载一个data.sql
文件。它的格式:
INSERT INTO `temp_table` VALUES (group_id,city_id,zip_code,post_code,earnings,'group_name',votes,'city_name',person_id,'person_name',networth);
INSERT INTO `temp_table` VALUES (group_id,city_id,zip_code,post_code,earnings,'group_name',votes,'city_name',person_id,'person_name',networth);
.
.
每行中的值不同。
表格结构:http://sqlfiddle.com/#!9/8f10d6
一个人可以拥有多个城市
一个人可以只在一个组中,也可以不在组中。
一个团体可以有多个人
我知道这些.sql数据来自哪个国家。
我需要把这些数据分成3个表。我将更新表中已经存在的数据,如果没有,我将创建新行。所以我想出了两个解决方案:
通过python从文件中分割值,然后在事务中对每行执行3x选择+3x更新/插入。
不知何故,将数据大容量插入到一个临时表中,然后在数据库中处理数据——这意味着对于临时表中的每一行,我将执行3个选择查询(每个实际表一个(,如果找到行,我会发送3x(更新查询,如果没有,则运行插入查询(。
我每天将多次运行此函数,.sql文件中的行数超过10K,它将在数据库中更新/创建超过30K的行。
//编辑
我现在插入/更新代码:
autocomit = "SET autocommit=0"
with connection.cursor() as cursor:
cursor.execute(autocomit)
data = data.sql
lines = data.splitlines
for line in lines:
with connection.cursor() as cursor:
cursor.execute(line)
temp_data = "SELECT * FROM temp_table"
with connection.cursor() as cursor:
cursor.execute(temp_data)
temp_data = cursor.fetchall()
for temp_row in temp_data:
group_id = temp_row[0]
city_id = temp_row[1]
zip_code = temp_row[2]
post_code = temp_row[3]
earnings = temp_row[4]
group_name = temp_row[5]
votes = temp_row[6]
city_name = temp_row[7]
person_id = temp_row[8]
person_name = temp_row[9]
networth = temp_row[10]
group_select = "SELECT * FROM perm_group WHERE group_id = %s AND countryid_fk = %s"
group_values = (group_id, countryid)
with connection.cursor() as cursor:
row = cursor.execute(group_select, group_values)
if row == 0 and group_id != 0: #If person doesn't have group do not create
group_insert = "INSERT INTO perm_group (group_id, group_name, countryid_fk) VALUES (%s, %s, %s)"
group_insert_values = (group_id, group_name, countryid)
with connection.cursor() as cursor:
cursor.execute(group_insert, group_insert_values)
groupid = cursor.lastrowid
elif row == 1 and group_id != 0:
group_update = "UPDATE perm_group SET group_name = group_name WHERE group_id = %s and countryid_fk = %s"
group_update_values = (group_id, countryid)
with connection.cursor() as cursor:
cursor.execute(group_update, group_update_values)
#Select group id for current row to assign correct group to the person
group_certain_select = "SELECT id FROM perm_group WHERE group_id = %s and countryid_fk = %s"
group_certain_select_values = (group_id, countryid)
with connection.cursor() as cursor:
cursor.execute(group_certain_select, group_certain_select_values)
groupid = cursor.fetchone()
#.
#.
#.
#Repeating the same piece of code for person and city
测量时间:206秒-这是不可接受的。
group_insert = "INSERT INTO perm_group (group_id, group_name, countryid_fk) VALUES (%s, %s, %s) ON DUPLICATE KEY UPDATE group_id = %s, group_name = %s"
group_insert_values = (group_id, group_name, countryid, group_id, group_name)
with connection.cursor() as cursor:
cursor.execute(group_insert, group_insert_values)
#Select group id for current row to assign correct group to the person
group_certain_select = "SELECT id FROM perm_group WHERE group_id = %s and countryid_fk = %s"
group_certain_select_values = (group_id, countryid)
with connection.cursor() as cursor:
cursor.execute(group_certain_select, group_certain_select_values)
groupid = cursor.fetchone()
测量时间:30到50秒。(仍然很长,但情况正在好转(
关于如何做到这一点,还有其他更好(更快(的选择吗?
谢谢你的建议,爆米花
我建议您将数据加载到临时表中,并在SQL中进行处理。
基本上,您的最终结果是一组SQL表,因此SQL必然是解决方案的一部分。您还可以将尽可能多的逻辑放入数据库中,以简化所需工具的数量。
加载10000行应该不会花费太多时间。但是,如果您可以选择数据格式,我建议使用CSV文件而不是insert
。insert
会产生额外的开销,如果只是因为它们更大的话。
一旦数据进入数据库,我就不会太担心将数据存储在三个表中的处理时间了。