python中"inner joins"列表中的代码速度慢

我在这里看到了几篇关于python列表的文章，但我没有找到我的问题的正确答案;因为它是关于优化代码的。

我有一个python代码来比较两个列表。它必须找到相同的代码，并修改第二个位置的值。它最终完美运行，但需要很多时间。在SQL中，此查询需要2分钟，不再....，但是，在这里，我花了15分钟....所以我不明白这是内存问题还是编写的代码不好。

我有两个清单。第一个[代码，点]。第二个[代码，许可证]如果第一个列表中的第一个值(代码(，则与第二个列表(代码(的第一个值匹配;如果许可证等于"THIS"，则必须更新第一个列表的第二个值(点(，例如：

itemswithscore = [5675, 0], [6676, 0], [9898, 0], [4545, 0]
itemswithlicense = [9999, 'ATR'], [9191, 'OPOP'], [9898, 'THIS'], [2222, 'PLPL']
for sublist1 in itemswithscore:
    for sublist2 in itemswithlicense:
        if sublist1[0] == sublist2[0]: #this is the "inner join" :)
            if sublist2[1] == 'THIS': #It has to be license 'THIS'
                sublist1[1] += 50 #I add 50 to the score value

最后，我在代码 9868 中更新了此列表：

itemswithscore = [5675, 0], [6676, 0], [9898, 50], [4545, 0]

确实，这两个列表有 80.000 个值，每个人都.. :(

提前谢谢!!

我建议将数据结构转换/保留为/作为字典。这样，您就不需要在更新分数值之前使用嵌套的 for 循环(O(n²( 或 O(n x m( 操作(遍历两个列表，搜索列表代码号的对齐位置。

您只需更新分数的值，其中相应字典处的键与搜索字符串匹配：

dct_score = dict(itemswithscore)
dct_license = dict(itemswithlicense)
for k in dct_score:
    if dct_license.get(k) == 'THIS': # use dict.get in case key does not exist
         dct_score[k] += 50

如果你能使用熊猫，那将是非常有效的。

因此，您可以创建两个数据帧并将它们合并到一列中

像这样的东西

itemswithscore = [5675, 0], [6676, 0], [9898, 0], [4545, 0]
itemswithlicense = [9999, 'ATR'], [9191, 'OPOP'], [9898, 'THIS'], [2222, 'PLPL']
df1 = pd.DataFrame(list(itemswithscore), columns =['code', 'points'])
df2 = pd.DataFrame(list(itemswithlicence), columns=['code', 'license'])
df3 = pd.merge(df1, df2 , on='code', how='inner')
df3 = df3.drop('points', axis=1)

希望这有帮助，如果正确，请接受

干杯！

我很确定缓慢主要是由于循环本身，这在 Python 中不是很快。您可以通过缓存变量来加快代码速度，如下所示：

for sublist1 in itemswithscore:
    a = sublist1[0]  # Save to variable to avoid repeated list-lookup
    for sublist2 in itemswithlicense:
        if a == sublist2[0]:
            if sublist2[1] == 'THIS':
                sublist1[1] += 50

此外，如果您碰巧知道'THIS'不会多次出现在itemswithlicense中，则应在更新sublist1[1]后插入break。

让我知道这有多大的不同。

相关内容

最新更新

热门标签：