我的python代码需要8个多小时来迭代大数据

更新：已经超过24小时了，代码仍然尚未完成：)

我有下面的python代码。基本上，这段代码目前只使用了数据集的1%(这就是为什么它被称为sample)。它有32968行纯名称。我去掉了标点符号，已经全部用小写了。

我的问题是，到目前为止，这个代码已经运行了8个小时，还没有完成。如上所述，由于我只使用了1%的数据，我稍后需要在整个数据集上再次运行此代码，这将花费100倍的时间。我不认为等800个小时是个好主意。所以对于我的问题：

有什么方法可以让它更快吗？是否应该学习spark或mapreduce并尝试将其用于此代码

编辑：好的，我会尝试添加更多关于代码实际操作的信息。清除之前的名称示例：

import pandas as pd
import numpy as np
data = {'clean_name': ['Abbott Laboratories','Apple Computers', 'Apple, Inc.', 'Abercrombie & Fitch Co.', 'ABM Industries Incorporated', 'Ace Hardware Corporation'], 'name_group': np.zeros(6, dtype=int)}
sample = pd.DataFrame(data)
sample
Out[2]: 
clean_name  name_group
0          Abbott Laboratories           0
1              Apple Computers           0
2                  Apple, Inc.           0
3      Abercrombie & Fitch Co.           0
4  ABM Industries Incorporated           0
5     Ace Hardware Corporation           0

然后，我清除了它的标点符号，并将其全部小写。基本上，我想将每个名字与下一个名字进行比较，如果相似，我会给它相同的组号。类似这样的东西：

sample
Out[28]: 
clean_name  name_group
0          abbott laboratories           0
1              apple computers           1
2                  apple  inc            1
3      abercrombie   fitch co            0
4  abm industries incorporated           0
5     ace hardware corporation           0

下面的代码是我想出的：

i = 1
for alpha,beta in itertools.combinations(sample.clean_name, 2):
score = fuzz.token_sort_ratio(alpha, beta)
A = sample.loc[sample.clean_name==alpha, 'name_group'].values[0]
B = sample.loc[sample.clean_name==beta, 'name_group'].values[0]
if score > 60:
if ((B != 0) & (A !=0)): continue
if ((A == 0) & (B !=0)): A = B
elif ((B == 0) & (A !=0)): B = A
elif ((B == 0) & (A ==0)):
A, B = i, i
i += 1
sample.loc[sample.clean_name==alpha, 'name_group'] = A
sample.loc[sample.clean_name==beta, 'name_group'] = B

对32k行使用itertools.combinations肯定会使代码变慢。以下是一种在较小的数据集上使用numpy而不是panda来实现以下目标的方法：

实现一个函数，用于根据某些条件(谓词)对公司名称进行分组
使函数比发布的实现更快

用这篇文章从不同的角度来解决你的问题。

给定

在这里，我们构建了一个公司名称A、B、C和Aa:的小列表

import itertools as it
import collections as ct
import numpy as np

companies = "A B C Aa".split()

代码

步骤1

首先，我们将创建一个2D数组，其中水平索引和垂直索引是相同的公司名称。矩阵内部将包括合并后的公司名称：

# 1. Build a 2D array of joined strings
def get_2darray(seq):
"""Return a 2D array of identical axes."""
x = np.array(seq)
y = np.array(seq)
xx = x[:, np.newaxis]
yy = y[np.newaxis, :]
return np.core.defchararray.add(xx, yy)                # ref 001

演示

arr = get_2darray(companies)
arr
# array([['AA', 'AB', 'AC', 'AAa'],
#        ['BA', 'BB', 'BC', 'BAa'],
#        ['CA', 'CB', 'CC', 'CAa'],
#        ['AaA', 'AaB', 'AaC', 'AaAa']], 
#       dtype='<U4')

步骤2

其次，我们实现了一个用于枚举类似公司的group函数。给定一个2D数组，将使用辅助函数(func)将每个元素"转换"为一个组号：

# 2. Group companies by "similarity", e.g. "AB" == "BA"
def group(func, arr, pred=None, verbose=False):
"""Return an array of items enumerated by similarity."""
if pred is None:
# Set diagnol to zero
pred = lambda x: len(set(x)) != len(x)
dd = ct.defaultdict(it.count().__next__)
dd[""] = np.nan 
# opt_func = np.vectorize(func)                        # optional, slower
opt_func = np.frompyfunc(func, 3, 1)                   # ref 002
m = opt_func(arr, dd, pred)
if verbose: print(dd)
return m

def transform(item, lookup, pred):
"""Return a unique group number element-wise."""
unique_idx = "".join(sorted(item.lower()))
name_group = lookup[unique_idx]
if pred(item):
return 0
else:
return name_group

演示

groups = group(transform, arr, verbose=True)
groups
# defaultdict(<method-wrapper '__next__' of itertools.count object at 0x00000000062BE408>,
# {'': nan, 'aaa': 3, 'aac': 8, 'ab': 1, 
# 'cc': 7, 'aa': 0, 'bc': 5, 'aaaa': 9,
# 'ac': 2, 'bb': 4, 'aab': 6})
# array([[0, 1, 2, 0],
#        [1, 0, 5, 6],
#        [2, 5, 0, 8],
#        [0, 6, 8, 0]], dtype=object)

每个公司名称都用一个唯一的编号分组。

步骤3

现在，您可以通过分割groups阵列来访问两家公司的组号：

# 3. Lookup the group number shared by companies
reversed_lookup = {v:k for k, v in enumerate(companies)}
def group_number(arr, a, b):
"""Return the name_group given company names, in 2D array `m`."""
i, j = reversed_lookup[a], reversed_lookup[b]
return arr[i, j]

for companies in [["B", "C"], ["A", "B"], ["C", "C"]]:
msg = "Companies {}: group {}"
print(msg.format(" & ".join(companies), group_number(groups, *companies)))

# Companies B & C: group 5
# Companies A & B: group 1
# Companies C & C: group 0

详细信息

步骤1

为什么要使用数组为什么要在数组中合并公司名称合并字符串的2D数组用于比较公司名称。这种比较方式类似于统计相关矩阵。

步骤2

如何确定组公司名称被传递到一个特殊的字典(dd)，该字典只在找到新密钥时分配一个递增的整数。当transform辅助函数应用于每个元素时，该字典用于跟踪组。

为什么要使用助手函数tranform函数将数组中的每个项转换为一个组号。请注意，跟踪字典(lookup)是通过谓词传入的。以下是关于这些group参数的一些注意事项：

跟踪字典的键是通过降低给定字符串的值并对其进行排序来生成的。这种技术在内部用于将字符串等同于交换的公司名称。例如，合并后的公司"AB"one_answers"BA"应属于同一组
谓词由用户决定。如果没有给定谓词(pred=None)，则应用默认谓词，该谓词天真地比较具有相同名称的字符串(尤其是沿diagnol)

您可能希望使用另一个谓词。例如，从默认谓词来看，任何一组降低的字符串都是等价的，因此A == Aa == AaAa(请参见数组的角被分配给组0)。以下是区分A和Aa的另一个示例谓词(分别为组0和组3)：

pred = lambda x: all(not(v%2) for k, v in ct.Counter(x).items()) group(transform, arr, pred) # array([[0, 1, 2, 3], # [1, 0, 5, 6], # [2, 5, 0, 8], # [3, 6, 8, 0]], dtype=object)
如何优化性能一些操作被向量化，以帮助使用C实现加速代码。在group函数中，numpy.frompyfun封装辅助函数。已经确定，这个特定的"通用函数"比矢量化函数numpy.vectorize更快。有关优化numpy代码的更多方法，请参阅Scipy讲义。
步骤3
如何找到两家公司的组号这只需对group函数返回的数组进行切片即可完成。CCD_ 22是用于查询数组的切片函数。由于步骤2中的索引现在是数字索引，因此我们从开始的有序序列companies构建了一个反向字典，以按公司名称查找相应的数字索引。请注意，反向字典是在切片函数之外构建的，以避免在每次查询后重新构建字典。
性能
它有多快对于我们的<10行，速度为亚毫秒：

%timeit group(transform, arr) # 10000 loops, best of 3: 110 µs per loop
为了进行演示，让我们将数据扩展到1000行左右(除此之外，即使创建数据集也需要很长时间并消耗内存)。

test = tuple(map(str, range(1000))) full_arr = get_2darray(test) print(full_arr.shape) full_arr # (1000, 1000) # array([['00', '01', '02', ..., '0997', '0998', '0999'], # ['10', '11', '12', ..., '1997', '1998', '1999'], # ['20', '21', '22', ..., '2997', '2998', '2999'], # ..., # ['9970', '9971', '9972', ..., '997997', '997998', '997999'], # ['9980', '9981', '9982', ..., '998997', '998998', '998999'], # ['9990', '9991', '9992', ..., '999997', '999998', '999999']], # dtype='<U6') %timeit group(transform, full_arr) # 1 loop, best of 3: 5.3 s per loop
通过只评估矩阵的一半来节省一些计算时间：

half_arr = np.triu(test) half_arr # array([['00', '01', '02', ..., '0997', '0998', '0999'], # ['', '11', '12', ..., '1997', '1998', '1999'], # ['', '', '22', ..., '2997', '2998', '2999'], # ..., # ['', '', '', ..., '997997', '997998', '997999'], # ['', '', '', ..., '', '998998', '998999'], # ['', '', '', ..., '', '', '999999']], # dtype='<U6') %timeit group(transform, half_arr) # 1 loop, best of 3: 3.61 s per loop
注意：没有在32k行的数据集上执行分析。
结论
在这种方法中，上述目标是通过以下方式实现的：
将小数据集的数据挖掘和评估分为步骤1和2
通过在步骤3中对分组公司的最终numpy数组进行切片来进行分析
考虑numpy来优化C级的比较函数。虽然本文中的性能测试可能仍然需要时间，但numpy为进一步的优化提供了空间。此外，该代码在OP的数据集上花费的时间可能少于8小时。需要进一步分析，以评估这种方法的复杂性。如果复杂性合理，用户可以决定如何进行，例如在多个线程上进行并行处理。这些任务留给了那些可能感兴趣的人。
参考
001：如何在numpy数组中合并字符串
002：矢量化功能

相关内容

最新更新

热门标签：