并行填充熊猫系列和csr矩阵中的ndarray



当前使用for循环将pandas系列(类别/对象dtype(和csr矩阵(numpy(的值填充到ndarray,我希望加快的速度

Sequential for循环(有效(、numba(不喜欢序列和字符串(、joblib(比顺序循环慢(、swifter.apply(慢得多,因为我必须使用panda,但它确实可以并行化(

import pandas as pd
import numpy as np
from scipy.sparse import rand
nr_matches = 10**5
name_vector = pd.Series(pd.util.testing.rands_array(10, nr_matches))
matches = rand(nr_matches, 10, density = 0.2, format = 'csr')
non_zeros = matches.nonzero()
sparserows = non_zeros[0]
sparsecols = non_zeros[1]
left_side = np.empty([nr_matches], dtype = object)
right_side = np.empty([nr_matches], dtype = object)
similarity = np.zeros(nr_matches)
for index in range(0, nr_matches):
left_side[index] = name_vector.iat[sparserows[index]]
right_side[index] = name_vector.iat[sparsecols[index]]
similarity[index] = matches.data[index]

没有错误消息,但这很慢,因为它使用了一个线程!

正如Divarak所提到的,切片直接起的作用

matches_df["left_side"] = name_vector.iloc[sparserows].values
matches_df["right_side"] = name_vector.iloc[sparsecols].values
matches_df["similarity"] = matches.data

最新更新