在 2 个熊猫 df 列之间cosine_similarity以获得余弦距离



>我有一个数据帧,如下所示:

vector_a            vector_b
[1,2,3]             [2,5,6]
[0,2,1]             [2,9,1]
[4,7,1]             [1,7,4]

我想在vector_a和vector_b列之间进行sklearn的cosine_similarity,以在同一数据框中获得一个名为"cosine_distance"的新列。请注意,vector_a和vector_b是熊猫dflist列。

这是我尝试过的:

df['vector_a'] = df['vector_a'].apply(lambda x: np.asarray(x))
df['vector_b'] = df['vector_b'].apply(lambda x: np.asarray(x))
df['cosine_distance'] = cosine_similarity(df['vector_a'].apply(lambda x: np.transpose(x)), 
df['vector_b'].apply(lambda x: np.transpose(x)))

我得到了这个错误:

---> 58         df['cosine_distance'] = cosine_similarity(df['vector_a'].apply(lambda x: np.transpose(x)), df['vector_b'].apply(lambda x: np.transpose(x)))
~Anaconda3libsite-packagessklearnmetricspairwise.py in cosine_similarity(X, Y, dense_output)
1025     # to avoid recursive import
1026 
-> 1027     X, Y = check_pairwise_arrays(X, Y)
1028 
1029     X_normalized = normalize(X, copy=True)
~Anaconda3libsite-packagessklearnmetricspairwise.py in check_pairwise_arrays(X, Y, precomputed, dtype)
110     else:
111         X = check_array(X, accept_sparse='csr', dtype=dtype,
--> 112                         estimator=estimator)
113         Y = check_array(Y, accept_sparse='csr', dtype=dtype,
114                         estimator=estimator)
~Anaconda3libsite-packagessklearnutilsvalidation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
494             try:
495                 warnings.simplefilter('error', ComplexWarning)
--> 496                 array = np.asarray(array, dtype=dtype, order=order)
497             except ComplexWarning:
498                 raise ValueError("Complex data not supportedn"
~Anaconda3libsite-packagesnumpycorenumeric.py in asarray(a, dtype, order)
536 
537     """
--> 538     return array(a, dtype, copy=False, order=order)
539 
540 
ValueError: setting an array element with a sequence.

提前谢谢你!

TLDR:

df['cosine_similarity'] = df.apply(
lambda row: cosine_similarity([row['vector_a']], [row['vector_b']])[0][0],
axis=1)

解释:

  • cosine_similarity需要 2D np.array 或列表列表。它不知道如何解释 pd。一系列列表。但是,即使我们确实将其转换为列表列表,也会出现下一个问题:
  • cosine_similarity返回所有与所有相似性。因此,让我们限制为成对比较,人为地创建第二维(注意[row['vector_a']], [row['vector_b']]中额外的方括号(,然后取 1x1 数组的唯一元素(cosine_similarity(...)[0][0]末尾的零(

最新更新