相似性评分来过滤pandas中的数据框列



我有一个pandas数据框架,df,其列名如下

columns = ['Baillie Gifford Positive Change Fund B Accumulation',
'Stewart Investors Worldwide Select Fund Class B (accumulation) Gbp',
'Stewart Investors Worldwide Select Fund Class A (accumulation) Gbp',
'Close Ftse Techmark Fund X Acc',
'Stewart Investors Asia Pacific Leaders Fund Class B (accumulation) Gbp',
'Stewart Investors Asia Pacific Leaders Fund Class A (accumulation) Gbp',
'Stewart Investors Worldwide Sustainability Fund Class A (accumulation) Gbp',
'Stewart Investors Worldwide Sustainability Fund Class B (accumulation) Gbp',
'Mi Somerset Emerging Markets Dividend Growth A Accumulation Shares',
'Axa Framlington Biotech Fund Gbp Z Acc',
'Stewart Investors Global Emerging Markets Sustainability Fund Class B (accumulation) Gbp',
'Schroder Asian Income Fund L Accumulation Gbp',
'Fidelity Active Strategy - Fast - Asia Fund Y-acc-gbp',
'Lf Miton Uk Value Opportunities Fund B Institutional Accumulation',
'Liontrust India Fund C Acc Gbp',
'Fidelity Asian Dividend Fund W Acc',
'Stewart Investors Global Emerging Markets Sustainability Fund Class A (accumulation) Gbp',
'Quilter Investors Emerging Markets Equity Growth Fund U2 (gbp) Accumulation',
'Man Glg Continental European Growth Fund Retail Accumulation Shares (class A)',
'Quilter Investors Europe (ex Uk) Equity Growth Fund A (gbp) Accumulation']

我想要的是过滤相似的列并保留其中一个。

例如,'Stewart Investors Worldwide Select Fund Class B (accumulation) Gbp''Stewart Investors Worldwide Select Fund Class A (accumulation) Gbp'相似,

我在想,NLP中用于识别相似文本的一些相似性评分可能会在这里有所帮助。但我不知道如何应用在我的情况下。

预期的结果应该是一个列表(我将使用它来过滤我的数据框架),其中保留了一个类似的文本。例如:

columns_filtered = ['Baillie Gifford Positive Change Fund B Accumulation',
'Stewart Investors Worldwide Select Fund Class B (accumulation) Gbp',
'Close Ftse Techmark Fund X Acc',
'Stewart Investors Asia Pacific Leaders Fund Class A (accumulation) Gbp',
'Stewart Investors Worldwide Sustainability Fund Class B (accumulation) Gbp',
'Mi Somerset Emerging Markets Dividend Growth A Accumulation Shares',
'Axa Framlington Biotech Fund Gbp Z Acc',
'Stewart Investors Global Emerging Markets Sustainability Fund Class B (accumulation) Gbp',
'Schroder Asian Income Fund L Accumulation Gbp',
'Fidelity Active Strategy - Fast - Asia Fund Y-acc-gbp',
'Lf Miton Uk Value Opportunities Fund B Institutional Accumulation',
'Liontrust India Fund C Acc Gbp',
'Fidelity Asian Dividend Fund W Acc',
'Stewart Investors Global Emerging Markets Sustainability Fund Class A (accumulation) Gbp',
'Quilter Investors Emerging Markets Equity Growth Fund U2 (gbp) Accumulation',
'Man Glg Continental European Growth Fund Retail Accumulation Shares (class A)',
'Quilter Investors Europe (ex Uk) Equity Growth Fund A (gbp) Accumulation']

帮忙吗?

我找到了解决方案

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity 
import numpy as np

vectorizer = CountVectorizer().fit_transform(df.columns.tolist())
vector = vectorizer.toarray()
similarity_score = cosine_similarity(vector)

df_similarity = pd.DataFrame(np.asmatrix(similarity_score))
df_similarity.columns = df.columns
df_similarity.index = df.columns
df_similarity

df_similarity是一个数据框,它保存每个列名相对于其他列名的相似性索引。

请注意,我使用了NLP中使用的相似度分数之一。你可以使用任何相似度评分。

相关内容

  • 没有找到相关文章

最新更新