分组配置文件具有相同单词但出现在Python中的无序字符串



我有一个数据帧,其中包含一列配置文件类型,如下所示:

0                                    Android Java
1                  Software Development Developer
2                            Full-stack Developer
3                      JavaScript Frontend Design
4                          Android iOS JavaScript
5                             Ruby JavaScript PHP

我使用NLP来模糊匹配相似的配置文件,它返回以下相似性数据帧:

left_side                       right_side                  similarity
7   JavaScript Frontend Design  Design JavaScript Frontend  0.849943
8   JavaScript Frontend Design  Frontend Design JavaScript  0.814599
9   JavaScript Frontend Design  JavaScript Frontend         0.808010
10  JavaScript Frontend Design  Frontend JavaScript Design  0.802881
12  Android iOS JavaScript      Android iOS Java            0.925126
15  Machine Learning Engineer   Machine Learning Developer  0.839165
21  Android Developer Developer Android Developer           0.872646
25  Design Marketing Testing    Design Marketing            0.817195
28  Quality Assurance           Quality Assurance Developer 0.948010

虽然这对我有所帮助,使我从478个独特的个人资料增加到461个,但我想关注的是这样的个人资料:

Frontend Design JavaScript  Design Frontend JavaScript

我看到的唯一一个解决这个问题的工具是difflib?我的问题是,还有什么其他技术可以使用,以便将这些由相同单词组成但无序的配置文件标准化为一个标准字符串。因此所希望的输出将是;设计"前端";以及";JavaScript";并将其替换为";设计前端JavaScript";。

现在,我正在将我的原始数据帧与相似性数据帧合并,以将右侧出现的所有配置文件字符串替换为左侧,但这意味着我将用下面的左侧替换下面的右侧("JavaScriptPython数据科学"(。

53  JavaScript Python Data Science  Java Python Data Science

如有任何帮助,我们将不胜感激!!!

编辑***我写了以下内容来替换words_to_keep和clean_talentpool['profile']列中出现的所有单词,但这似乎不起作用?有人能指出我没有看到的东西吗?我真的很感激!

def standardize_word_order(row):
words_to_keep = [
"javascript frontend design",
"android ios javascript",
"android developer developer",
"android developer",
"quality assurance",
"quality assurance engineer",
"architecture developer",
"big data architecture developer",
"data architecture developer",
"software architecture developer",
"javascript python data science",
"frontend php javascript",
"javascript android ios",
"frontend design javascript",
"java python data science",
"javascript frontend android",
".net javascript frontend",
]
for word in words_to_keep:
if (sorted(word.replace(" ", ""))) == sorted(
row.replace(" ", "")
) and word != row:
row.replace(row, word)
return row
clean_talentpool["profile"] = clean_talentpool["profile"].apply(
lambda x: standardize_word_order(x)
)

在您的情况下,我不会关注字符串,而是字符。基本上,如果两个字符串由相同的字符组成(排列(,则它们匹配。

a = "Frontend Design JavaScript"
b = "Javascript Frontend Design"
sorted(a) == sorted(b)
#prints True

您可以考虑移除空间,并进行其他预处理,如降低成本。

if sorted(a.lower().replace(" ","")) == sorted(b.lower().replace(" ","")):
# they are the same, do something

根据你的例子,一个实现可能是:

def standardize_word_order(row):
words_to_keep = [
"javascript frontend design",
"android ios javascript",
"android developer developer",
"android developer",
"quality assurance",
"quality assurance engineer",
"architecture developer",
"big data architecture developer",
"data architecture developer",
"software architecture developer",
"javascript python data science",
"frontend php javascript",
"javascript android ios",
"frontend design javascript",
"java python data science",
"javascript frontend android",
".net javascript frontend",
]
for word in words_to_keep:
if ((sorted(word.replace(" ", ""))) == sorted(
row.replace(" ", "")
) and word != row):
return word
return row
clean_talentpool["profile"] = standardize_word_order(clean_talentpool["profile"])

最新更新