我有一个数据帧,其中包含一列配置文件类型,如下所示:
0 Android Java
1 Software Development Developer
2 Full-stack Developer
3 JavaScript Frontend Design
4 Android iOS JavaScript
5 Ruby JavaScript PHP
我使用NLP来模糊匹配相似的配置文件,它返回以下相似性数据帧:
left_side right_side similarity
7 JavaScript Frontend Design Design JavaScript Frontend 0.849943
8 JavaScript Frontend Design Frontend Design JavaScript 0.814599
9 JavaScript Frontend Design JavaScript Frontend 0.808010
10 JavaScript Frontend Design Frontend JavaScript Design 0.802881
12 Android iOS JavaScript Android iOS Java 0.925126
15 Machine Learning Engineer Machine Learning Developer 0.839165
21 Android Developer Developer Android Developer 0.872646
25 Design Marketing Testing Design Marketing 0.817195
28 Quality Assurance Quality Assurance Developer 0.948010
虽然这对我有所帮助,使我从478个独特的个人资料增加到461个,但我想关注的是这样的个人资料:
Frontend Design JavaScript Design Frontend JavaScript
我看到的唯一一个解决这个问题的工具是difflib?我的问题是,还有什么其他技术可以使用,以便将这些由相同单词组成但无序的配置文件标准化为一个标准字符串。因此所希望的输出将是;设计"前端";以及";JavaScript";并将其替换为";设计前端JavaScript";。
现在,我正在将我的原始数据帧与相似性数据帧合并,以将右侧出现的所有配置文件字符串替换为左侧,但这意味着我将用下面的左侧替换下面的右侧("JavaScriptPython数据科学"(。
53 JavaScript Python Data Science Java Python Data Science
如有任何帮助,我们将不胜感激!!!
编辑***我写了以下内容来替换words_to_keep和clean_talentpool['profile']列中出现的所有单词,但这似乎不起作用?有人能指出我没有看到的东西吗?我真的很感激!
def standardize_word_order(row):
words_to_keep = [
"javascript frontend design",
"android ios javascript",
"android developer developer",
"android developer",
"quality assurance",
"quality assurance engineer",
"architecture developer",
"big data architecture developer",
"data architecture developer",
"software architecture developer",
"javascript python data science",
"frontend php javascript",
"javascript android ios",
"frontend design javascript",
"java python data science",
"javascript frontend android",
".net javascript frontend",
]
for word in words_to_keep:
if (sorted(word.replace(" ", ""))) == sorted(
row.replace(" ", "")
) and word != row:
row.replace(row, word)
return row
clean_talentpool["profile"] = clean_talentpool["profile"].apply(
lambda x: standardize_word_order(x)
)
在您的情况下,我不会关注字符串,而是字符。基本上,如果两个字符串由相同的字符组成(排列(,则它们匹配。
a = "Frontend Design JavaScript"
b = "Javascript Frontend Design"
sorted(a) == sorted(b)
#prints True
您可以考虑移除空间,并进行其他预处理,如降低成本。
if sorted(a.lower().replace(" ","")) == sorted(b.lower().replace(" ","")):
# they are the same, do something
根据你的例子,一个实现可能是:
def standardize_word_order(row):
words_to_keep = [
"javascript frontend design",
"android ios javascript",
"android developer developer",
"android developer",
"quality assurance",
"quality assurance engineer",
"architecture developer",
"big data architecture developer",
"data architecture developer",
"software architecture developer",
"javascript python data science",
"frontend php javascript",
"javascript android ios",
"frontend design javascript",
"java python data science",
"javascript frontend android",
".net javascript frontend",
]
for word in words_to_keep:
if ((sorted(word.replace(" ", ""))) == sorted(
row.replace(" ", "")
) and word != row):
return word
return row
clean_talentpool["profile"] = standardize_word_order(clean_talentpool["profile"])