使用字典查找/替换标记熊猫系列中的确切术语



使用字典,我需要根据以下标准查找和替换熊猫系列中的术语:

  1. 字典值正在替换pandas系列中任何位置的字典键(例如,在"mastersphd":"master-phd"中,替换结果将是"master/phd",无论"masterphd"出现在哪里(
  2. 保持记录的完整性(即,不能使用单词袋方法,因为我需要唯一的记录来保持完整。(
  3. 只应替换完全匹配的(例如,如果key:value为"rf":"随机林",则替换不应将"performance"变为"perrandom foreormance"(;所以regex=True显然导致了这种情况(

数据:term_fixes是字典,df['jobdescription']是感兴趣的标记化系列

term_fixes = {'rf': 'random forest',
'mastersphd': 'masters phd',
'curiosity': 'curious',
'trustworthy': 'ethical',
'realise': 'realize'}

df = pd.DataFrame(data={'job_description': [['knowledge', 'of', 'algorithm', 'like', 'rf'],
['must', 'have', 'a', 'mastersphd'],
['trustworthy', 'and', 'possess', 'curiosity'],
['we', 'realise', 'performance', 'is', 'key']]})

**注意:我也尝试过(未成功(未标记的数据结构,但更喜欢标记化,因为我有更多的NLP要做

df = pd.DataFrame(data={'job_description': ['knowledge of algorithm like rf',
'must have a mastersphd',
'must be trustworthy and possess curiosity',
'we realise performance is critical']})

**预期结果(请注意,性能中的"rf"不会被"随机森林"取代(:df['jobdescription']

0    ['knowledge' 'of' 'algorithm' 'like' 'random' 'forest']
1                        ['must' 'have' 'a' 'masters' 'phd']
2          ['must' 'be' 'ethical' 'and' 'possess' 'curious']
3             ['we' 'realize' 'performance' 'is' 'critical']

我尝试了很多方法。失败:df['job_description'].replace(list(term_fixes.keys()), list(term_fixes.values()), regex=False, inplace=True)

失败:df['job_description'].replace(dict(zip(list(term_fixes.keys()), list(term_fixes.values()))), regex=False, inplace=True)

失败:df['job_description'] = df['job_description'].str.replace(term_fixes, regex=False)

失败:df['job_description'] = df['job_description'].str.replace(str(term_fixes.keys()), str(term_fixes.values()), regex=True)

我最接近的是

df['job_description'] = df_jobs['job_description'].replace(term_fixes, regex=True)

但是,regex=True标记任何匹配项(如上面的"rf"one_answers"performance"示例(。不幸的是,将标志更改为regex=False无法替换任何内容。我在文档中查找了另一个可以使用的论点,但没有找到。请注意,这使用了未kenized结构。

任何帮助都将不胜感激。谢谢

使用;标记化的";df的版本。

df['job_description'] = df['job_description'].explode().replace(term_fixes).groupby(level=-1).agg(list)
# explode to get single terms per "cell"
# replace to replace the terms in "term_fixes"
# groupby to reverse the previous explode and return to a column of lists
job_description
0  [knowledge, of, algorithm, like, random forest]
1                     [must, have, a, masters phd]
2                 [ethical, and, possess, curious]
3              [we, realize, performance, is, key]

如果你需要新的术语也在空白处拆分,那么你可以在最终的groupby之前添加另一个中间步骤.str.split().explode()

df['job_description'] = df['job_description'].explode().replace(term_fixes).str.split().explode().groupby(level=-1).agg(list)
job_description
0  [knowledge, of, algorithm, like, random, forest] # random forest is now split
1                     [must, have, a, masters, phd] # masters phd is now split
2                  [ethical, and, possess, curious]
3               [we, realize, performance, is, key]

您可以使用类似以下的方法来处理未关键字化的数据。

for k in term_fixes:
df['job_description'] = (df['job_description'].str.replace(r'(^|(?<= )){}((?= )|$)'.format(k), term_fixes[k]))
print(df)
job_description
0  knowledge of algorithm like random forest
1                    must have a masters phd
2        must be ethical and possess curious
3         we realize performance is critical

最新更新