查找列中的部分文本,如果发现为true,则通过反映分配的文本值而不是true或false来创建新列



我想根据发现的关键字字符串创建类别,而不是指定的类别,否则"other"

例如- if "health",则将该关键字行命名为"健康",如果是"治疗师",则将其命名为"治疗师"。然后"治疗师">

  1. 创建"类别"列通码
  2. 根据条件分配类别

我能够通过创建表和使用索引匹配在Excel上做到这一点,并希望切换到Python将其应用于大型数据集,

下面是示例数据

类别th>全球销售培训平台空缺资产授权管理系统数字团队项目管理解决方案开场全球培训平台开放全球销售培训平台空缺数字团队项目管理解决方案

您可以对所有关键字使用正则表达式。然后,根据您是想获得第一个匹配还是全部匹配,分别使用extractextractall进行聚合。

我添加了关键字">private"以第3行为例:

import re
words = ['health', 'therapist', 'sales', 'private']
regex = '|'.join(map(re.escape, words))
# 'health|therapist|sales|private'
# option 1: get first match
df['category_first'] = (df['keyword']
.str.extract(f'(?i)({regex})', expand=False)
.fillna('other')
)
# option 2: get all matches
df['category_all'] = (df['keyword']
.str.extractall(f'(?i)({regex})')
[0].groupby(level=0).agg(','.join)
.reindex(df.index, fill_value='other')
)
print(df)

输出:

keyword   category category_first       category_all
0                     HR Consultancy UK-d-uk-159_bing      other          other              other
1           it support COMPANY LONDON-D-UK-G1161_bing      other          other              other
2             global sales training platform openings      sales          sales              sales
3                     tele private practice therapist  therapist        private  private,therapist
4                       asset grant management system      other          other              other
5   digital team project management solution openings      other          other              other
6                   global training platform openings      other          other              other
7                             tele practice therapist  therapist      therapist          therapist
8             global sales training platform openings      sales          sales              sales
9                                tele health practice     health         health             health
10                             asset grant management      other          other              other
11           digital team project management solution      other          other              other

最简单的解决方案是

import pandas as pd
df = your dataframe

优先分配默认值。你可以设置为None或np。南也

df['category'] = 'others'
df.loc[df.keyword.str.contains('therapist'),'category'] = 'therapist'

您可以首先创建一个关键字列表来进行测试或研究。然后用"other"填充。

import pandas as pd
#example df
df=pd.DataFrame(data=['health xxxxx','yyyy therapist','zzzzz'],columns=["keyword"])
keywords=['health','therapist']

df['category']=df['keyword'].str.findall('|'.join(keywords)).apply(set).str.join(',')
df = df.apply(lambda x: x.str.strip()).replace('', "other")
  1. Pandas已经为这个应用了方法:
def f(s):
if s.find('health')>=0:
return 'health'
if s.find('thera...')>=0:
return 'thera'
...
df['category'] = df['text'].apply(f)

阅读:https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html

  1. pandas有相应的map方法。首先创建一个字典,然后使用map方法。
d ={keyword1:category1, ...}
df['category'] = df['keyword'].map(d)

当您有一个特定的字典时使用。读:https://pandas.pydata.org/docs/reference/api/pandas.Series.map.html

相关内容