根据一列的值从另一列中提取模式

给定熊猫数据帧的两列：

import pandas as pd
df = {'word': ['replay','replayed','playable','thinker','think','thoughtful', 'ex)mple'],
'root': ['play','play','play','think','think','think', 'ex)mple']}
df = pd.DataFrame(df, columns= ['word','root'])

如果root中的字符串不包括在word中，我想提取列word的子字符串，该子字符串包括直到对应列root或NaN中的字符串末尾的所有内容。也就是说，生成的数据帧看起来如下：

word       root    match
replay     play    replay
replayed   play    replay
playable   play    play
thinker    think   think
think      think   think
thoughtful think   NaN
ex)mple    ex)mple ex)mple

我的数据帧有几千行，所以如果必要的话，我希望避免for循环。

您可以在groupby+apply:中使用带有str.extract的正则表达式

import re
df['match'] = (df.groupby('root')['word']
.apply(lambda g: g.str.extract(f'^(.*{re.escape(g.name)})'))
)

或者，如果你期望很少有重复的"；根"；值：

import re
df['match'] = df.apply(lambda r: m.group()
if (m:=re.match(f'.*{re.escape(r["root"])}', r['word']))
else None, axis=1)

输出：

word   root   match
0      replay   play  replay
1    replayed   play  replay
2    playable   play    play
3     thinker  think   think
4       think  think   think
5  thoughtful  think     NaN

基于mozway的答案，regex也可以拼凑在一起，谢天谢地。一个不同的应用程序，人们可能会认为它通常很有用。

这里有两列full和tiny，第三列。。。正在创建CCD_ 11。

类似tiny的30 year old(尽管它们变化很大，天、周、月、十年等(是从full字符串/列中的长内容中提取的(然后进行运算以仅获得另一列中的整数，这对这些目的来说无关紧要(。

决定更多地围绕context而不仅仅是基本的tiny字符串会更好，这解决了这一问题，而不需要对现有代码进行复杂的运算。

df['context'] = df.groupby('tiny', group_keys=False)['full'].apply(
lambda g: g.str.extract(
r'b(.{0,20}' + f'{re.escape(g.name)}' + r'.{0,20}b)'
)
)

解释正则表达式：

r'b(.{0,20}' + f'{re.escape(g.name)}' + r'.{0,20}b)'

它基本上说，对于在每行标题为tiny的列中找到的内容，在名为full的列中查找其匹配项，但在之前最多添加20个字符(必要时在单词边界处停止，以避免单词中途被截断(，在>之后最多添加20字符<strong]，对于b也是如此。>

group_keys=False是为了避免Python 3.7中的"未来警告">

相关内容

最新更新

热门标签：