如何使用条件拆分熊猫的列



我有一个Log df,其中df有Description列。看起来像。

Description
Machine x : Turn off
Another action here
Another action here
Machine y : Turn off
Machine x : Turn on
Another action here

我只需要用":"拆分行

类似:

Description               Machine           Action
Machine x : Turn off      Machine x         Turn off
Another action here
Another action here
Machine y : Turn off      Machine y         Turn off
Machine x : Turn on       Machine x         Turn on
Another action here

我已经试过了:

s = df["Description"].apply(lambda x:x.split(":"))
df["Action"] = s.apply(lambda x: x[1])
df["Machine"] = s.apply(lambda x: x[0])

还有一些带有"startswith"的东西。

您可以将str.extract与合适的regex一起使用。这将查找:周围的所有值(同时剥离冒号周围的空格(:

df[['Machine', 'Action']] = df.Description.str.extract('(.*) : (.*)',expand=True)
>>> df
Description    Machine    Action
0  Machine x : Turn off  Machine x  Turn off
1   Another action here        NaN       NaN
2   Another action here        NaN       NaN
3  Machine y : Turn off  Machine y  Turn off
4   Machine x : Turn on  Machine x   Turn on
5   Another action here        NaN       NaN
# df[['Machine', 'Action']] = df.Description.str.extract('(.*) : (.*)',expand=True).fillna('')

给定一个数据帧

>>> df
Description
0  Machine x : Turn off
1   Another action here
2   Another action here
3  Machine y : Turn off
4   Machine x : Turn on
5   Another action here

我会通过Series.str.split(splitter, expand=True)来处理这个问题。

>>> has_colon = df['Description'].str.contains(':')
>>> df[['Machine', 'Action']] = df.loc[has_colon, 'Description'].str.split('s*:s*', expand=True)
>>> df
Description    Machine    Action
0  Machine x : Turn off  Machine x  Turn off
1   Another action here        NaN       NaN
2   Another action here        NaN       NaN
3  Machine y : Turn off  Machine y  Turn off
4   Machine x : Turn on  Machine x   Turn on
5   Another action here        NaN       NaN

如果您喜欢空字符串,可以通过替换NaN单元格

>>> df.fillna('')
Description    Machine    Action
0  Machine x : Turn off  Machine x  Turn off
1   Another action here                     
2   Another action here                     
3  Machine y : Turn off  Machine y  Turn off
4   Machine x : Turn on  Machine x   Turn on
5   Another action here 

仅使用splitexpand=True

df[['Machine', 'Action']] =df.Description.str.split(':',expand=True).dropna()
df
Description     Machine     Action
0  Machine x : Turn off  Machine x    Turn off
1   Another action here         NaN        NaN
2   Another action here         NaN        NaN
3  Machine y : Turn off  Machine y    Turn off
4   Machine x : Turn on  Machine x     Turn on
5   Another action here         NaN        NaN

使用pd.Series.str.extract函数和特定的正则表达式模式(覆盖:分隔符周围的潜在多个空格(:

In [491]: df
Out[491]: 
Description
0  Machine x : Turn off
1   Another action here
2   Another action here
3  Machine y : Turn off
4   Machine x : Turn on
5   Another action here
In [492]: pd.concat([df, df.Description.str.extract('(?P<Machine>[^:]+)s+:s+(?P<Action>[^:]+)').fillna('')], axis=1)
Out[492]: 
Description    Machine    Action
0  Machine x : Turn off  Machine x  Turn off
1   Another action here                     
2   Another action here                     
3  Machine y : Turn off  Machine y  Turn off
4   Machine x : Turn on  Machine x   Turn on
5   Another action here                     

StringMethods既有用又方便,但通常性能不佳

我建议使用默认构造函数和纯python字符串处理

df[['Machine', 'Action']] = pd.DataFrame([x.split(':') for x in df.Description]).dropna()

定时比.str访问器选项更好。

df = pd.concat([df]*1000)
%timeit pd.DataFrame([x.split(':') for x in df.Description]).dropna()
4.47 ms ± 252 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit df.Description.str.split(':',expand=True).dropna()
14.9 ms ± 323 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit df.Description.str.extract('(.*) : (.*)',expand=True)
16.6 ms ± 393 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit pd.concat([df, df.Description.str.extract('(?P<Machine>[^:]+)s+:s+(?P<Action>[^:]+)').fillna('')], axis=1)
22.5 ms ± 448 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

我的主张是:

msk = df.Description.str.contains(':')
df[['Machine', 'Action']] = df.Description.str.split(':', 1, expand=True).where(msk, '')

首先创建一个掩码-行可以接收非空值。

然后只对掩码为true的行执行实际替换。其他行(实际上是所有新列(接收一个空字符串。

最新更新