我想提取一个单词后固定组合的单词和特殊字符还需要在特定单元格中提取单词的计数



我有一个数据帧df,它有一列。我想在固定的单词组合后提取一个单词&特殊字符,也需要在该特定单元格中提取单词的计数。

例如:(最近报警触发器(','valueString':'倾斜传感器',(最近报警触发(','valueString':'你好世界',(最新报警触发(],'valueString','BC',

现在从上面的行中,我想提取"后面逗号之间的任何单词;(最近的报警触发器(','valueString':";

所以,在这种情况下,我只想要"倾斜传感器"和它在特定单元格中的计数

我不需要"你好世界"或"ABC",因为它排在第二或第三位。基本上我想要第一个搜索词。

下面是我的df:-

import re
import pandas as pd
import numpy as np

data = {'product_name': ["[{'name':'Trigger Cause Status (Most Recent Alarm Trigger)','valueString':'Tilt Sensor','packetType':'enumerated','leastSigBit':440,,'Tilt Sensor','mostSigBit':447},{'name':'User Set Year (Most Recent Alarm Trigger)', (Most Recent Alarm Trigger)','valueString':'Band','valueNumber':2022.0,'units':'Undefined / Not Used','packetType':'Tilt Sensor','leastSigBit':432,'mostSigBit':439},{'name':'User Set Month,(Most Recent Alarm Trigger)','valueString':'Back space',{'name':'User Set Minute (Most Recent Alarm Trigger)','valueNumber':16.0,'units':'min','packetType':'unsigned','leastSigBit':400,'mostSigBit':407},'Tilt Sensor',{'name':'User Set Second (Most Recent Alarm Trigger)','valueNumber':36.0,'units':'s','packetType':'unsigned','leastSigBit':392,'mostSigBit':399}]",
"[{'name':'Trigger Cause Status (Most Recent Alarm Trigger)','valueString':'Volumetric Sensor','packetType':'enumerated','leastSigBit':440,'mostSigBit':447},{'name':'User Set Year (Most Recent Alarm Trigger)','valueNumber':2022.0,'units':'(Most Recent Alarm Trigger)','valueString':'Being human','packetType':'unsigned','leastSigBit':432,'mostSigBit':439},{'name':'User Set Month (Most Recent Alarm Trigger)','valueNumber':6.0,'(Most Recent Alarm Trigger)','valueString':'Hello'':'Month','Volumetric Sensor','packetType':'unsigned','leastSigBit':424,'mostSigBit':431},{'name':'User Set Day (Most Recent Alarm ]"]}
df = pd.DataFrame(data)
df

我尝试了regex或apply方法,但没有得到我想要的。

以下是我尝试过的一些代码,

df["Extract"] = df["product_name"].apply(lambda st: st[st.find("(Most Recent Alarm Trigger)','valueString':")+1:st.find(",")])
df['Title'] = df.product_name.str.extract(r'"(Most Recent Alarm Trigger)','valueString':'"s*([^.]*)s*.', expand=False)

以下是我的预期结果:

data = {'product_name': ["[{'name':'Trigger Cause Status (Most Recent Alarm Trigger)','valueString':'Tilt Sensor','packetType':'enumerated','leastSigBit':440,,'Tilt Sensor','mostSigBit':447},{'name':'User Set Year (Most Recent Alarm Trigger)', (Most Recent Alarm Trigger)','valueString':'Band','valueNumber':2022.0,'units':'Undefined / Not Used','packetType':'Tilt Sensor','leastSigBit':432,'mostSigBit':439},{'name':'User Set Month,(Most Recent Alarm Trigger)','valueString':'Back space',{'name':'User Set Minute (Most Recent Alarm Trigger)','valueNumber':16.0,'units':'min','packetType':'unsigned','leastSigBit':400,'mostSigBit':407},'Tilt Sensor',{'name':'User Set Second (Most Recent Alarm Trigger)','valueNumber':36.0,'units':'s','packetType':'unsigned','leastSigBit':392,'mostSigBit':399}]",
"[{'name':'Trigger Cause Status (Most Recent Alarm Trigger)','valueString':'Volumetric Sensor','packetType':'enumerated','leastSigBit':440,'mostSigBit':447},{'name':'User Set Year (Most Recent Alarm Trigger)','valueNumber':2022.0,'units':'(Most Recent Alarm Trigger)','valueString':'Being human','packetType':'unsigned','leastSigBit':432,'mostSigBit':439},{'name':'User Set Month (Most Recent Alarm Trigger)','valueNumber':6.0,'(Most Recent Alarm Trigger)','valueString':'Hello'':'Month','Volumetric Sensor','packetType':'unsigned','leastSigBit':424,'mostSigBit':431},{'name':'User Set Day (Most Recent Alarm ]"],
'Extarct': ['Tilt Sensor','Volumetric Sensor'],'Count': [4,2]}
df = pd.DataFrame(data)
df

一种解决方案如下:

  • 使用Series.str.extract获得'valueString':'',之间的第一个匹配
  • 接下来,将df.apply与每行的lambda函数(axis=1(一起使用,以获得当前存储在相应product_name字符串内的df.Extract中的每个值的计数
import pandas as pd
# also adding the string from your comment
data = {'product_name': ["[{'name':'Trigger Cause Status (Most Recent Alarm Trigger)','valueString':'Tilt Sensor','packetType':'enumerated','leastSigBit':440,,'Tilt Sensor','mostSigBit':447},{'name':'User Set Year (Most Recent Alarm Trigger)', (Most Recent Alarm Trigger)','valueString':'Band','valueNumber':2022.0,'units':'Undefined / Not Used','packetType':'Tilt Sensor','leastSigBit':432,'mostSigBit':439},{'name':'User Set Month,(Most Recent Alarm Trigger)','valueString':'Back space',{'name':'User Set Minute (Most Recent Alarm Trigger)','valueNumber':16.0,'units':'min','packetType':'unsigned','leastSigBit':400,'mostSigBit':407},'Tilt Sensor',{'name':'User Set Second (Most Recent Alarm Trigger)','valueNumber':36.0,'units':'s','packetType':'unsigned','leastSigBit':392,'mostSigBit':399}]",
"[{'name':'Trigger Cause Status (Most Recent Alarm Trigger)','valueString':'Volumetric Sensor','packetType':'enumerated','leastSigBit':440,'mostSigBit':447},{'name':'User Set Year (Most Recent Alarm Trigger)','valueNumber':2022.0,'units':'(Most Recent Alarm Trigger)','valueString':'Being human','packetType':'unsigned','leastSigBit':432,'mostSigBit':439},{'name':'User Set Month (Most Recent Alarm Trigger)','valueNumber':6.0,'(Most Recent Alarm Trigger)','valueString':'Hello'':'Month','Volumetric Sensor','packetType':'unsigned','leastSigBit':424,'mostSigBit':431},{'name':'User Set Day (Most Recent Alarm ]",
"[{'name':'Power Mode Quality Factor','valueString':'Power Mode Undefined','valueString':'Finally',Trigger Cause Status (Most Recent Alarm Trigger)','valueString':'No Trigger (Event Store Empty)',}]"]}
df = pd.DataFrame(data)
df['Extract'] = df.product_name.str.extract(
r'(Most Recent Alarm Trigger)','valueString':'(.*?)',')
# N.B. We're using the question mark to make the search for '.*' lazy
df['Count'] = df.apply(lambda row: row.product_name.count(row.Extract), axis=1)
print(df.iloc[:,1:])
Extract  Count
0                     Tilt Sensor      4
1               Volumetric Sensor      2
2  No Trigger (Event Store Empty)      1

N。B.如果str.extract可能找不到匹配项,那么您将在df.Extract中得到NaN值。如果是这样,这将导致df.apply(lambda row: row.product_name.count(row.Extract), axis=1)出现错误(因为它期望string(。为了避免这种情况,您可以使用:

df['Count'] = df.apply(lambda row: row.product_name.count(row.Extract) 
if isinstance(row.Extract,str) else 0, axis=1)

最新更新