从正文中的句子中识别出遵循特定模式的单词



我想使用python在正文中单独找到所有遵循以下四种模式的单词。我正在尝试regex。例如,第一种模式应该捕捉"碳"一词以及"碳法"、"碳政策"、"二氧化碳政策"、《碳警察》、"碳监管"等词的存在。它可以捕获大写和小写。如果除了正则表达式之外还有其他选项,我们可以使用

1. ("carbon" AND ("law" OR "polic*" OR "regulation")) 
2. ("Carbon" OR "carbon dioxide") AND "emissions") 
3. (("greenhouse gas*" OR "GHG") AND "emission*") 
4. (("carbon" OR "GHG" OR "greenhouse gas*) AND "pollution")
* denotes wild character

再现示例可以是以下数据帧df["文本"]。这里的所有示例都将通过regex或其他解决方案进行识别。

df['Text']
Text
1.  carbon footprint reducing law, and the policies have a potential to form regulations. There are many examples of regulations happening.
2.  Net Zero Carbon sourced emissions come from carbon dioxide generated emissions from fossil fuel.
3.  carbon reduction essentially means Carbon led greenhouse gases or GHG laced emissions.
4.  Reducing carbon  and netzero carbon footprint can happen from GHG reduction, greenhouse gases reduction and reducing pollution therefrom.

它基本上应该根据以下条件进行识别。

1. (Word "carbon" AND any of word ("law" OR "polic*" OR "regulation") appearing anywhere in group of sentences.) 
2. Word ("Carbon" OR "carbon dioxide") AND along with word "emissions" appearing anywhere within group of sentences) 
3. (Word ("greenhouse gas*" OR "GHG") AND word "emission*" appearing anywhere within group of sentences) 
4. (Word ("carbon" OR "GHG" OR "greenhouse gas*) AND word "pollution" appearing anywhere).. 

所有的单词都可以是小写和大写。这种情况可能发生多次。

然后,我们可以在df['Text']上使用regex函数来识别示例:

df['Text'].apply(lambda x: regex(x))

我用过

df_x = pd.DataFrame()
df_x['Text'] = ['carbon footprint reducing law, and the policies have a potential to form regulations. There are many examples of regulations happening.',
'Net Zero Carbon sourced emissions come from carbon dioxide generated emissions from fossil fuel.',
'carbon reduction essentially means Carbon led greenhouse gases or GHG laced emissions.',
'Reducing carbon  and netzero carbon footprint can happen from GHG reduction, greenhouse gases reduction and reducing pollution therefrom.']

#验证查询=("碳"与("法律"或"政策*"或"法规"(

df_x.loc[((df_x["carbon"] == True) & (df_x["law"] == True)) |  ((df_x["carbon"] == True) & (df_x["polic"] == True)) |
((df_x["carbon"] == True) & (df_x["regulation"] == True))
]

以上结果错误:KeyError:"carbon">

一种方法是使用https://pandas.pydata.org/docs/reference/api/pandas.Series.str.contains.html。

df.Text.str.contains("carbon", case=False)

会给你

0    True
1    True
2    True
3    True

类似的东西

df["carbon"] = df.Text.str.contains("carbon", case=False)
df["law"] =df.Text.str.contains("law", case=False)
df["regulation"] =df.Text.str.contains("regulation", case=False)
df["polic"] =df.Text.str.contains("polic", case=False)

并使用进行查询

df[
((df["carbon"] == True) & (df["law"] == True)) |
((df["carbon"] == True) & (df["polic"] == True)) |
((df["carbon"] == True) & (df["regulation"] == True))
]

您可以通过为每个单词应用contains来生成矩阵。然后查询矩阵以获得输出。。但当单词的顺序相反时,它可能不起作用。。例如:法律先于碳。

如果没有那么多单词,可以使用这种方法。否则,请使用regex。

df.str.contains也支持regex。

您可以应用类似的正则表达式

df.Text.str.contains("^.*carbon.*law.*$", regex=True)

最新更新