使用Rgex.findall(text)时,如何获取计算"无"匹配和正匹配的列表?



这是我的第一篇文章,我会尽可能完整:

我正在尝试使用Python执行我的第一个web清理程序。我正在研究冠状病毒,并试图自己从上传原始数据的页面中获取数据。主要目标是创建一个包含"月日"、"新增病例(病例("one_answers"新增死亡(死亡("的数据框架以及国家和省份,但我将在另一个问题中询问这两个问题。

到目前为止,使用导入的库,我能够清理html文件的元素,特别是<li>元素和<h4>元素。

import re
from selenium import webdriver
from bs4 import BeautifulSoup
import pandas as pd
driver = webdriver.Chrome(#Purposefully left blank)
#names of my lists
cases = [] 
death = []
Country = [] 
Province = [] 
#the page i'm scrumming
driver.get("https://bnonews.com/index.php/2020/01/timeline-coronavirus-epidemic/")
content = driver.page_source
soup = BeautifulSoup(content)
#Regex for finding ["Any number" + new + case(s)] in a <li> element
newcaseReg = re.compile(r'''(
d+sbn?e?w?scases?b
)''',re.VERBOSE)
#Regex for finding ["Any number" + new + death(s)] in a <li> element
deathcaseReg = re.compile(r'''(
d+s?bn?e?w?sdeaths?
)''',re.VERBOSE)
#Regex for finding ["in province(s), country"]
provconReg = re.compile(r'''(
binsw+?.*?
[?sbprovince?s]
?[,s]
?w+..?
)''',re.VERBOSE)
#Regex for cleaning the country and province regex and returning a list with only the names
SepProvConReg = re.compile(r'''(
[A-Z]w+
)''',re.VERBOSE)
#variables that contain all the <h4> and <li> elements of the page. 
h4Tag = str(soup.findAll('h4'))
liTag = str(soup.findAll('li'))
#Cleans string and adds the amount of new cases per <li> element to "cases" variable
for i in newcaseReg.findall(liTag):
cleanNCReg = re.compile(r'''(
d+
)''', re.VERBOSE)
cases.append(str(cleanNCReg.findall(i)))
#This is supposed to append "0" when deathcaseReg.findall(liTag) == 'None' and also append "value" when
#deathcaseReg.findall(liTag) finds the regex. 
for i in deathcaseReg.findall(liTag):
if i == 'None':
death.append('None')
else:
death.append(i)

len(cases)

所以"case"的输出是1950+,但len(death)是284。这是因为正则表达式只计算阳性结果,而不是像我希望的那样附加"0"。这是我需要帮助的地方,因为我已经搜索和检查过了,答案来自:如果re.findall没有找到匹配项,如何返回字符串对我没有任何帮助,因为输出不断返回278(使用了该结果搜索的所有答案(。

还有一个问题:由于我正试图构建一个基于列的数据框架来分析R中的数据,我想知道是否有人能想到一种方法来编写一个代码,该代码将重复<h4>元素,以获得与该<h4>标记对应的相同数量的<li>标记:我的意思是,假设

<h4> 4th of March <h4>
<li><li>
<li><li>
<li><li>
.
.
.
x50
<h4>3rd of March<h4>
<li><li>
.
x30
<h4>2nd of March<h4>

因此,我想写一个代码,识别第一个<h4>和第二个<h4>之间的<li>的数量,并创建一个重复<h4>字符串一定次数的列表。

任何帮助都将不胜感激。感谢您花时间阅读本文。

实际上,您的正则表达式与"None"字符串不匹配(字面上(。

import re
death_case_re = re.compile(r"d+s?bn?e?w?sdeaths?")
match = death_case_re.search("None")
print(match.group() if match else "(don't match)")
# => (don't match)

循环语句时,可以使用以下正则表达式r"(First|d+) s+ (?: new s+)? cases?"来匹配大小写和类似的死亡正则表达式。

例如:

import re

sentences = """
23:59: 3 new cases in San Marino. (Source)
23:59: 1 new case in Fairfax County, Virginia, United States. This is the first case in Virginia. The patient is a U.S. Marine assigned to Fort Belvoir who recently returned from overseas business. (Source)
23:50: 2 new cases in Thailand. (Source)
[...]
22:42: First case in Washington, D.C. (Source)
22:41: 2 new cases and 1 new death in New South Wales, Australia. (Source 1, Source 2)
22:34: 1 new case, a patient who has already died, in Argentina. This is the first death in South America. The patient was a 64-year-old man who had traveled to Paris, France. He had underlying health conditions. (Source 1, Source 2, Source 3)""".splitlines()
FLAGS = re.VERBOSE | re.DOTALL | re.IGNORECASE
for sentence in sentences:
mo = re.search(r"(First|d+) s+ (?: new s+)? cases?", sentence, flags=FLAGS)
if mo:
cases = mo.group(1)
cases = int(cases) if cases.isdigit() else 1
else:
cases = 0
mo = re.search(r"(First|d+) s+ (?: new s+)? deaths?", sentence, flags=FLAGS)
if mo:
deaths = mo.group(1)
deaths = int(deaths) if deaths.isdigit() else 1
else:
deaths = 0
print(f"{cases} case(s), {deaths} death(s): {sentence}")

输出:

3 case(s), 0 death(s): 23:59: 3 new cases in San Marino. (Source)
1 case(s), 0 death(s): 23:59: 1 new case in Fairfax County, Virginia, United States. This is the first case in Virginia. The patient is a U.S. Marine assigned to Fort Belvoir who recently returned from overseas business. (Source)
2 case(s), 0 death(s): 23:50: 2 new cases in Thailand. (Source)
0 case(s), 0 death(s): [...]
1 case(s), 0 death(s): 22:42: First case in Washington, D.C. (Source)
2 case(s), 1 death(s): 22:41: 2 new cases and 1 new death in New South Wales, Australia. (Source 1, Source 2)
1 case(s), 1 death(s): 22:34: 1 new case, a patient who has already died, in Argentina. This is the first death in South America. The patient was a 64-year-old man who had traveled to Paris, France. He had underlying health conditions. (Source 1, Source 2, Source 3)

编辑

在您的特殊情况下,您需要在<li>标签上循环,并获得每个标签的字符串,例如:

import re
FLAGS = re.VERBOSE | re.DOTALL | re.IGNORECASE
for li_tag in soup.findAll('li'):
sentence = str(li_tag)
mo = re.search(r"(First|d+) s+ (?: new s+)? cases?", sentence, flags=FLAGS)
if mo:
cases = mo.group(0)
cases = int(cases) if cases.isdigit() else 1
else:
cases = 0
print(f"{cases} case(s)")

最新更新