网页抓取、蟒蛇和美丽汤;提取<li>特定 H2 标签下的所有标签



我试图在某个日期(例如5月9日(从wiki文章中提取所有事件,并将所有这些事件放在一列数据框架中,同时忽略<h3>标记的子标题Pre-1600、1601–1900、1901–present。这些小节中的所有事件都应该无缝地连接到一列中。

我还想忽略<h2>标签中表示的其他部分,如出生、死亡等。因此,只提取事件部分。感兴趣的<h2>标签/部分是列表中的第二个,如图所示。

import requests, itertools, re
from bs4 import BeautifulSoup as soup
d = soup(requests.get('https://en.wikipedia.org/wiki/May_9').text, 'html.parser')
h2 = d.find_all("h2")
h2
[<h2 id="mw-toc-heading">Contents</h2>,
<h2><span class="mw-headline" id="Events">Events</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=May_9&amp;action=edit&amp;section=1" title="Edit section: Events">edit</a><span class="mw-editsection-bracket">]</span></span></h2>,
<h2><span class="mw-headline" id="Births">Births</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=May_9&amp;action=edit&amp;section=5" title="Edit section: Births">edit</a><span class="mw-editsection-bracket">]</span></span></h2>,
<h2><span class="mw-headline" id="Deaths">Deaths</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=May_9&amp;action=edit&amp;section=9" title="Edit section: Deaths">edit</a><span class="mw-editsection-bracket">]</span></span></h2>,
<h2><span class="mw-headline" id="Holidays_and_observances">Holidays and observances</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=May_9&amp;action=edit&amp;section=13" title="Edit section: Holidays and observances">edit</a><span class="mw-editsection-bracket">]</span></span></h2>,
<h2><span class="mw-headline" id="References">References</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=May_9&amp;action=edit&amp;section=14" title="Edit section: References">edit</a><span class="mw-editsection-bracket">]</span></span></h2>,
<h2><span class="mw-headline" id="External_links">External links</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=May_9&amp;action=edit&amp;section=15" title="Edit section: External links">edit</a><span class="mw-editsection-bracket">]</span></span></h2>,
<h2>Navigation menu</h2>]

我正在努力构建一个函数,该函数选择Events部分,然后选择后续的<li>标记,但忽略副标题和其他部分。

我试着用把<h2>部分分开

data = [[i.name, i] for i in d.find_all(re.compile('h2|ul'))]
new_data = [[a, list(b)] for a, b in itertools.groupby(data, key=lambda x:x[0] == 'h2')]

但我被困在这一点上了。如果有更好的方法,我很乐意使用它

您可以使用.find_previous检查以前的<h2>是否为事件标题:

import requests
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/May_9"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
for li in soup.select("h3 + ul > li"):
if (h2 := li.find_previous("h2")) and (h2.find(id="Events")):
date, event = li.text.replace("–", "-").split(" - ", maxsplit=1)
print("{:<10} {}".format(date, event))

打印:

0328       Athanasius is elected Patriarch of Alexandria.[1]
1009       Lombard Revolt: Lombard forces led by Melus revolt in Bari against the Byzantine Catepanate of Italy.
1386       England and Portugal formally ratify their alliance with the signing of the Treaty of Windsor, making it the oldest diplomatic alliance in the world which is still in force.
1450       'Abd al-Latif (Timurid monarch) is assassinated.
1540       Hernando de Alarcón sets sail on an expedition to the Gulf of California.
1662       The figure who later became Mr. Punch makes his first recorded appearance in England.[2]
1671       Thomas Blood, disguised as a clergyman, attempts to steal England's Crown Jewels from the Tower of London.
1726       Five men arrested during a raid on Mother Clap's molly house in London are executed at Tyburn.
1864       Second Schleswig War: The Danish navy defeats the Austrian and Prussian fleets in the Battle of Heligoland.
1865       American Civil War: Nathan Bedford Forrest surrenders his forces at Gainesville, Alabama.
1865       American Civil War: President Andrew Johnson issues a proclamation ending belligerent rights of the rebels and enjoining foreign nations to intern or expel Confederate ships.
1873       Der Krach: Vienna stock market crash heralds the Long Depression.
1877       Mihail Kogălniceanu reads, in the Chamber of Deputies, the Declaration of Independence of Romania. This day became the Independence Day of Romania.
1901       Australia opens its first national parliament in Melbourne.
1911       The works of Gabriele D'Annunzio are placed in the Index of Forbidden Books by the Vatican.
1915       World War I: Second Battle of Artois between German and French forces.
1918       World War I: Germany repels Britain's second attempt to blockade the port of Ostend, Belgium.
1920       Polish-Soviet War: The Polish army under General Edward Rydz-Śmigły celebrates its capture of Kiev with a victory parade on Khreshchatyk.
1926       Admiral Richard E. Byrd and Floyd Bennett claim to have flown over the North Pole (later discovery of Byrd's diary appears to cast some doubt on the claim.)
1927       Old Parliament House, Canberra officially opens.[3]
1936       Italy formally annexes Ethiopia after taking the capital Addis Ababa on May 5.
1941       World War II: The German submarine U-110 is captured by the Royal Navy. On board is the latest Enigma machine which Allied cryptographers later use to break coded German messages.
1942       The Holocaust in Ukraine: The SS executes 588 Jewish residents of the Podolian town of Zinkiv (Khmelnytska oblast. The Zoludek Ghetto (in Belarus) is destroyed and all its inhabitants executed or deported.
1945       World War II: The final German Instrument of Surrender is signed at the Soviet headquarters in Berlin-Karlshorst.
1946       King Victor Emmanuel III of Italy abdicates and is succeeded by Umberto II.
1948       Czechoslovakia's Ninth-of-May Constitution comes into effect.
1950       Robert Schuman presents the "Schuman Declaration", is considered by some people to be the beginning of the creation of what is now the European Union.
1955       Cold War: West Germany joins NATO.
1960       The Food and Drug Administration announces it will approve birth control as an additional indication for Searle's Enovid, making Enovid the world's first approved oral contraceptive pill.
1969       Carlos Lamarca leads the first urban guerrilla action against the military dictatorship of Brazil in São Paulo, by robbing two banks.
1974       Watergate scandal: The  United States House Committee on the Judiciary opens formal and public impeachment hearings against President Richard Nixon.
1979       Iranian Jewish businessman Habib Elghanian is executed by firing squad in Tehran, prompting the mass exodus of the once 100,000-strong Jewish community of Iran.
1980       In Florida, United States, Liberian freighter MV Summit Venture collides with the Sunshine Skyway Bridge over Tampa Bay, making a 1,400-ft. section of the southbound span collapse. Thirty-five people in six cars and a Greyhound bus fall 150 ft. into the water and die.
1980       In Norco, California, United States, five masked gunmen hold up a Security Pacific bank, leading to a violent shoot-out and one of the largest pursuits in California history. Two of the gunmen and one police officer are killed and thirty-three police and civilian vehicles are destroyed in the chase.
1987       LOT Flight 5055 Tadeusz Kościuszko crashes after takeoff in Warsaw, Poland, killing all 183 people on board.
1988       New Parliament House, Canberra officially opens.[3]
1992       Armenian forces capture Shusha, marking a major turning point in the First Nagorno-Karabakh War.
1992       Westray Mine disaster kills 26 workers in Nova Scotia, Canada.
2001       In Ghana, 129 football fans die in what became known as the Accra Sports Stadium disaster. The deaths are caused by a stampede (caused by the firing of tear gas by police personnel at the stadium) that followed a controversial decision by the referee.
2002       The 38-day stand-off in the Church of the Nativity in Bethlehem comes to an end when the Palestinians inside agree to have 13 suspected terrorists among them deported to several different countries.[4]
2017       US President Donald Trump fires FBI Director James Comey.[5]
2018       The historic defeat for Barisan Nasional, the governing coalition of Malaysia since the country's independence in 1957 in 2018 Malaysian general election.
2020       The COVID-19 recession causes the U.S. unemployment rate to hit 14.9 percent, its worst rate since the Great Depression.[6]

最新更新