美丽汤 - <em> 给我带来结果的麻烦



我试图将<strong>标签中的标题放入headerListinfoList中的其余信息。它适用于除<em>标记之外的所有内容。我知道,我知道,HTML很烂,但不是我做的。无论如何,这是我正在使用的HTML:

<table border="0">
<tbody>
<tr>
<td>
<p><strong>MIKE ALSTOTT</strong></p>
<p><strong>Inducted: </strong>June 25, 2014 in West Lafayette, IN</p>
<p><strong>Date of Birth: </strong>December 21, 1973 in Joliet, IL</p>
<p><strong>High School Attended: </strong>Joliet Catholic Academy         <strong>Graduated: </strong>1992</p>
<p><strong>High School Honors: </strong><em>Parade </em>All-American; <em>Chicago Sun-Times </em>Illinois Player of the Year honors; rushed for 2,100 yards and 31 TDs as a senior; led team to 14-0 record and Class 4A State Championship as a junior with 1,820 yards and 26 TDs; also lettered in baseball.</p>
<p><strong>College Attended: </strong>Purdue University                       <strong>Graduated: </strong>1996</p>
<p><strong>College Honors:  </strong>4-year starting fullback; team MVP last 3 years; Purdue's all-time leading rusher with 3,635 yards, 5.6 yards per carry; holds PU record for career TDs with 42 and all-time, all-purpose yardage leader; holds several single season records; rushed for 100 yards or more 16 times; only PU player to accumulate more than 2,500 yards rushing and 1,000 yards receiving; as a senior, finished 11th in Heisman Trophy balloting, First Team All-Big Ten, and Gannett All-American.</p>
<p><strong>Professional Athletic Background:  </strong>Drafted 35th by NFL Tampa Bay Buccaneers 1996 and played there 12 seasons; forced to retire on January 24, 2008, due to neck injuries .</p>
<p><strong>Professional Athletic Honors:  </strong>Buccaneers won Super Bowl XXXVII in 2003; after being named 2nd team All-Pro in 1996, became first offensive player in Bucs' team history to be named 1st team Associated Press All-Pro 1997; selected All-Pro fullback 6 times; holds franchise record of 71 TDs; ran for over 5,000 yards in NFL career.</p>
<p><strong>Special Recognition:  </strong>Since retiring, has worked in private business in St. Petersburg area; established the Mike Alstott Family Foundation that supports the Children's Cancer Center, Ronald McDonald House, St. Petersburg All Children's Hospital, Sally House, and Big Brothers/Big Sisters in the St. Petersburg area; inducted into Purdue Athletics Hall of Fame 2006.</p>
<p><strong>Family:  </strong>Wife, Nicole; children, Griffin, Hannah, and Lexie.</p>
</td>
<td valign="top"><img src="/images/alstott_mike2%207-14.jpg" alt="" width="178" height="249" /></td>
</tr>
</tbody>
</table>

这是目前为止我的Python:

for strong_tag in soup.find_all('strong'):
headers = strong_tag.text.replace(':', '').replace('xa0', ' ').strip()
info = strong_tag.next_sibling
headerList.append(headers)
infoList.append(info)
print(headerList)
print(infoList)

这是我得到的结果,我需要帮助解决。问题在于Parade,因为它没有捕获后面的其余信息:

['MIKE ALSTOTT', 'Inducted', 'Date of Birth', 'High School Attended', 'Graduated', 'High School Honors', 'College Attended', 'Graduated', 'College Honors', 'Professional Athletic Background', 'Professional Athletic Honors', 'Special Recognition', 'Family']
[None, 'June 25, 2014 in West Lafayette, IN', 'December 21, 1973 in Joliet, IL', 'Joliet Catholic Academyxa0xa0xa0xa0xa0xa0xa0xa0 ', '1992', <em>Parade </em>, 'Purdue Universityxa0xa0xa0xa0xa0xa0xa0xa0xa0xa0xa0xa0xa0xa0xa0xa0xa0xa0xa0xa0xa0xa0xa0', '1996', "4-year starting fullback; team MVP last 3 years; Purdue's all-time leading rusher with 3,635 yards, 5.6 yards per carry; holds PU record for career TDs with 42 and all-time, all-purpose yardage leader; holds several single season records; rushed for 100 yards or more 16 times; only PU player to accumulate more than 2,500 yards rushing and 1,000 yards receiving; as a senior, finished 11th in Heisman Trophy balloting, First Team All-Big Ten, and Gannett All-American.", 'Drafted 35th by NFL Tampa Bay Buccaneers 1996 and played there 12 seasons; forced to retire on January 24, 2008,xa0due to neck injuries .', "Buccaneers won Super Bowl XXXVII in 2003; after being named 2nd team All-Pro in 1996, became first offensive player in Bucs' team history to be named 1stxa0team Associated Press All-Pro 1997; selected All-Pro fullback 6 times; holds franchise record of 71 TDs; ran for over 5,000 yards in NFL career.", "Since retiring, has workedxa0in private business in St. Petersburgxa0area; established the Mike Alstott Family Foundation that supports the Children's Cancer Center, Ronald McDonald House, St. Petersburg All Children's Hospital, Sally House, and Big Brothers/Big Sisters in the St. Petersburg area; inducted into Purdue Athletics Hall of Fame 2006.", 'Wife, Nicole; children, Griffin, Hannah, and Lexie.']

试试这个:

from bs4 import BeautifulSoup, Tag
for strong_tag in soup.find_all('strong'):
headers = strong_tag.text.replace(':', '').replace('xa0', ' ').strip()
info = ' '.join([i if not isinstance(i,Tag) else i.text for i in strong_tag.next_siblings])
headerList.append(headers)
infoList.append(info)
print(headerList)
print(infoList)

另一个解决方案是替换<em>标签。

html = html.replace('<em>', '').replace('</em>', '')
soup = BeautifulSoup(html, 'html.parser')
...........

最新更新