BeautifulSoup HTML提取文本



我第一次使用BeautifulSoup,并试图从html(已下载(中提取一个笑话。但不幸的是,我没有可以用来提取信息的类。

有一行"笑话的开始"one_answers"结束",我想要的是笑话的标题和文本。在附件中,你可以找到我的代码以及输出。

from bs4 import BeautifulSoup
with open('init1.html', 'r') as f:
contents = f.read()
soup = BeautifulSoup(contents, 'lxml')   
print(soup.prettify)
Output:
<bound method Tag.prettify of <html>
<head>
<title>Joke 1 of 25</title>
</head>
<body bgcolor="#fddf84" text="black">
<center>
<table cellpadding="0" cellspacing="0" width="620">
<td width="470">
<font size="+1"> <br/>
<!--begin of joke -->
A man visits the doctor. The doctor says "I have bad news for you.You have
cancer and Alzheimer's disease". <p>
The man replies "Well,thank God I don't have cancer!"
<!--end of joke -->
</p></font></td></table>
</center>
</body>
</html>
>

这很简单而且有效:

soup.table.td.text.strip()
# -> 'A man visits the doctor. The doctor says "I have bad news for you.You havencancer and Alzheimer's disease". nThe man replies "Well,thank God I don't have cancer!"

最新更新