import BeautifulSoup
html = """
<html><head></head>
<body>
<a href='http://www.gurletins.com'>My HomePage</a>
<a href='http://www.gurletins.com/sections'>Sections</a>
</body>
</html>
"""
soup = BeautifulSoup.BeautifulSoup(html)
现在我想获取具有关键字Home
谁能告诉我如何使用BeautifulSoup?
html = """
<html><head></head>
<body>
<a href='http://www.gurletins.com'>My HomePage</a>
<a href='http://www.gurletins.com/sections'>Sections</a>
</body>
</html>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
for i in soup.find_all("a"):
if "HOME" in str(i).split(">")[1].upper():
print i["href"]
http://www.gurletins.com
有一个更好的方法。在text
参数中传递正则表达式:
import re
from bs4 import BeautifulSoup
html = """
<html><head></head>
<body>
<a href='http://www.gurletins.com'>My HomePage</a>
<a href='http://www.gurletins.com/sections'>Sections</a>
</body>
</html>
"""
soup = BeautifulSoup(html)
for a in soup.find_all("a", text=re.compile('Home')):
print a['href']
打印:
http://www.gurletins.com
注意,默认情况下它是区分大小写的。如果需要使其不敏感,则将re.IGNORECASE
标志传递给re.compile()
:
re.compile('Home', re.IGNORECASE)
演示:>>> import re
>>> from bs4 import BeautifulSoup
>>>
>>> html = """
... <html><head></head>
... <body>
... <a href='http://www.gurletins.com'>My HomePage</a>
... <a href='http://www.gurletins.com/sections'>Sections</a>
... <a href='http://www.gurletins.com/home'>So nice to be home</a>
... </body>
... </html>
... """
>>>
>>> soup = BeautifulSoup(html)
>>> for a in soup.find_all("a", text=re.compile('Home', re.IGNORECASE)):
... print a['href']
...
http://www.gurletins.com
http://www.gurletins.com/home