python新手,只是使用bs4和requests模块玩网络爬虫。目前的代码保持打印在我的关键字的实例,并想知道如何让它只打印一次。我是否使用"break",我将在哪里插入到我的代码?
import requests
from bs4 import BeautifulSoup
# Test for agency offering scrape
def seo(url):
result = requests.get(url)
soup = BeautifulSoup(result.text)
lowercased = result.text.lower()
keywords = ['creative']
for keyword in keywords:
if keyword.lower() in lowercased:
print (keyword)
links = soup.find_all('a')[1:]
for link in links:
seo(link['href'])
seo("http://www.daileyideas.com/")
如果你想退出你的函数,当你发现关键字只是return
:
def seo(url):
result = requests.get(url)
soup = BeautifulSoup(result.text)
lowercased = result.text.lower()
found=False
keywords = ['creative']
print keywords[0] in lowercased
for keyword in keywords:
if keyword.lower() in lowercased:
found =True
links = soup.find_all('a')[1:]
for link in links:
if not found:
seo(link['href'])
else:
print(keyword)
return
此函数将获得第一页上的所有链接并访问每个链接,直到找到关键字或我们用完链接:
import urlparse
def seo(url):
result = requests.get(url)
soup = BeautifulSoup(result.text)
links = [urlparse.urljoin(url, tag['href']) for tag in soup.findAll('a', href=True)] # get all links on the page
lower_cased = result.text.lower()
keywords = ['creative']
while links: # keep going until list is empty
for keyword in keywords:
if keyword.lower() in lower_cased:
print "Success we found the keyword: {}".format(keyword)
return
link = links.pop() # get next link to check
result = requests.get(link)
lower_cased = result.text.lower()
在递归搜索中,你需要设置一些深度限制,或者如果没有找到关键字,你的搜索将继续进行。Scrapy有你想要的工具,所以如果你真的想这样做,它值得一试。
您应该从seo
返回一些表示找到匹配的内容。然后调用代码可以检查返回值,当返回值表明存在匹配时,它可以跳出循环:
def seo(url):
result = requests.get(url)
soup = BeautifulSoup(result.text)
lowercased = result.text.lower()
keywords = ['creative']
for keyword in keywords:
if keyword.lower() in lowercased:
print (keyword)
return True # Found a match
links = soup.find_all('a')[1:]
for link in links:
if seo(link['href']):
return True
return False # No match