如何让我的代码停止打印关键字在我的网络爬虫



python新手,只是使用bs4和requests模块玩网络爬虫。目前的代码保持打印在我的关键字的实例,并想知道如何让它只打印一次。我是否使用"break",我将在哪里插入到我的代码?

import requests
from bs4 import BeautifulSoup
# Test for agency offering scrape
def seo(url):
    result = requests.get(url)
    soup = BeautifulSoup(result.text)
    lowercased = result.text.lower()
    keywords = ['creative']
    for keyword in keywords:
        if keyword.lower() in lowercased:
            print (keyword)
    links = soup.find_all('a')[1:]
    for link in links:
        seo(link['href'])
seo("http://www.daileyideas.com/")

如果你想退出你的函数,当你发现关键字只是return:

def seo(url):
    result = requests.get(url)
    soup = BeautifulSoup(result.text)
    lowercased = result.text.lower()
    found=False
    keywords = ['creative']
    print keywords[0] in lowercased
    for keyword in keywords:
        if keyword.lower() in lowercased:
            found =True
    links = soup.find_all('a')[1:]
    for link in links:
        if not found:
            seo(link['href'])
        else:
            print(keyword)
            return 

此函数将获得第一页上的所有链接并访问每个链接,直到找到关键字或我们用完链接:

import urlparse
def seo(url):
    result = requests.get(url)
    soup = BeautifulSoup(result.text)
    links = [urlparse.urljoin(url, tag['href']) for tag in soup.findAll('a', href=True)] # get all links on the page
    lower_cased = result.text.lower()
    keywords = ['creative']
    while links: # keep going until list is empty
        for keyword in keywords:
            if keyword.lower() in lower_cased:
                print "Success we found the keyword: {}".format(keyword)
                return
        link = links.pop() # get next link to check
        result = requests.get(link)
        lower_cased = result.text.lower()

在递归搜索中,你需要设置一些深度限制,或者如果没有找到关键字,你的搜索将继续进行。Scrapy有你想要的工具,所以如果你真的想这样做,它值得一试。

您应该从seo返回一些表示找到匹配的内容。然后调用代码可以检查返回值,当返回值表明存在匹配时,它可以跳出循环:

def seo(url):
    result = requests.get(url)
    soup = BeautifulSoup(result.text)
    lowercased = result.text.lower()
    keywords = ['creative']
    for keyword in keywords:
        if keyword.lower() in lowercased:
            print (keyword)
            return True # Found a match
    links = soup.find_all('a')[1:]
    for link in links:
        if seo(link['href']):
            return True
    return False # No match

最新更新