零工或美丽的套件,以刮擦各个网站的链接和文字



我正在尝试从输入的URL刮擦链接,但它仅适用于一个URL(http://www.businessinsider.com)。如何适应从输入的任何URL中刮擦?我正在使用Beautifutsoup,但是零食更适合此?

def WebScrape():  
    linktoenter = input('Where do you want to scrape from today?: ')
    url = linktoenter
    html = urllib.request.urlopen(url).read()
    soup = BeautifulSoup(html, "lxml")
    if linktoenter in url:
        print('Retrieving your links...')
        links = {}
        n = 0
        link_title=soup.findAll('a',{'class':'title'})
        n += 1
        links[n] = link_title
        for eachtitle in link_title:
            print(eachtitle['href']+","+eachtitle.string)
    else:
        print('Please enter another Website...')

您可以制作一个更通用的刮板,搜索这些标签中的所有标签和所有链接。有所有链接的列表后,您可以使用正则表达式或类似的方式来找到与所需结构相匹配的链接。

import requests
from bs4 import BeautifulSoup
import re
response = requests.get('http://www.businessinsider.com')
soup = BeautifulSoup(response.content)
# find all tags
tags = soup.find_all()
links = []
# iterate over all tags and extract links
for tag in tags:
    # find all href links
    tmp = tag.find_all(href=True)
    # append masters links list with each link
    map(lambda x: links.append(x['href']) if x['href'] else None, tmp)
# example: filter only careerbuilder links
filter(lambda x: re.search('[w]{3}.careerbuilder.com', x), links)

代码:

def WebScrape():
    url = input('Where do you want to scrape from today?: ')
    html = urllib.request.urlopen(url).read()
    soup = bs4.BeautifulSoup(html, "lxml")
    title_tags = soup.findAll('a', {'class': 'title'})
    url_titles = [(tag['href'], tag.text)for tag in title_tags]
    if title_tags:
        print('Retrieving your links...')
        for url_title in url_titles:
            print(*url_title)

out:

Where do you want to scrape from today?: http://www.businessinsider.com 
Retrieving your links...
http://www.businessinsider.com/trump-china-drone-navy-2016-12 Trump slams China's capture of a US Navy drone as 'unprecedented' act
http://www.businessinsider.com/trump-thank-you-rally-alabama-2016-12 'This is truly an exciting time to be alive'
http://www.businessinsider.com/how-smartwatch-pioneer-pebble-lost-everything-2016-12 How the hot startup that stole Apple's thunder wound up in Silicon Valley's graveyard
http://www.businessinsider.com/china-will-return-us-navy-underwater-drone-2016-12 Pentagon: China will return US Navy underwater drone seized in South China Sea
http://www.businessinsider.com/what-google-gets-wrong-about-driverless-cars-2016-12 Here's the biggest thing Google got wrong about self-driving cars
http://www.businessinsider.com/sheriff-joe-arpaio-still-wants-to-investigate-obamas-birth-certificate-2016-12 Sheriff Joe Arpaio still wants to investigate Obama's birth certificate
http://www.businessinsider.com/rents-dropping-in-new-york-bubble-pop-2016-12 Rents are finally dropping in New York City, and a bubble might be about to pop
http://www.businessinsider.com/trump-david-friedman-ambassador-israel-2016-12 Trump's ambassador pick could drastically alter 2 of the thorniest issues in the US-Israel relationship
http://www.businessinsider.com/can-hackers-be-caught-trump-election-russia-2016-12 Why Trump's assertion that hackers can't be caught after an attack is wrong
http://www.businessinsider.com/theres-a-striking-commonality-between-trump-and-nixon-2016-12 There's a striking commonality between Trump and Nixon
http://www.businessinsider.com/tesla-year-in-review-2016-12 Tesla's biggest moments of 2016
http://www.businessinsider.com/heres-why-using-uber-to-fill-public-transportation-gaps-is-a-bad-idea-2016-12 Here's why using Uber to fill public transportation gaps is a bad idea
http://www.businessinsider.com/useful-hard-adopt-early-morning-rituals-productive-exercise-2016-12 4 morning rituals that are hard to adopt but could really pay off
http://www.businessinsider.com/most-expensive-champagne-bottles-money-can-buy-2016-12 The 11 most expensive Champagne bottles money can buy
http://www.businessinsider.com/innovations-in-radiology-2016-11 5 innovations in radiology that could impact everything from the Zika virus to dermatology
http://www.businessinsider.com/ge-healthcare-mr-freelium-technology-2016-11 A new technology is being developed using just 1% of the finite resource needed for traditional MRIs

最新更新