抓取多个网页,但结果被最后一个 url 覆盖



我想从多个网页中抓取所有URL。它可以工作,但只有最后一个网页的结果保存在文件中。

from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import re
import requests
urls=['https://www.metacritic.com/browse/movies/genre/date?page=0', 'https://www.metacritic.com/browse/movies/genre/date?page=2', '...']
for url in urls:
    response = requests.get(url)
    req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
    html_page = urlopen(req).read()
    soup = BeautifulSoup(html_page, features="html.parser")
links = []
for link in soup.findAll('a', attrs={'href': re.compile("^/movie/([a-zA-Z0-9-])+$")}):
    links.append(link.get('href'))
filename = 'output.csv'
with open(filename, mode="w") as outfile:
    for s in links:
        outfile.write("%sn" % s)

我在这里错过了什么?

如果我可以使用包含所有 url 而不是列表的 csv 文件,那就更酷了。但是我尝试过的任何事情都离谱了...

您正在使用网址的最后一汤。您应该将每个的第二个移动到第一个中。此外,您还将获得与您的正则表达式匹配的所有元素。表格外有您尝试抓取的元素。

from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import re
import requests
urls=['https://www.metacritic.com/browse/movies/genre/date?page=0', 'https://www.metacritic.com/browse/movies/genre/date?page=2']
links = []
for url in urls:
    response = requests.get(url)
    req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
    html_page = urlopen(req).read()
    soup = BeautifulSoup(html_page, features="html.parser")
    #You should get only movies from list otherwise you will also append coming soon section. That is why we added select_one
    for link in soup.select_one('ol.list_products').findAll('a', attrs={'href': re.compile("^/movie/([a-zA-Z0-9-])+$")}):
        links.append(link.get('href'))

filename = 'output.csv'
with open(filename, mode="w") as outfile:
    for s in links:
        outfile.write("%sn" % s)

这是结果。

/movie/woman-at-war
/movie/destroyer
/movie/aquaman
/movie/bumblebee
/movie/between-worlds
/movie/american-renegades
/movie/mortal-engines
/movie/spider-man-into-the-spider-verse
/movie/the-quake
/movie/once-upon-a-deadpool
/movie/all-the-devils-men
/movie/dead-in-a-week-or-your-money-back
/movie/blood-brother-2018
/movie/ghostbox-cowboy
/movie/robin-hood-2018
/movie/creed-ii
/movie/outlaw-king
/movie/overlord-2018
/movie/the-girl-in-the-spiders-web
/movie/johnny-english-strikes-again
/movie/hunter-killer
/movie/bullitt-county
/movie/the-night-comes-for-us
/movie/galveston
/movie/the-oath-2018
/movie/mfkz
/movie/viking-destiny
/movie/loving-pablo
/movie/ride-2018
/movie/venom-2018
/movie/sicario-2-soldado
/movie/black-water
/movie/jurassic-world-fallen-kingdom
/movie/china-salesman
/movie/incredibles-2
/movie/superfly
/movie/believer
/movie/oceans-8
/movie/hotel-artemis
/movie/211
/movie/upgrade
/movie/adrift-2018
/movie/action-point
/movie/solo-a-star-wars-story
/movie/feral
/movie/show-dogs
/movie/deadpool-2
/movie/breaking-in
/movie/revenge
/movie/manhunt
/movie/avengers-infinity-war
/movie/supercon
/movie/love-bananas
/movie/rampage
/movie/ready-player-one
/movie/pacific-rim-uprising
/movie/tomb-raider
/movie/gringo
/movie/the-hurricane-heist
嘿,

这是我的第一个答案,所以生病了,尽力帮助。

数据覆盖的问题在于,您在一个循环中循环访问 url,然后在另一个循环中循环访问 soup 对象。

这将始终在循环结束时返回最后一个 soup 对象,因此最好的办法是从 url 循环中将每个 soup 对象附加到数组中,或者在 url 循环中实际查询 soup 对象:

soup_obj_list = []
for url in urls:
    response = requests.get(url)
    req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
    html_page = urlopen(req).read()
    soup = BeautifulSoup(html_page, features="html.parser")
    soup_obj_list.append(soup)

希望能解决你的第一个问题。 不能真正帮助解决CSV问题。

最新更新