网页抓取 - Python->美丽汤->网页抓取->循环访问网址(1 到 53)并保存结果



这是我试图抓取的网站http://livingwage.mit.edu/

特定的url来自

http://livingwage.mit.edu/states/01
http://livingwage.mit.edu/states/02
http://livingwage.mit.edu/states/04 (For some reason they skipped 03)
...all the way to...
http://livingwage.mit.edu/states/56

对于每一个url,我都需要第二个表的最后一行:

示例http://livingwage.mit.edu/states/01

要求税前年收入$20,260 $42,786 $51,642$64,767 $34,325 $42,305 $47,345 $53,206 $34,325 $47,69156934美元66997美元

欲望输出:

Alabama $20,260 $42,786 $51,642 $64,767 $34,325 $42,305 $47,345 $53,206 $34,325 $47,691 $56,934 $66,997

阿拉斯加$24,070 $49,295 $60,933 $79,871 $38,561 $47,136 $52,233 $61,531 $38,561 $54,433 $66,316 $82,403

Wyoming $20,867 $42,689 $52,007 $65,892 $34,988 $41,887 $46,983 $53,549 $34,988 $47,826 $57,391 $68,424

经过2个小时的折腾,这是我到目前为止(我是一个初学者):

import requests, bs4
res = requests.get('http://livingwage.mit.edu/states/01')
res.raise_for_status()
states = bs4.BeautifulSoup(res.text)

state_name=states.select('h1')
table = states.find_all('table')[1]
rows = table.find_all('tr', 'odd')[4:]

result=[]
result.append(state_name)
result.append(rows)

当我在Python控制台查看state_name和rows时,它给了我html元素

[<h1>Living Wag...Alabama</h1>]

[<tr class = "odd...   </td> </tr>]

问题1:这些是我想要的输出,但我怎么能让python给我一个字符串格式,而不是像上面的HTML ?

问题2:如何循环遍历请求。得到(网址01到网址56)?

谢谢你的帮助。

如果你能提供一种更有效的方法来访问我代码中的rows变量,我会非常感激,因为我到达那里的方式不是很python。

从初始页面获取所有状态,然后您可以选择第二个表并使用css类奇数结果来获取所需的tr,不需要切片,因为类名是唯一的:

import requests
from bs4 import BeautifulSoup
from urllib.parse import  urljoin # python2 -> from urlparse import urljoin 

base = "http://livingwage.mit.edu"
res = requests.get(base)
res.raise_for_status()
states = []
# Get all state urls and state name from the anchor tags on the base page.
# td + td skips the first td which is *Required annual income before taxes*
# get all the anchors inside each li that are children of the
# ul with the css class  "states list".
for a in BeautifulSoup(res.text, "html.parser").select("ul.states.list-unstyled li a"):
    # The hrefs look like "/states/51/locations".
    #  We want everything before /locations so we split on / from the right -> /states/51/
    # and join to the base url. The anchor text also holds the state name,
    # so we return the full url and the state, i.e "http://livingwage.mit.edu/states/01 "Alabama".
    states.append((urljoin(base, a["href"].rsplit("/", 1)[0]), a.text))

def parse(soup):
    # Get the second table, indexing in css starts at 1, so table:nth-of-type(2)" gets the second table.
    table = soup.select_one("table:nth-of-type(2)")
    # To get the text, we just need find all the tds and call .text on each.
    #  Each td we want has the css class "odd results", td + td starts from the second as we don't want the first.
    return [td.text.strip() for td in table.select_one("tr.odd.results").select("td + td")]

# Unpack the url and state from each tuple in our states list. 
for url, state in states:
    soup = BeautifulSoup(requests.get(url).content, "html.parser")
    print(state, parse(soup))

如果运行代码,您将看到如下输出:

Alabama ['$21,144', '$43,213', '$53,468', '$67,788', '$34,783', '$41,847', '$46,876', '$52,531', '$34,783', '$48,108', '$58,748', '$70,014']
Alaska ['$24,070', '$49,295', '$60,933', '$79,871', '$38,561', '$47,136', '$52,233', '$61,531', '$38,561', '$54,433', '$66,316', '$82,403']
Arizona ['$21,587', '$47,153', '$59,462', '$78,112', '$36,332', '$44,913', '$50,200', '$58,615', '$36,332', '$52,483', '$65,047', '$80,739']
Arkansas ['$19,765', '$41,000', '$50,887', '$65,091', '$33,351', '$40,337', '$45,445', '$51,377', '$33,351', '$45,976', '$56,257', '$67,354']
California ['$26,249', '$55,810', '$64,262', '$81,451', '$42,433', '$52,529', '$57,986', '$68,826', '$42,433', '$61,328', '$70,088', '$84,192']
Colorado ['$23,573', '$51,936', '$61,989', '$79,343', '$38,805', '$47,627', '$52,932', '$62,313', '$38,805', '$57,283', '$67,593', '$81,978']
Connecticut ['$25,215', '$54,932', '$64,882', '$80,020', '$39,636', '$48,787', '$53,857', '$61,074', '$39,636', '$60,074', '$70,267', '$82,606']

您可以在1-53的范围内循环,但是从基页提取锚也会在一个步骤中为我们提供州名,使用该页面的h1也会为您输出阿拉巴马州的生活工资计算,然后您必须尝试解析以获得名称,考虑到一些州有更多的一个单词名称,这不是微不足道的。

问题1:这些是我想要的输出,但我怎么能让python给我一个字符串格式,而不是像上面的HTML ?

您可以简单地通过在以下几行执行操作来获取文本:

state_name=states.find('h1').text

同样可以应用于每一行。

问题2:如何循环遍历请求。得到(网址01到网址56)?

相同的代码块可以放入从1到56的循环中,如下所示:

for i in range(1,57):
    res = requests.get('http://livingwage.mit.edu/states/'+str(i).zfill(2))
    ...rest of the code...

zfill将添加这些前导零。此外,如果requests.get被包含在try-except块中会更好,这样即使url错误,循环也会优雅地继续。

最新更新