当HTML元素没有类名时如何使用美丽汤?



我正在使用以下代码(从Nathan Yau的"Visualize This"早期示例中稍作修改)从WUnderGround的网站上抓取天气数据。正如您所看到的,python正在从类名为"wx-data"的元素中获取数字数据。

然而,我也想从DailyHistory.htmml中获取平均湿度。问题是并非所有的"span"元素都有类名,平均湿度单元就是这样如何使用BeautifulSoup和下面的代码选择此特定单元格?

(以下是一个页面被刮取的例子-进入开发模式,搜索"wx data"以查看引用的"span"元素:

http://www.wunderground.com/history/airport/LAX/2002/1/1/DailyHistory.html)

import urllib2
from BeautifulSoup import BeautifulSoup
year = 2004    

#create comma-delim file
f = open(str(year) + '_LAXwunder_data.txt','w')
#iterate through month and day
for m in range(1,13):
    for d in range (1,32):
        #Chk if already gone through month
        if (m == 2 and d > 28):
            break
        elif (m in [4,6,9,11]) and d > 30:
            break
        # open wug url
        timestamp = str(year)+'0'+str(m)+'0'+str(d)
        print 'Getting data for ' + timestamp
        url = 'http://www.wunderground.com/history/airport/LAX/'+str(year) + '/' + str(m) + '/' + str(d) + '/DailyHistory.html'
        page = urllib2.urlopen(url)
        #Get temp from page
        soup = BeautifulSoup(page)
        #dayTemp = soup.body.wx-data.b.string
        dayTemp = soup.findAll(attrs = {'class':'wx-data'})[5].span.string
        #Format month for timestamp
        if len(str(m)) < 2:
            mStamp = '0' + str(m)
        else:
            mStamp = str(m)
        #Format day for timestamp
        if len(str(d)) < 2:
            dStamp = '0' + str(d)
        else:
            dStamp = str(d)
        #Build timestamp
        timestamp = str(year)+ mStamp + dStamp
        #Wrtie timestamp and temp to file
        f.write(timestamp + ',' + dayTemp +'n')
#done - close
f.close()

您可以搜索包含文本的单元格,然后向上移动并移到下一个单元格:

humidity = soup.find(text='Average Humidity')
next_cell = humidity.find_parent('td').find_next_sibling('td')
humidity_value = next_cell.string

我在这里使用BeautifulSoup版本4,而不是3;你真的想升级,因为版本3在两年前就已经被封存了。

BeautifulSoup 3也可以做到这一点;但是在那里使用CCD_ 1和CCD_。

演示:

>>> import requests
>>> from bs4 import BeautifulSoup
>>> response = requests.get('http://www.wunderground.com/history/airport/LAX/2002/1/1/DailyHistory.html')
>>> soup = BeautifulSoup(response.content)
>>> humidity = soup.find(text='Average Humidity')
>>> next_cell = humidity.find_parent('td').find_next_sibling('td')
>>> next_cell.string
u'88'

非常感谢@Martijn_Pieters帮助制作这个最终脚本:

import requests
import urllib2
from bs4 import BeautifulSoup
year = 2003
#create comma-delim file
f = open(str(year) + '_LAXwunder_data.txt','w')
#change the year here, ->run

#iterate through month and day
for m in range(1,13):
    for d in range(1,32): #could step 5 days using range(1,32,2)
        #Chk if already gone through month
        if (m == 2 and d > 28):
            break
        elif (m in [4,6,9,11]) and d > 30:
            break
        # open wug url
        timestamp = str(year)+'.'+str(m)+'.'+str(d)
        print 'Getting data for ' + timestamp
        url = 'http://www.wunderground.com/history/airport/LAX/'+str(year) + '/' + str(m) + '/' + str(d) + '/DailyHistory.html'
        page = urllib2.urlopen(url)
        #Get temp from page
        soup = BeautifulSoup(page)
        #dayTemp = soup.body.wx-data.b.string
        dayTemp = soup.findAll(attrs = {'class':'wx-data'})[5].span.string
            humidity = soup.find(text='Average Humidity')
                next_cell = humidity.find_parent('td').find_next_sibling('td')
                avg_humidity = next_cell.string
        #Format month for timestamp
        if len(str(m)) < 2:
            mStamp = '0' + str(m)
        else:
            mStamp = str(m)
        #Format day for timestamp
        if len(str(d)) < 2:
            dStamp = '0' + str(d)
        else:
            dStamp = str(d)
        #Build timestamp
        timestamp = str(year)+ mStamp + dStamp
        #Wrtie timestamp and temp to file
        f.write(timestamp + ',' + dayTemp + ',' + avg_humidity + 'n')
        print dayTemp, avg_humidity
#done - close
f.close()

最新更新