仅使用beautifulsoup从html页面中抓取以.ece结尾的超链接



我写了一个代码来只抓取以.ecm结尾的超链接,这是我的代码

_URL='http://www.thehindu.com/archive/web/2017/08/08/'
r = requests.get(_URL)
soup = BeautifulSoup(r.text)
urls = []
names = []
newpath=r'D:fypdata set'
os.chdir(newpath)
name='testecmlinks'
for i, link in enumerate(soup.findAll('a')):
    _FULLURL = _URL + link.get('href')
    if _FULLURL.endswith('.ece'):
        urls.append(_FULLURL)
        names.append(soup.select('a')[i].attrs['href'])
names_urls = zip(names, urls)
for name, url in names_urls:
    print url
    rq = urllib2.Request(url)
    res = urllib2.urlopen(rq)
    pdf = open(name+'.txt', 'wb')
    pdf.write(res.read())
    pdf.close()

但是我收到以下错误

Traceback (most recent call last):
  File "D:/fyp/scripts/test.py", line 18, in <module>
    _FULLURL = _URL + link.get('href')
TypeError: cannot concatenate 'str' and 'NoneType' objects

你能帮我获得以.ece结尾的超链接吗?

试试这个。我希望您能从该页面获得所有以.ece结尾的超链接。

import requests
from bs4 import BeautifulSoup
response = requests.get("http://www.thehindu.com/archive/web/2017/08/08/").text
soup = BeautifulSoup(response,"lxml")
for link in soup.select("a[href$='.ece']"):
    print(link.get('href'))

该错误指示link.get('href')的结果是None。链接的过滤可以更好地直接在for循环中使用Beautiful Soup来完成。更改原始代码

...
for i, link in enumerate(soup.findAll('a')):
    _FULLURL = _URL + link.get('href')
    if _FULLURL.endswith('.ecm'):
        urls.append(_FULLURL)
        names.append(soup.select('a')[i].attrs['href'])
...

对此:

...
for i, link in enumerate(soup.find_all('a', href=re.compile(r'.ece$'))):
...

您可以获得更好的解决方案,但使用当前代码,您必须检查link.get('href')是否未None,然后才能添加到_URL

for link in soup.findAll('a'):
    url = link.get('href')  # get `href` or `None`
    if url and url.endswith('.ece'): # check `None` and `.ece'
        names_urls.append( _URL + url, url )
        # ... or directly download file ...
        # rq = urllib2.Request(_URL + url)
        # res = urllib2.urlopen(rq)
        # pdf = open(url+'.txt', 'wb')
        # pdf.write(res.read())
        # pdf.close()

最新更新