下载Word文档python



对于我的Coursework,我必须构建一个网络抓取器,它抓取网站上的img、word文档和pdf,并将它们下载到一个文件中。我已经完成了img下载工作,但当我将代码更改为下载文档或pdf时,它根本找不到任何内容,我使用beautifulsoup抓取网站,我知道网站上有一些文档和pdf无法下载。

from bs4 import BeautifulSoup
import urllib.request
import shutil
import requests
from urllib.parse import urljoin
import sys
import time
import os
import url
import hashlib
import re
url = 'http://www.soc.napier.ac.uk/~40009856/CW/'
path=('c:\temp\')
def ensure_dir(path):
directory = os.path.dirname(path)
if not os.path.exists(path):
os.makedirs(directory) 
return path
os.chdir(ensure_dir(path))
def webget(url): 
response = requests.get(url)
html = response.content
return html
def get_docs(url):
soup = make_soup(url)
docutments = [docs for docs in soup.findAll('doc')]
print (str(len(docutments)) + " docutments found.")
print('Downloading docutments to current working directory.')
docutments_links = [each.get('src') for each in docutments]
for each in docutments_links:
try:
filename = each.strip().split('/')[-1].strip()
src = urljoin(url, each)
print ('Getting: ' + filename)
response = requests.get(src, stream=True)
# delay to avoid corrupted previews
time.sleep(1)
with open(filename, 'wb') as out_file:
shutil.copyfileobj(response.raw, out_file)
except:
print('  An error occured. Continuing.')
print ('Done.')
if __name__ == '__main__':
get_docs(url)

首先,您应该了解.find_all((和其他方法的作用:.find_aall((

.find_all((的第一个参数是标记名。还可以

<img src='some_url'>

标签。您获得了汤.find_all('img'(的所有img标签,提取了实际文件的所有url并下载了它们。

现在你正在寻找像这样的标签

<a href='some_url'></a>

URL包含".doc"。应该这样做:

soup.select('a[href*=".doc"]')

更多的是旁白,但您可以使用OR CSS选择器语法来组合pdf、docx等。请注意,您仍然需要完成一些路径,例如前缀为"http://www.soc.napier.ac.uk/~40009856/CW/"。下面使用attribute=value css选择器语法和$operator(这意味着属性字符串的值以结尾(

from bs4 import BeautifulSoup
import requests
url= 'http://www.soc.napier.ac.uk/~40009856/CW/'
res = requests.get(url)
soup = BeautifulSoup(res.content, 'lxml')
items = soup.select("[href$='.docx'], [href$='.pdf'], img[src]")
print([item['href'] if 'href' in item.attrs.keys()  else item['src'] for item in items])

最新更新