我试图建立一个网络爬虫生成多个不同的网站的文本文件。在它抓取一个网站后,它应该得到网站上的所有链接。然而,我遇到了一个问题,而网页爬行维基百科。python脚本给了我错误:
Traceback (most recent call last):
File "/home/banana/Desktop/Search engine/data/crawler?.py", line 22, in <module>
urlwaitinglist.write(link.get('href'))
TypeError: write() argument must be str, not None
我更深入地研究了它,让它打印发现的链接,它有"None"在顶端。我想知道是否有一个函数来查看变量是否有任何值。
下面是我目前为止写的代码:from bs4 import BeautifulSoup
import os
import requests
import random
import re
toscan = "https://en.wikipedia.org/wiki/Wikipedia:Contents"
url = toscan
source_code = requests.get(url)
plain_text = source_code.text
removal_list = ["http://", "https://", "/"]
for word in removal_list:
toscan = toscan.replace(word, "")
soup = BeautifulSoup(plain_text, 'html.parser')
for link in soup.find_all('a'):
print(link.get('href'))
urlwaitinglist = open("/home/banana/Desktop/Search engine/data/toscan", "a")
urlwaitinglist.write('n')
urlwaitinglist.write(link.get('href'))
urlwaitinglist.close()
print(soup.get_text())
directory = "/home/banana/Desktop/Search engine/data/Crawled Data/"
results = soup.get_text()
results = results.strip()
f = open("/home/banana/Desktop/Search engine/data/Crawled Data/" + toscan + ".txt", "w")
f.write(url)
f.write('n')
f.write(results)
f.close()
from bs4 import BeautifulSoup
import os
import requests
import random
import re
file_directory = './' # your specified directory location
filename = 'urls.txt' # your specified filename
url = "https://en.wikipedia.org/wiki/Wikipedia:Contents"
res = requests.get(url)
html = res.text
soup = BeautifulSoup(html, 'html.parser')
links = []
for link in soup.find_all('a'):
link = link.get('href')
print(link)
match = re.search('^(http|https)://', str(link))
if match:
links.append(str(link))
with open(file_directory + filename, 'w') as file:
for link in links:
file.write(link + 'n')