Python中是否有"if x (or any variable) has any value"函数之类的东西？

我试图建立一个网络爬虫生成多个不同的网站的文本文件。在它抓取一个网站后，它应该得到网站上的所有链接。然而，我遇到了一个问题，而网页爬行维基百科。python脚本给了我错误:

Traceback (most recent call last):
File "/home/banana/Desktop/Search engine/data/crawler?.py", line 22, in <module>
urlwaitinglist.write(link.get('href'))
TypeError: write() argument must be str, not None

我更深入地研究了它，让它打印发现的链接，它有"None"在顶端。我想知道是否有一个函数来查看变量是否有任何值。

下面是我目前为止写的代码:

from bs4 import BeautifulSoup
import os
import requests
import random
import re
toscan = "https://en.wikipedia.org/wiki/Wikipedia:Contents"
url = toscan
source_code = requests.get(url)
plain_text = source_code.text
removal_list = ["http://", "https://", "/"]
for word in removal_list:
toscan = toscan.replace(word, "")
soup = BeautifulSoup(plain_text, 'html.parser')
for link in soup.find_all('a'):
print(link.get('href'))
urlwaitinglist = open("/home/banana/Desktop/Search engine/data/toscan", "a")
urlwaitinglist.write('n')
urlwaitinglist.write(link.get('href'))
urlwaitinglist.close()

print(soup.get_text())
directory = "/home/banana/Desktop/Search engine/data/Crawled Data/"
results = soup.get_text()
results = results.strip()
f = open("/home/banana/Desktop/Search engine/data/Crawled Data/" + toscan + ".txt", "w")
f.write(url)
f.write('n')
f.write(results)
f.close()

看起来不是每个您正在抓取的标签正在返回一个值。我建议让每个链接变量你抓住一个字符串，并检查如果它不是None。不使用'with'子句打开文件也是不好的做法。我添加了一个例子，抓取每个https|http链接，并使用下面的一些代码将其写入文件:

from bs4 import BeautifulSoup
import os
import requests
import random
import re
file_directory = './' # your specified directory location
filename = 'urls.txt' # your specified filename
url = "https://en.wikipedia.org/wiki/Wikipedia:Contents"
res = requests.get(url)
html = res.text

soup = BeautifulSoup(html, 'html.parser')
links = []
for link in soup.find_all('a'):
link = link.get('href')
print(link)
match = re.search('^(http|https)://', str(link))
if match:
links.append(str(link))


with open(file_directory + filename, 'w') as file:
for link in links:
file.write(link + 'n')

相关内容

最新更新

热门标签：