PyCharm中的Python web scraper出现问题.(初学者)

我最近开始学习Python。在学习网络抓取的过程中，我以谷歌新闻为例进行了抓取。运行代码后，我收到消息："进程已完成，退出代码为0"，但没有结果。如果我将url更改为"https://yahoo.com"我得到了结果。有人能指出我做错了什么吗？

代码：

import urllib.request
from bs4 import BeautifulSoup

class Scraper:
def __init__(self, site):
self.site = site
def scrape(self):
r = urllib.request.urlopen(self.site)
html = r.read()
parser = "html.parser"
sp = BeautifulSoup(html, parser)
for tag in sp.find_all("a"):
url = tag.get("href")
if url is None:
continue
if "html" in url:
print("n" + url)
news = "https://news.google.com/"
Scraper(news).scrape()

试试这个：

import urllib.request
from bs4 import BeautifulSoup

class Scraper:
def __init__(self, site):
self.site = site
def scrape(self):
r = urllib.request.urlopen(self.site)
html = r.read()
parser = "html.parser"
sp = BeautifulSoup(html, parser)
for tag in sp.find_all("a"):
url = tag.get("href")
if url is None:
continue
else:
print("n" + url)

if __name__ == '__main__':
news = "https://news.google.com/"
Scraper(news).scrape()

最初，你检查每个链接，看看它是否包含"html"。我假设你下面的例子是检查链接是否以".html；

Beautiful汤工作得很好，但你需要在你的抓取网站上查看源代码，以了解代码的布局。chrome中的开发工具非常适合这一点，F12可以快速获得它们。

我删除了：

if "html" in url:
print("n" + url)

并将其替换为：

else:
print("n" + url)

相关内容

最新更新

热门标签：