PyCharm中的Python web scraper出现问题.(初学者)



我最近开始学习Python。在学习网络抓取的过程中,我以谷歌新闻为例进行了抓取。运行代码后,我收到消息:"进程已完成,退出代码为0",但没有结果。如果我将url更改为"https://yahoo.com"我得到了结果。有人能指出我做错了什么吗?

代码:

import urllib.request
from bs4 import BeautifulSoup

class Scraper:
def __init__(self, site):
self.site = site
def scrape(self):
r = urllib.request.urlopen(self.site)
html = r.read()
parser = "html.parser"
sp = BeautifulSoup(html, parser)
for tag in sp.find_all("a"):
url = tag.get("href")
if url is None:
continue
if "html" in url:
print("n" + url)
news = "https://news.google.com/"
Scraper(news).scrape()

试试这个:

import urllib.request
from bs4 import BeautifulSoup

class Scraper:
def __init__(self, site):
self.site = site
def scrape(self):
r = urllib.request.urlopen(self.site)
html = r.read()
parser = "html.parser"
sp = BeautifulSoup(html, parser)
for tag in sp.find_all("a"):
url = tag.get("href")
if url is None:
continue
else:
print("n" + url)

if __name__ == '__main__':
news = "https://news.google.com/"
Scraper(news).scrape()

最初,你检查每个链接,看看它是否包含"html"。我假设你下面的例子是检查链接是否以".html;

Beautiful汤工作得很好,但你需要在你的抓取网站上查看源代码,以了解代码的布局。chrome中的开发工具非常适合这一点,F12可以快速获得它们。

我删除了:

if "html" in url:
print("n" + url)

并将其替换为:

else:
print("n" + url)

最新更新