是否可以将"www"添加到任何域?



我修改了 https://impythonist.wordpress.com/2015/01/06/ultimate-guide-for-scraping-javascript-rendered-web-pages/的网络抓取代码。

from PyQt4.QtCore import QUrl
from PyQt4.QtGui import QApplication
from PyQt4.QtWebKit import QWebPage
from sys import argv
from bs4 import BeautifulSoup
# Use result of rendering.
class Render(QWebPage):
def __init__(self, url):
self.app = QApplication(argv)
QWebPage.__init__(self)
self.loadFinished.connect(self._loadFinished)
self.mainFrame().load(QUrl(url))
self.app.exec_()
def _loadFinished(self, result):
self.frame = self.mainFrame()
self.app.quit()
r = Render(url)
result = unicode(r.frame.toHtml().toUtf8(), encoding="UTF-8")
soup = BeautifulSoup(result, 'html.parser')
for script in soup(["script", "style"]):
script.extract()
text = soup.get_text().encode("utf-8")

使用此代码,我发现"nate.com"不返回任何文本,但"www.nate.com"返回。因此,我正在尝试为所有域添加" www"。

  1. 是否有一些网站我不应该在其域中添加"www"?

(像这样(

if "www" in url:
url = url.split("www")[1]
url = "www" + url
  1. (可选(为什么"nate.com"不返回任何文本,而"www.nate.com"返回?我发现它使用铬重定向到"www.nate.com"。

欢迎任何意见。

有些网站我不应该在其域中添加"www"吗?

是的。例如huji.ac.il

$ http http://huji.ac.il
HTTP/1.1 200 OK
Accept-Ranges: bytes
Age: 94
Cache-Control: max-age=300
Connection: Keep-Alive
Content-Length: 173
Content-Type: text/html
Date: Fri, 25 Aug 2017 01:16:23 GMT
Expires: Fri, 25 Aug 2017 01:19:49 GMT
Server: Apache/2.2.15 (Red Hat)
<HTML>
<HEAD>
<meta http-equiv="refresh" content="0; URL=http://new.huji.ac.il">
</HEAD>
<BODY>
<a href="http://new.huji.ac.il">click here</a> jumping ....
</BODY>
</HTML>

好的,现在让我们尝试www.huji.ac.il

$ http http://www.huji.ac.il
HTTP/1.1 200 OK
Accept-Ranges: bytes
Cache-Control: max-age=300
Connection: close
Content-Length: 173
Content-Type: text/html
Date: Fri, 25 Aug 2017 01:16:31 GMT
Expires: Fri, 25 Aug 2017 01:21:31 GMT
Server: Apache/2.2.15 (Red Hat)
<HTML>
<HEAD>
<meta http-equiv="refresh" content="0; URL=http://new.huji.ac.il">
</HEAD>
<BODY>
<a href="http://new.huji.ac.il">click here</a> jumping ....
</BODY>
</HTML>

无论如何它会重定向到new.huji.ac.il,让我们尝试一下www

$ http http://www.new.huji.ac.il
http: error: ConnectionError: HTTPConnectionPool(host='www.new.huji.ac.il', port=80): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f54f57fdd90>: Failed to establish a new connection: [Errno -2] Name or service not known',)) while doing GET request to URL: http://www.new.huji.ac.il/

所以www在这个例子中导致错误。

为什么"nate.com"不返回任何文本,而"www.nate.com"返回?我发现它用铬重定向到"www.nate.com"。

因为"nate.com"使用 JavaScript 进行重定向

$ http http://nate.com
HTTP/1.1 200 OK
Cache-Control: no-store, no-cache, must-revalidate
Connection: close
Content-Encoding: gzip
Content-Language: ko
Content-Length: 88
Content-Type: text/html; charset=utf-8
Date: Fri, 25 Aug 2017 01:13:34 GMT
Pragma: no-cache
Server: Apache
Vary: Accept-Encoding
<script type='text/javascript'>location.href='http://www.nate.com';</script>

正如注释中所指出的:您应该在代码中添加一个功能以跟踪重定向。

最新更新