HttpUrlConnection获取内容的标题并获得"Moved Permanently"



这是我用Groovy编写的代码,用于从URL中获取页面标题。然而,有些网站我得到了"永久移动",我认为这是因为301重定向。如何避免这种情况,并让HttpUrlConnection跟随正确的URL并获得正确的页面标题

例如这个网站,我得到"永久移动"而不是正确的页面标题http://www.nytimes.com/2011/08/14/arts/music/jay-z-and-kanye-wests-watch-the-throne.html

<>之前 def con = (HttpURLConnection) new URL(url).openConnection() con.connect() def inputStream = con.inputStream HtmlCleaner cleaner = new HtmlCleaner() CleanerProperties props = cleaner.getProperties() TagNode node = cleaner.clean(inputStream) TagNode titleNode = node.findElementByName("title", true); def title = titleNode.getText().toString() title = StringEscapeUtils.unescapeHtml(title).trim() title = title.replace("n", ""); return title

如果我自己管理重定向,我可以让这个工作…

我认为问题是网站会期望它在重定向链的中途发送cookie,如果它没有得到它们,它会将您发送到登录页面。

这段代码显然需要一些清理(可能有更好的方法),但它展示了如何提取标题:
@Grab( 'net.sourceforge.htmlcleaner:htmlcleaner:2.2' )
@Grab( 'commons-lang:commons-lang:2.6' )
import org.apache.commons.lang.StringEscapeUtils
import org.htmlcleaner.*
String location = 'http://www.nytimes.com/2011/08/14/arts/music/jay-z-and-kanye-wests-watch-the-throne.html'
String cookie = null
String pageContent = ''
while( location ) {
  new URL( location ).openConnection().with { con ->
    // We'll do redirects ourselves
    con.instanceFollowRedirects = false
    // If we got a cookie last time round, then add it to our request
    if( cookie ) con.setRequestProperty( 'Cookie', cookie )
    con.connect()
    // Get the response code, and the location to jump to (in case of a redirect)
    int responseCode = con.responseCode
    location = con.getHeaderField( "Location" )
    // Try and get a cookie the site will set, we will pass this next time round
    cookie = con.getHeaderField( "Set-Cookie" )
    // Read the HTML and close the inputstream
    pageContent = con.inputStream.withReader { it.text }
  }
}
// Then, clean paceContent and get the title
HtmlCleaner cleaner = new HtmlCleaner()
CleanerProperties props = cleaner.getProperties()
TagNode node = cleaner.clean( pageContent )
TagNode titleNode = node.findElementByName("title", true);
def title = titleNode.text.toString()
title = StringEscapeUtils.unescapeHtml( title ).trim()
title = title.replace( "n", "" )
println title

希望有帮助!

你需要在HttpUrlConnection上调用setInstanceFollowRedirects(true)。即,在第一行之后,插入con.setInstanceFollowRedirects(真正的)

相关内容

  • 没有找到相关文章

最新更新