如何解析网站，获取信息

我正在尝试解析一个网站。这就是我所做的，我下载源和遍历使用nokogiri的数据，并获得我需要的信息，如链接，内容等。我已经有了获取数据的脚本。但我遇到了一个问题，当你在一个活跃的网站上点击它时，这个链接才会起作用。

这是我要遍历的示例源。

<div class="story-item-content group">
<div class="story-item-details">
  <h3 class="story-item-title">
    <a href="/story/r/how_not_to_fix_your_computer_part_2" target="_blank" class="external-link ">How NOT to fix your computer, part 2.</a>
    <span class="external-link-icon"></span>                                            
    </h3>
    <p class="story-item-description">
         <a href="/search?q=site:zug.com" class="story-item-source" title="More stories from zug.com">zug.com</a>                            <a href="/news/technology/how_not_to_fix_your_computer_part_2" class="story-item-teaser">&mdash; After you read this you should understand what not to do.
        <span class="timestamp">21 hr 59 min ago</span></a>
        <a class="crawl4link" href="http://crawl4.digg.internal/permalink/view/how_not_to_fix_your_computer_part_2">View in Crawl 4</a>
    </p>
</div>

在第4行。链接href="/story/r/how_not_to_fix_your_computer_part_2

仅适用于在线站点。当我下载源代码并点击链接。这行不通。我猜链接保存在服务器上了。任何想法我怎么得到完整的链接?我想有一个脚本，点击该链接，这样我就可以得到工作链接。知道怎么做吗?非常感谢

该url是一个相对url，

所以如果你访问的网站是:

http://mywebsite.com/index.html

则完整链接为

http://mysebsite.com/story/r/how_not_to_fix_your_computer_part_2

这是一个相对链接，相对于网站的根目录。只需预先添加域名(即example.com/story/r/how_not_to_fix_your_computer_part_2)。

点击链接不起作用的原因是href值是相对的…相对于存储文件的位置。一旦您将页面下载到本地计算机，它就不再是相对于原始域的，浏览器将假定它正在寻找http://localhost/story/r/how_not_to_fix_your_computer_part_2上的文件。由于在该URL上没有文件或资源，所以它失败了。

你要做的是改变href值为一个绝对url的前缀原始域(即digg.com/story/r/how_not_to_fix_your_computer_part_2)。然后，当您在本地驱动器上单击它时，它将工作。

当url最终解析时，您不需要担心添加到url上的数字，这将由digg.com/story/r/how_not_to_fix_your_computer_part_2 url上的资源处理。

相关内容

最新更新

热门标签：