从bash中的谷歌页面中提取链接



我正在制作一个脚本,从bash中的谷歌页面获取所有链接。我得到了带有w3m实用程序和以下脚本的谷歌页面:

#!/bin/bash 
# performs a google search using a word in input
word=$1
touch .google
if [ -z $word ]
then
    echo "$word missing!"
    echo "Aborting..."
    exit
fi

a="www.google.com/search?q="  
search=$a$word
w3m -no-cookie $search > .google
sleep 1

接下来,我必须从这个页面获取所有网站。我想取所有以www.开头、以/ 结尾的字符串

echo `grep -wo "www[^/]*" .google`> .temp

这样做的问题是,我错过了很多不是以www开头的链接,同时,当有一个网站没有以/结尾时,我会冒着破坏一切的风险。

还有什么更好的方法可以从这个响应中获取URL?

链接提取是一个难题。但是,lynx程序有一个方便的-dump选项,可以让您跳过大多数(或全部)html解析。

具体来说,请注意底部的References部分。你可以从那一行开始输出,去掉领先的子弹头编号:

$ lynx -dump 'http://www.seomoz.org/'
   #[1]RSS 2.0 [2]publisher
   [3]SEOmoz
     * [4]Log in
     * [5]Sign up
     * [6]Help
          + [7]Help Resources
          + [8]Support Forums
          + [9]Request a Feature
          + [10]Contact Us
     * [11]Features
     * [12]Pricing & Plans
     * [13]Community
          + [14]SEO Blog
          + [15]YOUmoz User Blog
          + [16]Top Users
          + [17]Events
          + [18]Recommended Companies
     * [19]Resources
          + [20]Learn SEO
          + [21]SEO Tools
          + [22]PRO Q&A Forum
          + [23]Mozscape API
     * [24]Blog
          + [25]SEO Blog
          + [26]YOUmoz User Blog
     * [27]About
          + [28]Our TAGFEE Mission
          + [29]Meet the Mozzers
          + [30]Contact Us
          + [31]Join Our Team
          + [32]Press & Awards
          + [33]Events
   Search SEOmoz
   ____________________ Search
SEO & Social Monitoring
Made Simple.
   SEOmoz PRO combines SEO management, social media monitoring, actionable
   recommendations, and so much more in one easy-to-use platform. Try it
   free for 30 days.
   [34]Try it for Free!
   [35]Take a tour of SEOmoz PRO
   or see [36]plans & pricing
     * Campaign Overview
     * Social Dashboard
     * Crawl Diagnostics
     * Dashboard
     * Google Analytics
     * Link Analysis
Loved By...
     * Zillow
     * Disney
     * Overstock
     * Best Buy
     * Yelp
     * Sun Microsystems

   Roger Mozbot
Be My Buddy...
     * [37]RSS
     * [38]Twitter
     * [39]Facebook
     * [40]Google+
Effectively Manage Your SEO and Monitor Your Social Media
   [41]Link Analysis
   Analyze links and track key performance metrics in an efficient
   all-in-one dashboard.
   [42]Identify SEO Issues
   Identify critical SEO issues and get actionable recommendations.
   [43]Monitor Changes
   Automatically monitor changes to your rankings and take control of your
   organic traffic.
   Avinash Kaushik
     "SEOmoz tools provide best of class data. Their tools are a
     must-have for marketers looking to optimize their organic search
     results."
Avinash Kaushik,
   Author, Web Analytics 1.0: An Hour A Day
   Patrick Altoft
     "SEOmoz has enabled us to scale our link-building process quickly
     without compromising on quality."
Patrick Altoft,
   CEO, Branded3
Latest from the SEOmoz Blog
     __________________________________________________________________
   [44]jennita
[45]Winners of #MozCation 2012
   Posted by [46]jennita on 08/04/2012
   Whoa. Ever have one of those times where your expectations are
   completely blown out of the water? Well that's what happened during
   this year's nomination for a MozCation. Wait, wait, wait, before I get
   too far ahead of myself, I...
   [47]Read Full Entry
   13
   2
   [48]13 Comments
     __________________________________________________________________
Latest from the Community YouMoz Blog
     __________________________________________________________________
   [49]larry.kim
[50]Does SEO Even Work for Small Businesses?
   Posted by [51]larry.kim on 08/03/2012
   Clicks on paid search listings beat out organic listings by nearly a
   2:1 margin for keywords with high commercial intent in the US. Is SEO
   still a viable marketing tactic for the average small business owner?
   [52]Read Full Entry
   17
   3
   [53]28 Comments
     __________________________________________________________________
Voted Best SEO Tool 2010!
   [54]Try it for Free!
Looking for SEO consulting?
   SEOmoz doesn't provide consulting, but our friends at [55]Distilled
   still do. Rock on!
   Copyright ? 1996-2012 SEOmoz. All Rights Reserved.
Product and Tools
     * [56]SEOmoz PRO
     * [57]Pricing and Plans
     * [58]Open Site Explorer
     * [59]SEO Toolbar
     * [60]Mozscape API
     * [61]More SEO Tools
Company
     * [62]About
     * [63]SEO Blog
     * [64]YOUmoz Blog
     * [65]Affiliate Program
     * [66]Terms & Privacy Policy
     * [67]PRO Perks
Popular Content
     * [68]Link Building
     * [69]Reputation Management
     * [70]Analytics
     * [71]Social Media
     * [72]Content & Blogging
     * [73]See All Categories
Stay in Touch
     *
          + [74]RSS
          + [75]Twitter
          + [76]Facebook
          + [77]LinkedIn
     *

    SEOmoz
    119 Pine St. Suite 400
    Seattle, WA 98101
    206.632.3171
     * [78]Contact Us
     * [79]Sitemap
References
   1. http://feeds.feedburner.com/seomoz
   2. https://plus.google.com/112544075040456048636
   3. http://www.seomoz.org/
   4. https://www.seomoz.org/users/login
   5. https://www.seomoz.org/users/register
   6. http://www.seomoz.org/
   7. http://www.seomoz.org/help
   8. http://www.seomoz.org/q
   9. http://seomoz.zendesk.com/forums/293194-seomoz-PRO-feature-requests
  10. http://www.seomoz.org/about/contact
  11. http://www.seomoz.org/features
  12. http://www.seomoz.org/plans
  13. http://www.seomoz.org/community
  14. http://www.seomoz.org/blog
  15. http://www.seomoz.org/ugc
  16. http://www.seomoz.org/users
  17. http://www.seomoz.org/about/events
  18. http://www.seomoz.org/article/recommended
  19. http://www.seomoz.org/resources
  20. http://www.seomoz.org/learn-seo
  21. http://www.seomoz.org/tools
  22. http://www.seomoz.org/q
  23. http://www.seomoz.org/api
  24. http://www.seomoz.org/blog
  25. http://www.seomoz.org/blog
  26. http://www.seomoz.org/ugc
  27. http://www.seomoz.org/about
  28. http://www.seomoz.org/about/mission
  29. http://www.seomoz.org/about/team
  30. http://www.seomoz.org/about/contact
  31. http://www.seomoz.org/about/jobs
  32. http://www.seomoz.org/about/press
  33. http://www.seomoz.org/about/seo-events
  34. http://www.seomoz.org/cart/freetrial?pg=home
  35. http://www.seomoz.org/features
  36. http://www.seomoz.org/plans
  37. http://feeds.feedburner.com/seomoz
  38. http://twitter.com/seomoz
  39. http://www.facebook.com/SEOmoz
  40. https://plus.google.com/112544075040456048636?prsrc=3
  41. http://www.seomoz.org/features
  42. http://www.seomoz.org/features
  43. http://www.seomoz.org/features
  44. http://www.seomoz.org/users/profile/81197
  45. http://www.seomoz.org/blog/winners-mozcation-2012
  46. http://www.seomoz.org/users/profile/81197
  47. http://www.seomoz.org/blog/winners-mozcation-2012
  48. http://www.seomoz.org/blog/winners-mozcation-2012#comments
  49. http://www.seomoz.org/users/profile/402613
  50. http://www.seomoz.org/ugc/does-seo-even-work-for-small-businesses
  51. http://www.seomoz.org/users/profile/402613
  52. http://www.seomoz.org/ugc/does-seo-even-work-for-small-businesses
  53. http://www.seomoz.org/ugc/does-seo-even-work-for-small-businesses#comments
  54. http://www.seomoz.org/cart/freetrial?pg=features
  55. http://www.seomoz.org/dp/distilled
  56. http://www.seomoz.org/features
  57. http://www.seomoz.org/plans
  58. http://www.opensiteexplorer.org/
  59. http://www.seomoz.org/seo-toolbar
  60. http://www.seomoz.org/api
  61. http://www.seomoz.org/tools
  62. http://www.seomoz.org/about
  63. http://www.seomoz.org/blog
  64. http://www.seomoz.org/ugc
  65. http://www.seomoz.org/dp/seomoz-pro-affiliate-program
  66. http://www.seomoz.org/terms-and-privacy
  67. http://www.seomoz.org/pro-perks
  68. http://www.seomoz.org/blog/category/4
  69. http://www.seomoz.org/blog/category/19
  70. http://www.seomoz.org/blog/category/8
  71. http://www.seomoz.org/blog/category/18
  72. http://www.seomoz.org/blog/category/1
  73. http://www.seomoz.org/blog
  74. http://feeds.feedburner.com/seomoz
  75. http://twitter.com/seomoz
  76. http://www.facebook.com/SEOmoz
  77. http://www.linkedin.com/groups?about=&gid=2976409&trk=anet_ug_grppro
  78. http://www.seomoz.org/about/contact
  79. http://www.seomoz.org/sitemap

您可能需要对<a href="进行grep,并将该值带到下一个引号。然后过滤掉所有javascript内容。尽管这个解决方案可能也不是傻瓜式的。

相关内容

  • 没有找到相关文章

最新更新