Bash脚本使用wget克隆一个网站



我正在尝试创建一个bash脚本克隆一个网站,通过修改wget命令在这里找到。

这是我到目前为止,但我有问题解析正确的URL:

URL=$1
DOMAIN=`echo $URL | sed -e 's/[^/]*//([^@]*@)?([^:/]*).*/2/'`
echo $DOMAIN
wget 
--recursive  # Download the whole site.
--page-requisites  # Get all assets/elements (CSS/JS/images).
--adjust-extension  # Save files with .html on the end.
--span-hosts  # Include necessary assets from offsite as well.
--convert-links  # Update links to still work in the static version.
--restrict-file-names=windows  # Modify filenames to work in Windows as well.
--domains $DOMAIN  # Do not follow links outside this domain.
--no-parent  # Don't follow links outside the directory you pass in.
$URL # The URL to download

问题似乎是从URL确定域名。

这是当我尝试克隆一个网页的结果:

./clone_website.sh https://www.elliman.com/newyork/sales/detail/612-l-566-14_h344874/5-paumanok-road-water-mill-ny-11976
www.elliman.com
--2021-01-26 11:06:11--  http://%20/
Resolving   ( )... failed: Name or service not known.
wget: unable to resolve host address ‘ ’
--2021-01-26 11:06:11--  http://download/
Resolving download (download)... failed: Temporary failure in name resolution.
wget: unable to resolve host address ‘download’
--2021-01-26 11:06:11--  http://the/
Resolving the (the)... failed: Temporary failure in name resolution.
wget: unable to resolve host address ‘the’
--2021-01-26 11:06:11--  http://whole/
Resolving whole (whole)... failed: Temporary failure in name resolution.
wget: unable to resolve host address ‘whole’
--2021-01-26 11:06:11--  http://site./
Resolving site. (site.)... failed: Temporary failure in name resolution.
wget: unable to resolve host address ‘site.’
./clone_website.sh: line 7: syntax error near unexpected token `('
./clone_website.sh: line 7: `     --page-requisites  # Get all assets/elements (CSS/JS/images).'

如何解决这个问题?

必须是最后一个字符在一条线上。后面不允许有注释。你想要的:

wget 
--recursive 
--page-requisites 
--adjust-extension 
--span-hosts 
--convert-links 
--restrict-file-names=windows 
--domains "$DOMAIN" 
--no-parent 
"$URL" # The URL to download This comment is fine

你的脚本正在执行:

wget --recursive ' #' Download the whole site.
--page-requisites ' #' Get all assets/elements (CSS/JS/images).
^^ - syntax error
etc.

wget尝试下载http:// #/http://Download/等站点

记住引用变量扩展来禁用文件名扩展和分字(在url的情况下,引用参数来禁用在后台创建&子shell)。在脚本中使用bash数组更容易管理:

runme=(
wget # yay bash array - you can comment and no need for ''
--recursive # another comment
--page-requisites # to this and that
--adjust-extension
--span-hosts
--convert-links
--restrict-file-names=windows
--domains "$DOMAIN"
--no-parent 
"$URL" # The URL to download This comment is fine
)
"${runme[@]}"
# replace the double-slash between protocol and domain with `%`:
DOMAIN_WITH_PROTOCOL=${URL////%}
# strip away all the path components
DOMAIN_WITH_PROTOCOL=${DOMAIN_WITH_PROTOCOL%%/*}
# replace the `%` (that we used to delimit protocol from domain) back to `//`
DOMAIN_WITH_PROTOCOL=${DOMAIN_WITH_PROTOCOL/%///}

或者只是在单独的步骤中寻找域和协议,这需要多一步,但我发现更容易遵循:

# overything before the first `://` is the protocol
proto=${URL%%://*}
# strip away the protocol, which gives us <domain>+<path>
domain=${URL#*://}
# strip the path from the domain:
domain=${domain%%/*}
# and reassemble protocol+domain:
DOMAIN_WITH_PROTOCOL="${proto://${domain}"

相关内容

  • 没有找到相关文章

最新更新