我正在尝试创建一个bash脚本克隆一个网站,通过修改wget命令在这里找到。
这是我到目前为止,但我有问题解析正确的URL:
URL=$1
DOMAIN=`echo $URL | sed -e 's/[^/]*//([^@]*@)?([^:/]*).*/2/'`
echo $DOMAIN
wget
--recursive # Download the whole site.
--page-requisites # Get all assets/elements (CSS/JS/images).
--adjust-extension # Save files with .html on the end.
--span-hosts # Include necessary assets from offsite as well.
--convert-links # Update links to still work in the static version.
--restrict-file-names=windows # Modify filenames to work in Windows as well.
--domains $DOMAIN # Do not follow links outside this domain.
--no-parent # Don't follow links outside the directory you pass in.
$URL # The URL to download
问题似乎是从URL确定域名。
这是当我尝试克隆一个网页的结果:
./clone_website.sh https://www.elliman.com/newyork/sales/detail/612-l-566-14_h344874/5-paumanok-road-water-mill-ny-11976
www.elliman.com
--2021-01-26 11:06:11-- http://%20/
Resolving ( )... failed: Name or service not known.
wget: unable to resolve host address ‘ ’
--2021-01-26 11:06:11-- http://download/
Resolving download (download)... failed: Temporary failure in name resolution.
wget: unable to resolve host address ‘download’
--2021-01-26 11:06:11-- http://the/
Resolving the (the)... failed: Temporary failure in name resolution.
wget: unable to resolve host address ‘the’
--2021-01-26 11:06:11-- http://whole/
Resolving whole (whole)... failed: Temporary failure in name resolution.
wget: unable to resolve host address ‘whole’
--2021-01-26 11:06:11-- http://site./
Resolving site. (site.)... failed: Temporary failure in name resolution.
wget: unable to resolve host address ‘site.’
./clone_website.sh: line 7: syntax error near unexpected token `('
./clone_website.sh: line 7: ` --page-requisites # Get all assets/elements (CSS/JS/images).'
如何解决这个问题?
必须是最后一个字符在一条线上。后面不允许有注释。你想要的:
wget
--recursive
--page-requisites
--adjust-extension
--span-hosts
--convert-links
--restrict-file-names=windows
--domains "$DOMAIN"
--no-parent
"$URL" # The URL to download This comment is fine
你的脚本正在执行:
wget --recursive ' #' Download the whole site.
--page-requisites ' #' Get all assets/elements (CSS/JS/images).
^^ - syntax error
etc.
wget
尝试下载http:// #/
和http://Download/
等站点
记住引用变量扩展来禁用文件名扩展和分字(在url的情况下,引用参数来禁用在后台创建&
子shell)。在脚本中使用bash数组更容易管理:
runme=(
wget # yay bash array - you can comment and no need for ''
--recursive # another comment
--page-requisites # to this and that
--adjust-extension
--span-hosts
--convert-links
--restrict-file-names=windows
--domains "$DOMAIN"
--no-parent
"$URL" # The URL to download This comment is fine
)
"${runme[@]}"
# replace the double-slash between protocol and domain with `%`:
DOMAIN_WITH_PROTOCOL=${URL////%}
# strip away all the path components
DOMAIN_WITH_PROTOCOL=${DOMAIN_WITH_PROTOCOL%%/*}
# replace the `%` (that we used to delimit protocol from domain) back to `//`
DOMAIN_WITH_PROTOCOL=${DOMAIN_WITH_PROTOCOL/%///}
或者只是在单独的步骤中寻找域和协议,这需要多一步,但我发现更容易遵循:
# overything before the first `://` is the protocol
proto=${URL%%://*}
# strip away the protocol, which gives us <domain>+<path>
domain=${URL#*://}
# strip the path from the domain:
domain=${domain%%/*}
# and reassemble protocol+domain:
DOMAIN_WITH_PROTOCOL="${proto://${domain}"