我有这个PowerShell脚本,它剥离html标签,只留下文本,并在脚本执行时显示该html文件的单词计数。我的问题是当我执行:
function Html-ToText {
param([System.String] $html)
# remove line breaks, replace with spaces
$html = $html -replace "(`r|`n|`t)", " "
# write-verbose "removed line breaks: `n`n$html`n"
# remove invisible content
@('head', 'style', 'script', 'object', 'embed', 'applet', 'noframes', 'noscript', 'noembed') | % {
$html = $html -replace "<$_[^>]*?>.*?</$_>", ""
}
# write-verbose "removed invisible blocks: `n`n$html`n"
# Condense extra whitespace
$html = $html -replace "( )+", " "
# write-verbose "condensed whitespace: `n`n$html`n"
# Add line breaks
@('div','p','blockquote','h[1-9]') | % { $html = $html -replace "</?$_[^>]*?>.*?</$_>", ("`n" + '$0' )}
# Add line breaks for self-closing tags
@('div','p','blockquote','h[1-9]','br') | % { $html = $html -replace "<$_[^>]*?/>", ('$0' + "`n")}
# write-verbose "added line breaks: `n`n$html`n"
#strip tags
$html = $html -replace "<[^>]*?>", ""
# write-verbose "removed tags: `n`n$html`n"
# replace common entities
@(
@("&bull;", " * "),
@("&lsaquo;", "<"),
@("&rsaquo;", ">"),
@("&(rsquo|lsquo);", "'"),
@("&(quot|ldquo|rdquo);", '"'),
@("&trade;", "(tm)"),
@("&frasl;", "/"),
@("&(quot|#34|#034|#x22);", '"'),
@('&(amp|#38|#038|#x26);', "&"),
@("&(lt|#60|#060|#x3c);", "<"),
@("&(gt|#62|#062|#x3e);", ">"),
@('&(copy|#169);', "(c)"),
@("&(reg|#174);", "(r)"),
@("&nbsp;", " "),
@("&(.{2,6});", "")
) | % { $html = $html -replace $_[0], $_[1] }
# write-verbose "replaced entities: `n`n$html`n"
return $html + $a | Measure-Object -word
}
然后运行:
Html-ToText (new-object net.webclient).DownloadString("test.html")
显示4个字显示在PowerShell的输出中。我如何从PowerShell窗口导出输出到一个excel电子表格与列字和计数4?
您想要的CSV就像这样:
Words
4
很容易把它写入文本文件,Excel会读取它。但你很幸运,Measure-Object的输出已经是一个以'Words'作为属性和'4'作为值的对象,你可以把那个直接馈入Export-Csv
。使用select-object
选择您想要的属性:
$x = Html-ToText (new-object net.webclient).DownloadString("test.html")
# drop the Lines/Characters/etc fields, just export words
$x | select-Object Words | Export-Csv out.csv -NoTypeInformation
我很想看看是否可以使用
$x = Invoke-WebResponse http://www.google.com
$x.AllElements.InnerText
将单词从HTML中取出,然后尝试用替换符剥离内容。
我明白了。我所做的是添加的+ $a | Measure-Object -Word在脚本中的#html变量之后,然后运行:Html-ToText (new-object net.webclient).DownloadString("test.html") + select-Object Words | Export-Csv out.csv - notypeinformation输出字数