分析ATOM rss提要并删除html标记



am使用powershell开发此代码。我需要能够提取html标签。

Invoke-WebRequest -Uri 'https://psu.box.com/shared/static/jf36ohodxnw7oemghsau1t7qb0w4y708.rss' -  OutFile C:usersanr2809Documentsalerts.txt
[xml]$Content = Get-Content C:usersanr2809Documentsalerts.txt -Raw
$Regex = '(?s)SE1046.*?Description := "(?<Description>.*?)"'
If ($Content -match $Regex) {
"Description is '$($Matches['Description'])'"
# do something here with $Matches['Description']
}
Else {
"No match."
}
$Feed = $Content.rss.channel
ForEach ($msg in $Feed.Item){
$ParseData = (($msg.description))
ForEach ($Datum in $ParseData){
If ($Datum -like "Title"){[int]$Upvote = ($Datum).split(' ') | Select-Object -First 1}#EndIf
If ($Datum -like "comments"){[int]$Downvote = ($Datum).split(' ') | Select-Object -First 1}    #EndIf
}#EndForEach
[PSCustomObject]@{
'LastUpdated' = [datetime]$msg.pubDate
'Title' = $msg.title
'Category' = $msg.category
'Author' = $msg.author
'Link' = $msg.link
'UpVotes' = $Upvote
'DownVotes' = $Downvote
'Validations' = $Validation
'WorkArounds' = $Workaround
'Comments' = $msg.description.InnerText                   
'FeedbackID' = $FeedBackID
}#EndPSCustomObject
}

这就是结果,我想删除html标记。

LastUpdated : 3/30/2020 9:45:52 AM
Title       : Enterprise Network Planned Outage
Category    : 
Author      : 
Link        : link
UpVotes     : 
DownVotes   : 
Validations : 
WorkArounds : 
Comments    : 
<p><strong>People and Locations Impacted:</strong><br />All    students, faculty, and staff at all State locations<br /><br />
FeedbackID  : 

您可以用实际的换行符替换<br/>,然后将其余部分完全标记为:

$commentsPlain = $msg.description.InnerText -replace '<br ?/?>',[System.Environment]::NewLine -replace '<[^>]+>'
[PSCustomObject]@{
'LastUpdated' = [datetime]$msg.pubDate
'Title' = $msg.title
'Category' = $msg.category
'Author' = $msg.author
'Link' = $msg.link
'UpVotes' = $Upvote
'DownVotes' = $Downvote
'Validations' = $Validation
'WorkArounds' = $Workaround
'Comments' = $commentsPlain
'FeedbackID' = $FeedBackID
}

您应该能够使用以下脚本。它使用了HTMLFilecom对象。

Invoke-WebRequest -Uri 'https://*.rss' -  OutFile C:*.rss
[xml]$Content = Get-Content C:*.rss -Raw
$Regex = '(?s)SE1046.*?Description := "(?<Description>.*?)"'
If ($Content -match $Regex) {
"Description is '$($Matches['Description'])'"
# do something here with $Matches['Description']
}
Else {
"No match."
}
$Feed = $Content.rss.channel
ForEach ($msg in $Feed.Item){

$ParseData = $msg.description
ForEach ($Datum in $ParseData){
If ($Datum -like "Title"){[int]$Upvote = ($Datum).split(' ') | Select-Object -First 1}#EndIf
If ($Datum -like "comments"){[int]$Downvote = ($Datum).split(' ') | Select-Object -First 1}    #EndIf
}#EndForEach     
$HTML = New-Object -ComObject "HTMLFile"
$HTML.IHTMLDocument2_write($ParseData.InnerText)
[PSCustomObject]@{
'LastUpdated' = [datetime]$msg.pubDate
'Title' = $msg.title
'Category' = $msg.category
'Author' = $msg.author
'Link' = $msg.link
'UpVotes' = $Upvote
'DownVotes' = $Downvote
'Validations' = $Validation
'WorkArounds' = $Workaround
'Comments' = $HTML.all.tags("p") | % InnerText           
'FeedbackID' = $FeedBackID
}#EndPSCustomObject
}

最新更新