使用beautifulsoup删除不需要的标签



我正在使用beautifulsoup4进行一些网页抓取项目。然而,代码返回的内容比预期的要多。例如,以下内容也是返回的。如何消除它们?我已经尝试过简单的内务管理方法。

从我的代码中可以看出,我已经添加了要删除的"样式",但它仍然显示在结果下。我不知道为什么会这样。

# Remove unwanted tag elements:
cleaned_text = ''
blacklist = [
'[document]', 
'noscript',
'header',
'html',
'meta',
'head', 
'input',
'script',
'style',]
# Then we will loop over every item in the extract text and make sure that the beautifulsoup4 tag
# is NOT in the blacklist
for item in text:
if item.parent.name not in blacklist:
cleaned_text += '{} '.format(item)

# Remove any tab separation and strip the text:
cleaned_text = cleaned_text.replace('t', '')
return cleaned_text.strip()

以下是需要删除的不必要的结果。

[if !mso]>
<style>
v:* {behavior:url(#default#VML);}
</style>
<![endif] [if gte mso 9]><xml>
<o:OfficeDocumentSettings>
<o:RelyOnVML/>
<o:AllowPNG/>
</o:OfficeDocumentSettings>
</xml><![endif] [if gte mso 9]><xml>
<w:WordDocument>
<w:View>Normal</w:View>
<w:Zoom>0</w:Zoom>
<w:TrackMoves/>
<w:TrackFormatting/>
<w:PunctuationKerning/>
<w:ValidateAgainstSchemas/>
<w:SaveIfXMLInvalid>false</w:SaveIfXMLInvalid>
<w:IgnoreMixedContent>false</w:IgnoreMixedContent>
<w:AlwaysShowPlaceholderText>false</w:AlwaysShowPlaceholderText>
<w:DoNotPromoteQF/>
<w:LidThemeOther>EN-US</w:LidThemeOther>
<w:LidThemeAsian>X-NONE</w:LidThemeAsian>
<w:LidThemeComplexScript>X-NONE</w:LidThemeComplexScript>
<w:Compatibility>
<w:DontVertAlignInTxbx/>
<w:Word11KerningPairs/>
<w:CachedColBalance/>
</w:Compatibility>
<m:mathPr>
<m:mathFont m:val="Cambria Math"/>
<m:brkBin m:val="before"/>
<m:brkBinSub m:val="--"/>
<m:smallFrac m:val="off"/>
<m:naryLim m:val="undOvr"/>
</m:mathPr></w:WordDocument>
</xml><![endif] [if gte mso 9]><xml>
<w:LatentStyles DefLockedState="false" DefUnhideWhenUsed="true"
DefSemiHidden="true" DefQFormat="false" DefPriority="99"
LatentStyleCount="267">
<w:LsdException Locked="false" Priority="0" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Normal"/>
<w:LsdException Locked="false" Priority="9" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="heading 1"/>
</w:LatentStyles>
</xml><![endif] [if gte mso 10]>
<style>
/* Style Definitions */
table.MsoNormalTable
{mso-style-name:"Table Normal";
mso-para-margin-right:0in;
mso-para-margin-bottom:10.0pt;
mso-para-margin-left:0in;
line-height:115%;
mso-pagination:widow-orphan;
font-size:11.0pt;
font-family:"Calibri","sans-serif";
mso-ascii-font-family:Calibri;
</style>
<![endif] 

为了解决这个问题,我打印出item.parent.name并手动检查每个值并相应地删除。类似于diggusbickus提到的。

最新更新