美丽汤获取< >标签的内容



我有一组刮过的页面,必须使用(不能再刮了(,这些页面包含引用的&lt; &gt;标签中的元信息,如下所示:

...
<span class="html-tag">
&lt;meta <span class="html-attribute-name">name</span>="
<span class="html-attribute-value">twitter:title</span>" 
<span class="html-attribute-name">property</span>="
<span class="html-attribute-value">og:title</span>" 
<span class="html-attribute-name">content</span>="
<span class="html-attribute-value">Smart TV wifi won't turn on</span>" /&gt;
...
&lt;meta <span class="html-attribute-name">property</span>="
<span class="html-attribute-value">og:url</span>" 
<span class="html-attribute-name">content</span>="
<span class="html-attribute-value">
https://x.y.org/discussion/437/smart-tv-wifi-wont-turn-on</span>" /&gt;
...

更新3:

这些加载在Chrome中的行看起来像这样:

<meta name="twitter:title" property="og:title" content="Smart TV wifi won't turn on" />
<meta property="og:url" content="https://x.y.org/discussion/437/lg-smart-tv-wifi-wont-turn-on" />

但是原始刮取文本而不是<meta>标签具有&lt;meta .... &gt;meta

是否可以使用BeautifulSoup从&lt;meta .... &gt;meta标签获取内容在这种情况下,我需要获得"智能电视无线网络无法打开"和urlhttps://x.y.org/discussion/437/smart-tv-wifi-wont-turn-on"如何做到这一点?

我不知道这是否是您想要的。

from simplified_scrapy import SimplifiedDoc
html = '''
<span class="html-tag">
&lt;meta <span class="html-attribute-name">name</span>="
<span class="html-attribute-value">twitter:title</span>" 
<span class="html-attribute-name">property</span>="
<span class="html-attribute-value">og:title</span>" 
<span class="html-attribute-name">content</span>="
<span class="html-attribute-value">Smart TV wifi won't turn on</span>" /&gt;
...
&lt;meta <span class="html-attribute-name">property</span>="
<span class="html-attribute-value">og:url</span>" 
<span class="html-attribute-name">content</span>="
<span class="html-attribute-value">
https://x.y.org/discussion/437/smart-tv-wifi-wont-turn-on</span>" /&gt;
'''
doc = SimplifiedDoc(html)
block = doc.getSectionByReg('&lt;meta[sS]+?/&gt;') # Get the first data block. 
span = SimplifiedDoc(block).getElementByText('content').next.text
print (span)
blocks = doc.getSectionsByReg('&lt;meta[sS]+?/&gt;') # Get all data blocks
for block in blocks:
span = SimplifiedDoc(block).getElementByText('content').next.text
print (span)

结果:

Smart TV wifi won't turn on
Smart TV wifi won't turn on
https://x.y.org/discussion/437/smart-tv-wifi-wont-turn-on
from bs4 import BeautifulSoup

html = """ ...
<span class="html-tag">
&lt;meta <span class="html-attribute-name">name</span>="
<span class="html-attribute-value">twitter:title</span>" 
<span class="html-attribute-name">property</span>="
<span class="html-attribute-value">og:title</span>" 
<span class="html-attribute-name">content</span>="
<span class="html-attribute-value">Smart TV wifi won't turn on</span>" /&gt;
...
"""

soup = BeautifulSoup(html, 'html.parser')
for item in soup.findAll("span", {'class': 'html-attribute-value'})[2]:
print(item)

更新:

from bs4 import BeautifulSoup
import re
html = """<meta name="twitter:title" property="og:title" content="Smart TV wifi won't turn on" />
<meta property="og:url" content="https://x.y.org/discussion/437/lg-smart-tv-wifi-wont-turn-on" />"""

soup = BeautifulSoup(html, 'html.parser')
for item in soup.findAll("meta", property=re.compile("^og")):
print(item.get("content"))

输出:

Smart TV wifi won't turn on
https://x.y.org/discussion/437/lg-smart-tv-wifi-wont-turn-on

最新更新