使用美丽汤和网址库抓取<span>流文本</span>



我正在使用BeautifulSoup从网站上刮擦数据。无论出于何种原因,我似乎都找不到一种在跨度元素之间打印的方法。这是我正在运行的。

data = """ <div class="grouping">
     <div class="a1 left" style="width:20px;">Text</div>
     <div class="a2 left" style="width:30px;"><span 
     id="target_0">Data1</span>
   </div>
   <div class="a3 left" style="width:45px;"><span id="div_target_0">Data2
   </span></div>
   <div class="a4 left" style="width:32px;"><span id="reg_target_0">Data3
   </span</div>
</div>
"""

我的最终目标是能够为每个条目打印一个列表["文本"," data1"," data2"]。但是现在,我很难让Python和Urllib在介于之间产生任何文本。这是我正在运行的:

import urllib
from bs4 import BeautifulSoup
url = 'http://target.com'
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html, "lxml")
Search_List = [0,4,5] # list of Target IDs to scrape
for i in Search_List:
    h = str(i)
    root = 'target_' + h
    taggr = soup.find("span", { "id" : root })
    print taggr, ", ", taggr.text

当我使用urllib时,它会产生以下内容:

<span id="target_0"></span>, 
<span id="target_4"></span>, 
<span id="target_5"></span>, 

但是,我还下载了HTML文件,当我解析下载的文件时,它会产生此输出(我想要的输出(:

<span id="target_0">Data1</span>, Data1 
<span id="target_4">Data1</span>, Data1
<span id="target_5">Data1</span>, Data1

谁能向我解释为什么Urllib不产生结果?

使用此代码:

...
soup = BeautifulSoup(html, 'html.parser')
your_data = list()
for line in soup.findAll('span', attrs={'id': 'target_0'}):
    your_data.append(line.text)

...

类似地添加所有需要从CSV文件中提取数据并编写your_data列表的class attributes。希望如果这不解决,这将有所帮助。让我知道。

您可以使用以下方法根据您显示的源html创建列表:

from bs4 import BeautifulSoup
data = """ 
<div class="grouping">
     <div class="a1 left" style="width:20px;">Text0</div>
     <div class="a2 left" style="width:30px;"><span id="target_0">Data1</span></div>
     <div class="a3 left" style="width:45px;"><span id="div_target_0">Data2</span></div>
     <div class="a4 left" style="width:32px;"><span id="reg_target_0">Data3</span></div>
</div>
<div class="grouping">
     <div class="a1 left" style="width:20px;">Text2</div>
     <div class="a2 left" style="width:30px;"><span id="target_2">Data1</span></div>
     <div class="a3 left" style="width:45px;"><span id="div_target_0">Data2</span></div>
     <div class="a4 left" style="width:32px;"><span id="reg_target_0">Data3</span></div>
</div>
<div class="grouping">
     <div class="a1 left" style="width:20px;">Text4</div>
     <div class="a2 left" style="width:30px;"><span id="target_4">Data1</span></div>
     <div class="a3 left" style="width:45px;"><span id="div_target_0">Data2</span></div>
     <div class="a4 left" style="width:32px;"><span id="reg_target_0">Data3</span></div>
</div>
"""
soup = BeautifulSoup(data, "lxml")
search_ids = [0, 4, 5] # list of Target IDs to scrape
for i in search_ids:
    span = soup.find("span", id='target_{}'.format(i))
    if span:
        grouping = span.parent.parent
        print list(grouping.stripped_strings)[:-1]      # -1 to remove "Data3"

该示例已进行了稍作修改,以显示其查找IDS 04。这将显示以下输出:

[u'Text0', u'Data1', u'Data2']
[u'Text4', u'Data1', u'Data2']

注意,如果您正在从URL中返回的HTML与从浏览器中查看源的不同(即您想要的数据完全缺少(,那么您将需要使用selenium等解决方案来连接向您的浏览器提取HTML。这是因为在这种情况下,HTML可能是通过JavaScript本地生成的,并且urllib没有JavaScript处理器。

最新更新