如何将抓取结果转换为dict来创建数据帧



我正在尝试抓取一个包含以下代码的页面:

<li>
<span class="name">Person One</span>
<span class="organization">Mall</span>
<span class="employee">-Yes<span class="contractor yes">Yes</span></span>
</li>
<li>
<span class="name">Person Two</span>
<span class="organization">Market</span>
<span class="employee">-Yes<span class="contractor yes">Yes</span></span>
</li>
<li>
<span class="name">Person Three</span>
<span class="organization">Mall</span>
<span class="employee">-Yes<span class="contractor no">No</span></span>
</li>

我已经用bs4找到了这些标签所在的类。我的目标是获得这些span类中的每一个,然后将其排序为dict,将其转换为数据帧。

我真的陷入了困境!任何帮助都是很好的

使用dict comprehension:从span类中提取密钥并从其文本中提取值

{x.get('class')[0]: x.text for x in li.select('span')} 

由于x.get('class')将产生一个列表,我们必须选择它的第一个元素才能使list comprehension工作

嵌套<span>引起的员工价值调整:

df['employee'] = df.apply(lambda x: ''.join(x['employee'].strip('-').split(x['contractor'], 1)), axis=1)

示例

from bs4 import BeautifulSoup
import pandas as pd
html='''
<li>
<span class="name">Person One</span>
<span class="organization">Mall</span>
<span class="employee">-Yes<span class="contractor yes">Yes</span></span>
</li>
<li>
<span class="name">Person Two</span>
<span class="organization">Market</span>
<span class="employee">-Yes<span class="contractor yes">Yes</span></span>
</li>
<li>
<span class="name">Person Three</span>
<span class="organization">Mall</span>
<span class="employee">-Yes<span class="contractor no">No</span></span>
</li>
'''
soup = BeautifulSoup(html)
data = []
for li in soup.select('li'):
data.append({x.get('class')[0]: x.text for x in li.select('span')})
df = pd.DataFrame(data)
df['employee'] = df.apply(lambda x: ''.join(x['employee'].strip('-').split(x['contractor'], 1)), axis=1)
df

输出

名称
个人第二人第三人

最新更新