我正在尝试抓取一个包含以下代码的页面:
<li>
<span class="name">Person One</span>
<span class="organization">Mall</span>
<span class="employee">-Yes<span class="contractor yes">Yes</span></span>
</li>
<li>
<span class="name">Person Two</span>
<span class="organization">Market</span>
<span class="employee">-Yes<span class="contractor yes">Yes</span></span>
</li>
<li>
<span class="name">Person Three</span>
<span class="organization">Mall</span>
<span class="employee">-Yes<span class="contractor no">No</span></span>
</li>
我已经用bs4
找到了这些标签所在的类。我的目标是获得这些span类中的每一个,然后将其排序为dict,将其转换为数据帧。
我真的陷入了困境!任何帮助都是很好的
使用dict comprehension
:从span类中提取密钥并从其文本中提取值
{x.get('class')[0]: x.text for x in li.select('span')}
由于x.get('class')
将产生一个列表,我们必须选择它的第一个元素才能使list comprehension
工作
嵌套<span>
引起的员工价值调整:
df['employee'] = df.apply(lambda x: ''.join(x['employee'].strip('-').split(x['contractor'], 1)), axis=1)
示例
from bs4 import BeautifulSoup
import pandas as pd
html='''
<li>
<span class="name">Person One</span>
<span class="organization">Mall</span>
<span class="employee">-Yes<span class="contractor yes">Yes</span></span>
</li>
<li>
<span class="name">Person Two</span>
<span class="organization">Market</span>
<span class="employee">-Yes<span class="contractor yes">Yes</span></span>
</li>
<li>
<span class="name">Person Three</span>
<span class="organization">Mall</span>
<span class="employee">-Yes<span class="contractor no">No</span></span>
</li>
'''
soup = BeautifulSoup(html)
data = []
for li in soup.select('li'):
data.append({x.get('class')[0]: x.text for x in li.select('span')})
df = pd.DataFrame(data)
df['employee'] = df.apply(lambda x: ''.join(x['employee'].strip('-').split(x['contractor'], 1)), axis=1)
df
输出
名称 | |||||
---|---|---|---|---|---|
个人 | 第二人 | 第三人 |