如何从TXT抓取HTML并将所有项目存储到CSV?



我试图从HTMLon导出一个TXTfile标签项。由于某种原因,我的代码只接受最后一行并将其导出到CSV。它不会刮掉其他列出的项目。不知道为什么。我尝试了多种方法,但都无济于事。

这是我的代码…

import pandas as pd
from bs4 import BeautifulSoup
import schedule
import time
#import urllib.parse
import requests

baseurl = 'https://www.soxboxmtl.com'
dataset = []
with open(r'/run/user/759001103/gvfs/smb-share:server=192.168.0.112,share=corporate%20share/Corporate Share/Systems and Infrastructure/Engineering/jbot tests/soxboxmtl2.txt', "r") as f:

soup = BeautifulSoup(f.read(), "html.parser")
for imgurl in soup.find_all('img', class_='grid-item-image'):(imgurl['data-src'])
for name in soup.find_all('div', class_='grid-title'):(name.text)    
for link in soup.find_all('a', class_='grid-item-link'):(link['href'])  
for price in soup.find_all('div', class_='product-price'):(price.text)

dataset.append({'Field_01':(imgurl['data-src']),'Field_02':name.text,'Field_03':(baseurl + link['href']),'Field_04':price.text})

print(dataset)
df = pd.DataFrame(dataset).to_csv(r'/run/user/759001103/gvfs/smb-share:server=192.168.0.112,share=corporate%20share/Corporate Share/Systems and Infrastructure/Engineering/jbot tests/soxboxmtl2.csv', index = False)

下面是一个HTML数据示例

<div class="grid-item hentry tag-paddle tag-brush tag-bristle tag-wide tag-detangle tag-kitsch tag-anti-frizz tag-black author-jill-kessner post-type-store-item article-index-45 sqs-product-quick-view-button-hover-area" data-controller="ProductListImageLoader" data-item-id="625ef30d651884142d5a2dc2" id="thumb-kitsch-paddle-hair-brush">
<a aria-label="Kitsch Paddle Hair Brush" class="grid-item-link" href="/home-bath-body/p/kitsch-paddle-hair-brush">
</a>
<figure class="grid-image" data-animation-role="image" data-test="plp-grid-image">
<div class="grid-image-wrapper has-hover-img">
<img alt="Screenshot 2022-04-19 at 1.31.04 PM.png" class="grid-item-image grid-image-cover" data-image="https://images.squarespace-cdn.com/content/v1/5eb9807914392e1510a400ed/1650390361257-FA4PYOB3KLXRT69ME502/Screenshot+2022-04-19+at+1.31.04+PM.png" data-image-dimensions="1341x1335" data-image-focal-point="0.5,0.5" data-load="false" data-src="https://images.squarespace-cdn.com/content/v1/5eb9807914392e1510a400ed/1650390361257-FA4PYOB3KLXRT69ME502/Screenshot+2022-04-19+at+1.31.04+PM.png"/>
<img alt="Screenshot 2022-04-19 at 1.31.24 PM.png" class="grid-item-image grid-image-hover" data-image="https://images.squarespace-cdn.com/content/v1/5eb9807914392e1510a400ed/1650390381627-ZJU6GL0JVR2AZG3FKM84/Screenshot+2022-04-19+at+1.31.24+PM.png" data-image-dimensions="1338x1338" data-image-focal-point="0.5,0.5" data-load="false" data-src="https://images.squarespace-cdn.com/content/v1/5eb9807914392e1510a400ed/1650390381627-ZJU6GL0JVR2AZG3FKM84/Screenshot+2022-04-19+at+1.31.24+PM.png"/>
<div class="list-quick-view-wrapper sqs-product-quick-view-button-wrapper">
<span class="sqs-product-quick-view-button" data-group="5ec69b56a188e3129c377b33" data-id="625ef30d651884142d5a2dc2" role="button" tabindex="0">Quick View</span>
</div>
</div>
</figure>
<section class="grid-meta-wrapper" data-animation-role="content">
<div class="grid-main-meta">
<div class="grid-title" data-test="plp-grid-title">
Kitsch Paddle Hair Brush
</div>
<div class="grid-prices" data-test="plp-grid-prices">
<div class="product-price">
CA$24.00
</div>
</div>
</div>
<div class="grid-meta-status" data-test="plp-grid-status">
<div class="product-scarcity">
Only 2 left in stock
</div>
</div>
</section>
</div>
<div class="grid-item hentry tag-blanket tag-plush tag-cozy-plush tag-pj-salvage tag-embroidered tag-blush tag-pink tag-luxe-plush tag-luxe author-jill-kessner post-type-store-item article-index-46 sqs-product-quick-view-button-hover-area" data-controller="ProductListImageLoader" data-item-id="635031c65ac9872b4ba44f5a" id="thumb-pj-salvage-luxe-plush-embroidered-blanket-blush">
<a aria-label="PJ Salvage Luxe Plush Embroidered Blanket - Blush" class="grid-item-link" href="/home-bath-body/p/pj-salvage-luxe-plush-embroidered-blanket-blush">
</a>
<figure class="grid-image" data-animation-role="image" data-test="plp-grid-image">
<div class="grid-image-wrapper has-hover-img">
<img alt="Screenshot 2022-10-17 at 12.03.06 AM.png" class="grid-item-image grid-image-cover" data-image="https://images.squarespace-cdn.com/content/v1/5eb9807914392e1510a400ed/1666200149369-QJ9BN6T3KE45I2H11K9Z/Screenshot+2022-10-17+at+12.03.06+AM.png" data-image-dimensions="891x1340" data-image-focal-point="0.5,0.5" data-load="false" data-src="https://images.squarespace-cdn.com/content/v1/5eb9807914392e1510a400ed/1666200149369-QJ9BN6T3KE45I2H11K9Z/Screenshot+2022-10-17+at+12.03.06+AM.png"/>
<img alt="Screenshot 2022-10-17 at 12.02.56 AM.png" class="grid-item-image grid-image-hover" data-image="https://images.squarespace-cdn.com/content/v1/5eb9807914392e1510a400ed/1666200171128-41WP2X90CW820GH07IPH/Screenshot+2022-10-17+at+12.02.56+AM.png" data-image-dimensions="890x1339" data-image-focal-point="0.5,0.5" data-load="false" data-src="https://images.squarespace-cdn.com/content/v1/5eb9807914392e1510a400ed/1666200171128-41WP2X90CW820GH07IPH/Screenshot+2022-10-17+at+12.02.56+AM.png"/>
<div class="list-quick-view-wrapper sqs-product-quick-view-button-wrapper">
<span class="sqs-product-quick-view-button" data-group="5ec69b56a188e3129c377b33" data-id="635031c65ac9872b4ba44f5a" role="button" tabindex="0">Quick View</span>
</div>
</div>
</figure>
<section class="grid-meta-wrapper" data-animation-role="content">
<div class="grid-main-meta">
<div class="grid-title" data-test="plp-grid-title">
PJ Salvage Luxe Plush Embroidered Blanket - Blush
</div>
<div class="grid-prices" data-test="plp-grid-prices">
<div class="product-price">
CA$118.00
</div>
</div>
</div>
<div class="grid-meta-status" data-test="plp-grid-status">
<div class="product-scarcity">
Only 1 left in stock
</div>
</div>

目前的实现有两个问题:

问题1

您的循环实际上不会对bs4找到的数据做任何事情。唯一向数据集中添加数据的是对dataset.append()的单个调用,这将产生您所经历的单行数据。

问题2

即使循环是有效的,脚本也可能失败,因为pandas dataframe需要一致的列长度。例如,图片比标题多,所以你最终会得到不同长度的列。

解决方案除了确保我们实际上正确地附加了数据之外,我们还需要确保所有列的格式都正确且一致。我们不是搜索任何和所有彼此没有关系的信息,而是搜索包含与我们的需要有关的信息的所有父元素。

然后遍历父元素列表。在每次迭代中,我们只在父元素中搜索可用的数据,然后将其格式化,以便在DataFrame中使用。这个DataFrame被附加到我们的DataFrame列表中,一旦迭代完成,它被连接到单个DataFrame中,并最终导出。
# Find all the grid-items first.
sections = soup.find_all('div', {'class': 'grid-item'}, recursive=True)
# We will append our formatted data to this list, then
# provide it to the DataFrame on creation
df_items = []
# Format and add the data from each grid-item to the DataFrame.
for section in sections:
title = section.find('a', {'class': 'grid-item-link'})
imgs = section.findAll('img')
price = section.find('div', {'class': 'product-price'})
data = {
'Field_01': [img['data-src'] for img in imgs],
'Field_02': [title['aria-label']],
'Field_03': [baseurl + title['href']],
'Field_04': [''.join(price.text.split())],
}
# DataFrames require all arrays to be the same length.
# This automatically fills in any missing cells.
df = pd.DataFrame.from_dict(data, orient='index')
df = df.transpose()
# Append the DataFrame to our list of DataFrames.
df_items.append(df)
# Concatenate all dataframes.
result = pd.concat(df_items)
# Export
result.to_csv('data.csv', index=False)

这是因为for循环遍历,但总是覆盖值,因此只保留最后一个值,然后将其添加到dataset中。

建议-尝试简化并将自己定位到包含信息的类grid-item的容器元素,遍历所有这些容器,然后将数据添加到dataset。这样,您只需要一个for循环,这更容易控制。

以下示例使用css selectors,因为我更喜欢使用这些:

...
soup = BeautifulSoup(f.read(), "html.parser")
for e in soup.select('.grid-item'):
dataset.append({
'Field_01':e.img.get('data-src'),
'Field_02':e.select_one('.grid-title').get_text(strip=True),
'Field_03':baseurl + e.a.get('href'),
'Field_04':e.select_one('.product-price').get_text(strip=True)
})

,但您也可以使用find_all()find()。也检查get_text()和它的参数,以摆脱断点或空白符。

for e in soup.find_all('div', class_='grid-item'):
dataset.append({
'Field_01':e.find('img', class_='grid-item-image').get('data-src'),
'Field_02':e.find('div', class_='grid-title').get_text(strip=True),
'Field_03':baseurl + e.find('a', class_='grid-item-link').get('href'),
'Field_04':e.find('div', class_='product-price').get_text(strip=True)
})

这将导致:

Field_04https://images.squarespace-cdn.com/content/v1/5eb9807914392e1510a400ed/1650390361257-FA4PYOB3KLXRT69ME502/Screenshot+2022-04-19+at+1.31.04+PM.pnghttps://www.soxboxmtl.com/home-bath-body/p/kitsch-paddle-hair-brushCA$24.00https://images.squarespace-cdn.com/content/v1/5eb9807914392e1510a400ed/1666200149369-QJ9BN6T3KE45I2H11K9Z/Screenshot+2022-10-17+at+12.03.06+AM.pnghttps://www.soxboxmtl.com/home-bath-body/p/pj-salvage-luxe-plush-embroidered-blanket-blushCA$118.00