我正在尝试使用 Python 将抓取的数据保存到 CSV 文件中,但收到 typeError



我正在尝试将scraped data保存到csv文件中。然而,我得到以下错误

TypeError:列表索引必须是整数或切片,而不是str。我认为错误来自这段代码。

csv_writer.writerow(str(row['url']), str(row['img']), str(row['text']))

以下是整个代码。。

import requests
from bs4 import BeautifulSoup
import csv
page_url = 'https://alansimpson.me/python/scrape_sample.html'
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4514.131 Safari/537.36'}
rawpage = requests.get(page_url, headers=headers)
soup = BeautifulSoup(rawpage.content, 'html5lib')
content = soup.article
link_list = []
for link in content.find_all('a'):
try:
url = link.get('href')
img = link.get('src')
text = link.span.text
link_list.append([{'url':url, 'img':img, 'text':text}])
except AttributeError:
pass
with open('links.csv', 'w', encoding='utf-8', newline='') as csv_out:
csv_writer = csv.writer(csv_out)
csv_writer.writerow(['url', 'img', 'text'])
for row in link_list:
csv_writer.writerow(str(row['url']), str(row['img']), str(row['text']))
print('All done')

请注意:下面的代码创建一个文件并写入行

with open('links.csv', 'w', encoding='utf-8', newline='') as csv_out:
csv_writer = csv.writer(csv_out)
csv_writer.writerow(['url', 'img', 'text'])

更新

使用csv.DictWriter():

with open('links.csv', 'w', encoding='utf-8', newline='') as csv_out:
i = csv.DictWriter(csv_out, fieldnames = set().union(*(d.keys() for d in link_list)))
i.writeheader()
i.writerows(link_list)

您可以使用set().union(*(d.keys() for d in link_list))dicts获取密钥列表,也可以简单地将['url', 'img', 'text']作为fieldnames传递

示例
import requests
from bs4 import BeautifulSoup
import csv
page_url = 'https://alansimpson.me/python/scrape_sample.html'
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4514.131 Safari/537.36'}
rawpage = requests.get(page_url, headers=headers)
soup = BeautifulSoup(rawpage.content, 'html5lib')
content = soup.article
link_list = []
for link in content.find_all('a'):
try:
url = link.get('href')
img = link.img.get('src')
text = link.span.text
link_list.append({'url':url, 'img':img, 'text':text})
except AttributeError:
pass
with open('links.csv', 'w', encoding='utf-8', newline='') as csv_out:
i = csv.DictWriter(csv_out, fieldnames = set().union(*(d.keys() for d in link_list)))
i.writeheader()
i.writerows(link_list)
print('All done')

替代方法:

将数据简单地存储为dict,而不是在list:中使用dict作为list

link_list.append({'url':url, 'img':img, 'text':text})

并将其写成:

csv_writer.writerow([row['url'], row['img'], row['text']])

或者更简单地将其直接保存为list:

link_list.append([url,img,text])

并将其写为列表:

csv_writer.writerow(row)
示例
import requests
from bs4 import BeautifulSoup
import csv
page_url = 'https://alansimpson.me/python/scrape_sample.html'
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4514.131 Safari/537.36'}
rawpage = requests.get(page_url, headers=headers)
soup = BeautifulSoup(rawpage.content, 'html5lib')
content = soup.article
link_list = []
for link in content.find_all('a'):
try:
url = link.get('href')
img = link.img.get('src')
text = link.span.text
link_list.append({'url':url, 'img':img, 'text':text})
except AttributeError:
pass
with open('links.csv', 'w', encoding='utf-8', newline='') as csv_out:
csv_writer = csv.writer(csv_out)
csv_writer.writerow(['url', 'img', 'text'])
for row in link_list:
csv_writer.writerow([row['url'], row['img'], row['text']])
print('All done')

错误修复:将此行img = link.get('src')替换为img = link.img.get('src')

更新代码:

import requests
from bs4 import BeautifulSoup
import pandas as pd
page_url = 'https://alansimpson.me/python/scrape_sample.html'
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4514.131 Safari/537.36'}
rawpage = requests.get(page_url, headers=headers)
soup = BeautifulSoup(rawpage.content, 'html5lib')
content = soup.article
link_list = []
for link in content.find_all('a'):
try:
url = link.get('href')
img = link.img.get('src')
text = link.span.text
link_list.append({
'url':url,
'img':img,
'text':text,
})
except AttributeError:
pass
df = pd.DataFrame(link_list)
print(df)

相关内容

  • 没有找到相关文章

最新更新