我正在尝试将scraped data
保存到csv文件中。然而,我得到以下错误
TypeError:列表索引必须是整数或切片,而不是str。我认为错误来自这段代码。
csv_writer.writerow(str(row['url']), str(row['img']), str(row['text']))
以下是整个代码。。
import requests
from bs4 import BeautifulSoup
import csv
page_url = 'https://alansimpson.me/python/scrape_sample.html'
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4514.131 Safari/537.36'}
rawpage = requests.get(page_url, headers=headers)
soup = BeautifulSoup(rawpage.content, 'html5lib')
content = soup.article
link_list = []
for link in content.find_all('a'):
try:
url = link.get('href')
img = link.get('src')
text = link.span.text
link_list.append([{'url':url, 'img':img, 'text':text}])
except AttributeError:
pass
with open('links.csv', 'w', encoding='utf-8', newline='') as csv_out:
csv_writer = csv.writer(csv_out)
csv_writer.writerow(['url', 'img', 'text'])
for row in link_list:
csv_writer.writerow(str(row['url']), str(row['img']), str(row['text']))
print('All done')
请注意:下面的代码创建一个文件并写入行
with open('links.csv', 'w', encoding='utf-8', newline='') as csv_out:
csv_writer = csv.writer(csv_out)
csv_writer.writerow(['url', 'img', 'text'])
更新
使用csv.DictWriter()
:
with open('links.csv', 'w', encoding='utf-8', newline='') as csv_out:
i = csv.DictWriter(csv_out, fieldnames = set().union(*(d.keys() for d in link_list)))
i.writeheader()
i.writerows(link_list)
您可以使用set().union(*(d.keys() for d in link_list))
从dicts
获取密钥列表,也可以简单地将['url', 'img', 'text']
作为fieldnames
传递
示例
import requests
from bs4 import BeautifulSoup
import csv
page_url = 'https://alansimpson.me/python/scrape_sample.html'
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4514.131 Safari/537.36'}
rawpage = requests.get(page_url, headers=headers)
soup = BeautifulSoup(rawpage.content, 'html5lib')
content = soup.article
link_list = []
for link in content.find_all('a'):
try:
url = link.get('href')
img = link.img.get('src')
text = link.span.text
link_list.append({'url':url, 'img':img, 'text':text})
except AttributeError:
pass
with open('links.csv', 'w', encoding='utf-8', newline='') as csv_out:
i = csv.DictWriter(csv_out, fieldnames = set().union(*(d.keys() for d in link_list)))
i.writeheader()
i.writerows(link_list)
print('All done')
替代方法:
将数据简单地存储为dict
,而不是在list
:中使用dict
作为list
link_list.append({'url':url, 'img':img, 'text':text})
并将其写成:
csv_writer.writerow([row['url'], row['img'], row['text']])
或者更简单地将其直接保存为list
:
link_list.append([url,img,text])
并将其写为列表:
csv_writer.writerow(row)
示例
import requests
from bs4 import BeautifulSoup
import csv
page_url = 'https://alansimpson.me/python/scrape_sample.html'
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4514.131 Safari/537.36'}
rawpage = requests.get(page_url, headers=headers)
soup = BeautifulSoup(rawpage.content, 'html5lib')
content = soup.article
link_list = []
for link in content.find_all('a'):
try:
url = link.get('href')
img = link.img.get('src')
text = link.span.text
link_list.append({'url':url, 'img':img, 'text':text})
except AttributeError:
pass
with open('links.csv', 'w', encoding='utf-8', newline='') as csv_out:
csv_writer = csv.writer(csv_out)
csv_writer.writerow(['url', 'img', 'text'])
for row in link_list:
csv_writer.writerow([row['url'], row['img'], row['text']])
print('All done')
错误修复:将此行img = link.get('src')
替换为img = link.img.get('src')
更新代码:
import requests
from bs4 import BeautifulSoup
import pandas as pd
page_url = 'https://alansimpson.me/python/scrape_sample.html'
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4514.131 Safari/537.36'}
rawpage = requests.get(page_url, headers=headers)
soup = BeautifulSoup(rawpage.content, 'html5lib')
content = soup.article
link_list = []
for link in content.find_all('a'):
try:
url = link.get('href')
img = link.img.get('src')
text = link.span.text
link_list.append({
'url':url,
'img':img,
'text':text,
})
except AttributeError:
pass
df = pd.DataFrame(link_list)
print(df)