在网页抓取期间将数据保存为CSV

我在网页抓取后将行保存在csv文件中时遇到一些问题。我使用了相同的符号，它在另一个网站上工作得很好，但现在csv文件是空白的。python似乎没有写入任何行。

我给你看我的代码，提前感谢:

import requests
from bs4 import BeautifulSoup
import csv
import lxml
html_page = requests.get('https://www.scrapethissite.com/pages/forms/?page_num=1').text
soup = BeautifulSoup(html_page, 'lxml')
# get the number of pages (it might change in the future as the data is updated)
pagenum = soup.find('ul', {'class': 'pagination'})
n = pagenum.findAll('li')[-2].find('a')['href'].split('=')[1]
# now we convert the value of the page in a range so that we can loop over it
page = range(1, int(n) + 1)
print(page)
with open('HockeyLeague.csv', 'w') as f:
csv_writer = csv.writer(f)
csv_writer.writerow(['team_name', 'year', 'wins', 'losses', 'win_perc', 'goal_for', 'goal_against'])
for p in page:
html_page = requests.get(f'https://www.scrapethissite.com/pages/forms/?page_num={p}&per_page=25').text
soup = BeautifulSoup(html_page, 'lxml')
table = soup.find('table', {'class': 'table'})
for row in table.findAll('tr', {'class': 'team'}):
# getting the wanted variables:
team_name = row.find('td', {'class': 'name'}).text
year = row.find('td', {'class': 'year'}).text
wins = row.find('td', {'class': 'wins'}).text
losses = row.find('td', {'class': 'losses'}).text
goal_for = row.find('td', {'gf'}).text
goal_against = row.find('td', {'ga'}).text
try:
win_perc = row.find('td', {'pct text-success'}).text
except:
win_perc = row.find('td', {'pct text-danger'}).text
# write the data in the csv file we created at the beginning
csv_writer.writerow([team_name, year, wins, losses, win_perc, goal_for, goal_against])

由于脚本通常是工作的这些事情你应该记住:

我建议在所有平台上使用newline=''打开文件禁用通用换行符转换和encoding='utf-8'到确保你的工作是"正确的"。一:
```
with open('HockeyLeague.csv', 'w', newline='', encoding='utf-8') as f:
...
```

.strip()您的文本或使用.get_text(strip=True)得到一个清洁输出并避免不喜欢的换行。

team_name = row.find('td', {'class': 'name'}).text.strip()
year = row.find('td', {'class': 'year'}).text.strip() 
...

在较新的代码中避免使用旧语法findAll()，而使用find_all()-想了解更多，请花一分钟查看文档

选择例子使用while循环检查"下一个按钮";并提取它的url，也是stripped_strings从每行提取文本:

import requests
from bs4 import BeautifulSoup
import csv
url = 'https://www.scrapethissite.com/pages/forms/'
with open('HockeyLeague.csv', 'w', newline='', encoding='utf-8') as f:
csv_writer = csv.writer(f)
csv_writer.writerow(['team_name', 'year', 'wins', 'losses', 'win_perc', 'goal_for', 'goal_against'])
while True:
html_page = requests.get(url).text
soup = BeautifulSoup(html_page)
for row in soup.find_all('tr', {'class': 'team'}):
# write the data in the csv file we created at the beginning 
csv_writer.writerow(list(row.stripped_strings)[:-1])
if soup.select_one('.pagination a[aria-label="Next"]'):
url = 'https://www.scrapethissite.com'+soup.select_one('.pagination a[aria-label="Next"]').get('href')
else:
break

输出

team_name,year,wins,losses,win_perc,goal_for,goal_against
Boston Bruins,1990,44,24,0.55,299,264
Buffalo Sabres,1990,31,30,0.388,292,278
Calgary Flames,1990,46,26,0.575,344,263
Chicago Blackhawks,1990,49,23,0.613,284,211
Detroit Red Wings,1990,34,38,0.425,273,298
Edmonton Oilers,1990,37,37,0.463,272,272
...

相关内容

最新更新

热门标签：