如何使用Scrapy从RealGM中抓取玩家数据



首先,我要做的是从RealGM中抓取字段,例如:https://basketball.realgm.com/player/player/Summary/1https://basketball.realgm.com/player/player/Summary/160000

我正试图从玩家档案框中提取每一条信息,所以在第一个例子中,我想提取:Greg Oden C#20出生日期:1988年1月22日(33岁(出生地/家乡:纽约州水牛城国籍:美国身高:7-0(213厘米(体重:273(124公斤(选秀:2007年NBA选秀预备队:俄亥俄州(Fr(高中:劳伦斯北高中[印第安纳州印第安纳波利斯]

我没有取得多大成功,我在下面获得的代码可以提取href,这并不完美,但我可以使用它。问题是我遇到了一个错误,我认为这是因为并非所有玩家都有相同的数据字段,上面的例子是我想要的最大输出,但有些玩家没有出生日期,有些玩家没有选秀前团队,等等。所以,对于那些我需要它的人来说,只需要为那个领域拉一块空白,然后继续刮。拉一个像高度/重量这样的字段,其中没有href,所有内容都包含在内。我拉得不成功,每当我提到该部分时,它都是空白的。

任何帮助都将不胜感激!这就是我目前所拥有的,但我被卡住了:


import scrapy
class RealGMSpider(scrapy.Spider):
name = "players"
start_urls = [
'https://basketball.realgm.com/player/player/Summary/1',
'https://basketball.realgm.com/player/player/Summary/2',
'https://basketball.realgm.com/player/player/Summary/160000'

]
def parse(self, response):
for player in response.css('.profile-box .container , .level-1'):
yield {
'name': player.css('span::text')[1].get,
'link': player.css('a.selected').attrib['href'],
'bday': player.css('.half-column-left img+ p a').attrib['href'],
'htwn': player.css('p:nth-child(4) a').attrib['href'],
'ntion': player.css('.half-column-left p~ p+ p a').attrib['href'],
'cteam': player.css('.half-column-right img+ p a').attrib['href'],
'agent': player.css('.half-column-right p:nth-child(5) a').attrib['href'],
'draftyr': player.css('p:nth-child(6) a').attrib['href'],
'earlyen': player.css('p:nth-child(7) a').attrib['href'],
'drafted': player.css('p:nth-child(8) a').attrib['href'],
'predraft': player.css('p:nth-child(9) a').attrib['href'],
'hs': player.css('p:nth-child(10) a').attrib['href']
}

没关系,我可以使用BeautifulSoup!

import csv ;import requests
from bs4 import BeautifulSoup
import csv
import re
url_list = ['https://basketball.realgm.com/player/player/Summary/2',
'https://basketball.realgm.com/player/player/Summary/1']
for url in url_list:
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
player = soup.find_all('div', class_='wrapper clearfix container')[0]
playerprofile = re.sub(
r'ns*n', r'n', player.get_text().strip(), flags=re.M)
output = playerprofile + "n"

最新更新