使用selenium的Python Web抓取,并通过Web爬虫将数据加载到Pandas DataFrame中



我正试图从https://robloxsong.com/抓取所有的Track, Roblox ID,评级,并希望它们进入pandas DataFrame。但是,当我尝试下面的代码,它给了我一个单一的列表,所有的轨道,ID,评级"n"。此外,我希望无法跳过所有50页并获得所有数据。

#Importing
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
webD = webdriver.Chrome(ChromeDriverManager().install())
webD.get('https://robloxsong.com/')
#Loding data form songs tag
elements = webD.find_elements_by_class_name('songs')
#Declaring DataFrame
result = pd.DataFrame(columns = ['Track','Roblox_ID','Rating'])
#Extracting Text 
listOfElements = []
for i in elements:
listOfElements.append(i.text)

print(listOfElements)

当我打印listOfElements下面是输出

>>> ["Track Roblox ID Ratingneggn5128532009nCopyn30267nCaillou Trap Remixn212675193nCopyn26550nZEROTWOOOOOn3951847031nCopyn26045nRUNNING IN THE OOFS! (EPIC)n1051512943nCopyn25938nSPOOKY SCARY SKELETONS (100,000+ sales)n160442087nCopyn24106nBanana Song (I'm A Banana)n169360242nCopyn23065nshrek anthemn152828706nCopyn22810nraining tacosn142376088nCopyn19135nGFMO - Hello (100k!!)n214902446nCopyn19118nWide Put in Walking Audion5356051569nCopyn13472nRaining Tacos. (Original)n142295308nCopyn13235nNARWHALSn130872377nCopyn12858nOld Town Roadn2862170886nCopyn11888nnon130786686nCopyn11570nCRAB RAVE OOFn2590490779nCopyn11551nKFC is illuminati confirmed ( ͡° ͜ʖ ͡° )n205254380nCopyn10668nNightcore - Titaniumn398159550nCopyn10667nHelp Me Help You Logan Pauln833322858nCopyn10631nI Like Trainsn131072261nCopyn10271nI'm Fine.n513919776nCopyn9289nAINT NOBODY GOT TIME FOR DATn130776739nCopyn9093nRoxannen4277136473nCopyn8912nFlamingo Intron6123746751nCopyn8836nOld Town Road OOFEDn3180460921nCopyn8447nWii Musicn1305251774nCopyn8364nHow To Save A Life (Bass Boosted)n727844285nCopyn8309nDubstep Remix [26k+]n130762736nCopyn8052nEVERYBODY DO THE FLOPn130778839nCopyn7962nAnt, SeeDeng, Poke - PRESTONPLAYZ ROBLOXn1096142805nCopyn7778nYeah Yeah Yeahs - Heads Will Roll (JVH-C remix)n290176752nCopyn7706n♪ Nightcore - Light 'Em Up x Girl On Fire (S/V)n587156015nCopyn7527nDo the Harlem Shake!n131154740nCopyn7314nZero two but full songn5060369688nCopyn7221nInvinsible [NCS]n6104227669nCopyn7011nParty Musicn141820924nCopyn7009n♫♫Ƴℴu'ѵҿ ßƏƏƝ ƮƦ☉ᏝᏝƎƊ♫♫n142633540nCopyn6972nRevenge (Minecraft Music)n3807239428nCopyn6943nOOF LASAGNAn2866646141nCopyn6808nAlbert Sings Despaciton1398660411nCopyn6655nDo A Barrel Roll!n130791919nCopyn6647nLadies And Gentlemen We Got Himn2624663028nCopyn6642nCreepy Music Boxn143382469nCopyn6516nThe Roblox Songn1784385682nCopyn6474nZEROTWOOOOO with pandan4459223174nCopyn6362nsad violinn135308045nCopyn6261noofing in the 90'sn915288747nCopyn6092nElevator Musicn130768299nCopyn5998nFEED ME!n130766856nCopyn5909nTanqR Outron5812114304nCopyn5859nMako - Beam (Proximity)n165065112nCopyn5787"]

需要回答两个问题-

  1. 我该如何把这个放到Dataframe
  2. 如何从所有50页获取数据

您可以简单地使用requests库来获取页面,使用pandas库来解析页面中的表。为了获取所有页面,您需要分别解析所有页面。以下代码可以将所有页面解析为单个DataFrame:

import requests
import pandas as pd
def parse_tables(page_html):
page_tables = pd.read_html(page_html)    # directly parse tables into pandas dataframe
column_names = page_tables[0].columns    # save column names for later use
page_tables[0].columns = range(len(column_names))
# data is present in multiple <table> html tags but it looks like a single table, so combine data from all <table> tags
df = pd.concat(page_tables)
df.columns = column_names
return df.reset_index(drop=True)
headers = {"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36"}
base_url = "https://robloxsong.com/"
num_pages = 50    ##number of pages that you want to parse
ratings_tables = []
for page_num in range(1,num_pages):
page_url = base_url + "?page=" + str(page_num)
print("Parsing page " + str(page_num))
response = requests.get(url, headers=headers)   # fetch the html page
if response.ok:
page_html = response.content
page_table = parse_tables(page_html)
ratings_tables.append(page_table)
else:
print("Unable to fetch page:", response.content)
final_ratings_table = pd.concat(ratings_tables).reset_index()

最新更新