OSU，下载链接打开beatmap页面，而不是下载beatmap文件

注意到OSU官方提供的beatmap包中有98%的歌曲我不想播放。与你可以找到的非官方超级组合一样，2011年、2012年、2013年等，每年都有20场歌曲演唱会。。

我确实发现；最受欢迎的"；osu中的页面：https://osu.ppy.sh/beatmapsets?sort=favourites_desc有一大堆我喜欢或愿意播放的歌曲。所以我尝试创建一个python脚本，点击每个beatmap面板上的下载按钮。在这个过程中我学到了很多东西>quot；Actions move_to_element(悬停菜单(、Wait.unti_clickable、Stale element Exceptions、Scroll Page执行脚本。

元素从Page/DOM中消失以生成"；对于元素中的元素"；工作正常，我决定让它滚动多次，以加载更多的beatmaps，而不是抓取带有单词"的HREF链接；下载"；这对捕捉"；大多数"；的链接。Atleast捕获了3000多个独特的链接。

我把它放在一个文本文件中，它看起来像这样：

...  
https://osu.ppy.sh/beatmapsets/1457867/download  
https://osu.ppy.sh/beatmapsets/881996/download  
https://osu.ppy.sh/beatmapsets/779173/download  
https://osu.ppy.sh/beatmapsets/10112/download  
https://osu.ppy.sh/beatmapsets/996628/download  
https://osu.ppy.sh/beatmapsets/415886/download  
https://osu.ppy.sh/beatmapsets/490662/download  
...

"；下载"；每个面板上的按钮都有这个HREF链接。如果单击按钮，则下载beatmap文件，它是.osz文件类型。然而，如果你；右键单击->复制链接"；从"；下载"；按钮，然后从新页面或新选项卡打开它，它将重定向到beatmaps页面，而不会下载文件。

我通过使用Pandas模块读取url的.xlxs excel文件并为每个url循环来实现它。打开url页面后，单击下载按钮：

def read_excel():
import pandas as pd
df = pd.read_excel('book.xlsx') # Get all the urls from the excel
mylist = df['urls'].tolist() #urls is the column name
print(mylist) # will print all the urls
# now loop through each url & perform actions.
for url in mylist:
options = webdriver.ChromeOptions()
options.add_experimental_option('excludeSwitches', ['enable-logging'])
options.add_argument("user-data- dir=C:\Users\%UserName%\AppData\Local\Google\Chrome\User Data\Profile1")
driver = webdriver.Chrome(executable_path=driver_path, chrome_options=options)
driver.get(url)
try:
WebDriverWait(driver, 3).until(EC.alert_is_present(),'Timed out waiting for alert.')   
alert = driver.switch_to.alert
alert.accept()
print("alert accepted")
except TimeoutException:
print("no alert")
time.sleep(1)
wait = WebDriverWait(driver, 10)
try:
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "body > div.osu-layout__section.osu-layout__section--full.js-content.beatmaps_show > div > div > div:nth-child(2) > div.beatmapset-header > div > div.beatmapset-header__box.beatmapset-header__box--main > div.beatmapset-header__buttons > a:nth-child(2) > span"))).click()
time.sleep(1)
except Exception:
print("Can't find the Element Download") 
time.sleep(10)
download_file()
driver.close()

这是一个序列"；一次一个"；函数，download_file((函数是一个循环，它检查下载文件夹以查看是否有正在下载的文件，如果没有，则转到下一个url。这是有效的。当然，网站有局限性。一次最多只能下载8次，下载100到200次后，你就不能再下载了，你必须等待一段时间。但循环会继续，并尝试每个URL，除非您停止脚本。幸运的是，您可以看到下载的最后一个beatmap，并将其引用到Excel电子表格中的位置，然后删除上面的行，重新启动脚本。我确信我可以对它进行编码，这样当下载文件夹中没有弹出新文件时，它就会停止循环。

最后一个问题：有没有办法打开这些下载链接并下载文件，而不必单击"；下载按钮"；打开页面后？它重定向到beatmap页面，而不是自动下载文件。一定是一些我不知道的java/html数据。

def read_excel((：进口熊猫作为pddf=pd.read_excel('book.xlsx'(#从excel中获取所有urlmylist=df['urls'].tolist((#urls是列名

print(mylist) # will print all the urls
# now loop through each url & perform actions.
for url in mylist:
options = webdriver.ChromeOptions()
options.add_experimental_option('excludeSwitches', ['enable-logging'])
options.add_argument("user-data- dir=C:\Users\%UserName%\AppData\Local\Google\Chrome\User Data\Profile1")
driver = webdriver.Chrome(executable_path=driver_path, chrome_options=options)
driver.get(url)
try:
WebDriverWait(driver, 3).until(EC.alert_is_present(),'Timed out waiting for alert.')   
alert = driver.switch_to.alert
alert.accept()
print("alert accepted")
except TimeoutException:
print("no alert")
time.sleep(1)
wait = WebDriverWait(driver, 10)
try:
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "body > div.osu-layout__section.osu-layout__section--full.js-content.beatmaps_show > div > div > div:nth-child(2) > div.beatmapset-header > div > div.beatmapset-header__box.beatmapset-header__box--main > div.beatmapset-header__buttons > a:nth-child(2) > span"))).click()
time.sleep(1)
except Exception:
print("Can't find the Element Download") 
time.sleep(10)
download_file()
driver.close()

相关内容

最新更新

热门标签：