加速潜在的大型连锁BeautifulSoup任务



我对网络抓取非常陌生(我对html几乎一无所知,这是我第一次使用BeautifulSoup(,我正在制作一个程序,基本上可以让我在线生成小说的PDF或epub。我并不担心与各种网站的兼容性,因为我只是为自己做这件事。我制作了代码,从特定章节的任何链接中获取网络小说所有章节的链接,并将其全部放入列表中,但这需要很长时间。每个链接大约一秒钟。考虑到有些小说的章节实际上超过了1-2千章,仅仅获取所有链接就需要半个小时,而程序甚至还没有获取每个链接的正文并将其编译成PDF,有没有办法让代码更快?

import requests
from bs4 import BeautifulSoup
def list_chapters():
given_chapter = 'https://www.box-novel.com/novel/cannon-fodder-counterattack-system/chapter-4-1/'
current_chapter = find_first_chapter(given_chapter)
print("Starting chapter: ", current_chapter)
link_list = []
try:
while True:
link_list.append(current_chapter)
r = requests.get(current_chapter)
soup = BeautifulSoup(r.content, 'html.parser')
s = soup.find('div', class_='nav-next')
for link in s.find_all('a'):
current_chapter = link.get('href')
except AttributeError:
link_list.pop(-1)
print(len(link_list), "chapters detected.")

请告诉我改进代码的其他方法。注意:我弹出链接中的最后一个值,因为这比检测上一章导航中引用的漫画信息的导航下一个值更容易,也可以忽略我使用的随机垃圾小说链接,这是我在第一页上能找到的最短的链接。

您的任务并不平凡。首先,通过入口点页面中的ajax POST请求加载到所有章节的链接。在你解决了这个问题之后,你需要一个强大的异步解决方案,我的意思是,它可以处理1BN链接列表,并且可以在Raspberry pi上执行(所以你需要一些队列的概念(。以下内容大约需要10秒,并将返回一个数据帧,其中包含小说中90章中每一章的标题和内容(如果你愿意,你可以按标题排序(:

import asyncio
from httpx import Client, AsyncClient, Limits
from bs4 import BeautifulSoup
import pandas as pd
from datetime import datetime

pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
## run this is you're executing the code in a notebook
import nest_asyncio
nest_asyncio.apply()
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"
}
#### setup some sort of mock persistence ###
big_df_list = []
#### async scrape funcs ####
def all_chapters_urls():
url_list = []
payload = {
'action': 'manga_get_reading_nav',
'manga': '1987979',
'chapter': 'chapter-29-7',
'volume_id': '0',
'type': 'content'
}
with Client(headers=headers, timeout=60.0, follow_redirects=True) as client:
r = client.post('https://www.box-novel.com/wp-admin/admin-ajax.php', data = payload)
soup = BeautifulSoup(r.text, 'html.parser')
links = soup.select_one('select.c-selectpicker.selectpicker_chapter.selectpicker.single-chapter-select').select('option')
for l in links:
url_list.append(l.get('data-redirect'))
return url_list

async def get_chapters(url):
async with AsyncClient(headers=headers, timeout=60.0, follow_redirects=True) as client:
try:
r = await client.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
title = soup.select_one('h1#chapter-heading').get_text(strip=True)
text_content = soup.select_one('div.text-left').get_text(strip=True)
big_df_list.append((title, text_content))
except Exception as e:
print(url, e)
async def scrape_chapters():
start_time = datetime.now()
tasks = asyncio.Queue()
for x in all_chapters_urls():
tasks.put_nowait(get_chapters(x))
async def worker():
while not tasks.empty():
await tasks.get_nowait()

await asyncio.gather(*[worker() for _ in range(20)])
end_time = datetime.now()
duration = end_time - start_time
print('chapters scraping took', duration)
asyncio.run(scrape_chapters())
df = pd.DataFrame(big_df_list, columns = ['Chapter', 'Content'])
print(df)

这将在终端返回:

chapters scraping took 0:00:10.991827
Chapter Content
0   Cannon Fodder Counterattack System - Chapter 30.1   The power of gossip was never been underestimated. Huang Dezheng’s reputation for kind and charismatic was far-reaching. His neighbours recognized him. The original impression of him was quite good, but he did not expect that he would be well-known not only in public but also in private. Especially messing about with your own students!Seeing his white and tender student being dragged by him, notice the way he couldn’t even walk properly. Hehe! What a scumbag!The gossipy neighbours recalled the scene they saw through their door’s peepholes and were still amazed. There was no way. At that time, the two of them were getting intimate, there was still energy to pay attention to whether the door was open, wasn’t there?Huang Dezheng did not notice this little detail when he left with Su Yibai in anger. The time he realized this, it was already several days later.The campus forum calmness of the past was swept away with an earthquake. The entire page layout was filled by posts with similar titles! Among them, the top one was the most eye-catching and popular!“During the 18th of August, School grass[1] Su and Teacher Huang’s cohabitation dog blood drama, here are the pictures and truth”Huang Dezheng, who was passing by his colleague’s computer, inadvertently caught a glimpse of this thick red line of words, and his heart jerked. He quietly held his breath as he returned to his office. His face paled as he entered into the forum he had previously scorned. With trembling hands, he opened the very hot post.“It is said that the landlord was shocked when he heard this. He was not familiar with the school, but the teacher Huang’s reputation in the school was very good. How could it be that he did not close the door and even did it with a student? What a scum?! But there are pictures of the truth, so it was not nonsense, the pictures are linked below.”“Fu*k! It turned out to be true!!!”“The soft and cute school grass together with the male god! Look at the hickey on the neck! Fu*k! It’s too intense! Teacher Huang bao dao wei lao[2]!!”“After examining the pictures, it truly hasn’t been photo-shopped… Fu*k! What a scumbag!!”“It should be true… School grass Su never returned to the dormitory and stayed outside, so it turned out…”“To help the landlord add fire, the photos were taken by a friend who went to the nightclub to play”” It turns out that Su Xuedi[3] is like this in private! Look at the half-covered chest, the creamy thighs! No wonder Teacher Huang This white flower has a half-covered chest and a chest, and the trough is still pink!! No wonder Huang teacher doesn’t love Jiangshan beauties!!”“Wow, there’s a reason the number of people who never go to class is so high. With these two pictures, it seems like our Su Xuedi’s eyes are not very good!”“…”Huang Dezheng looked at the increasingly unsightly text and pictures on the computer screen, his whole body was shaking in anger!Who was it?! Who did he offend for him to be framed so viciously?!He immediately left a message asking the moderator to delete the post, but it didn’t take long for the message that didn’t hide his identity to completely detonate the entire forum!Fu*k the person involved actually appeared!!!The forum was boiling with this additional drama and Huang Dezheng got so angry that his liver began to ache. Not only were the posts not deleted, but his message was even re-posted with screenshots!These students were really shameless![1]School grass: most handsome guy in school. For the opposite gender it would be school flower.[2] Bao dao wei loa: Old but still vigorous. I think that explains it.[3] Xuedi: junior or younger male school mate.(Visited 1 times, 1 visits today)
1   Cannon Fodder Counterattack System - Chapter 29.7   Qin Shiyue rushed back to the house without saying a word, he was tempted to blow up, but he was afraid of hurting the stupid rabbit, so he kept suppressing it.Ye Si Nian also did not say a word, and when he got home, he went into the bathroom without saying anything.The more he thought about the more frustrated he was! Qin Shiyue was tense like a trapped beast as he moved about in the study. The desk was already in chaos, and there were scattered documents on the floor.Just as his anger was reaching the apex, the study door was opened, and the stupid rabbit who had just taken a bath with a towel around his body leisurely walked in.His body was covered with a thin layer of tight and well-proportioned muscles. The skin was fair and smooth, the waist, thin but not weak. At first glance, it was full of explosive power.His eyes glided uncontrollably as he observed the man’s movement. Qin Shiyue was frozen in place, his heart almost stopped beating, and a thought flashed in his mind flashed that allowed him to recover his heartbeat whose speed soared to the limit.Ye Si Nian was getting closer and closer, and Qin Shiyue, who only had a theoretical experience, wanted to step forward into his (Ye Si Nian’s)arms, but Qin Shiyue’s brain was blank, and he didn’t know where to start…Intensely attracted to his lover who was stunned, he pressed his naked and exposed skin on the man’s thin shirt and gently rubbed on them.The man’s reaction was very interesting. Ye Si Nian pursed his lips and pushed the man slightly on his shoulder to make him sit down on the large chair.Smiling as Qin Shiyue raised his head to look up at him, Ye Si Nian’s index finger hooked up his chin and he bent to kiss the tense tightly-close thin lip.Effortlessly prying his lover’s lips open, Ye Si Nian invaded his soft tongue constantly wreaking havoc in Qin Shiyue’s mouth. He licked and played with Qin Shiyue’s sensitive mouth before his lover finally reacted.The breathing became more intense, his lover’s strength also increased, Ye Si Nian hummed and pulled away from Qin Shiyue’s mouth and gently licked his lower lip.“I want you, Qin Shiyue.”Looking at his lover’s suddenly large eyes, Ye Si Nian smiled smugly, kissing his earlobe and licking his ears he murmured slowly, “I want you… Qin Shiyue… I want you……”If one could hold back at this time, would he still be a man?!!Qin Shiyue slammed down Ye Si Nian’s thin waist, suppressing his desire. His voice was hoarse with craving, “Stupid rabbit, do you know that you are playing with fire?!”Ye Si Nian raised an eyebrow and replied to the question with action instead.(Visited 1 times, 1 visits today)
2   Cannon Fodder Counterattack System - Chapter 29.8   With his long leg stretched, Ye Si Nian sat on Qin Shiyue’s lap, lowering his head to nibble on his throat, he felt his slight trembling and repressed gasp. He flexibly untied his clothes and put his hands on the well-defined chest.No longer be a man!!Qin Shiyue made a beast-like roar and kissed Ye Si Nian’s fragile neck hard. The hands clinging behind him tore open Ye Si Nian’s towel.=======================The next afternoon Ye Si Nian sat up in bed sourly and examined the various traces all over his body. He was full of regrets.He really underestimated the enemy’s fighting power!The two personalities were frightening! They being virgins who were almost thirty years old was also dreadful! The combination of the two resulted in being tossed from yesterday afternoon to this morning was scary!!!When Qin Shiyue and Pei Yiyuan took turns in battle, who said that having a double personality was amazing? !!Complaining in his heart, Ye Si Nian saw the door being pushed open, and Pei Yiyuan came in with a gentle smile like a spring breeze.“Woken up? Are there any uncomfortable place in your body?” Pei Yiyuan went near the bed and knelt on one knee as he reached out and placed Ye Si Nian into his arms.“No.” Ye Si Nian gave a serious thought about it. He felt that the communication last night was really hearty and he enjoyed himself. It was normal for the muscles to be sore, and it was obvious that he was clean and dry now, so he decided to praise instead, “I felt very good last night!”“It will get better in the future!” The performance of the first time last night was affirmed. Pei YiYuan felt a little proud in his heart. He bowed to kiss Ye Si Nian’s lips. “Yes, Qin Shiyue wanted me to ask how you intend to deal with those two?”Speaking about the incident, the second personality was embarrassed to come out himself to ask. Ye Si Nian’s lips twitched and said: “I decided to sell the apartment.”“That’s it?” Pei Yiyuan raised his eyebrows, he also had no good feelings for the two people.“Don’t underestimate the power of gossip…” Ye Si Nian shook his head with a smile and said, “Otherwise, you just wait and see! Without me, they are well able to kill themselves!”“Then I’ll wait and see.” Pei Yiyuan’s arm wrapped around him as he lifted Ye Si Nian up to carry to the bathroom. He did not care and decided to change to a more important topic, “I just went out for a walk and bought your favourite. Porridge…”(Visited 1 times, 1 visits today)
[...]

如果一次一个请求太长,我们应该同时激发多个请求!

如何?好吧,有多种选择,但我会坚持使用aiohttp库,它做requests做的事情,但异步。

下面是一些使用它的例子,我完全从另一个问题中窃取了这个例子:

import asyncio
import aiohttp
import time
websites = """https://www.youtube.com
http://www.chrome.com
http://www.booking.com
http://www.googleusercontent.com
http://www.google.com.au
http://www.popads.net
http://www.cntv.cn"""

async def get(url, session):
try:
async with session.get(url=url) as response:
resp = await response.read()
print("Successfully got url {} with resp of length {}.".format(url, len(resp)))
except Exception as e:
print("Unable to get url {} due to {}.".format(url, e.__class__))

async def main(urls):
async with aiohttp.ClientSession() as session:
ret = await asyncio.gather(*[get(url, session) for url in urls])
print("Finalized all. Return is a list of len {} outputs.".format(len(ret)))

urls = websites.split("n")
start = time.time()
asyncio.run(main(urls))
end = time.time()
print("Took {} seconds to pull {} websites.".format(end - start, len(urls)))

最新更新