到目前为止,我成功地制作了这个:
from bs4 import BeautifulSoup
import requests
def function():
url = 'https://dynasty-scans.com/chapters/liar_satsuki_can_see_death_ch28_6#6'
soup = BeautifulSoup(requests.get(url).text, 'html.parser')
script = soup.find_all('script')
print(script[1])
输出:
<script>
//<![CDATA[
var pages = [{"image":"/system/releases/000/036/945/1.png","name":"1"},{"image":"/system/releases/000/036/945/2.png","name":"2"},{"image":"/system/releases/000/036/945/3.png","name":"3"},{"image":"/system/releases/000/036/945/4.png","name":"4"},{"image":"/system/releases/000/036/945/5.png","name":"5"},{"image":"/system/releases/000/036/945/6.png","name":"6"},{"image":"/system/releases/000/036/945/7.png","name":"7"},{"image":"/system/releases/000/036/945/credits.png","name":"credits"}];
//]]>
</script>
我正试图提取";图像";作为字符串
例如:"/system/releases/000/036/945/7.png";
我该怎么做?
您可以使用正则表达式提取变量"页面";
import re, json, requests
url = 'https://dynasty-scans.com/chapters/liar_satsuki_can_see_death_ch28_6#6'
r = requests.get(url)
# extract the data
match = re.search('var pages = ([.*?]);', r.text).group(1)
# parse it into json
match_json = json.loads(match)
# iterate through it to get the links
images = [img['image'] for img in match_json]
输出:
['/system/releases/000/036/945/1.png',
'/system/releases/000/036/945/2.png',
'/system/releases/000/036/945/3.png',
'/system/releases/000/036/945/4.png',
'/system/releases/000/036/945/5.png',
'/system/releases/000/036/945/6.png',
'/system/releases/000/036/945/7.png',
'/system/releases/000/036/945/credits.png']