抓取包含特定文本的脚本标记



我正在尝试抓取"imageToken"从下面的页面源的脚本标签的值,下面的python代码将获得令牌大约75%的时间,但其他时间的脚本标签的数量必须改变,它选择了错误的标签。

是否有任何方法可以搜索所有的脚本标签,特别是包含"imageToken"的标签?

这是75%的时间运行的代码。

html_source = driver.page_source
soup = BeautifulSoup(html_source, 'html.parser')

scripts = soup.find_all('script')[20]
findtoken = scripts.string.split(',')[58]
token = findtoken.split(':')[2].strip('"')
print(token)

我也试过了,但是没有返回任何东西:

html_source = driver.page_source
soup = BeautifulSoup(html_source, 'html.parser')

scripts = soup.find_all('script')
for script in scripts:
if 'imageToken' in script:
print(script)

这是script标签的来源,页面上也有很多其他脚本,但这是唯一一个带有"imageToken"的脚本。

<script>
((data) => {
/* TASK: Fix this. Move away from F3.page */
window.F3 = window.F3 || {};
window.F3.page = window.F3.page || {};
Object.assign(window.F3.page, data);
})({"user":{"email":"email@address.com","useFacebookPhoto":false,"joinDate":"2021-02-05T10:11:11-07:00","hasIcon":false,"confirmed":true,"disabled":false,"hasPassword":true,"ancestrySubscriber":false,"admin":false,"accountStatus":"monthly-subscriber","subscriptionStatus":"subscriber","FreeAccess":true,"accountState":{"signedOut":false,"registered":true,"subscriber":true,"expiring":false,"freeTrialSubscriber":false,"payingSubscriber":true,"bundleSubscriber":false,"newspaperSubscriber":false,"acomSubscriber":false,"formerSubscriber":false,"formerPayingSubscriber":false,"formerBundle":false,"currentSubscriptionType":"monthly","currentAccountStatus":"monthly-subscriber","oldSubscriptionStatus":"subscriber"},"passwordSerial":1,"userId":6812311,"username":"myusername"},"totalImages":585709948,"config":{"api":{"host":"http://svc.fold3.com:50000","f3Api":"http://api.fold3.com/fold31-api","path":"/fold31/api"},"app":{"canonical":"https://www.fold3.com","cookieDomain":".fold3.com","env":"live","goStack":"https://go.fold3.com","hostname":"www.fold3.com","trustedHostname":"fold3.com"},"ancestry":{"domain":"https://www.ancestry.com","internalDomain":"ancestry.int","redirectHost":"https://www.fold3.com","clientId":"60e8bf12987c2a38a1f48b3c8e41f4400d3b7eb2","redirectPath":"/auth/openid","ssoPath":"/sso/oidc/authorize"},"fold3":{"contactNumber":"1-800-613-0181"},"image":{"host":"https://img.fold3.com","hostRotating":"https://img#.fold3.com","path":"/img/"},"oldStack":{"host":"http://php.fold3.com:9090"},"regiment":{"host":"http://regiment.fold3.com","path":"/fold31-regiment/api"},"search":{"host":"http://search-es.fold3.com","path":"/fold31-search/api"}},"isMobile":false,"image":{"imageToken":"4IIROAS9p-z9rCHcF2toENYedok9hGmwdOsdlKGAfCzNNch2fNPT9HcElRYXBOL66kcnDgT7C9-aivjlk5o4Kwlgc7HB6U_MeIjtQuF2mMrfZq6dsivylzR2d30JiKv46hcMyMMwmBuRSI9_TlCelg==","imageId":692219369,"publication":{"dbid":61641,"mediaProvider":"EMS","allowDownloadDoc":true,"allowAnnotations":true,"hasOcr":false,"recordCountMode":"images","rollupImage":"NONE","lastModification":"2020-10-14T11:06:06-06:00","lastSorted":"2020-10-15T09:16:04-06:00","configuredAccessLevel":"REGISTERED","maximumAccessLevel":"REGISTERED","minimumAccessLevel":"REGISTERED","featured":false,"hashPath":"hiOcMlUzt","publicationId":1104,"contentType":"IMAGE"</script>

要搜索包含特定文本的标记,可以使用:contains(<my text>)选择器。

在您的示例中查看script是否包含文本imageToken,请使用:

print(soup.select_one("script:contains('imageToken')"))

注意:要使用选择器,请使用select()方法而不是find_all(),或select_one()方法而不是find()

你的第二个方法是正确的,但它缺少.string

for script in scripts:
if 'imageToken' in script.string: # <== add .string
print(script.string)

最新更新