从<p>中提取特定标签标签



我只想从p标签中提取address,比如我想获得这些Santa Barbara, CA 93101

[<p class="hide" id="phoneDiv_80863"><i aria-hidden="true" class="fa fa-phone-square"></i> (805) 636-9890</p>, <p>
Santa Barbara, CA 93101

</p>, <p style="margin-top:2em;"><a class="btn btn-default" href="/profile/id/80863/NicoleABotaitis93101" target="_top">View</a> <a class="btn btn-default" href="mailto:nicole@santabarbaratherapist.com" id="eml80863" target="_top">Email</a></p>]
[]
[<p class="hide" id="phoneDiv_26092"><i aria-hidden="true" class="fa fa-phone-square"></i> 8058956960</p>, <p>
Santa Barbara, CA 93111

代码

import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
limit = 25
url = f'https://www.counselingcalifornia.com/cc/cgi-bin/utilities.dll/customlist?FIRSTNAME=~&LASTNAME=~&ZIP=&DONORCLASSSTT=&_MULTIPLE_INSURANCE=&HASPHOTOFLG=&_MULTIPLE_EMPHASIS=&ETHNIC=&_MULTIPLE_LANGUAGE=ENG&QNAME=THERAPISTLIST&WMT=NONE&WNR=NONE&WHP=therapistHeader.htm&WBP=therapistList.htm&RANGE=1%2F{limit}&SORT=LASTNAME'
headers = {'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Mobile Safari/537.36'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
rows = soup.find_all('div', {'class':'row'})
temp=[]
for row in rows:
t=row.find_all('div',class_='col-sm-3')
for i in t:
u=i.find_all('p')
print(u)

以下是使用css-selector:的更好解决方案

import requests
from bs4 import BeautifulSoup
import pandas as pd
limit = 25
url = f'https://www.counselingcalifornia.com/cc/cgi-bin/utilities.dll/customlist?FIRSTNAME=~&LASTNAME=~&ZIP=&DONORCLASSSTT=&_MULTIPLE_INSURANCE=&HASPHOTOFLG=&_MULTIPLE_EMPHASIS=&ETHNIC=&_MULTIPLE_LANGUAGE=ENG&QNAME=THERAPISTLIST&WMT=NONE&WNR=NONE&WHP=therapistHeader.htm&WBP=therapistList.htm&RANGE=1%2F{limit}&SORT=LASTNAME'
headers = {'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Mobile Safari/537.36'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
for address in soup.select('.col-sm-3>p:nth-child(3)'):
print(address.text.strip())

样本输出:

Santa Barbara, CA 93101
Santa Barbara, CA 93111
Santa Barbara, CA 93101
Tustin, CA 92780
Valencia, CA 91355
Pasadena, CA 91105
United States
Walnut Creek, CA 94596
Woodland Hills, CA 91365-0644
Monterey, CA 93940
Granada Hills, CA 91344
United States
Studio City, CA 91604
Santa Rosa, CA 95404
Sonoma
San Dimas, CA 91773
United States
San Francisco, CA 94116
Rancho Mirage, CA 92270
Berkeley, CA 94705-1808
Anderson, CA 96007
Shasta
Mission Viejo, CA 92691
United States
Claremont, CA 91711
Seal Beach, CA 90740
USA
West Covina, CA 91790
Los Angeles
Mission Viejo, CA 92692
Laguna Niguel, CA 92677
Camarillo, CA 93010
West Hills, CA 91308

参考文献:

  • :nth-child()
  • soup.select()

这就是您想要的:

soup = BeautifulSoup(response.text, 'html.parser')
rows = soup.find_all('div', {'class':'row'})
temp=[]
for row in rows:
t=row.find_all('div',class_='col-sm-3')
for i in t:
u=i.find_all('p')[1:2]
for each_u in u:
address = each_u.text.split('n')[1]
print(address)

相关内容

  • 没有找到相关文章

最新更新