我正在尝试从此站点进行网络抓取,但无法修复此错误:
属性错误:"unicode"对象没有属性"find_all">
我正在尝试使用 unicodedata 方法从解析的字符串中删除 \xa0(当有空 p 标签时出现(。
pages = ["http://sg.startupjobs.asia/sg/job/search?w=jobs&q=data+scientist+OR+data+analyst+OR+business+analyst+OR+business+intelligence&l=Anywhere&t=any&job_page=" + str(i) for i in range(1, 12)]
job_links = []
for p in pages:
r = requests.get(p)
data = r.text
soup = BeautifulSoup(data, "lxml").text
clean_soup = unicodedata.normalize("NFKD", soup)
container = clean_soup.find_all('div', attrs={'id': 'yw0'})
for text in container:
job_names = text.find_all('span', attrs={'class': 'JobRole'})
for name in job_names:
for link in name.find_all('a'):
job_link = link.get('href')
job_links.append("http://sg.startupjobs.asia" + str(job_link))
clean_soup的类型是Unicode而不是BeautifulSoup。
clean_soup_2 = BeautifulSoup(clean_soup)
clean_soup_2.find_all('div', attrs={'id': 'yw0'})
以下内容应该有效。改变
r = requests.get(p)
data = r.text
soup = BeautifulSoup(data, "lxml").text
clean_soup = unicodedata.normalize("NFKD", soup)
container = clean_soup.find_all('div', attrs={'id': 'yw0'})
自
r = requests.get(p)
clean_text = unicodedata.normalize('NFKD', r.text)
soup = BeautifulSoup(clean_text, 'lxml')
container = soup.find_all('div', attrs={'id': 'yw0'})