请提供帮助。我想得到每一页的所有公司名称,它们有12页。
http://www.saramin.co.kr/zf_user/jobs/company-labs/list/page/1http://www.saramin.co.kr/zf_user/jobs/company-labs/list/page/2--这个网站只更改号码。
到目前为止,这是我的代码。我可以只得到12页的标题(公司名称(吗?提前谢谢。
from bs4 import BeautifulSoup
import requests
maximum = 0
page = 1
URL = 'http://www.saramin.co.kr/zf_user/jobs/company-labs/list/page/1'
response = requests.get(URL)
source = response.text
soup = BeautifulSoup(source, 'html.parser')
whole_source = ""
for page_number in range(1, maximum+1):
URL = 'http://www.saramin.co.kr/zf_user/jobs/company-labs/list/page/' + str(page_number)
response = requests.get(URL)
whole_source = whole_source + response.text
soup = BeautifulSoup(whole_source, 'html.parser')
find_company = soup.select("#content > div.wrap_analysis_data > div.public_con_box.public_list_wrap > ul > li:nth-child(13) > div > strong")
for company in find_company:
print(company.text)
---------一页的输出
---------页面来源:(
那么,您想删除所有的headers
,只获取公司名称的string
吗?基本上,您可以使用soup.findAll
以如下格式查找公司列表:
<strong class="company"><span>중소기업진흥공단</span></strong>
然后使用.find
函数从<span>
标签中提取信息:
<span>중소기업진흥공단</span>
之后,使用.contents
函数从<span>
标签中获取字符串:
'중소기업진흥공단'
因此,您编写了一个循环来对每个页面执行同样的操作,并制作了一个名为company_list
的列表来存储每个页面的结果并将它们附加在一起。
这是代码:
from bs4 import BeautifulSoup
import requests
maximum = 12
company_list = [] # List for result storing
for page_number in range(1, maximum+1):
URL = 'http://www.saramin.co.kr/zf_user/jobs/company-labs/list/page/{}'.format(page_number)
response = requests.get(URL)
print(page_number)
whole_source = response.text
soup = BeautifulSoup(whole_source, 'html.parser')
for entry in soup.findAll('strong', attrs={'class': 'company'}): # Finding all company names in the page
company_list.append(entry.find('span').contents[0]) # Extracting name from the result
company_list
会给你所有你想要的公司名称
我最终明白了。谢谢你的回答!
图片:代码捕获在jupyter笔记本
这是我的最终代码。
from urllib.request import urlopen
from bs4 import BeautifulSoup
company_list=[]
for n in range(12):
url = 'http://www.saramin.co.kr/zf_user/jobs/company-labs/list/page/{}'.format(n+1)
webpage = urlopen(url)
source = BeautifulSoup(webpage,'html.parser',from_encoding='utf-8')
companys = source.findAll('strong',{'class':'company'})
for company in companys:
company_list.append(company.get_text().strip().replace('n','').replace('t','').replace('r',''))
file = open('company_name1.txt','w',encoding='utf-8')
for company in company_list:
file.write(company+'n')
file.close()