在进行一些抓取后,我得到了所有数据,将其存储在熊猫df中,但是在编写标头时遇到了问题。由于我要抓取作业现场的许多页面,因此我必须创建一个循环来遍历页面并每页获得不同的 df,完成后,我将 df 保存到 CSV 文件中。
问题是标题每次迭代总是写一次,而我只想写一次。
我已经在这里尝试了有关上一个问题的所有解决方案,但我仍然无法找到解决此问题的方法。如果这是一个愚蠢的问题,我很抱歉,但我仍在学习和热爱这段旅程。任何帮助,提示,建议将非常有帮助。
这是我的代码:
def find_data(soup):
l = []
for div in soup.find_all('div', class_ = 'js_result_container'):
d = {}
try:
d["Company"] = div.find('div', class_= 'company').find('a').find('span').get_text()
d["Date"] = div.find('div', {'class':['job-specs-date', 'job-specs-date']}).find('p').find('time').get_text()
pholder = div.find('div', class_= 'jobTitle').find('h2').find('a')
d["URL"] = pholder['href']
d["Role"] = pholder.get_text().strip()
l.append(d)
except:
pass
df = pd.DataFrame(l)
df = df[['Date', 'Company', 'Role', 'URL']]
df = df.dropna()
df = df.sort_values(by=['Date'], ascending=False)
df.to_csv("csv_files/pandas_data.csv", mode='a', header=True, index=False)
if __name__ == '__main__':
f = open("csv_files/pandas_data.csv", "w")
f.truncate()
f.close()
query = input('Enter role to search: ')
max_pages = int(input('Enter number of pages to search: '))
for i in range(max_pages):
page = 'https://www.monster.ie/jobs/search/?q='+query+'&where=Dublin__2C-Dublin&sort=dt.rv.di&page=' + str(i+1)
soup = getPageSource(page)
print("Scraping Page number: " + str(i+1))
find_data(soup)
输出:
Date,Company,Role,URL
Posted today,Solas IT,QA Engineer,https://job-openings.monster.ie/QA-Engineer-Dublin-Dublin-Ireland-Solas-IT/11/195166152
Posted today,Hays Ireland,Resident Engineer,https://job-openings.monster.ie/Resident-Engineer-Dublin-Dublin-Ireland-Hays-Ireland/11/195162741
Posted today,IT Alliance Group,Presales Consultant,https://job-openings.monster.ie/Presales-Consultant-Dublin-Dublin-IE-IT-Alliance-Group/11/192391675
Posted today,Allen Recruitment Consulting,Automation Test Engineer,https://job-openings.monster.ie/Automation-Test-Engineer-Dublin-West-Dublin-IE-Allen-Recruitment-Consulting/11/191229801
Posted today,Accenture,Privacy Analyst,https://job-openings.monster.ie/Privacy-Analyst-Dublin-Dublin-IE-Accenture/11/195164219
Date,Company,Role,URL
Posted today,Solas IT,Automation Engineer,https://job-openings.monster.ie/Automation-Engineer-Dublin-Dublin-Ireland-Solas-IT/11/195159636
Posted today,PROTENTIAL RESOURCES,Desktop Support Engineer,https://job-openings.monster.ie/Desktop-Support-Engineer-Santry-Dublin-Ireland-PROTENTIAL-RESOURCES/11/195159322
Posted today,IT Alliance Group,Service Desk Team Lead,https://job-openings.monster.ie/Service-Desk-Team-Lead-Dublin-Dublin-IE-IT-Alliance-Group/11/193234050
Posted today,Osborne,IT Internal Audit Specialist – Dublin City Centre,https://job-openings.monster.ie/IT-Internal-Audit-Specialist-–-Dublin-City-Centre-Dublin-City-Centre-Dublin-IE-Osborne/11/192169909
Posted today,Brightwater Recruitment Specialists,Corporate Tax Partner Designate,https://job-openings.monster.ie/Corporate-Tax-Partner-Designate-Dublin-2-Dublin-IE-Brightwater-Recruitment-Specialists/11/183837695
因为您调用find_data(soup)
max_pages
次数,这意味着您还要多次执行以下操作:
df = pd.DataFrame(l)
df = df[['Date', 'Company', 'Role', 'URL']]
df = df.dropna()
df = df.sort_values(by=['Date'], ascending=False)
df.to_csv("csv_files/pandas_data.csv", mode='a', header=True, index=False)
尝试更改find_data()
函数以接收列表、填充列表并返回列表。然后,在调用函数后,可以添加标头并使用 to_csv()
将其写入文件。
例如:
def find_data(soup, l):
for div in soup.find_all('div', class_ = 'js_result_container'):
d = {}
try:
d["Company"] = div.find('div', class_= 'company').find('a').find('span').get_text()
d["Date"] = div.find('div', {'class':['job-specs-date', 'job-specs-date']}).find('p').find('time').get_text()
pholder = div.find('div', class_= 'jobTitle').find('h2').find('a')
d["URL"] = pholder['href']
d["Role"] = pholder.get_text().strip()
l.append(d)
except:
pass
return l
if __name__ == '__main__':
f = open("csv_files/pandas_data.csv", "w")
f.truncate()
f.close()
query = input('Enter role to search: ')
max_pages = int(input('Enter number of pages to search: '))
l = []
for i in range(max_pages):
page = 'https://www.monster.ie/jobs/search/?q='+query+'&where=Dublin__2C-Dublin&sort=dt.rv.di&page=' + str(i+1)
soup = getPageSource(page)
print("Scraping Page number: " + str(i+1))
l = find_data(soup)
df = pd.DataFrame(l)
df = df[['Date', 'Company', 'Role', 'URL']]
df = df.dropna()
df = df.sort_values(by=['Date'], ascending=False)
df.to_csv("csv_files/pandas_data.csv", mode='a', header=True, index=False)