我在刮痧一系列URL,代码如下:
df1 = pd.DataFrame()
url = 'https://www.welcometothejungle.com/fr/jobs?
page=1&refinementList%5Bprofession_name.fr.Tech%5D%5B%5D=Data%20Science'
path = '/Users/jdkj/desktop/chromedriver 3'
options = webdriver.ChromeOptions()
driver = webdriver.Chrome((path), chrome_options=options)
html = driver.get(url)
time.sleep(3)
elems = driver.find_elements_by_xpath("//article/div[2]/header/a['href']")
for elem in elems:
urls = elem.get_attribute("href")
print(urls)
这将返回我想要看到的正确结果,问题是,当我尝试将此"url "在我的空数据框架"df1"使用以下代码:
df_test = df1.append({'URLS' : urls}, ignore_index = True)
df_test.head()
它没有显示我想要的url(它没有返回一个错误,但结果真的没有意义)
我是从python开始的,所以我想我的问题可能有一些简单的答案,我希望我清楚
您的代码的问题是,您正在覆盖urls
变量,然后追加到DataFrame
只有最后抓取的URL。将df1.append
语句移动到for
块内部:
df1 = pd.DataFrame()
url = 'https://www.welcometothejungle.com/fr/jobs?
page=1&refinementList%5Bprofession_name.fr.Tech%5D%5B%5D=Data%20Science'
path = '/Users/jdkj/desktop/chromedriver 3'
options = webdriver.ChromeOptions()
driver = webdriver.Chrome((path), chrome_options=options)
html = driver.get(url)
time.sleep(3)
elems = driver.find_elements_by_xpath("//article/div[2]/header/a['href']")
for elem in elems:
url = elem.get_attribute("href") # <--- get the url from the <a> tag
df1 = df1.append({'URLS': url}, ignore_index=True) # <--- add the url to the dataframe in the URLS column