BeautifulSoup4 和 w3lib — 为什么我的结果是垂直打印的?如何以CSV格式格式化结果?



这是代码:


while startDate <= endDate:
try:
the_year = startDate.strftime('%Y')
the_month = startDate.strftime('%B')
the_day = startDate.strftime('%-d')
url_template = base_url + the_year + "/" + the_month + "/" + the_day + "/"
url = url_template
page_two = "?page=2"
time.sleep(1)
response = requests.get(url, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'})
content = bs(response.content, "html.parser")
uls = content.find("div", {'class': 'sitemap-column-wrapper'}).findAll("ul", {'class': 'sitemap-list'})
for ul in uls:
for li in ul.find_all('li', {'class': 'sitemap-list-item'}):
for a in li.find('a').text:
a = w3lib.html.remove_tags(a)
print(str(startDate) + ',"' + a)
time.sleep(1)
url = url_template + page_two
response = requests.get(url, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'})
content = bs(response.content, "html.parser")
uls = content.find("div", {'class': 'sitemap-column-wrapper'}).findAll("ul", {'class': 'sitemap-list'})
for ul in uls:
for li in ul.find_all('li', {'class': 'sitemap-list-item'}):
for a in li.find('a').text:
a = w3lib.html.remove_tags(a)
print(str(startDate) + ',"' + a)
time.sleep(1)
startDate += delta
except Exception as e:
print(e)
break

以下是结果:

2020-01-01,"C
2020-01-01,"i
2020-01-01,"n
2020-01-01,"c
2020-01-01,"i
2020-01-01,"n
2020-01-01,"n

等等。 这是怎么回事? 我要做的是以CSV格式打印出日期和标题:"日期,标题">

在我使用".remove_tags"之前,我得到了一个HTML代码块,里面有所有的标题。

没关系,循环太多了:

content = bs(response.content, "html.parser")
uls = content.find("div", {'class': 'sitemap-column-wrapper'}).findAll("ul", {'class': 'sitemap-list'})
for ul in uls:
for li in ul.find_all('li', {'class': 'sitemap-list-item'}):
li = li.find('a').text
title = w3lib.html.remove_tags(li)

最新更新