如何在这个美丽的 Python 脚本上迭代 CSV 输出中的列?

我有一个漂亮的Python脚本，它在网站上的组件中查找href链接，并将这些链接逐行输出到CSV文件。我计划每天通过 cron 作业运行脚本，我想在 CSV 中添加第二列，标记为"看到的次数"。因此，当脚本运行时，如果它找到列表中已有的链接，它只会添加到该列中的数字中。例如，如果这是第二次看到特定链接，则该链接将是"N+1"或该列中的 2。但是，如果这是Python脚本第一次看到该链接，它只会将链接添加到列表的底部。我不确定如何攻击它，因为我对 Python 很陌生。

我开发了 Python 脚本来从 XML 站点地图中所有页面上的组件中抓取链接。但是，我不确定如何迭代 CSV 输出中的"看到的次数"列，因为 cron 作业每天都会运行脚本。我不希望文件被覆盖，我只希望"看到的次数"列迭代，或者如果这是第一次看到链接，则将链接放在列表的底部。

这是我到目前为止拥有的 Python 脚本：

sitemap_url = 'https://www.lowes.com/sitemap/navigation0.xml'
import requests
from bs4 import BeautifulSoup
import csv
from tqdm import tqdm
import time
# def get_urls(url):
page = requests.get(sitemap_url)
soup = BeautifulSoup(page.content, 'html.parser')
links = [element.text for element in soup.findAll('loc')]
# return links
print('Found {:,} URLs in the sitemap! Now beginning crawl of each URL...'
.format(len(links)))     
csv_file = open('cms_scrape.csv', 'w')
csv_writer = csv.writer(csv_file)
csv_writer.writerow(['hrefs', 'Number of times seen:'])
for i in tqdm(links):
#print("beginning of crawler code")
r = requests.get(i)
data = r.text
soup = BeautifulSoup(data, 'lxml')
all_a = soup.select('.carousel-small.seo-category-widget a')
for a in all_a:
hrefs = a['href']
print(hrefs)
csv_writer.writerow([hrefs, 1])
csv_file.close()

当前状态：目前，每次脚本运行时，CSV 输出中的"看到的次数："列都会被覆盖。

所需状态：我希望每当脚本找到在上一次爬网中看到的链接时，"看到的次数："列就会迭代，或者如果这是第一次看到该链接，我希望它在 CSV 的此列中说"1"。

非常感谢您的帮助！！

所以，这实际上不是关于bs4的探索，而是关于如何在python中处理数据结构。

您的脚本缺少加载您已经知道的数据的部分。一种方法是构建一个字典，将所有 href 作为键，然后将计数作为值。

所以给定一个带有这样行的 csv...

href,seen_count
https://google.com/1234,4
https://google.com/3241,2

。您首先需要构建字典

csv_list = list(open("cms_scrape.csv", "r", encoding="utf-8"))
# we skip the first line, since it hold your header and not data
csv_list = csv_list[1:]
# now we convert this to a dict
hrefs_dict = {}
for line in csv_list:
url, count = line.split(",")
# remove linebreak from count and convert to int
count = int(count.strip())
hrefs_dict[url] = count

这会产生这样的字典：

{ 
"https://google.com/1234": 4,
"https://google.com/3241": 2
}

现在，您可以检查您遇到的所有hrefs是否都作为此字典中的键存在。如果是 - 将计数增加 1。如果没有，请在字典中插入 href 并将计数设置为 1。

要将其应用于您的代码，我建议您先抓取数据，并在所有抓取完成后写入文件。这样：

for i in tqdm(links):
#print("beginning of crawler code")
r = requests.get(i)
data = r.text
soup = BeautifulSoup(data, 'lxml')
all_a = soup.select('.carousel-small.seo-category-widget a')
for a in all_a:
href = a['href']
print(href)
# if href is a key in hrefs_dict increase the value by one
if href in hrefs_dict:
hrefs_dict[href] += 1
# else insert it into the hrefs_dict and set the count to 1
else:             
hrefs_dict[href] = 1

现在，当抓取完成后，遍历字典中的每一行并将其写入您的文件。通常建议您在写入文件时使用上下文管理器(以避免在意外忘记关闭文件时阻塞(。所以"with"负责文件的打开和关闭：

with open('cms_scrape.csv', 'w') as csv_file:
csv_writer = csv.writer(csv_file)
csv_writer.writerow(['hrefs', 'Number of times seen:'])
# loop through the hrefs_dict
for href, count in hrefs_dict.items():
csv_writer.writerow([href, count])

因此，如果您实际上不必为此使用csv文件，我建议您使用JSON或Pickle。这样，您可以阅读和存储字典，而无需来回转换为csv。

我希望这能解决你的问题...

相关内容

最新更新

热门标签：