Python web抓取.如何修复按帖子发布日期排序的问题



这个脚本是为从Thehackernews网站抓取新闻而编写的。主要目标:

  • 只发送包含特殊字符串的链接(href)的邮件(在表75行代码中呈现并排序的类Scraper中)

  • Redis数据库仅在本地主机。不要将太多数据保存到数据库中。轻巧快速的工作。

  • 每天从日程安排中发送带有链接的电子邮件。

但是当我测试脚本并向表中添加更多特殊字符串时。请求是重复的,因为没有按日期等排序。

在hackernews网站的html代码中,日期是这样表示的:

  • <span class='h-datetime'><i class='icon-font icon-calendar'>&#59394;</i>Dec 22, 2022</span>

My code:

from bs4 import BeautifulSoup
import redis
from password import bot_email_pw
import requests

# source
# source
class Scraper:
def __init__(self, keywords):
self.markup = requests.get('https://thehackernews.com/').text
self.keywords = keywords
# parser
def parse(self):
soup = BeautifulSoup(self.markup, 'html.parser')
links = soup.findAll('a')
"""links = soup.findAll("a", {"class": "titlelink"})"""
self.saved_links = []
for link in links:
for keyword in self.keywords:
if keyword in link.text:
self.saved_links.append(link)
# store
def store(self):
r = redis.Redis(host='localhost', port=6379, db=0, charset="utf-8", decode_responses=True)
for link in self.saved_links:
"""r.set(link.get('href'), link.h2.text)"""
r.set(link.text,link.get('href'))
print(link)
#send email
def email(self):
r = redis.Redis(host='localhost', port=6379, db=0, charset="utf-8", decode_responses=True)
links = [str(r.get(k)) for k in r.keys()]
print(links)
# email
import smtplib
from email.mime.multipart import MIMEMultipart
from email.mime.text import MIMEText
fromEmail = ""
toEmail = ""
msg = MIMEMultipart('alternative')
msg['Subject'] = "Newsy z HackerNews"
msg['From'] = fromEmail
msg['To'] =toEmail
html = """
<h4> %s linków mogących ciebie zainteresować: </h4>
%s <br/></br> 
"""% (len(links), "<br/> <br/>".join(links))
mime = MIMEText(html,'html')
msg.attach(mime)
try:
mail = smtplib.SMTP('smtp.gmail.com', 587)
mail.ehlo()
mail.starttls()
mail.login(fromEmail, "")
mail.sendmail(fromEmail, toEmail.split(','), msg.as_string())
mail.quit()
print(('Email sent!'))
except Exception as exc:
print('something might went wrong...%s' % exc)
# Free redis
r.flushdb()

s = Scraper(['malware','exploit','cve','ransomware','campaign','agent tesla','Hackers','hackers','hacker','Hacker','Ddos',
'Vulnerability','vulnerability','Botnet','Dec 20'])
s.parse()
s.store()
s.email()

问题是如何排序这个html代码(是由BeautifulSoup抓取的)按日期发布在网站上。只给我这个链接,这是真正公开的,例如在22.12

只对发布日期进行过滤,如果与今天的日期匹配,则检查该关键字是否在链接中。

您可以轻松地根据需要调整代码,并将相关链接推送到DB。

例如:

from datetime import datetime
import requests
from bs4 import BeautifulSoup

class Scraper:
def __init__(self, keywords):
self.saved_links = None
self.markup = requests.get('https://thehackernews.com/').text
self.keywords = keywords
def parse(self):
posts = BeautifulSoup(self.markup, 'html.parser').select(".body-post")
self.saved_links = []
for post in posts:
date = (
post
.select_one('span[class="h-datetime"]')
.getText(strip=True)
.replace("", "")
)
if date == datetime.now().strftime("%b %d, %Y"):
if any(word in post.text for word in self.keywords):
self.saved_links.append(post.select_one('a').attrs['href'])
return self.saved_links

s = Scraper(
[
'malware', 'exploit', 'cve', 'ransomware', 'campaign', 'agent tesla',
'Hackers', 'hackers', 'hacker', 'Hacker', 'Ddos', 'Vulnerability',
'vulnerability', 'Botnet', 'Dec 20',
]
)
print("n".join(s.parse()))

这个打印:

https://thehackernews.com/2022/12/two-new-security-flaws-reported-in.html
https://thehackernews.com/2022/12/zerobot-botnet-emerges-as-growing.html
https://thehackernews.com/2022/12/hackers-breach-oktas-github.html

最新更新