Python:将以前抓取的数据保存在数据库(已创建)中



我已经正确地用硒进行了刮擦,结果打印得很完美。我不知道如何将它们保存在数据库中,我已经使用 Sqlite 创建了表和字段。一般来说,我知道如何将数据保存在数据库中,但我不知道如何保存抓取数据。我不知道的是,如何编写代码来保存抓取数据。

现在我只想将抓取的数据添加到数据库中并保存。数据库表名称为 TableExample1。这些字段是:ID,Product_NameDB,Product_DescriptionDB,供应商数据库,价格数据库。

我有 2 个问题

  1. 保存在数据库中的文本如下所示:[
  2. 仅向数据库添加一 (1) 行。相反,应该添加数十个或数百个(每个都像这样:Product_Name,Product_Description,供应商,价格),具体取决于抓取的数据。

抓取打印数据的Python代码如下:

#Name of the scraped data 
Product_Name = (driver.find_element_by_class_name ("tablet-desktop-only"). Text)
Product_Description = (driver.find_element_by_class_name ("h-text-left"). Text)
Vendor = (driver.find_element_by_class_name ("in-vendor"). Text)
Price = (driver.find_element_by_class_name ("h-text-center"). Text)
for seq in Product_Name + Product_Description + Vendor + Price:
print(seq.text)

更新 N.1 当前代码,遵循"杰里米·卡汉"对这个问题的回答<</strong>br/>和 THIS(不同,我要求在控制台中打印所有抓取结果,而不仅仅是 1)

这是当前最稳定和功能最强大的代码。刮擦工作正常,但是Num_Groups,对于您在另一个问题中建议的范围(Num_Groups)中的i,仅打印一组,而不是全部。

我仍然有同样的 2 个问题

  1. 保存在数据库中的文本如下所示:[

  2. 仅向数据库添加一 (1) 行。相反,应添加数十个或数百个(每个如下所示:Product_Name,Product_Description,供应商,价格),具体取决于抓取的数据。

    import sqlite3
    from datetime import datetime 
    #SCRAPING
    Product_Name=driver.find_elements_by_class_name("tablet-desktop-only")
    Product_Description=driver.find_elements_by_class_name("h-text-left")
    Vendor=driver.find_elements_by_class_name("in-match")
    Price=driver.find_elements_by_class_name("h-text-center")
    
    # How do I print the other data always with the same html name?
    # This is one row data. This is the code you wrote me in the  
    # other question. Print only one group. What to write to print 
    # all groups?  
    Num_Groups = min(len(Product_Name),len(Product_Description),len(Vendor), len(Price))
    for i in range(Num_Groups):    
    print(Product_Name[i].text)
    print(Product_Description[i].text)
    print(Vendor[i].text)
    print(Price[i].text)
    
    #INSERT DATA IN DATABASE
    con = sqlite3.connect('/home/mypc/Desktop/aaaaa/Database.db')
    cursor = con.cursor()
    ID=datetime.today().strftime('%Y%m%d%H%M%S')
    Values =   f"VALUES({ID},'{Product_Name}','{Product_Description}','{Vendor}','{Price}')"
    sqlite_insert_query = 'INSERT INTO TableExample1 (ID, Product_Name, Product_Description, Vendor, Price)' + Values
    count = cursor.execute(sqlite_insert_query)
    con.commit()
    print("Record inserted successfully ", cursor.rowcount)
    cursor.close()
    

2° 上传(最终?问题:找到 12 组,记录插入成功 1,共添加 1 条记录 如何在数据库中插入 12 个五线谱?

Num_Groups = min(len(Product_Name),len(Product_Description),len(Vendor), len(Price))
records_added = 0
for i in range(Num_Groups):    
print(Product_Name[i].text)
print(Product_Description[i].text)
print(Vendor[i].text)
print(Price[i].text)

con = sqlite3.connect('/home/mypc/Scrivania/aaaa/Database.db')
cursor = con.cursor()
Values = f" VALUES ('{Product_Name[i].text}','{Product_Description[i].text}','{Vendor[i].text}','{Price[i].text}')"
sqlite_insert_query = 'INSERT INTO TableExample1 (Product_Name, Product_Description, Vendor, Price)' + Values
print("Qui, all'interno del ciclo, eseguiresti" + sqlite_insert_query)
count = cursor.execute(sqlite_insert_query)
con.commit()
print("Record inserted successfully ", cursor.rowcount)
records_added = records_added + 1
cursor.close()
print("")
print(f'Added a total of {records_added} records')
print(f"found {Num_Groups} groups")  # should be more than 1

好的。所以你告诉我你擅长连接到数据库,获取游标,使用游标和连接执行sql,提交更改和关闭游标。你需要的是要执行的 sql 字符串。

Values = f"VALUES('{Product_Name}','{Product_Description}','{Vendor}','{Price}')"
sqlite_insert_query = 'INSERT INTO TableExample1(Product_NameDB, Product_DescriptionDB, VendorDB, PriceDB) ' + Values

然后,假设光标设置cursor,您应该能够:

count = cursor.execute(sqlite_insert_query)

我假设您设置了东西,因此ID是一个唯一的键,如果您不指定它,将生成该键。

编辑:所以听起来数据库正在与ID作斗争,而不是制作多个新ID。假设您的 ID 只需要是顺序的(并且只有您在使用它并且抓取需要一秒钟以上),您可以按如下方式自行处理 ID。早from datetime import datetime之后

ID=datetime.today().strftime('%Y%m%d%H%M%S')
Values = f"VALUES({ID},'{Product_Name}','{Product_Description}','{Vendor}','{Price}')"
sqlite_insert_query = 'INSERT INTO TableExample1(ID, Product_NameDB, Product_DescriptionDB, VendorDB, PriceDB) ' + Values

此外,您提到获取描述元素的文本Product_Name而不仅仅是Product_Name(其他字段也是如此)。你需要:

Product_Name = driver.find_element_by_class_name("tablet-desktop-only").text

(或者把(Product_Name.text)放在我的 f 字符串中,但这似乎令人困惑)

2ND编辑:@Prophet想说的是 https://stackoverflow.com/questions/68110293/problems-saving-scraped-data-to-database-python/68111109?noredirect=1#comment120399103_68111109 如下。 当你说:

`Product_Name = driver.find_element_by_class_name("tablet-desktop-only").text`

Product_Name是一个字符串,你的 for 循环Product_Name + Product_Description + Vendor + Price循环访问单个串联字符串,然后 seq 具有字符串的字符,一次获取一个。然后seq.text失败了,正如你所经历的那样。这就是为什么我在那里注释掉了print命令,稍后再放一个来打印Values字符串。这是一种应该行之有效的方法。

如果你让事情保持原来的样子

`Product_Name = driver.find_element_by_class_name("tablet-desktop-only")

Product_Name是一个元素,当合并到 Values 字符串中时,它会转换为元素的字符串表示形式,这就是您在数据库中看到所有不相关文本的原因。我理解你离开它,以便 for 循环工作,但你应该做这样的事情:

ID=datetime.today().strftime('%Y%m%d%H%M%S')
Values = f"VALUES({ID},'{Product_Name.text}','{Product_Description.text}','{Vendor.text}','{Price.text}')"

Product_Name_Text = Product_Name.Text
Product_Description_Text = Product_Description.Text
Vendor_Text = Vendor.Text
Price_Text = Price.Text

然后

Values = f"VALUES({ID},'{Product_Name_Text}','{Product_Description_Text}','{Vendor_Text}','{Price_Text}')"

ID=datetime.today().strftime('%Y%m%d%H%M%S')
Values = f"VALUES({ID}"
for seq in Product_Name + Product_Description + Vendor + Price:
print(seq.text)
Values = Values + "," + "'" + seq.text + "'"
Values = Values + ")"

我肯定会推荐,至少对于调试、打印Valuessqlite_insert_query。如果您将这些结果与我们分享,如果仍然不起作用,我们也许可以提供帮助。

在上述选项中,您可以省略我分配的 ID,如果这毕竟不是问题。

我没有在您的代码中看到不同元素组的任何循环,因此我不确定您在插入多个新条目方面的期望。

第三名(决赛?编辑:您有2个问题。你在数据库中得到额外的东西是因为你把每个元素的描述放到SQL的Values部分,而不是文本。我已经说明了如何通过将文本形式化到 Values 字符串中来解决此问题。 您遇到的第二个问题是,当您说find_elements时,您的定位器只找到一件事(我无法调试您的定位器,因为我不知道页面)。但是这里的代码应该告诉你有多少东西匹配。为了验证这一假设,我在下面写了我自己的版本,它可以(如果页面是永久性的)从亚马逊获取化妆品数据。细节并不重要,但代码应该说明必须发生的情况。另外,我相信我之前对 ID 所做的是不必要的,数据库会处理这个问题。 由于我实际上不是一个数据库,所以我把它注释掉了。您将取消注释它。

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
# launch and go to site, yours will vary
driver = webdriver.Firefox(executable_path='/usr/bin/geckodriver')
driver.maximize_window()
wait = WebDriverWait(driver, 15)
driver.get("https://www.amazon.com/b?node=18505451011&pd_rd_w=gVvMZ&pf_rd_p=b6363b44-58dd-4354-979f-1446a1c45f7a&pf_rd_r=5FYAS41AJR7GPQ5Q74J4&pd_rd_r=9bfd1639-2256-4c17-928c-99cf03e00d63&pd_rd_wg=UPUHV")
wait.until(EC.presence_of_element_located((By.LINK_TEXT, "See all results")))
# SCRAPING
# you will want to change back to your locators
Product_Name = driver.find_elements_by_class_name("apb-line-clamp-3")
# actually number of reviewers, picks up see all at the bottom, too, but it's ok because of min
Product_Description = driver.find_elements_by_class_name("a-color-link")
Vendor = driver.find_elements_by_class_name(
"apb-browse-searchresults-product-byline")
Price = driver.find_elements_by_class_name("a-price")  # prices
# if the locators work well, these match multiple groups of products not just one
Num_Groups = min(len(Product_Name), len(
Product_Description), len(Vendor), len(Price))
print(f"found {Num_Groups} groups")  # should be more than 1
# I have commented out the database code, because I am not actually using one
#con = sqlite3.connect('/home/mypc/Desktop/aaaaa/Database.db')
# connect to database outside the loop, not for each item
# I removed the code about generating ID's, which the database should handle
records_added = 0
for i in range(Num_Groups):  # should cause i to count from 0 up to and including Num_Groups-1
print("")  # skip a line between stuff
print(Product_Name[i].text)
print(Product_Description[i].text)
print(Vendor[i].text)
print(Price[i].text)
# note below I need to format the .text into the values string, not the text description of the element
Values = f" VALUES ('{Product_Name[i].text}','{Product_Description[i].text}','{Vendor[i].text}','{Price[i].text}')"
sqlite_insert_query = 'INSERT INTO TableExample1 (Product_Name, Product_Description, Vendor, Price)' + Values
print("Here, inside the loop,  you would execute " + sqlite_insert_query)
#cursor = con.cursor()
#count = cursor.execute(sqlite_insert_query)
# con.commit()
#print("Record inserted successfully ", cursor.rowcount)
records_added = records_added + 1
# cursor.close()
print("")
print(f'Added a total of {records_added} records')

输出为:

found 12 groups
Crest 3D White Professional Effects Whitestrips 20 Treatments + Crest 3D White 1 Hour Express Whitestrips 2 Treatments - Teeth Whitening Kit
46,020
by Crest
$47
88
Here, inside the loop,  you would execute INSERT INTO TableExample1 (Product_Name, Product_Description, Vendor, Price) VALUES ('Crest 3D White Professional Effects Whitestrips 20 Treatments + Crest 3D White 1 Hour Express Whitestrips 2 Treatments - Teeth Whitening Kit','46,020','by Crest','$47
88')
REVLON One-Step Hair Dryer And Volumizer Hot Air Brush, Black, Packaging May Vary
274,196
by REVLON
$41
99
Here, inside the loop,  you would execute INSERT INTO TableExample1 (Product_Name, Product_Description, Vendor, Price) VALUES ('REVLON One-Step Hair Dryer And Volumizer Hot Air Brush, Black, Packaging May Vary','274,196','by REVLON','$41
99')
Waterpik WP-660 Water Flosser Electric Dental Countertop Professional Oral Irrigator For Teeth, Aquarius, White
72,786
by Waterpik
$59.99
Here, inside the loop,  you would execute INSERT INTO TableExample1 (Product_Name, Product_Description, Vendor, Price) VALUES ('Waterpik WP-660 Water Flosser Electric Dental Countertop Professional Oral Irrigator For Teeth, Aquarius, White','72,786','by Waterpik','$59.99')
Schick Hydro Silk Touch-Up Multipurpose Exfoliating Dermaplaning Tool, Eyebrow Razor, and Facial Razor with Precision Cover, 3 Count (Packaging May Vary)
113,269
by Schick Hydro Silk
$68
27
Here, inside the loop,  you would execute INSERT INTO TableExample1 (Product_Name, Product_Description, Vendor, Price) VALUES ('Schick Hydro Silk Touch-Up Multipurpose Exfoliating Dermaplaning Tool, Eyebrow Razor, and Facial Razor with Precision Cover, 3 Count (Packaging May Vary)','113,269','by Schick Hydro Silk','$68
27')
Neutrogena Makeup Remover Cleansing Face Wipes, Daily Cleansing Facial Towelettes to Remove Waterproof Makeup and Mascara, Alcohol-Free, Value Twin Pack, 25 Count, 2 Pack
62,881
by Neutrogena
$4
99
Here, inside the loop,  you would execute INSERT INTO TableExample1 (Product_Name, Product_Description, Vendor, Price) VALUES ('Neutrogena Makeup Remover Cleansing Face Wipes, Daily Cleansing Facial Towelettes to Remove Waterproof Makeup and Mascara, Alcohol-Free, Value Twin Pack, 25 Count, 2 Pack','62,881','by Neutrogena','$4
99')
Gillette Fusion Power Men's Razor Blades - 8 Refills
32,435
by Gillette
$6.99
Here, inside the loop,  you would execute INSERT INTO TableExample1 (Product_Name, Product_Description, Vendor, Price) VALUES ('Gillette Fusion Power Men's Razor Blades - 8 Refills','32,435','by Gillette','$6.99')
Softsoap Moisturizing Liquid Hand Soap, Soothing Clean Aloe Vera - 7.5 Fluid Ounces (6 Pack)
47,238
by Softsoap
$8
12
Here, inside the loop,  you would execute INSERT INTO TableExample1 (Product_Name, Product_Description, Vendor, Price) VALUES ('Softsoap Moisturizing Liquid Hand Soap, Soothing Clean Aloe Vera - 7.5 Fluid Ounces (6 Pack)','47,238','by Softsoap','$8
12')
Neutrogena Lightweight Body Oil for Dry Skin, Sheer Body Moisturizer in Light Sesame Formula, 16 fl. oz
13,011
by Neutrogena
$11.96
Here, inside the loop,  you would execute INSERT INTO TableExample1 (Product_Name, Product_Description, Vendor, Price) VALUES ('Neutrogena Lightweight Body Oil for Dry Skin, Sheer Body Moisturizer in Light Sesame Formula, 16 fl. oz','13,011','by Neutrogena','$11.96')
Crest 3D White Whitestrips with Light, Teeth Whitening Strips Kit, 10 Treatments, 20 Individual Strips (Packaging May Vary)
9,172
by Crest
$23
91
Here, inside the loop,  you would execute INSERT INTO TableExample1 (Product_Name, Product_Description, Vendor, Price) VALUES ('Crest 3D White Whitestrips with Light, Teeth Whitening Strips Kit, 10 Treatments, 20 Individual Strips (Packaging May Vary)','9,172','by Crest','$23
91')
Neutrogena Hydro Boost Hyaluronic Acid Hydrating Water Gel Daily Face Moisturizer for Dry Skin, Oil-Free, Non-Comedogenic Face Lotion, 1.7 fl. oz
51,417
by Neutrogena
$32.51
Here, inside the loop,  you would execute INSERT INTO TableExample1 (Product_Name, Product_Description, Vendor, Price) VALUES ('Neutrogena Hydro Boost Hyaluronic Acid Hydrating Water Gel Daily Face Moisturizer for Dry Skin, Oil-Free, Non-Comedogenic Face Lotion, 1.7 fl. oz','51,417','by Neutrogena','$32.51')
CHI 44 Iron Guard Thermal Protection Spray 8 Fl Oz
17,660
by CHI
$5
33
Here, inside the loop,  you would execute INSERT INTO TableExample1 (Product_Name, Product_Description, Vendor, Price) VALUES ('CHI 44 Iron Guard Thermal Protection Spray 8 Fl Oz','17,660','by CHI','$5
33')
Crest 3D White Toothpaste Radiant Mint (3 Count of 4.1 oz Tubes), 12.3 oz (Packaging May Vary)
40,354
by Crest
$13.99
Here, inside the loop,  you would execute INSERT INTO TableExample1 (Product_Name, Product_Description, Vendor, Price) VALUES ('Crest 3D White Toothpaste Radiant Mint (3 Count of 4.1 oz Tubes), 12.3 oz (Packaging May Vary)','40,354','by Crest','$13.99')
added a total of 12 records

修订版这是您的代码需要的外观。

Num_Groups = min(len(Product_Name),len(Product_Description),len(Vendor), len(Price))
con = sqlite3.connect('/home/mypc/Scrivania/aaaa/Database.db')
cursor = con.cursor()
records_added = 0
for i in range(Num_Groups):    
print(Product_Name[i].text)
print(Product_Description[i].text)
print(Vendor[i].text)
print(Price[i].text)
Values = f" VALUES ('{Product_Name[i].text}','{Product_Description[i].text}','{Vendor[i].text}','{Price[i].text}')"
sqlite_insert_query = 'INSERT INTO TableExample1 (Product_Name, Product_Description, Vendor, Price)' + Values
print("Qui, all'interno del ciclo, eseguiresti" + sqlite_insert_query)
count = cursor.execute(sqlite_insert_query)
con.commit()
print("Record inserted successfully ", cursor.rowcount)
records_added = records_added + 1
cursor.close()

for:下缩进的代码主体针对 i 的每个值执行(在您提到的情况下,从 0 到 11)。现在,由于数据库插入在此循环之外,因此仅执行一次。

真的是最后一次编辑:我们将使用参数化查询让数据库引擎处理值中的撇号(以及我可能错过的任何其他内容)。从我所读到的内容来看,无论如何,这是一种更安全的方法,可以帮助防止SQL注入攻击。

Values=行(保持相同的缩进)更改为

Values = (Product_Name[i].text, Product_Description[i].text, Vendor[i].text, Price[i].text)

sqlite_insert_query =行更改为 say(保持与现在相同的缩进):

sqlite_insert_query = 'INSERT INTO TableExample1 (Product_NameDB, Product_DescriptionDB, VendorDB, PriceDB) VALUES (?, ?, ?, ?);'

count=cursor.execute行更改为

count = cursor.execute(sqlite_insert_query, Values)

这应该更好,更安全。如果它有效,您可以进行一些可选的清理。例如,您可以在循环外部设置sqlite_insert_query,回到连接到数据库并初始化变量的位置。您也可以停止打印sqlite_insert_query并打印值(或者只是不打印任何值,因为您之前有 4 行打印值)。

最新更新