Python - 为什么打印结果重复而"write to a text"只有一行



可爱的人!我对Python完全陌生。我尝试抓取多个 URL,但遇到了"打印"问题。

我尝试打印并写下"发货状态"。 我有两个 URL,所以理想情况下我会得到两个结果。

这是我的代码:

from bs4 import BeautifulSoup 
import re 
import urllib.request
import urllib.error
import urllib 

# read urls of websites from text file 
list_open = open("c:/Users/***/Downloads/web list.txt") 
read_list = list_open.read() 
line_in_list = read_list.split("n") 

for url in line_in_list: 
soup = BeautifulSoup(urllib.request.urlopen(url).read(), 'html') 
# parse something special in the file 
shipment = soup.find_all('span')
Preparation=shipment[0] 
Sent=shipment[1]
InTransit=shipment[2]
Delivered=shipment[3]
for p in shipment: 
# extract information 
print (url,';',"Preparation",Preparation.getText(),";","Sent",Sent.getText(),";","InTransit",InTransit.getText(),";","Delivered",Delivered.getText())

import sys

file_path = 'randomfile.txt'
sys.stdout = open(file_path, "w")
print(url,';',"Preparation",Preparation.getText(),";","Sent",Sent.getText(),";","InTransit",InTransit.getText(),";","Delivered",Delivered.getText())`

我这里有两个问题:

  1. 问题一:我只有两个URL,当我打印结果时,每个"跨度"重复4次(因为有四个"跨度")。 "输出"中的结果如下:

(我删除了结果示例以保护隐私。

  1. 问题二:我试图将"打印"写入文本文件,但文件中只出现了一行:

(我删除了结果示例以保护隐私。

我想知道代码中出了什么问题。我只想打印 2 个网址结果。

非常感谢您的帮助! 提前谢谢你!

第一点是由迭代装运引起的 - 只需删除 for 循环并更正print()的缩进:

for url in line_in_list: 
soup = BeautifulSoup(urllib.request.urlopen(url).read(), 'html') 
# parse something special in the file 
shipment = soup.find_all('span')
Preparation=shipment[0] 
Sent=shipment[1]
InTransit=shipment[2]
Delivered=shipment[3]
print (url,';',"Preparation",Preparation.getText(),";","Sent",Sent.getText(),";","InTransit",InTransit.getText(),";","Delivered",Delivered.getText())

第二个问题是当您在循环外而不是在追加模式下调用写入时引起的 - 您最终会将其作为循环:

#open file in append mode
with open('somefile.txt', 'a') as f:
#start iterating your urls
for url in line_in_list: 
soup = BeautifulSoup(urllib.request.urlopen(url).read(), 'html') 
# parse something special in the file 
shipment = soup.find_all('span')
Preparation=shipment[0] 
Sent=shipment[1]
InTransit=shipment[2]
Delivered=shipment[3]
#create output text
line = f'{url};Preparation{Preparation.getText()};Sent{Sent.getText()};InTransit{InTransit.getText()};Delivered{Delivered.getText()}'
#print output text
print (line)
#append output text to file
f.write(line+'n')

您可以删除:

import sys
file_path = 'randomfile.txt'
sys.stdout = open(file_path, "w")
print(url,';',"Preparation",Preparation.getText(),";","Sent",Sent.getText(),";","InTransit",InTransit.getText(),";","Delivered",Delivered.getText())`

位优化代码示例:

from bs4 import BeautifulSoup
import urllib.request
import urllib.error
import urllib
# read urls of websites from text file
list_open = open("c:/Users/***/Downloads/web list.txt")
read_list = list_open.read()
line_in_list = read_list.split("n")
file_path = "randomfile.txt"
with open('somefile.txt', 'a', encoding='utf-8') as f:
for url in line_in_list:
soup = BeautifulSoup(urllib.request.urlopen(url).read(), 'html') 
# parse something special in the file 
shipment = list(soup.select_one('#progress').stripped_strings)
line = f"{url},{';'.join([':'.join(x) for x in list(zip(shipment[::2], shipment[1::2]))])}"
print (line)
f.write(line+'n')
list_open = open("c:/Users/***/Downloads/web list.txt") 
read_list = list_open.read() 
line_in_list = read_list.split("n") 
file_path = 'randomfile.txt'
sys.stdout = open(file_path, "w") 
There are four spans actuelly, try this
for url in line_in_list: 
soup = BeautifulSoup(urlopen(url).read(), 'html') 
# parse something special in the file 
shipments = soup.find_all("span") # there are four span actually;
sys.stdout.write('Url '+url+'; Preparation'+shipments[0].getText()+'; Sent'+shipments[1].getText()+'; InTransit'+shipments[2].getText()+'; Delivered'+shipments[3].getText()) 
# change line
sys.stdout.write("r")

第一个问题

您有两个嵌套循环:

for url in line_in_list:
for p in shipment:
print(...)

打印嵌套在第二个循环中。如果每个网址有 4 个货件,则每个网址需要 4 次打印。

由于您不使用for p in shipment中的p,因此您可以完全摆脱第二个循环并将打印向左移动一个缩进级别,如下所示:

for url in line_in_list: 
soup = BeautifulSoup(urllib.request.urlopen(url).read(), 'html') 
# parse something special in the file 
shipment = soup.find_all('span')
Preparation=shipment[0] 
Sent=shipment[1]
InTransit=shipment[2]
Delivered=shipment[3]
print (url,';',"Preparation",Preparation.getText(),";","Sent",Sent.getText(),";","InTransit",InTransit.getText(),";","Delivered",Delivered.getText())

第二个问题

sys.stdout = open(file_path, "w")
print(url,';',"Preparation",Preparation.getText(),";","Sent",Sent.getText(),";","InTransit",InTransit.getText(),";","Delivered",Delivered.getText())`

如果没有关键字参数,print 将写入sys.stdout,这是默认情况下您的终端输出。sys.sdtout = ...之后只有一个打印,因此只会有一行写入文件。

还有另一种打印到文件的方法:

with open('demo.txt', 'a') as f:
print('Hello world', file = f)

关键字with将确保即使引发异常,文件也会关闭。

两者结合

据我了解,您希望将两行打印到文件中。这是一个解决方案:

from bs4 import BeautifulSoup
import urllib.request
import urllib.error
import urllib
# read urls of websites from text file
list_open = open("c:/Users/***/Downloads/web list.txt")
read_list = list_open.read()
line_in_list = read_list.split("n")
file_path = "randomfile.txt"
for url in line_in_list:
soup = BeautifulSoup(urllib.request.urlopen(url).read(), "html")
# parse something special in the file
shipment = soup.find_all("span")
Preparation = shipment[0]
Sent = shipment[1]
InTransit = shipment[2]
Delivered = shipment[3]
with open(file_path, "a") as f:
f.write(
f"{url} ; Preparation {Preparation.getText()}; Sent {Sent.getText()}; InTransit {InTransit.getText()}; Delivered {Delivered.getText()}"
)

最新更新