如何使用 Python 自动选择网页上的数据并下载生成的 xls 文件



我是Python的新手。我正在尝试抓取页面上的数据:

例如:

Category: grains  
Organic: No  
Commodity: Coarse  
SubCommodity: Corn  
Publications: Daily  
Location: All  
Refine Commodity: All
Dates: 07/31/2018 - 08/01/2019

有没有办法让 Python 在网页上选择它,然后单击运行,然后

点击下载为Excel并存储Excel文件?

可能吗?我是编码新手,在这里需要一些指导。

目前我所做的是输入数据,然后在结果页面上我使用美丽汤来刮桌子。但是,由于该表分布在 200 多页上,因此需要花费大量时间。

使用您定义的查询作为示例,我手动输入查询并找到了以下 Excel(实际上是 HTML(格式的 URL:

url = 'https://marketnews.usda.gov/mnp/ls-report?&endDateGrain=08%2F01%2F2019&commDetail=All&repMonth=1&endDateWeekly=&repType=Daily&rtype=&fsize=&_use=1&use=&_fsize=1&byproducts=&run=Run&pricutvalue=&repDateGrain=07%2F31%2F2018&runReport=true&grade=&regionsDesc=&subprimals=&mscore=&endYear=2019&repDateWeekly=&_wrange=1&endDateWeeklyGrain=&repYear=2019&loc=All&_loc=1&wrange=&_grade=1&repDateWeeklyGrain=&_byproducts=1&organic=NO&category=Grain&_mscore=1&subComm=Corn&commodity=Coarse&_commDetail=1&_subprimals=1&cut=&endMonth=1&repDate=07%2F31%2F2018&endDate=08%2F01%2F2019&format=excel'

在URL中是我们可以在Python中设置的参数,我们可以很容易地创建一个循环来更改参数。现在,让我进入实际获取此数据的示例。我使用pandas.read_html来读取此 HTML 结果并填充数据帧,可以将其视为包含列和行的表。

import pandas as pd
#  use URL defined earlier
#  url = '...'
df_lst = pd.read_html(url, header=1)

现在df_lst是包含所需数据的数据帧对象列表。对于您的特定示例,这将产生 30674 行和 11 列:

>>> df_lst[0].columns
Index([u'Report Date', u'Location', u'Class', u'Variety', u'Grade Description',
u'Units', u'Transmode', u'Low', u'High', u'Pricing Point',
u'Delivery Period'],
dtype='object')
>>> df_lst[0].head()
Report Date         Location   Class  Variety Grade Description   Units   Transmode   Low  High         Pricing Point Delivery Period
0  07/31/2018  Blytheville, AR  YELLOW      NaN           US NO 2  Bushel       Truck  3.84  3.84     Country Elevators            Cash
1  07/31/2018       Helena, AR  YELLOW      NaN           US NO 2  Bushel       Truck  3.76  3.76     Country Elevators            Cash
2  07/31/2018  Little Rock, AR  YELLOW      NaN           US NO 2  Bushel       Truck  3.74  3.74  Mills and Processors            Cash
3  07/31/2018   Pine Bluff, AR  YELLOW      NaN           US NO 2  Bushel       Truck  3.67  3.67     Country Elevators            Cash
4  07/31/2018       Denver, CO  YELLOW      NaN           US NO 2  Bushel  Truck-Rail  3.72  3.72    Terminal Elevators            Cash
>>> df_lst[0].shape
(30674, 11)

现在,回到我关于URL参数的观点 - 使用Python,我们可以运行列表并根据自己的喜好格式化URL字符串。例如,迭代给定查询的 20 年可以通过修改 URL 以具有与 Pythonstr.format()方法中的位置参数相对应的数字来完成。下面是一个完整的示例:

import datetime
import pandas as pd
url = 'https://marketnews.usda.gov/mnp/ls-report?&endDateGrain={1}&commDetail=All&repMonth=1&endDateWeekly=&repType=Daily&rtype=&fsize=&_use=1&use=&_fsize=1&byproducts=&run=Run&pricutvalue=&repDateGrain={0}&runReport=true&grade=&regionsDesc=&subprimals=&mscore=&endYear=2019&repDateWeekly=&_wrange=1&endDateWeeklyGrain=&repYear=2019&loc=All&_loc=1&wrange=&_grade=1&repDateWeeklyGrain=&_byproducts=1&organic=NO&category=Grain&_mscore=1&subComm=Corn&commodity=Coarse&_commDetail=1&_subprimals=1&cut=&endMonth=1&repDate={0}&endDate={1}&format=excel'
start = [datetime.date(2018-i, 7, 31) for i in range(20)]
end = [datetime.date(2019-i, 8, 1) for i in range(20)]
for s, e in zip(start, end):
url_get = url.format(s.strftime('%m/%d/%Y'), e.strftime('%m/%d/%Y'))
df_lst = pd.read_html(url_get, header=1)
#print(df_lst[0].head())  # uncomment to see first five rows
#print(df_lst[0].shape)  # uncomment to see DataFrame shape

小心pd.read_html。我已经用一个header关键字参数修改了我的答案来pd.read_html()因为多索引使获得结果变得很痛苦。通过将单行索引作为标题,它不再是多索引,并且数据索引很容易。例如,我可以使用这个来获得玉米类:

df_lst[0]['Class']

使用Pandas将所有报告编译成一个大文件也很容易。由于我们有数据帧,因此可以使用pandas.to_csv函数将数据导出为CSV(或您想要的任何其他文件类型,但我在本例中选择了CSV(。这是一个修改后的版本,具有输出 CSV 的附加功能:

import datetime
import pandas as pd
# URL
url = 'https://marketnews.usda.gov/mnp/ls-report?&endDateGrain={1}&commDetail=All&repMonth=1&endDateWeekly=&repType=Daily&rtype=&fsize=&_use=1&use=&_fsize=1&byproducts=&run=Run&pricutvalue=&repDateGrain={0}&runReport=true&grade=&regionsDesc=&subprimals=&mscore=&endYear=2019&repDateWeekly=&_wrange=1&endDateWeeklyGrain=&repYear=2019&loc=All&_loc=1&wrange=&_grade=1&repDateWeeklyGrain=&_byproducts=1&organic=NO&category=Grain&_mscore=1&subComm=Corn&commodity=Coarse&_commDetail=1&_subprimals=1&cut=&endMonth=1&repDate={0}&endDate={1}&format=excel'
# CSV output file and flag
csv_out = 'myreports.csv'
flag = True
# Start and end dates
start = [datetime.date(2018-i, 7, 31) for i in range(20)]
end = [datetime.date(2019-i, 8, 1) for i in range(20)]
# Iterate through dates and get report from URL
for s, e in zip(start, end):
url_get = url.format(s.strftime('%m/%d/%Y'), e.strftime('%m/%d/%Y'))
df_lst = pd.read_html(url_get, header=1)
print(df_lst[0].head())  # uncomment to see first five rows
print(df_lst[0].shape)  # uncomment to see DataFrame shape
# Save to big CSV
if flag is True:
# 0th iteration, so write header and overwrite existing file
df_lst[0].to_csv(csv_out, header=True, mode='w')  # change mode to 'wb' if Python 2.7
flag = False
else:
# Subsequent iterations should append to file and not add new header
df_lst[0].to_csv(csv_out, header=False, mode='a')  # change mode to 'ab' if Python 2.7

您的特定查询至少生成了 1227 页数据 - 所以我只是将其修剪到一个位置 - 亚利桑那州(从 07/31/2018 - 08/1/2019( - 现在生成 47 页数据.xml大小为 500KB

你可以像这样半自动化:

>>> end_day='01'
>>> start_day='31'
>>> start_month='07'
>>> end_month='08'
>>> start_year='2018'
>>> end_year='2019'
>>> link = f"https://marketnews.usda.gov/mnp/ls-report?&endDateGrain={end_month}%2F{end_day}%2F{end_year}&commDetail=All&endDateWeekly={end_month}%2F{end_day}%2F{end_year}&repMonth=1&repType=Daily&rtype=&use=&_use=1&fsize=&_fsize=1&byproducts=&run=Run&pricutvalue=&repDateGrain={start_month}%2F{start_day}%2F{start_year}+&runReport=true&grade=&regionsDesc=All+AR&subprimals=&mscore=&endYear={end_year}&repDateWeekly={start_month}%2F{start_day}%2F{start_year}&_wrange=1&endDateWeeklyGrain=&repYear={end_year}&loc=All+AR&_loc=1&wrange=&_grade=1&repDateWeeklyGrain=&_byproducts=1&category=Grain&organic=NO&commodity=Coarse&subComm=Corn&_mscore=0&_subprimals=1&_commDetail=1&cut=&endMonth=1&repDate={start_month}%2F{start_day}%2F{start_year}&endDate={end_month}%2F{end_day}%2F{end_year}&format=xml"
>>> link
'https://marketnews.usda.gov/mnp/ls-report?&endDateGrain=08%2F01%2F2019&commDetail=All&endDateWeekly=08%2F01%2F2019&repMonth=1&repType=Daily&rtype=&use=&_use=1&fsize=&_fsize=1&byproducts=&run=Run&pricutvalue=&repDateGrain=07%2F31%2F2018+&runReport=true&grade=&regionsDesc=All+AR&subprimals=&mscore=&endYear=2019&repDateWeekly=07%2F31%2F2018&_wrange=1&endDateWeeklyGrain=&repYear=2019&loc=All+AR&_loc=1&wrange=&_grade=1&repDateWeeklyGrain=&_byproducts=1&category=Grain&organic=NO&commodity=Coarse&subComm=Corn&_mscore=0&_subprimals=1&_commDetail=1&cut=&endMonth=1&repDate=07%2F31%2F2018&endDate=08%2F01%2F2019&format=xml'
>>> with urllib.request.urlopen(link) as response:
...     html = response.read()
... 

加载 HTML 可能需要一分钟来处理大型查询

如果您出于某种原因希望处理整个数据集 - 您可以重复此过程 - 但您可能希望研究可以专门针对大数据进行优化的技术 - 也许是涉及Python的Pandas和numexpr的解决方案(用于GPU加速/并行化(

您可以在此处找到此答案中使用的数据 - 您可以将其下载为xml。

首先导入您的 xml:

>>> import xml.etree.ElementTree as ET

您可以从 Python 中的网站下载文件

>>> tree = ET.parse(html)

或手动

>>> tree = ET.parse('report.xml')
>>> report = tree.getroot()

然后你可以做这样的事情:

>>> report[0][0]
<Element 'reportDate' at 0x7f902adcf368>
>>> report[0][0].text
'07/31/2018'
>>> for el in report[0]:
...     print(el)
... 
<Element 'reportDate' at 0x7f902adcf368>
<Element 'location' at 0x7f902ac814f8>
<Element 'classStr' at 0x7f902ac81548>
<Element 'variety' at 0x7f902ac81b88>
<Element 'grade' at 0x7f902ac29cc8>
<Element 'units' at 0x7f902ac29d18>
<Element 'transMode' at 0x7f902ac29d68>
<Element 'bidLevel' at 0x7f902ac29db8>
<Element 'deliveryPoint' at 0x7f902ac29ea8>
<Element 'deliveryPeriod' at 0x7f902ac29ef8>

有关解析 xml 的详细信息,请参阅此处。 你会想学习一些python - 但希望你能理解以下内容。幸运的是 - 网上有许多免费的 python 教程 - 这里有一个快速片段可以帮助您入门。

#lets find the lowest bid on a certain day
>>> report[0][0]
<Element 'reportDate' at 0x7f902adcf368>
>>> report[0][0].text
'07/31/2018'
>>> report[0][7]
<Element 'bidLevel' at 0x7f902ac29db8>
>>> report[0][7][0]
<Element 'lowPrice' at 0x7f902ac29e08>
>>> report[0][7][0].text
'3.84'
#how many low bids are there?
>>> len(report)
1216
#get an average of the lowest bids...
>>> low_bid_list = [float(bid[7][0].text) for bid in report]
[3.84, 3.76, 3.74, 3.67, 3.65, 3.7, 3.5, 3.7, 3.61,...]
>>> sum = 0
>>> for el in low_bid_list:
...     sum = sum + el
... 
>>> sum
4602.599999999992
>>> sum/len(report)
3.7850328947368355

最新更新