我使用Python 中的以下代码使用SODA API提取数据
response = requests.get('https://healthdata.gov/resource/uqq2-txqb.json')
数据集包含434865行,但当我使用API时,它只返回前1000行。我在另一个问题上看到,$limit
可以用于获得前50000行,但我如何将其与$offset
组合以获得所有434865行?
**我知道了如何使用$offset
,现在有了结果代码,有什么方法可以浓缩它吗?
response1 = requests.get('https://healthdata.gov/resource/uqq2-txqb.json?$limit=50000')
response2 = requests.get('https://healthdata.gov/resource/uqq2-txqb.json?$limit=50000&$offset=50001')
response3 = requests.get('https://healthdata.gov/resource/uqq2-txqb.json?$limit=50000&$offset=100002')
response4 = requests.get('https://healthdata.gov/resource/uqq2-txqb.json?$limit=50000&$offset=150003')
response5 = requests.get('https://healthdata.gov/resource/uqq2-txqb.json?$limit=50000&$offset=200004')
response6 = requests.get('https://healthdata.gov/resource/uqq2-txqb.json?$limit=50000&$offset=250005')
response7 = requests.get('https://healthdata.gov/resource/uqq2-txqb.json?$limit=50000&$offset=300006')
response8 = requests.get('https://healthdata.gov/resource/uqq2-txqb.json?$limit=50000&$offset=350007')
response9 = requests.get('https://healthdata.gov/resource/uqq2-txqb.json?$limit=50000&$offset=400008')
这被称为paging
,您可以在这里找到例如文档:https://dev.socrata.com/docs/paging.html
其中还规定了API有两个版本:
- v2.0,其中
$limit
最大可为50000 - v2.1,其中
$limit
不受限制
您使用的端点似乎支持v2.1,至少基于此https://dev.socrata.com/foundry/healthdata.gov/uqq2-txqb因此,您应该能够为$limit
使用一个大值,并一次检索整个集合。
在进行分页路由时,$offset
的值为0-based
,因此您的查询应该正确地重写为:
response1 = requests.get('https://healthdata.gov/resource/uqq2-txqb.json?$limit=50000')
response2 = requests.get('https://healthdata.gov/resource/uqq2xqb.json?$limit=50000&$offset=50000')
response3 = requests.get('https://healthdata.gov/resource/uqq2xqb.json?$limit=50000&$offset=100000')
response4 = requests.get('https://healthdata.gov/resource/uqq2xqb.json?$limit=50000&$offset=150000')
注意$limit
倍数上的对齐。
response1 = requests.get('https://healthdata.gov/resource/uqq2-txqb.json?$limit=50000')
response2 = requests.get('https://healthdata.gov/resource/uqq2-txqb.json?$limit=50000&$offset=50001')
response3 = requests.get('https://healthdata.gov/resource/uqq2-txqb.json?$limit=50000&$offset=100002')
response4 = requests.get('https://healthdata.gov/resource/uqq2-txqb.json?$limit=50000&$offset=150003')
response5 = requests.get('https://healthdata.gov/resource/uqq2-txqb.json?$limit=50000&$offset=200004')
response6 = requests.get('https://healthdata.gov/resource/uqq2-txqb.json?$limit=50000&$offset=250005')
response7 = requests.get('https://healthdata.gov/resource/uqq2-txqb.json?$limit=50000&$offset=300006')
response8 = requests.get('https://healthdata.gov/resource/uqq2-txqb.json?$limit=50000&$offset=350007')
response9 = requests.get('https://healthdata.gov/resource/uqq2-txqb.json?$limit=50000&$offset=400008')
# Initialize variables for pagination
limit = 50000
offset = 0
data = []
while True:
# Set query parameters
params = {
'$limit': limit,
'$offset': offset
}
# Make a GET request to the API endpoint with the query parameters
response = requests.get(url, headers=headers, params=params)