利用Foundry API,如何获得数据集的编号或行和列



我希望利用Foundry中的API检索数据集中的记录和列的数量。我发现的一个似乎显示记录数量的API是"monole/api/table/stats";,然而,我不知道如何通过数据集的rid。

我最终试图获得我管理的所有数据集的总列、记录和大小,以便使用Quiver或Slate构建一个面板,显示我们在Foundry平台内管理的数据量。

您可以使用以下示例代码来计算数据集的统计信息:

import time
import requests
from urllib.parse import quote_plus
import json
def calculate_dataset_stats(token: str,
dataset_rid: str,
branch='master',
api_base='https://foundry-stack.com'
) -> dict:
"""
Calculates statistics for last transaction of a dataset in a branch
Args:
dataset_rid: the dataset rid
branch: branch of the dataset
Returns: a dictionary with statistics
"""
start_stats_calculation = requests.post(f"{api_base}/foundry-stats/api/stats/datasets/"
f"{dataset_rid}/branches/{quote_plus(branch)}",
headers={
'content-type': "application/json",
'authorization': f"Bearer {token}",
})
start_stats_calculation.raise_for_status()
metadata = start_stats_calculation.json()
transaction_rid = metadata['view']['endTransactionRid']
schema_id = metadata['view']['schemaId']
calculated_finished = False
maybe_stats = {
'status': 'FAILED'
}
while not calculated_finished:
response = requests.get(f"{api_base}/foundry-stats/api/stats/datasets/"
f"{dataset_rid}/branches/{quote_plus(branch)}",
headers={
'content-type': "application/json",
'authorization': f"Bearer {token}",
},
params={
'endTransactionRid': transaction_rid,
'schemaId': schema_id
})
response.raise_for_status()
maybe_stats = response.json()
if (maybe_stats['status'] == 'SUCCEEDED') or (maybe_stats['status'] == 'FAILED'):
calculated_finished = True
time.sleep(0.5)
if maybe_stats['status'] != 'SUCCEEDED':
raise ValueError(f'Stats Calculation failed for dataset {dataset_rid}. '
f'Failure handling not implemented.')
return maybe_stats['result']['succeededDatasetResult']['stats']

token = "eyJwb..."
dataset_rid = "ri.foundry.main.dataset.14703427-09ab-4c9c-b036-1234b34d150b"
stats = calculate_dataset_stats(token, dataset_rid)
print(json.dumps(stats, indent=4))

另一个答案使用computeStatsgetDatasetStatsoundry API。还有另一个API-getComputedDatasetStats-它可以获得您所需的统计数据,甚至可以执行得更好。

根据我的测试:

  • getDatasetStats不可用,除非运行computeStats。后者需要时间。另一方面,getComputedDatasetStats立即可用
  • getComputedDatasetStats将返回sizeInBytes,但前提是computeStats未运行。当我调用computeStats API并完成任务时,sizeInBytes变为空。getDatasetStats也显示为null

要获得行数、列数和数据集大小,您可以尝试使用类似的方法:

import requests
import json
def getComputedDatasetStats(token, dataset_rid, api_base='https://.....'):
response = requests.post(
url=f'{api_base}/foundry-stats/api/computed-stats-v2/get',
headers={
'content-type': 'application/json',
'Authorization': 'Bearer ' + token
},
data=json.dumps({
"datasetRid": dataset_rid,
"branch": "master"
})
)
return response.json()
token = 'eyJwb.....'
dataset_rid = 'ri.foundry.main.dataset.1d9ef04e-7ec6-456e-8326-1c64b1105431'
result = getComputedDatasetStats(token, dataset_rid)
# full resulting json:
# print(json.dumps(result, indent=4))
# required statistics:
print('size:', result['computedDatasetStats']['sizeInBytes'])
print('rows:', result['computedDatasetStats']['rowCount'])
print('cols:', len(result['computedDatasetStats']['columnStats']))

示例输出:

size: 24
rows: 2
cols: 2

最新更新