Bulk Upsert Javascript存储过程总是超过5秒的执行上限,并导致超时



我目前正在python SDK中运行一个脚本,该脚本通过编程将150万个文档批量添加到azure cosmos数据库中的一个集合中。我一直在使用github repo中提供的示例中的批量导入存储过程:https://github.com/Azure/azure-cosmosdb-js-server/tree/master/samples/stored-procedures,唯一的变化是我将collection.createDocument替换为collection.upsertDocument。我将在下面完整地包含我的存储过程。

存储过程确实成功地运行了——它可以相对快速地一致地打乱文档。尽管只有在大约30%的进度之前才会出现这种情况,但当这个错误被抛出时:

CosmosHttpResponseError: (RequestTimeout) Message: {"Errors":["The requested operation exceeded maximum alloted time. Learn more: https://aka.ms/cosmosdb-tsg-service-request-timeout"]}
ActivityId: 9f2357c6-918c-4b67-ba20-569034bfde6f, Request URI: /apps/4a997bdb-7123-485a-9808-f952db2b7e52/services/a7c137c6-96b8-4b53-a20c-b9577981b353/partitions/305a8287-11d1-43f8-be1f-983bd4c4a63d/replicas/132488328092882514p/, RequestStats:
RequestStartTime: 2020-11-03T23:43:59.9158203Z, RequestEndTime: 2020-11-03T23:44:05.3858559Z, Number of regions attempted:1
ResponseTime: 2020-11-03T23:44:05.3858559Z, StoreResult: StorePhysicalAddress: rntbd://cdb-ms-prod-centralus1-fd22.documents.azure.com:14354/apps/4a997bdb-7123-485a-9808-f952db2b7e52/services/a7c137c6-96b8-4b53-a20c-b9577981b353/partitions/305a8287-11d1-43f8-be1f-983bd4c4a63d/replicas/132488328092882514p/, LSN: -1, GlobalCommittedLsn: -1, PartitionKeyRangeId: , IsValid: False, StatusCode: 408, SubStatusCode: 0, RequestCharge: 0, ItemLSN: -1, SessionToken: , UsingLocalLSN: False, TransportException: null, ResourceType: StoredProcedure, OperationType: ExecuteJavaScript, SDK: Microsoft.Azure.Documents.Common/2.11.0

有没有办法添加一些重试逻辑或延长批量故障的超时时间?我相信if (!isAccepted) getContext().getResponse().setBody(count);下面存储过程中的代码部分应该对这种情况有所帮助,但在我的情况下似乎不起作用。

Javascript中的批量追加启动存储过程:

function bulkUpsert(docs) {
var collection = getContext().getCollection();
var collectionLink = collection.getSelfLink();
// The count of imported docs, also used as current doc index.
var count = 0;
// Validate input.
if (!docs) throw new Error("The array is undefined or null.");
var docsLength = docs.length;
if (docsLength == 0) {
getContext().getResponse().setBody(0);
return;
}
// Call the CRUD API to create a document.
tryCreate(docs[count], callback);
// Note that there are 2 exit conditions:
// 1) The upsertDocument request was not accepted. 
//    In this case the callback will not be called, we just call setBody and we are done.
// 2) The callback was called docs.length times.
//    In this case all documents were created and we don't need to call tryCreate anymore. Just call setBody and we are done.
function tryCreate(doc, callback) {
var isAccepted = collection.upsertDocument(collectionLink, doc, callback);
// If the request was accepted, callback will be called.
// Otherwise report current count back to the client, 
// which will call the script again with remaining set of docs.
// This condition will happen when this stored procedure has been running too long
// and is about to get cancelled by the server. This will allow the calling client
// to resume this batch from the point we got to before isAccepted was set to false
if (!isAccepted) {
getContext().getResponse().setBody(count);
}
}
// This is called when collection.upsertDocument is done and the document has been persisted.
function callback(err, doc, options) {
if (err) throw err;
// One more document has been inserted, increment the count.
count++;
if (count >= docsLength) {
// If we have created all documents, we are done. Just set the response.
getContext().getResponse().setBody(count);
} else {
// Create next document.
tryCreate(docs[count], callback);
}
}
}

我认为问题可能在于存储过程,而不是python脚本,如果不是这样的话,尽管我可以提供我的python脚本。如果能在这方面提供任何帮助,我们将不胜感激,这几天来我一直很头疼!

额外信息:

吞吐量=10000,分区追加启动大小始终为1.9MB。

如果其他人有这个问题,我使用的解决方法是在批量追加销售操作进行期间将吞吐量暂时提高到100000,而不是10000。如果将大容量追加启动存储过程与足够高的吞吐量结合使用,则不会发生错误。我认为,一旦批量追加启动操作对150万条记录中约30%的记录进行了追加启动,就会频繁发生超时,这可能是因为吞吐量在分区之间没有得到充分分配,导致了瓶颈。一旦我的容器在实践中使用,我可能不得不再次为其分配更大的吞吐量,或者我可以减少它以节省成本。无论哪种方式,代码都很简单,只需以下方法:

new_throughput = 10000; container.replace_throughput(new_throughput)

存储过程的执行时间为5秒。但是,您可以编写存储过程来处理有界执行,方法是检查布尔返回值,然后使用每次调用存储过程时插入的项数来跟踪和恢复批处理的进度。这里有一个例子。

最新更新