在Azure函数中,将一个blob存储容器中的多个CSV文件合并为另一blob存储容器上的一个CSV文件的最快方法



我想知道是否有可能改进下面的代码,使其运行得更快(也许更便宜(,作为Azure功能的一部分,通过使用Python将来自源blob存储容器的多个CSV文件组合到Azure目标blob存储容器上的一个CSV文件中(请注意,如果需要,我也可以使用另一个库,而不是panda(?

from azure.identity import DefaultAzureCredential
from azure.keyvault.secrets import SecretClient
from azure.storage.blob import ContainerClient
import pandas as pd
from io import StringIO
# Used for getting access to secrets on Azure key vault for authentication purposes
credential = DefaultAzureCredential()
vault_url = 'AzureKeyVaultURL'
secret_client = SecretClient(vault_url=vault_url, credential=credential)
azure_datalake_connection_str = secret_client.get_secret('Datalake_connection_string')
# Connecting to a source Azure blob storage container where multiple CSV files are stored
blob_block_source = ContainerClient.from_connection_string(
conn_str= azure_datalake_connection_str.value,
container_name= "sourceContainerName"
)
# Connecting to a target Azure blob storage container to where the CSV files from the source should be combined into one CSV file
blob_block_target = ContainerClient.from_connection_string(
conn_str= azure_datalake_connection_str.value,
container_name= "targetContainerName"
)
# Retrieve list of the blob storage names from the source Azure blob storage container, but only those that end with the .csv file extension
blobNames = [name.name for name in blob_block_source.list_blobs()]
only_csv_blob_names = list(filter(lambda x:x.endswith(".csv") , blobNames)) 
# Creating a list of dataframes - one dataframe from each CSV file found in the source Azure blob storage container 
listOfCsvDataframes = []
for csv_blobname in only_csv_blob_names:
df = pd.read_csv(StringIO(blob_block_source.download_blob(csv_blobname, encoding='utf-8').content_as_text(encoding='utf-8')), encoding = 'utf-8',header=0, low_memory=False)
listOfCsvDataframes.append(df)
# Contatenating the different dataframes into one dataframe  
df_concat = pd.concat(listOfCsvDataframes, axis=0, ignore_index=True)
# Creating a CSV object from the concatenated dataframe 
outputCSV = df_concat.to_csv(index=False, sep = ',', header = True)
# Upload the combined dataframes as a CSV file (i.e. the CSV files have been combined into one CSV file)
blob_block_target.upload_blob('combinedCSV.csv', outputCSV, blob_type="BlockBlob", overwrite = True)

您可以使用Azure数据工厂来连接文件,而不是使用Azure Function。

使用ADF可能会比使用Panda的Azure函数有更好的效率。

看看这篇博客文章https://www.sqlservercentral.com/articles/merge-multiple-files-in-azure-data-factory

如果你想使用Azure功能,请尝试在不使用panda的情况下连接文件。如果所有文件都有相同的列和相同的列顺序,则可以直接连接字符串,并删除除第一个文件外的所有文件的标题行(如果有的话(。

相关内容

  • 没有找到相关文章

最新更新