我有一个类,可以进行一些提取,将加载转换为位于不同JSON文件中的数据集。
这个过程工作正常。但是,我每个月都必须手动处理。我在 intelliJ 中提交了一个 Spark 应用程序(并提交一个带有转换的 Scalla Singleton 对象(
所以,我正在尝试自动化这个过程。但是,我没有找到文档或教程来了解实现这一目标的最佳服务是什么。
这些过程应:
- 创建 HDInsight Spark 群集
- 运行进程(Scala 类(
- 删除之前创建的 HDInsight Spark 群集
我已经搜索过,但我找到的链接(寻找"按需创建高清洞察火花集群"(如下:
- 使用按需高清见解从 Azure 数据工厂 V2 访问数据湖 簇
- 如何使用数据创建 Azure 按需高清见解 Spark 群集 厂
我搜索过的其他选项:
- 在 Azure 中托管和运行 PowerShell 脚本
- Azure Logic Apps
- Azure Automation
谢谢!
这是您想要的过程
- 创建 HDInsight Spark 群集
使用电源外壳应该很容易创建HDInsight群集,下面是一个示例代码:
### Create a Spark 2.3 cluster in Azure HDInsight
# Default cluster size (# of worker nodes), version, and type
$clusterSizeInNodes = "1"
$clusterVersion = "3.6"
$clusterType = "Spark"
# Create the resource group
$resourceGroupName = Read-Host -Prompt "Enter the resource group name"
$location = Read-Host -Prompt "Enter the Azure region to create resources in, such as 'Central US'"
$defaultStorageAccountName = Read-Host -Prompt "Enter the default storage account name"
New-AzResourceGroup -Name $resourceGroupName -Location $location
# Create an Azure storage account and container
# Note: Storage account kind BlobStorage can only be used as secondary storage for HDInsight clusters.
New-AzStorageAccount `
-ResourceGroupName $resourceGroupName `
-Name $defaultStorageAccountName `
-Location $location `
-SkuName Standard_LRS `
-Kind StorageV2 `
-EnableHttpsTrafficOnly 1
$defaultStorageAccountKey = (Get-AzStorageAccountKey `
-ResourceGroupName $resourceGroupName `
-Name $defaultStorageAccountName)[0].Value
$defaultStorageContext = New-AzStorageContext `
-StorageAccountName $defaultStorageAccountName `
-StorageAccountKey $defaultStorageAccountKey
# Create a Spark 2.3 cluster
$clusterName = Read-Host -Prompt "Enter the name of the HDInsight cluster"
# Cluster login is used to secure HTTPS services hosted on the cluster
$httpCredential = Get-Credential -Message "Enter Cluster login credentials" -UserName "admin"
# SSH user is used to remotely connect to the cluster using SSH clients
$sshCredentials = Get-Credential -Message "Enter SSH user credentials" -UserName "sshuser"
# Set the storage container name to the cluster name
$defaultBlobContainerName = $clusterName
# Create a blob container. This holds the default data store for the cluster.
New-AzStorageContainer `
-Name $clusterName `
-Context $defaultStorageContext
$sparkConfig = New-Object "System.Collections.Generic.Dictionary``2[System.String,System.String]"
$sparkConfig.Add("spark", "2.3")
# Create the HDInsight cluster
New-AzHDInsightCluster `
-ResourceGroupName $resourceGroupName `
-ClusterName $clusterName `
-Location $location `
-ClusterSizeInNodes $clusterSizeInNodes `
-ClusterType $clusterType `
-OSType "Linux" `
-Version $clusterVersion `
-ComponentVersion $sparkConfig `
-HttpCredential $httpCredential `
-DefaultStorageAccountName "$defaultStorageAccountName.blob.core.windows.net" `
-DefaultStorageAccountKey $defaultStorageAccountKey `
-DefaultStorageContainer $clusterName `
-SshCredential $sshCredentials
Get-AzHDInsightCluster `
-ResourceGroupName $resourceGroupName `
-ClusterName $clusterName
- 运行进程(一个 Scala 类(
可以参考此链接将应用程序作业远程提交到 Spark 群集:
https://learn.microsoft.com/en-us/azure/hdinsight/spark/apache-spark-create-standalone-application#run-the-application-on-the-apache-spark-cluster
- 删除之前创建的 HDInsight Spark 群集
清理集群,可以使用powershell来实现它,这里是相同的示例代码;
# Removes the specified HDInsight cluster from the current subscription.
Remove-AzHDInsightCluster `
-ResourceGroupName $resourceGroupName `
-ClusterName $clusterName
# Removes the specified storage container.
Remove-AzStorageContainer `
-Name $clusterName `
-Context $defaultStorageContext
# Removes a Storage account from Azure.
Remove-AzStorageAccount `
-ResourceGroupName $resourceGroupName `
-Name $defaultStorageAccountName
# Removes a resource group.
Remove-AzResourceGroup `
-Name $resourceGroupName
附加参考:
https://learn.microsoft.com/en-us/azure/hdinsight/spark/apache-spark-jupyter-spark-sql-use-powershell
https://github.com/MicrosoftDocs/azure-docs/blob/master/articles/data-factory/v1/data-factory-build-your-first-pipeline-using-powershell.md
希望对您有所帮助。