火花纱远程提交



当前我正在从事Spark-streaming项目。刚开始,我仍然是Spark-Kafka-Yarn-Cloudera的新手。要尝试(或查看(程序的结果,目前我必须构建项目的罐子,将其上传到群集然后spark-submit,我认为这是不高效的。

我可以从IDE [远程]以编程为程序运行此程序吗?我使用Scala-ide。我正在寻找一些代码,但仍然没有找到合适的代码

我的环境:-Cloudera 5.8.2 [OS Redhat 7.2,Kerberos 5,Spark_2.1,Scala 2.11] - Windows 7

按下以下步骤进行单元测试您的应用程序。

  1. 下载奇迹套装的hadoop_home环境变量
  2. 提供精确的Kafka经纪人URL和Sparkstreaming的主题名称
  3. 确保设置了适当的抵销级别管理属性。
  4. 使用Intellij IDE(也可以使用Scala IDE(。只需按照Scala应用程序来运行。

    val kafkaparams =地图( " metadata.broker.list" ->" 168.172.72.128:9092", computerConfig.auto_offset_reset_config->"最小", " group.id" -> uuid.randomuuid((。toString(((

    val tocerset = set(" test"(//主题名称val kafkastream = kafkautils .CREATEDIRECTSTREAM [String,String,StringDecoder,StringDecoder](SSC,Kafkaparams,topicset(//创建BSON数据结构并将数据加载到MongoDB集合中kafkastream.foreachrdd( rdd => {//商业逻辑的代码}(

我遵循本教程http://blog.andlypls.com/blog/2017/10/15/ususe-spark-sql-sql-sql-and-spark-stark-streaming-togeth/pect->

以下是我的代码:

import org.apache.kafka.clients.consumer.ConsumerRecord
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.streaming.kafka010._
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe
import scala.collection.mutable.ListBuffer
import org.apache.spark.SparkConf
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.Seconds
import org.apache.spark.sql.types.{StringType, StructType, TimestampType}
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.count
object SparkKafkaExample {
  def main(args: Array[String]): Unit =
  {
  val brokers = "broker1.com:9092,broker2.com:9092," +
    "broker3.com:9092,broker4.com:9092,broker5.com:9092"
  // Create Spark Session
  val spark = SparkSession
    .builder()
    .appName("KafkaSparkDemo")
    .master("local[*]")
    .getOrCreate()
  import spark.implicits._
  // Create Streaming Context and Kafka Direct Stream with provided settings and 10 seconds batches
  val ssc = new StreamingContext(spark.sparkContext, Seconds(10))
  var kafkaParams = Map(
    "bootstrap.servers" -> brokers,
    "key.deserializer" -> "org.apache.kafka.common.serialization.StringDeserializer",
    "value.deserializer" -> "org.apache.kafka.common.serialization.StringDeserializer",
    "group.id" -> "test",
    "security.protocol" -> "SASL_PLAINTEXT",
    "sasl.kerberos.service.name" -> "kafka",
    "auto.offset.reset" -> "earliest")
  val topics = Array("sparkstreaming")
  val stream = KafkaUtils.createDirectStream[String, String](
    ssc,
    PreferConsistent,
    Subscribe[String, String](topics, kafkaParams))
  // Define a schema for JSON data
  val schema = new StructType()
    .add("action", StringType)
    .add("timestamp", TimestampType)
  // Process batches:
  // Parse JSON and create Data Frame
  // Execute computation on that Data Frame and print result
  stream.foreachRDD { (rdd, time) =>
    val data = rdd.map(record => record.value)
    val json = spark.read.schema(schema).json(data)
    val result = json.groupBy($"action").agg(count("*").alias("count"))
    result.show
  }
  ssc.start
  ssc.awaitTermination
}
}

因为我使用kerberos的群集,然后我将此配置文件(kafka_jaas.conf(传递给我的IDE(eclipse-> on vm gragments(

-Djava.security.auth.login.config=kafka-jaas.conf

kafka-jaas.conf内容:

KafkaClient {
    com.sun.security.auth.module.Krb5LoginModule required
    useKeyTab=true
    keyTab="user.keytab"
    serviceName="kafka"
    principal="user@HOST.COM";
};
Client {
   com.sun.security.auth.module.Krb5LoginModule required
   useKeyTab=true
   keyTab="user.keytab"
   storeKey=true
   useTicketCache=false
   serviceName="zookeeper"
   principal="user@HOST.COM";
};

相关内容

  • 没有找到相关文章

最新更新