我通常会使用Java以这种方式将数据从Cassandra加载到Apache Spark中:
SparkContext sparkContext = StorakleSparkConfig.getSparkContext();
CassandraSQLContext sqlContext = new CassandraSQLContext(sparkContext);
sqlContext.setKeyspace("midatabase");
DataFrame customers = sqlContext.cassandraSql("SELECT email, first_name, last_name FROM store_customer " +
"WHERE CAST(store_id as string) = '" + storeId + "'");
但是想象一下,我有一个更难的,我需要将几个分区键加载到这个数据帧中。我可以在我的查询中使用 WHERE IN (...),然后再次使用 cassandraSql 方法。但是我有点不愿意使用 WHERE IN,因为在协调器节点方面存在单点故障的臭名昭著的问题。这里对此进行了解释:
https://lostechies.com/ryansvihla/2014/09/22/cassandra-query-patterns-not-using-the-in-query-for-multiple-partitions/
有没有办法使用多个查询但将它们加载到单个数据帧中?
一种方法是运行单个查询并联合所有/联合多个数据帧/RDD。
SparkContext sparkContext = StorakleSparkConfig.getSparkContext();
CassandraSQLContext sqlContext = new CassandraSQLContext(sparkContext);
sqlContext.setKeyspace("midatabase");
DataFrame customersOne = sqlContext.cassandraSql("SELECT email, first_name, last_name FROM store_customer " + "WHERE CAST(store_id as string) = '" + storeId1 + "'");
DataFrame customersTwo = sqlContext.cassandraSql("SELECT email, first_name, last_name FROM store_customer " + "WHERE CAST(store_id as string) = '" + storeId2 + "'");
DataFrame allCustomers = customersOne.unionAll(CustomersTwo)