我正在使用Spark 1.6.2在HDP 2.4.3上运行Vora 1.3。
我有两个表,其中包含相同架构的数据,一个表驻留在 HANA 数据库中,另一个表作为 CSV 文件存储在 HDFS 中。
我使用齐柏林飞艇在 Vora 中创建了两个表:
CREATE TABLE flights_2006 (Year int, Month_ int, DayofMonth int, DayOfWeek int, DepTime int, CRSDepTime int, ArrTime int, CRSArrTime int, UniqueCarrier string, FlightNum int,
TailNum string, ActualElapsedTime int, CRSElapsedTime int, AirTime int, ArrDelay int, DepDelay int, Origin string, Dest string, Distance int, TaxiIn int, TaxiOut int,
Cancelled int, CancellationCode int, Diverted int, CarrierDelay int, WeatherDelay int, NASDelay int, SecurityDelay int, LateAircraftDelay int)
USING com.sap.spark.vora
OPTIONS (
files "/exch/flights_filtered/part-00000,/exch/flights_filtered/part-00001,/exch/flights_filtered/part-00002,/exch/flights_filtered/part-00003,/exch/flights_filtered/part-00004",
csvdelimiter ","
)
问题 1.顺便问一下,从文件源创建 Vora 表时,什么时候可以只提供目录名称,而不列出目录中的所有文件?这是非常不切实际的,因为无法预测目录中将有多少部分文件。
CREATE TABLE flights_2007
USING com.sap.spark.hana
OPTIONS (
tablepath "XXXXXXXXXXXX",
dbschema "XXXXXXXXXX",
host "XXXXXXXXXXX",
instance "00",
user "XXXXXXXXXXX",
passwd "XXXXXXXXXX"
)
我能够从这两个表连接中生成结果(这种连接的业务含义放在一边):
select f7.MONTH, f7.DAYOFMONTH, f7.UNIQUECARRIER, f7.FLIGHTNUM, f7.YEAR, f7.DEPTIME, f6.year, f6.DepTime
from flights_2007 as f7 inner join flights_2006 as f6
on f7.MONTH = f6.Month_ and f7.DAYOFMONTH = f6.DayofMonth and f7.UNIQUECARRIER = f6.UniqueCarrier and f7.FLIGHTNUM = f6.FlightNum
where f7.MONTH = 1 and f7.DAYOFMONTH = 2 and f7.UNIQUECARRIER = 'WN'
然后我尝试在Vora Modeler中执行相同的步骤。
问题 2.为什么齐柏林飞艇中的注册表不会导致 Vora Modeler 中的表可用?
因此,我在 Vora Modeler 中执行了相同的两个表创建语句,使用表名中的所有大写字母,因为我记得 Vora 之前对此有一些问题。然后创建了一个 Vora 视图作为具有以下条件的两个表的连接:
FLIGHTS_2007.MONTH = FLIGHTS_2006.MONTH_ and
FLIGHTS_2007.DAYOFMONTH = FLIGHTS_2007.DAYOFMONTH and
FLIGHTS_2007.UNIQUECARRIER = FLIGHTS_2006.UNIQUECARRIER and
FLIGHTS_2007.FLIGHTNUM = FLIGHTS_2006.FLIGHTNUM
.. 并使用了 where-条件:
FLIGHTS_2007.MONTH = 1 and
FLIGHTS_2007.DAYOFMONTH = 2 and
FLIGHTS_2007.UNIQUECARRIER = 'WN'
该视图预览的预期结果将与基于齐柏林飞艇的选择相同。实际结果(前几行):
org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 2165.0 failed 4 times, most recent failure: Lost task 2.3 in stage 2165.0 (TID 78743, eba165.extendtec.com.au): com.sap.spark.vora.client.jdbc.VoraJdbcException: [Vora [eba165.extendtec.com.au:34530.1615085]] Unknown error when executing SELECT "FLIGHTS_2006"."FLIGHTNUM", "FLIGHTS_2006"."DEPTIME", "FLIGHTS_2006"."UNIQUECARRIER", "FLIGHTS_2006"."MONTH_", "FLIGHTS_2006"."YEAR" FROM "FLIGHTS_2006": HL(9): Runtime error. (schema error: could not resolve column "FLIGHTS_2006"."YEAR" (sql parse error)) at com.sap.spark.vora.client.jdbc.VoraJdbcClient.liftedTree1$1(VoraJdbcClient.scala:210) at com.sap.spark.vora.client.jdbc.VoraJdbcClient.generateAutocloseableIteratorFromQuery(VoraJdbcClient.scala:187) at com.sap.spark.vora.client.VoraClient$$anonfun$generateAutocloseableIteratorFromQuery$1.apply(VoraClient.scala:363) at com.sap.spark.vora.client.VoraClient$$anonfun$generateAutocloseableIteratorFromQuery$1.apply(VoraClient.scala:363) at scala.util.Try$.apply(Try.scala:161) at com.sap.spark.vora.client.VoraClient.handleExceptions(VoraClient.scala:775) at com.sap.spark.vora.client.VoraClient.generateAutocloseableIteratorFromQuery(VoraClient.scala:362) at com.sap.spark.vora.VoraRDD.compute(voraRDD.scala:54) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313) at org.apache.spark.rdd.RDD.iterator(RDD.scala:277) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313) at org.apache.spark.rdd.RDD.iterator(RDD.scala:277) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313) at org.apache.spark.rdd.RDD.iterator(RDD.scala:277) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313) at org.apache.spark.rdd.RDD.iterator(RDD.scala:277) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at
问题 3.我在Vora Modeler中做错了什么吗?还是它实际上是一个错误?
您提到在运行 CREATE 语句时使用了表名的所有大写字母。根据我对 1.3 Modeler 的经验,您还必须使用全大写字母作为列名。
架构错误:无法解析列"FLIGHTS_2006"。年">
例如,如果您使用了"CREATE TABLE FLIGHTS_2006 (Year int, ...",请尝试将其更改为"CREATE TABLE FLIGHTS_2006 (YEAR int, ...")
关于您的Q1,是的,这是目前正在作为功能请求进行审查的内容。
关于您的Q2,您的齐柏林飞艇是否与您的Vora Modeler(又名Vora Tools)连接到相同的Vora Thriftserver?
关于您的 Q3,Ryan 的另一个回复是正确的,列名在 Vora 1.3 中也是大小写意义的