我无法获得任何查询来处理我的AWS Glue Partitioned表。我得到的错误是
HIVE_METASTORE_ERROR:com.facebook.presto.spi.PrestoException:错误:在"STRING"的位置0处应为类型,但找到了"STRING"。(服务:null;状态代码:0;错误代码:null;请求ID:null)
我发现了另一个线程,它提出了数据库名称和表不能包含字母数字和下划线以外的字符的事实。因此,我确保数据库名称、表名称和所有列名都遵守此限制。唯一不遵守此限制的对象是我的s3 bucket名称,它很难更改。
以下是表格定义和镶木地板工具的数据转储。
AWS粘合表定义
{
"Table": {
"UpdateTime": 1545845064.0,
"PartitionKeys": [
{
"Comment": "call_time year",
"Type": "INT",
"Name": "date_year"
},
{
"Comment": "call_time month",
"Type": "INT",
"Name": "date_month"
},
{
"Comment": "call_time day",
"Type": "INT",
"Name": "date_day"
}
],
"StorageDescriptor": {
"OutputFormat": "org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat",
"SortColumns": [],
"InputFormat": "org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat",
"SerdeInfo": {
"SerializationLibrary": "org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe",
"Name": "ser_de_info_system_admin_created",
"Parameters": {
"serialization.format": "1"
}
},
"BucketColumns": [],
"Parameters": {},
"Location": "s3://ph-data-lake-cududfs2z3xveg5t/curated/system/admin_created/",
"NumberOfBuckets": 0,
"StoredAsSubDirectories": false,
"Columns": [
{
"Comment": "Unique user ID",
"Type": "STRING",
"Name": "user_id"
},
{
"Comment": "Unique group ID",
"Type": "STRING",
"Name": "group_id"
},
{
"Comment": "Date and time the message was published",
"Type": "TIMESTAMP",
"Name": "call_time"
},
{
"Comment": "call_time year",
"Type": "INT",
"Name": "date_year"
},
{
"Comment": "call_time month",
"Type": "INT",
"Name": "date_month"
},
{
"Comment": "call_time day",
"Type": "INT",
"Name": "date_day"
},
{
"Comment": "Given name for user",
"Type": "STRING",
"Name": "given_name"
},
{
"Comment": "IANA time zone for user",
"Type": "STRING",
"Name": "time_zone"
},
{
"Comment": "Name that links to geneaology",
"Type": "STRING",
"Name": "family_name"
},
{
"Comment": "Email address for user",
"Type": "STRING",
"Name": "email"
},
{
"Comment": "RFC BCP 47 code set in this user's profile language and region",
"Type": "STRING",
"Name": "language"
},
{
"Comment": "Phone number including ITU-T ITU-T E.164 country codes",
"Type": "STRING",
"Name": "phone"
},
{
"Comment": "Date user was created",
"Type": "TIMESTAMP",
"Name": "date_created"
},
{
"Comment": "User role",
"Type": "STRING",
"Name": "role"
},
{
"Comment": "Provider dashboard preferences",
"Type": "STRUCT<portal_welcome_done:BOOLEAN,weekend_digests:BOOLEAN,patients_hidden:BOOLEAN,last_announcement:STRING>",
"Name": "preferences"
},
{
"Comment": "Provider notification settings",
"Type": "STRUCT<digest_email:BOOLEAN>",
"Name": "notifications"
}
],
"Compressed": true
},
"Parameters": {
"classification": "parquet",
"parquet.compress": "SNAPPY"
},
"Description": "System wide admin_created messages",
"Name": "system_admin_created",
"TableType": "EXTERNAL_TABLE",
"Retention": 0
}
}
AWS Athena模式
CREATE EXTERNAL TABLE `system_admin_created`(
`user_id` STRING COMMENT 'Unique user ID',
`group_id` STRING COMMENT 'Unique group ID',
`call_time` TIMESTAMP COMMENT 'Date and time the message was published',
`date_year` INT COMMENT 'call_time year',
`date_month` INT COMMENT 'call_time month',
`date_day` INT COMMENT 'call_time day',
`given_name` STRING COMMENT 'Given name for user',
`time_zone` STRING COMMENT 'IANA time zone for user',
`family_name` STRING COMMENT 'Name that links to geneaology',
`email` STRING COMMENT 'Email address for user',
`language` STRING COMMENT 'RFC BCP 47 code set in this user's profile language and region',
`phone` STRING COMMENT 'Phone number including ITU-T ITU-T E.164 country codes',
`date_created` TIMESTAMP COMMENT 'Date user was created',
`role` STRING COMMENT 'User role',
`preferences` STRUCT<portal_welcome_done:BOOLEAN,weekend_digests:BOOLEAN,patients_hidden:BOOLEAN,last_announcement:STRING> COMMENT 'Provider dashboard preferences',
`notifications` STRUCT<digest_email:BOOLEAN> COMMENT 'Provider notification settings')
PARTITIONED BY (
`date_year` INT COMMENT 'call_time year',
`date_month` INT COMMENT 'call_time month',
`date_day` INT COMMENT 'call_time day')
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://ph-data-lake-cududfs2z3xveg5t/curated/system/admin_created/'
TBLPROPERTIES (
'classification'='parquet',
'parquet.compress'='SNAPPY')
镶木地板工具
role = admin
date_created = 2018-01-11T14:40:23.142Z
preferences:
.patients_hidden = false
.weekend_digests = true
.portal_welcome_done = true
email = foo.barr+123@example.com
notifications:
.digest_email = true
group_id = 5a5399df23a804001aa25227
given_name = foo
call_time = 2018-01-11T14:40:23.000Z
time_zone = US/Pacific
family_name = bar
language = en-US
user_id = 5a5777572060a700170240c3
镶木地板工具架构
message spark_schema {
optional binary role (UTF8);
optional binary date_created (UTF8);
optional group preferences {
optional boolean patients_hidden;
optional boolean weekend_digests;
optional boolean portal_welcome_done;
optional binary last_announcement (UTF8);
}
optional binary email (UTF8);
optional group notifications {
optional boolean digest_email;
}
optional binary group_id (UTF8);
optional binary given_name (UTF8);
optional binary call_time (UTF8);
optional binary time_zone (UTF8);
optional binary family_name (UTF8);
optional binary language (UTF8);
optional binary user_id (UTF8);
optional binary phone (UTF8);
}
我遇到了类似的PrestoException,原因是列类型使用了大写字母。一旦我把"VARCHAR(10)"改为"VARCHAR(10))",它就起作用了。
我将分区键声明为表中的字段。我还在TIMESTAMP中发现了Parquet与Hive的差异,并将其转换为ISO8601字符串。从那时起,我几乎放弃了,因为如果s3桶中的所有镶木地板文件都没有与Athena相同的模式,那么Athena会抛出一个模式错误。然而,对于可选字段和稀疏列,这保证会发生
我也遇到了这个错误,当然,错误消息最终没有告诉我实际的问题。我犯了与最初海报完全相同的错误。
我正在通过python boto3 API创建粘合表,并将列名、类型、分区列和其他一些内容提供给它。问题:
这是我用来创建表的代码:
import boto3
glu_clt = boto3.client("glue", region_name="us-east-1")
glue_clt.create_table(
DatabaseName=database,
TableInput={
"Name": table,
"StorageDescriptor": {
"Columns": table_cols,
"Location": table_location,
"InputFormat": "org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat",
"OutputFormat": "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat",
"SerdeInfo": {
"SerializationLibrary": "org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe",
}
},
"PartitionKeys": partition_cols,
"TableType": "EXTERNAL_TABLE"
}
)
因此,我结束了为API的输入Columns
定义所有列名和类型。然后我还在API中为它提供了输入PartitionKeys
的列名和类型。当我浏览到AWS控制台时,我意识到因为我在Columns
和PartitionKeys
中都定义了分区列,所以它在表中定义了两次。
有趣的是,如果您尝试通过控制台执行此操作,它将抛出一个更具描述性的错误,让您知道该列已经存在(如果您尝试添加表中已经存在的分区列)。
要解决:我从输入Columns
中删除了分区列及其类型,而只是通过PartitionKeys
输入来提供它们,这样它们就不会被放在表上两次。令人沮丧的是,这最终导致了与OP在通过Athena查询时相同的错误消息。
这也可能与您创建数据库的方式有关(无论是通过CloudFormation、UI还是CLI),或者您是否有任何禁止字符,如"-"。我们的数据库和表名中都有连字符,这会使许多功能变得毫无用处。