谷歌云数据流(文本文件到云扳手)java.lang.RuntimeException:parseRow错误.行:CSVR



我正在运行一个云数据流作业,将多个文本文件(.csv(从GCS导入云扳手。

该作业正在部分工作,10亿行中约有600万行被导入,但随后该作业因以下错误而失败:

Error message from worker: java.lang.RuntimeException: org.apache.beam.sdk.util.UserCodeException: java.lang.RuntimeException: Error to parseRow. row: CSVRecord [comment='null', recordNumber=1, values=[source_set_id_hash, rank, run_id, source_set_id, recommended_set_id, score, updated_at, version]], table: CREATE TABLE `set_recs_similar_content` (
`source_set_id_hash`                    STRING(MAX) NOT NULL,
`version`                               STRING(MAX) NOT NULL,
`rank`                                  INT64 NOT NULL,
`recommended_set_id`                    INT64 NOT NULL,
`run_id`                                STRING(MAX) NOT NULL,
`score`                                 FLOAT64 NOT NULL,
`source_set_id`                         INT64 NOT NULL,
`updated_at`                            TIMESTAMP NOT NULL,
) PRIMARY KEY (`source_set_id_hash` ASC, `version` ASC, `rank` ASC)

这是因为它正在读取CSV的第一行并期望它与格式匹配吗?

我的manifest.json文件中的相关部分如下:

"columns": [
{"column_name": "source_set_id_hash", "type_name": "STRING"},
{"column_name": "rank", "type_name": "INT64"},
{"column_name": "run_id", "type_name": "STRING"},
{"column_name": "source_set_id", "type_name": "INT64"},
{"column_name": "recommended_set_id", "type_name": "INT64"},
{"column_name": "score", "type_name": "FLOAT64"},
{"column_name": "updated_at", "type_name": "TIMESTAMP"},
{"column_name": "version", "type_name": "STRING"}
]

GCS中的所有文件都是相同格式的,所以6%的工作会完成,但随后失败,这似乎很奇怪。

相关文档链接:https://cloud.google.com/dataflow/docs/guides/templates/provided-batch#gcstexttocloudspanner

感谢

事实证明,CSV文件中不应该存在标头。

来自模板源文件:

* <p>Text file must NOT have a header.

请在从所有文件中删除标头后重试。错误消息的下一行表明第一行是标题:

recordNumber=1, values=[source_set_id_hash, rank, run_id, source_set_id, recommended_set_id, score, updated_at, version]

相关内容

  • 没有找到相关文章

最新更新