我正在使用Spark加载一些数据到BigQuery。其思想是从S3读取数据,并使用Spark和BigQuery客户端API来加载数据。下面是对BigQuery执行插入操作的代码:
val bq = createAuthorizedClientWithDefaultCredentialsFromStream(appName, credentialStream)
val bqjob = bq.jobs().insert(pid, job, data).execute() // data is a InputStream content
使用这种方法,我看到了很多SocketTimeoutException。
Caused by: java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:170)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
at sun.security.ssl.InputRecord.read(InputRecord.java:503)
at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:954)
at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:911)
at sun.security.ssl.AppInputStream.read(AppInputStream.java:105)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:703)
at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:647)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1534)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1439)
at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480)
at sun.net.www.protocol.https.HttpsURLConnectionImpl.getResponseCode(HttpsURLConnectionImpl.java:338)
at com.google.api.client.http.javanet.NetHttpResponse.<init>(NetHttpResponse.java:37)
at com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:94)
at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:972)
at com.google.api.client.googleapis.media.MediaHttpUploader.executeCurrentRequestWithoutGZip(MediaHttpUploader.java:545)
at com.google.api.client.googleapis.media.MediaHttpUploader.executeCurrentRequest(MediaHttpUploader.java:562)
at com.google.api.client.googleapis.media.MediaHttpUploader.resumableUpload(MediaHttpUploader.java:419)
at com.google.api.client.googleapis.media.MediaHttpUploader.upload(MediaHttpUploader.java:336)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:427)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:352)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:469)
看起来从S3读取的延迟导致Google http-client超时。我想增加超时时间,并尝试了以下选项:
val req = bq.jobs().insert(pid, job, data).buildHttpRequest()
req.setReadTimeout(3 * 60 * 1000)
val res = req.execute()
但是这会导致BigQuery中的Precondition失败。它期望mediaUploader为空,但不确定为什么。
Exception in thread "main" java.lang.IllegalArgumentException
at com.google.api.client.repackaged.com.google.common.base.Preconditions.checkArgument(Preconditions.java:76)
at com.google.api.client.util.Preconditions.checkArgument(Preconditions.java:37)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.buildHttpRequest(AbstractGoogleClientRequest.java:297)
这导致我在BigQuery
上尝试第二个插入APIval req = bq.jobs().insert(pid, job).buildHttpRequest().setReadTimeout(3 * 60 * 1000).setContent(data)
val res = req.execute()
这次失败了,出现了不同的错误。
Exception in thread "main" com.google.api.client.googleapis.json.GoogleJsonResponseException: 400 Bad Request
{
"code" : 400,
"errors" : [ {
"domain" : "global",
"message" : "Job configuration must contain exactly one job-specific configuration object (e.g., query, load, extract, spreadsheetExtract), but there were 0: ",
"reason" : "invalid"
} ],
"message" : "Job configuration must contain exactly one job-specific configuration object (e.g., query, load, extract, spreadsheetExtract), but there were 0: "
}
请告诉我如何设置超时时间。如果我做错了什么,也指出我。
我将回答标题中的主要问题:如何使用Java客户端库设置超时。
要设置超时,您需要在客户端中配置一个自定义HttpRequestInitializer。例如:
Bigquery.Builder builder =
new Bigquery.Builder(new UrlFetchTransport(), new JacksonFactory(), credential);
final HttpRequestInitializer existing = builder.getHttpRequestInitializer();
builder.setHttpRequestInitializer(new HttpRequestInitializer() {
@Override
public void initialize(HttpRequest request) throws IOException {
existing.initialize(request);
request
.setReadTimeout(READ_TIMEOUT)
.setConnectTimeout(CONNECTION_TIMEOUT);
}
});
Bigquery client = builder.build();
我不认为这将解决你所面临的所有问题。有一些想法可能会有帮助,但我不完全理解这个场景,所以这些可能会偏离轨道:
- 如果你正在移动大文件:考虑在加载到BigQuery之前在GCS上暂存它们。
- 如果你使用媒体上传来发送你的请求数据:这些不能太大,否则你会有超时或网络连接失败的风险。
- 如果您正在运行一个令人尴尬的并行数据迁移,并且数据块相对较小,那么
bigquery.tabledata.insertAll
可能更适合这样的大型风扇场景。详见https://cloud.google.com/bigquery/streaming-data-into-bigquery
谢谢你的问题!