AWS节点上的Storm爬网程序种子注入失败



我在AWS实例上使用风暴爬行器1.15(ES 7.3.0,storm 1.2.3(种子注入(ESSeedInjector(失败了,我不知道为什么。基本上,传递给"排队"螺栓的每个url都会失败。

以下是apache工作日志的一个示例:

...
2019-11-13 09:38:18.937 o.a.s.d.executor Thread-16-enqueue-executor[4 4] [INFO] Processing received message FOR 4 TUPLE: source: filter:5, stream: status, id: {}, [http://aquaplex.pvi.com/, , DISCOVERED]
2019-11-13 09:38:18.937 o.a.s.d.executor Thread-16-enqueue-executor[4 4] [INFO] Execute done TUPLE source: filter:5, stream: status, id: {}, [http://aquaplex.pvi.com/, discoveryDate: 2019-11-13T09:38:18.937Z
, DISCOVERED] TASK: 4 DELTA: -1
2019-11-13 09:38:18.937 o.a.s.d.executor Thread-16-enqueue-executor[4 4] [INFO] Processing received message FOR 4 TUPLE: source: filter:5, stream: status, id: {}, [http://aquaponics.com/aquaponic-systems/com
mercial-systems/, , DISCOVERED]
2019-11-13 09:38:18.937 o.a.s.d.executor Thread-16-enqueue-executor[4 4] [INFO] Execute done TUPLE source: filter:5, stream: status, id: {}, [http://aquaponics.com/aquaponic-systems/commercial-systems/, disc
overyDate: 2019-11-13T09:38:18.937Z
, DISCOVERED] TASK: 4 DELTA: -1
2019-11-13 09:38:18.937 o.a.s.d.executor Thread-16-enqueue-executor[4 4] [INFO] Processing received message FOR 4 TUPLE: source: filter:5, stream: status, id: {}, [http://aquarium-fish.kamihata.net/guppy/, ,
DISCOVERED]
2019-11-13 09:38:18.935 c.d.s.e.p.StatusUpdaterBolt I/O dispatcher 2 [ERROR] Exception with bulk 1 - failing the whole lot
org.elasticsearch.ElasticsearchStatusException: Unable to parse response body
at org.elasticsearch.client.RestHighLevelClient.parseResponseException(RestHighLevelClient.java:1707) ~[stormjar.jar:?]
at org.elasticsearch.client.RestHighLevelClient$1.onFailure(RestHighLevelClient.java:1621) [stormjar.jar:?]
at org.elasticsearch.client.RestClient$FailureTrackingResponseListener.onDefinitiveFailure(RestClient.java:564) [stormjar.jar:?]
at org.elasticsearch.client.RestClient$1.completed(RestClient.java:310) [stormjar.jar:?]
at org.elasticsearch.client.RestClient$1.completed(RestClient.java:294) [stormjar.jar:?]
at org.apache.http.concurrent.BasicFuture.completed(BasicFuture.java:122) [stormjar.jar:?]
at org.apache.http.impl.nio.client.DefaultClientExchangeHandlerImpl.responseCompleted(DefaultClientExchangeHandlerImpl.java:181) [stormjar.jar:?]
at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.processResponse(HttpAsyncRequestExecutor.java:448) [stormjar.jar:?]
at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.inputReady(HttpAsyncRequestExecutor.java:338) [stormjar.jar:?]
at org.apache.http.impl.nio.DefaultNHttpClientConnection.consumeInput(DefaultNHttpClientConnection.java:265) [stormjar.jar:?]
at org.apache.http.impl.nio.client.InternalIODispatch.onInputReady(InternalIODispatch.java:81) [stormjar.jar:?]
at org.apache.http.impl.nio.client.InternalIODispatch.onInputReady(InternalIODispatch.java:39) [stormjar.jar:?]
at org.apache.http.impl.nio.reactor.AbstractIODispatch.inputReady(AbstractIODispatch.java:114) [stormjar.jar:?]
at org.apache.http.impl.nio.reactor.BaseIOReactor.readable(BaseIOReactor.java:162) [stormjar.jar:?]
at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvent(AbstractIOReactor.java:337) [stormjar.jar:?]
at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvents(AbstractIOReactor.java:315) [stormjar.jar:?]
at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:276) [stormjar.jar:?]
at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104) [stormjar.jar:?]
at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:591) [stormjar.jar:?]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_232]
Caused by: org.elasticsearch.client.ResponseException: method [POST], host [http://node-1], URI [/_bulk?timeout=1m], status line [HTTP/1.1 404 Not Found]
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>404 Not Found</title>
</head><body>
<h1>Not Found</h1>
<p>The requested URL /_bulk was not found on this server.</p>
</body></html>
at org.elasticsearch.client.RestClient.convertResponse(RestClient.java:253) ~[stormjar.jar:?]
at org.elasticsearch.client.RestClient.access$900(RestClient.java:95) ~[stormjar.jar:?]
at org.elasticsearch.client.RestClient$1.completed(RestClient.java:298) ~[stormjar.jar:?]
... 16 more
2019-11-13 09:38:18.938 o.a.s.d.executor Thread-16-enqueue-executor[4 4] [INFO] Execute done TUPLE source: filter:5, stream: status, id: {}, [http://aquarium-fish.kamihata.net/guppy/, discoveryDate: 2019-11-
13T09:38:18.937Z
, DISCOVERED] TASK: 4 DELTA: -1
2019-11-13 09:38:18.938 o.a.s.d.executor Thread-16-enqueue-executor[4 4] [INFO] Processing received message FOR 4 TUPLE: source: filter:5, stream: status, id: {}, [http://aquarius-spectrum.com/pdf/Aquarius-S
pectrum-PPT.pdf, , DISCOVERED]
2019-11-13 09:38:18.938 o.a.s.d.executor Thread-16-enqueue-executor[4 4] [INFO] Execute done TUPLE source: filter:5, stream: status, id: {}, [http://aquarius-spectrum.com/pdf/Aquarius-Spectrum-PPT.pdf, disco
veryDate: 2019-11-13T09:38:18.938Z
, DISCOVERED] TASK: 4 DELTA: -1
2019-11-13 09:38:18.938 o.a.s.d.executor Thread-16-enqueue-executor[4 4] [INFO] Processing received message FOR 4 TUPLE: source: filter:5, stream: status, id: {}, [http://aquaservinc.com/contact-us/, , DISCO
...

有人面临过同样的问题吗?

您是否正确指定了ES服务器的端口?看起来您正在访问一个普通的HTTP服务器,当然它对_bulk一无所知。

请参阅示例ES conf。如果未指定,它应该选择默认端口,因此只设置主机就可以了。

相关内容

最新更新