如何从docker容器连接到hdfs ?



我的目标是在气流中从hdfs读取文件并进行进一步的操作。

经过研究,我发现我需要使用的网址如下:

df = pd.read_parquet('http://localhost:9870/webhdfs/v1/hadoop_files/sample_2022_01.parquet?op=OPEN')

localhost/172.20.80.1/computer-name.mshome.net可以互换使用,

9870 - namenode端口,

hadoop_files/sample_2022_01。Parquet -我在根目录下创建的文件夹和文件。

我可以在PyCharm中访问和读取本地文件,但我无法在docker中获得相同的气流结果。我尝试使用本地hdfs和hdfs托管在docker和更改主机到host.docker.internal,但我得到相同的错误。

堆栈跟踪:

[2022-06-12, 17:52:45 UTC] {taskinstance.py:1889} ERROR - Task failed with exception
Traceback (most recent call last):
File "/usr/local/lib/python3.7/urllib/request.py", line 1350, in do_open
encode_chunked=req.has_header('Transfer-encoding'))
File "/usr/local/lib/python3.7/http/client.py", line 1281, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/usr/local/lib/python3.7/http/client.py", line 1327, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "/usr/local/lib/python3.7/http/client.py", line 1276, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/usr/local/lib/python3.7/http/client.py", line 1036, in _send_output
self.send(msg)
File "/usr/local/lib/python3.7/http/client.py", line 976, in send
self.connect()
File "/usr/local/lib/python3.7/http/client.py", line 948, in connect
(self.host,self.port), self.timeout, self.source_address)
File "/usr/local/lib/python3.7/socket.py", line 728, in create_connection
raise err
File "/usr/local/lib/python3.7/socket.py", line 716, in create_connection
sock.connect(sa)
OSError: [Errno 113] No route to host
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.7/site-packages/airflow/operators/python.py", line 207, in execute
branch = super().execute(context)
File "/home/airflow/.local/lib/python3.7/site-packages/airflow/operators/python.py", line 171, in execute
return_value = self.execute_callable()
File "/home/airflow/.local/lib/python3.7/site-packages/airflow/operators/python.py", line 189, in execute_callable
return self.python_callable(*self.op_args, **self.op_kwargs)
File "/opt/airflow/dags/includes/parquet_dag/main.py", line 15, in main
df_parquet = read('hdfs://localhost:9000/hadoop_files/sample_2022_01.parquet')
File "/opt/airflow/dags/includes/parquet_dag/utils.py", line 29, in read
df = pd.read_parquet('http://172.20.80.1:9870/webhdfs/v1/hadoop_files/sample_2022_01.parquet?op=OPEN')
File "/home/airflow/.local/lib/python3.7/site-packages/pandas/io/parquet.py", line 500, in read_parquet
**kwargs,
File "/home/airflow/.local/lib/python3.7/site-packages/pandas/io/parquet.py", line 236, in read
mode="rb",
File "/home/airflow/.local/lib/python3.7/site-packages/pandas/io/parquet.py", line 102, in _get_path_or_handle
path_or_handle, mode, is_text=False, storage_options=storage_options
File "/home/airflow/.local/lib/python3.7/site-packages/pandas/io/common.py", line 614, in get_handle
storage_options=storage_options,
File "/home/airflow/.local/lib/python3.7/site-packages/pandas/io/common.py", line 312, in _get_filepath_or_buffer
with urlopen(req_info) as req:
File "/home/airflow/.local/lib/python3.7/site-packages/pandas/io/common.py", line 212, in urlopen
return urllib.request.urlopen(*args, **kwargs)
File "/usr/local/lib/python3.7/urllib/request.py", line 222, in urlopen
return opener.open(url, data, timeout)
File "/usr/local/lib/python3.7/urllib/request.py", line 525, in open
response = self._open(req, data)
File "/usr/local/lib/python3.7/urllib/request.py", line 543, in _open
'_open', req)
File "/usr/local/lib/python3.7/urllib/request.py", line 503, in _call_chain
result = func(*args)
File "/usr/local/lib/python3.7/urllib/request.py", line 1378, in http_open
return self.do_open(http.client.HTTPConnection, req)
File "/usr/local/lib/python3.7/urllib/request.py", line 1352, in do_open
raise URLError(err)
urllib.error.URLError: <urlopen error [Errno 113] No route to host>
与host.docker.internal:

urllib.error.URLError: <urlopen error [Errno 99] Cannot assign requested address>

您需要使用任何可路由的地址在气流docker容器。

如果hadoop也在docker容器中,使用docker inspect CONTAINER(doc)检查它的ip地址。如果hadoop在本地主机上,可以设置network_mode: "host"(doc)

还有一个重要的注意事项,如果你在macos上使用docker桌面应用程序,它基本上是一个虚拟机。在这种情况下,你需要一些额外的设置,检查这个,例如。

其中localhost/172.20.80.1/computername.mshome.net可以互换使用,

它们不应该在Docker网络中互换。

在气流中,你可以使用Docker服务名,而不是IP地址,并确保容器在同一个桥接网络(而不是)主机模式,仅适用于Linux)。host.docker.internal也不正确,因为您试图到达另一个容器,而不是您的主机

https://docs.docker.com/network/bridge/

我还建议使用气流Spark操作符从HDFS实际读取Parquet,使用Spark,而不是Pandas或WebHDFS。如果需要,可以将Spark数据框转换为Pandas