I have a setup which without configuration change sometimes work, sometimes not, and I would welcome any help to understand why (and have it work 100% of the time).
Setup
Platform:
- Windows 10
- WSL2, ubuntu 21.04
- docker compose 1.29.2
- docker engine v20.10.7
- docker desktop (WSL2 backend) 3.4.1 (65384)
A different set of a bit older docker/docker desktop had the same behavior.
Cluster setup:
- Docker-compose run under WSL2, which talks to docker in Windows
- my docker compose file starts 5 services:
- kudu master (an Apache distributed database, this is the master)
- 3 kudu tservers (the workers of the database)
- 1 Apache impala (sql interface to the database)
- an app (outside docker, spark/scala in my case) will first talk to the master, and will receive the IP of all the workers to speak directly to them
- each component of the cluster needs to speak to each other as well, within different docker container
Run/docker-compose.yaml
Each service will advertise its IP as being the host IP (ifconfig | grep "inet " | grep -Fv 127.0.0.1 | awk '{print $2}' | tail -1). They all listen on different ports.
If I give docker.host.internal as advertised IP, it does not work either (because as the IP is eventually used outside docker, it is not valid there.)
I do not set up any specific networking configuration.
A subset of my docker compose to not pollute the question too much:
services:
master:
image: apache/kudu:latest
ports:
- "7051:7051" # RPC interface
- "8051:8051" # Web interface
command: ["master"]
tserver-1:
image: apache/kudu:latest
depends_on:
- master
ports:
- "7050:7050" # RPC interface
- "8050:8050" # Web interface
command: ["tserver"]
environment:
- KUDU_MASTERS=${KUDU_QUICKSTART_IP}:7051
- >
TSERVER_ARGS=
--rpc_bind_addresses=0.0.0.0:7050
--rpc_advertised_addresses=${KUDU_QUICKSTART_IP}:7050
It is heavily based on the example from kudu itself: https://github.com/apache/kudu/blob/master/docker/quickstart.yml
Problem
- sometimes, it completely works.
- sometimes, I can access kudu via impala, which I believe will only talk to master, the master itself will talk to the workers. I cannot access via my app, which will try to speak to the workers directly
- sometimes, even impala will not connect.
All without changing docker-compose.yaml! I make sure to destroy all volumes every time as well, just in case.
If I change all rpc_advertised_addresses to the service name (which is thus a DNS name as well) I end up with the same error as in the other cases, but a bit further down the lane.
Logs
If something does not go well, I see in the logs inside the containers an error along the lines of:
Timed out: Client connection negotiation failed: client connection to 172.17.188.205:7150: Timeout exceeded waiting to connect
Which means that the container themselves cannot talk to each other.
172.17.188.205 is here my host IP. If I use services names as rpc_advertised_addresses I will see the service IP.
The problem
I'm not the only one with this issue, a teammate has the same. It feels like I'm not understanding the networking correctly.