Elasticsearch: reconcile-desired-balance

6. juni 2025

I've been struggling all day to get my GitLab pipelines running properly again.

It turns out the combination of Docker happily eating all your disk space and Elasticsearch being very cautious to not start if there is not plenty of available disk made things break down.

My setup is GitLab's runnners running inside Docker, which is testing a Django app with Elasticsearch (ES) attached as a service. This has been working flawlessly forever. But at some point recently, I started getting errors for my tests that depend on ES. And the weird thing is that it started during a quiet period where I did not touch the setup or the code.

From Python, I was getting the error message:

elastic_transport.ConnectionTimeout: Connection timed out

This seemed weird as I was perfectly able to connect to the ES container from the app container: I do a check at the beginning of my test script, by simply curling the ES host. And even the line of code right before the one that got a timeout was successfully connecting to ES. What those 2 lines do is:

Remove the search index.
Create the search index again.

With all the things running inside GitLab runners running inside Docker, it seemed a bit like a black box. But I figured out a way to get the logs from the ES container:

Run the job in GitLab.
Run docker ps on the GitLab runner host and get the ID of the ES container when it shows up.
Save a copy of the log with docker logs <container_id> -f > elasticsearch.log and wait for the job to have failed.

Looking at the log I found the very last entry to contain a clue:

{"@timestamp":"2025-06-06T11:47:27.110Z", "log.level": "INFO",  "current.health":"RED","message":"Cluster health status changed from [YELLOW] to [RED] (reason: [reconcile-desired-balance]).","previous.health":"YELLOW","reason":"reconcile-desired-balance" , "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[5a28270d5238][masterService#updateTask][T#1]","log.logger":"org.elasticsearch.cluster.routing.allocation.AllocationService","elasticsearch.cluster.uuid":"dX-ZZeLEQ-OMCDKm2PjoCQ","elasticsearch.node.id":"r0CL-D0LT5u-tpTeHujQwQ","elasticsearch.node.name":"5a28270d5238","elasticsearch.cluster.name":"docker-cluster"}

It wasn't super clear what "reconcile-desired-balance" meant, but I fortunately found a forum post from someone having the same problem, suggesting it's because of lack of disk space.

Checking the disk (df -h), I had more than 30 GB free, but the usage percentage had crawled past 90%, which I assume could be a red flag for ES.

I do know that Docker will happily eat all your disk space over time. That has caused me problems before. And yes, it had also had a feast this time. Running docker system prune -a reclaimed 265 GB of disk.

After this, tadaaa! Elasticsearch no longer turns to a RED health status and thus does not time out: My tests are passing again. Oh, the joys of modern development.