So we had a power out yesterday. UPS did its thing and shuts down the main server safely. The Pi that runs my backup PiHole gets knocked offline by the power outage and comes back up when the power does.
I get home from work, with power being back on and my phone is telling me my Wifi has no internet access. Modem, router and backup Pi are online… WTF has happened?!
A few weeks prior…
I had decided to finally setup PiHole properly on my network and force all traffic through it. I get it up and running on my main server and decide I should install it on my Pi that I have setup with Nagios as a backup PiHole in case my master server goes down. Nearly everything on my network is dockerized, including Nagios on the Pi. I had recently set this Pi so that all logs get sent to a syslog server running on my main server. This includes setting dockers log driver to be syslog using TCP. I made this decision as I had gone through a few SD cards, and for some reason am resisting running the Pi off an external SSD and thought that this might help the SD cards last longer than 3 months.
Everything is working fine, I simulate a server failure by stopping the PiHole docker on the master server, watch the DNS queries flow onto the Pi and then back to the main server when the PiHole docker is started again. Great, job done!
Then we have a power outage… My main server runs Unraid with a dedicated VM for my dockers. The UPS is configured to not power itself back on when it shuts down. Unraid is configured so I have to manually start everything (including the physical box), so when a power out happens it stays offline. This shouldn’t have been an issue right? I have the Pi running PiHole which would startup when the power returns, providing DNS to the household until I can get the server back online safely.
What I did not count on was that if the docker syslog driver is using TCP, it will prevent any container from starting when it can’t connect to the syslog server, which is of course on the main server that was still offline. I had a quick look and found a bug report on docker about this issue from a few years ago, and no real changes/decision had been made about how best to resolve this problem from a docker point of view.
To get around this issue myself, I just set the docker syslog driver to use UDP instead of TCP. A post on the bug report recommended a log forwarder to be installed locally on the host to get around the issue. Docker forwards the logs to the log forwarder, if the master syslog server is down, it caches the logs until the server comes back online. This would be the ideal setup in a proper production environment, but as this is just my home network the unreliability of UDP isn’t really an issue.Lessons learned:
- Docker is designed to preserve logs at all costs, and will prevent containers from starting if the conditions are such that logs cannot be confirmed as delivered (except if syslog UDP is used).
- When simulating a “server failure”, ensure all resources used on the server are simulated as being as offline.