We were upgrading our Kafka cluster to the newest versions following the rolling upgrade plan according to the documentation. As we restarted brokers, even without upgrading them, they take a lot of time to join the cluster. Kafka server logs showed us warnings about corrupted index file and that its rebuilding the index:
[2019-07-08 12:12:58,901] WARN Found a corrupted index file due to requirement failed: Corrupt index found, index file (/kafka-logs/email-opens/00000000000008365023.index) has non-zero size but the last offset is 8365023 which is no larger than the base offset 8365023.}. deleting /kafka-logs/email-opens/00000000000008365023.timeindex, /kafka-logs/email-opens/00000000000008365023.index and rebuilding index... (kafka.log.Log)
This warning appeared for every single topic.
We keep a lot of data on our Kafka cluster for analytics and seeing such errors worried us a lot. After quite some debugging, we realised that systemd configuration is the cause.
Systemd wasn’t letting Kafka shutdown safely
We manage Kafka on our brokers using systemd service. During the shutdown phase of the restart, systemd was not letting Kafka to complete the shutdown but was forcibly terminating Kafka.
We realised this because we knew from our previous experience that Kafka logs
Shutdown completed message as it finishes the shutdown routine.
Turns out, systemd has a 90sec timeout default for letting the processes to stop. If the processes don’t stop by then, it forcibly terminates them with a SIGKILL.
TimeoutStopSec to 600 seconds in
... [Service] Type=simple PIDFile=/var/run/kafka.pid User=experteer Group=experteer TimeoutStopSec=600 ...
and reloaded systemd to see new the changes with
From now, stopping Kafka server takes about 6 minutes and rejoining the cluster after restart takes less than a minute. Also, no more corrupted index warnings!
As the number of messages and topics grow in a Kafka cluster, it takes longer for Kafka server to shutdown safely. Systemd or other alternatives that are used to manage Kafka server on the brokers need to the server process enough time to shutdown safely. We had 150 topics at the time of this post with about 2TB of data and it takes about 6 minutes for Kafka to shutdown safely.