We were upgrading our Kafka cluster to the newest versions following the rolling upgrade plan according to the documentation. As we restarted brokers, even without upgrading them, they take a lot of time to join the cluster. Kafka server logs showed us warnings about corrupted index file and that its rebuilding the index:

[2019-07-08 12:12:58,901] WARN Found a corrupted index file due to requirement failed:
Corrupt index found, index file (/kafka-logs/email-opens/00000000000008365023.index)
  has non-zero size but the last offset is 8365023 which is no larger than the
  base offset 8365023.}. deleting /kafka-logs/email-opens/00000000000008365023.timeindex,
  /kafka-logs/email-opens/00000000000008365023.index and rebuilding index...
  (kafka.log.Log)

This warning appeared for every single topic.

We keep a lot of data on our Kafka cluster for analytics and seeing such errors worried us a lot. After quite some debugging, we realised that systemd configuration is the cause.

Systemd wasn’t letting Kafka shutdown safely

We manage Kafka on our brokers using systemd service. During the shutdown phase of the restart, systemd was not letting Kafka to complete the shutdown but was forcibly terminating Kafka.

We realised this because we knew from our previous experience that Kafka logs Shutdown completed message as it finishes the shutdown routine.

Turns out, systemd has a 90sec timeout default for letting the processes to stop. If the processes don’t stop by then, it forcibly terminates them with a SIGKILL.

We increased TimeoutStopSec to 600 seconds in kafka.service file:

...

[Service]
Type=simple
PIDFile=/var/run/kafka.pid
User=experteer
Group=experteer
TimeoutStopSec=600
...

and reloaded systemd to see new the changes with systemctl daemon-reload.

From now, stopping Kafka server takes about 6 minutes and rejoining the cluster after restart takes less than a minute. Also, no more corrupted index warnings!

Summary

As the number of messages and topics grow in a Kafka cluster, it takes longer for Kafka server to shutdown safely. Systemd or other alternatives that are used to manage Kafka server on the brokers need to the server process enough time to shutdown safely. We had 150 topics at the time of this post with about 2TB of data and it takes about 6 minutes for Kafka to shutdown safely.