After setting up a two node replica Gluster cluster to use as a web content backend, I began testing what would happen when I brought down one of the Gluster nodes. The web nodes access the Gluster cluster using the GlusterFS native client, and I expected them to quickly see that one of the Gluster nodes was down and begin serving content from the healthy Gluster node, but that was not happening. My test simply involved rebooting one of the Gluster nodes. That node would come back online within about 30 seconds, but within those 30 seconds my website was down.
I discovered the default connection timeout for a Gluster client to flag a Gluster node as down is 42 seconds. You would think this would be easy enough to change, but it turned out to be a bit trickier.
First, I stumbled upon the network.ping-timeout Gluster volume setting. Looking at the documentation, this value is set to 42 seconds. On one of the Gluster nodes, I changed it to 5 seconds with the following command (gv0 is the Gluster volume):
gluster volume set gv0 network.ping-timeout 5
However, this did not do anything.
Second, on both of the Gluster nodes, I tried applying a similar configuration change to the main Gluster configuration file at /etc/glusterfs/glusterd.vol by adding option network.ping-timeout 5:
volume management
type mgmt/glusterd
option working-directory /var/lib/glusterd
option transport-type socket,rdma
option transport.socket.keepalive-time 10
option transport.socket.keepalive-interval 2
option transport.socket.read-fail-log off
option network.ping-timeout 5
# option base-port 49152
end-volume
After restarting the glusterd service on both Gluster nodes, nothing had changed.
Finally, on both of the Gluster nodes, I found that I needed to add a specific volume configuration block for the Gluster volume in /etc/glusterfs/glusterd.vol and set option ping-timeout 5 inside it:
volume management
type mgmt/glusterd
option working-directory /var/lib/glusterd
option transport-type socket,rdma
option transport.socket.keepalive-time 10
option transport.socket.keepalive-interval 2
option transport.socket.read-fail-log off
# option base-port 49152
end-volume
volume gv0
type protocol/client
option ping-timeout 5
end-volume
After restarting the glusterd service on both Gluster nodes, I was able to reboot one of the Gluster nodes (with a healthy node still online), and, within about 5 seconds, continue serving content from my website.