|Checking ldirectord is alive
||[Jul. 24th, 2009|01:20 pm]
Trent 'Lathiat' Lloyd (トレント)
This week's blog post is not strictly to do with MySQL, but a little more to do with highly available clusters using heartbeat and ldirectord
While many people use heartbeat with MySQL, ldirectord usage is less common in these scenarios and more often using in clustered web and mail servers. It is however sometimes used with a selection of MySQL slave servers.
ldirectord is a tool that manages IPVS in the kernel. Essentially you give it a list of servers that act as a "cluster" for a service, for example, 2 web servers. It will then setup the linux load balancer IPVS to direct to these 2 machines. However, what it does is keeps monitoring the servers, and if one of them goes away, it removes them from the cluster pool.
The problem I have run into a number of times, is that ldirectord gets stuck and stops monitoring services. However when it does this, then services stop getting updated, and if a service goes down, it won't be removed from the cluster. On top of that, I have had it get stuck on START UP, after a failover, in this case not many of the services had a chance to come up yet, and you are stuck with a cluster often with 0 nodes available to service some requests - which causes downtime.
So last night, Shane Short and myself wrote a patch to ldirectord and a nagios plugin, in order to make sure that ldirectord is doing it's job and hasn't got stuck. It works by hooking into ldirectord's '_check_real' function, which is called whenever a server is checked. It will then adjust the timestamp on a 'pulse' file, which we later check with our nagios script.
Here is the patch to ldirectord: http://lathiat.net/files/ldirectord.patch
So now, in the default configuration we will end up with a file at /var/run/ldirector.ldirectord.pulse (this file is actually in the same location as your ldirectord PID file, so if you have moved that, it will be with it) which has it's timestamp updated with each service check.
Now, we make a plugin for nagios, which I have here:
And you can configure this plugin into Nagios or Nagios NRPE as normal. You can test it is working by running the plugin manually, you should have a result like this;
OK: ldirectord pulse is regular (4 seconds since last beat)
The default timeout is 120 seconds (2 minutes) for a warning, and 300 seconds (5 minutes) for a critical alert.
Hope this helps some people using ldirectord! Personally this has caused downtime for ourselves a few times when ldirectord got stuck. I would love to hear any feedback or if you are successfully (or unsuccesfully) using this, you can contact me here: