Category: Nagios

Merlin goes distributed

by Andreas Ericsson Email

Since a couple of weeks ago (3, to be precise), two distributed merlin setups have been running non-stop in our lab without hiccups. I am, ofcourse, quite proud of this. The two scenarios we're currently testing for is 2*peers and 1*master+2*slaves. The new algorithm for preventing checks from being run that some other server is supposed to take care of seems to be working extraordinarily well, and the 15 second take-over slowdown in the loadbalanced peer situation is now gone. All peers will share all checks with each peer doing exactly half the checks each.

Slaves will ofcourse "only" do all checks assigned to them, but the merlin module will prevent checks it doesn't know about from entering Nagios. This makes it possible to configure checks on a poller that the master needn't know anything about. Someday I'll add some marker so the daemon doesn't even send events for such checks to its masters.

A lot of people have opened their eyes to Merlin and are now using it in production environments. Bold? Maybe, but they really don't have that many alternatives, and Merlin has been running steadily in a lot of production networks for weeks (or even months) now.

Next step now is to merge the reports-module with merlin. This is necessary in order to allow each poller node to be a fully fledged monitoring server in its own right, while all checkresults must still be sent to the master so it has all the information. It shouldn't take long really, and I've already merged over the improved hash library and all its tests from the reports-module, as well as some other things.

After that comes proper command forwarding. Commands concerning a particular check will have to be forwarded to the poller or peer in charge of running that particular check, or people will have to figure out which peer or slave is running the checks. Since that's very nearly impossible to do with a peered setup, this is not very practical, so command forwarding is absolutely essential in order to let sysadmins manage their monitoring systems sanely.

When that's done, I'll get busy with config syncing. This will be tricky, and will naturally require me to take care of a million small and annoying things, as well as extending the merlin protocol to accept streaming data where it currently only accepts packets with a max-size of 128KiB (or possibly 64KiB, I keep forgetting).

The scenarios for config syncing we'll be supporting is "master to poller", where the master server splits out the config the poller should handle and sends it to the poller, and "peer -> peer", where a peer simply sends all its configuration to the other peer. The peer2peer scenario is relatively straightforward, since it doesn't involve any config-splitting. The master2poller scenario is a lot more tricky. Frankly, it would be a lot easier to do things the other way around, so maybe I'll throw that one in too while I'm at it.

Merlin progress report

by Andreas Ericsson Email

I'm clearly a workaholic when I'm fiddling with stuff I really like, and all the community interest in Merlin and Ninja lately has just made me a pure-bred hacker fanatic.

So I've implemented the state retention stuff in Merlin. Turns out that all that was really required was to make sure the status and object import works ok and is up-to-date (so I implemented an automagic way of making that happen). Then I can just read the current status from the database. I use an array sorted by object name ("host_name" for hosts and "host_name;service_description" for services) so I can use a binary search. 3000000 lookups of some randomly chosen nodes in a config with 15k hosts complete in just under 2.2 seconds. Quite impressive. Especially so when about 0.8 seconds is spent loading all the states and sorting the array in the first place.

I've also had a chance to look over the cross-host event transport stuff, which was subtly broken due to a brainfart of mine in a tertiary operation. I've tested it and it works just fine now agan.

With the changes mentioned above, I Merlin is rapidly approaching production quality in terms of its planned feature-set, so I've just released v0.5 with the hopes of attracting some more testers.

Let it rip, people, and make sure to let me know how it's going. Merlin can be downloaded from our git repositories using the following command:


  git clone git://git.op5.org/nagios/merlin.git

Cheerios for now.

Nagios for huge networks

by Andreas Ericsson Email

With the recent changes to the Nagios core development team, patches have been flooding in to the nagios-devel list. There's been such a flurry of improvements that I've actually had to stop working on Ninja and Merlin entirely over the past two weeks and just work on testing, adding and commenting on patches sent to the list. In view of that, I must say I'm convinced Ethan did the right thing when he extended the core dev team a bit.

However, this post is mostly about one particular patch from a guy named Jean Gabès. The patch speeds up Nagios' circular-parent-child dependency checks a *lot*. In a network 300 levels deep (root-host -> lvl1-child -> lvl2-child -> ... -> lvl300-child) where each level in itself has 500 hosts, vanilla Nagios had to be Ctrl-C'd out of after 53 minutes, while Nagios with Jean's patch completed in less than seven seconds (a speedup of more than 51000%!).

For a more modest network of 15000 nodes, (30 levels deep, 500 hosts in each level), vanilla Nagios completed a configuration check in 3 minutes and 33 seconds, while patched Nagios did the exact same job in less than one second.

Awesome, Jean. Thanks a lot indeed :-)

The future of Nagios

by Andreas Ericsson Email

Some of you might know that a fork of Nagios has appeared recently. If you don't, go read about it in the nagios-devel mailing list archives. They're available on sourceforge somewhere, but I can't be bothered to look for them right now.

Working for a company that makes a living out of supporting and writing addons for Nagios, I must say I'm a bit sad. Being an enthusiastic and optimistic guy, I must say I'm thrilled.

A couple of facts before we set off:

  • The fork was instigated largely by german members of the community. It appears to have been spearheaded by a german company (though I don't know this for sure) that makes its living selling customized Nagios solutions and/or support. I don't know this for sure, but it sure looks as if that's what's happened.
  • The german company have unlawfully used the Nagios trademark after being asked not to do so. It has also registered Nagios as a trademark in Germany, to which is a huge slap in the face of an opensource project. They are naturally not on the best of terms with Nagios' founding father, Ethan, at the moment.
  • Ethan has been absent working with the aforementioned lawsuit (or whatever it is a trademark violation results in when friendly talk is no longer enough), and also trying to put together a new webbased user interface for Nagios.
  • Patches from all levels of the community have been erratically ignored during Ethan's absence. Some were picked up, but as many or more slipped between the cracks.
  • Ethan has always been the single person with commit access to the Nagios CVS (yuk) repository.
  • The fork uses git to track their patches.



The community developers have voiced a complaint that they cite as the primary reason for the fork:

Nagios is not being developed fast and openly enough.

I agree with this, and I'm currently discussing with Ethan about expanding the developer-base. Unfortunately, the scarce resource "trust" is even scarcer for those developers who joined the fork, which leaves the available candidates rather few. Happily, I count myself among them, and apparently so does Ethan. He emailed me away from public channels asking if I'd be willing to become a core developer, and op5 has graciously given a tentative promise to devote one to two days per week to Nagios development / patch management. Nothing's settled yet, but development has to continue even if the core maintainer takes a leave of absence, so one way or another, we'll make sure this happens.


In a perfect world (ie, one where I get to decide everything ;)), here's what will happen:

  • Nagios incorporates the good changes that the fork produces.
  • The benevolent but previously frustrated developers from the fork hop back to working on Nagios when they see it's once again moving forward. They could actually do that by keeping on working on their fork, although that would set them apart from the Nagios community a bit rather than make them members of it.
  • Nagios development picks up its pace and a new GUI is added to it which fulfills everyones wildest dreams.
  • Nagios development moves to using git instead of CVS. Since git actually invites people to fork the code but makes it incredibly simple to merge those changes back to the pre-fork project again, there could be any number of forks and Nagios would be the grand total of the best of all of them. Who would win on that? Well, the Nagios users for a start, and Nagios itself, and Ethan, and every company making a living off of Nagios one way or another. So that'd be a win-win-win-win-win situation? I like it.




For those who wonder where I'm standing in all this, I'll be working with Ethan to make the community developers happy while at the same time trying to prevent the community users from living through the confusion that a long-lived fork means. In the end, I hope Nagios becomes a better product with a stronger and better community backing it, which seems rather inevitable now that more people than ever are working frantically at making it so. Hopefully it results in a happy community where The Right People(tm) are part of a Nagios steering committee or some such.


Time will tell. It always does ;-)

Making Nagios even more awesome

by Andreas Ericsson Email

It's been quite a while since I blogged anything now, and the reason is that I, along with my colleagues here at op5, have been hard at work producing a new GUI for Nagios. Naturally it will be GPL'd, and equally naturally it will be blazing fast, awesomely pretty and contain lots and lots of cool stuff, such as our reporting tool (pretty graphs for the suits), a new flash-based network map (based on RaVis by Google), and the Merlin module.

What with me being the company's die-hard C programmer, I'm naturally taking care of finishing off the Merlin module.

As some of you know, the merlin module was originally designed to be an event transport for effortless redundant and loadbalanced network monitoring. Since modules running inside Nagios have certain restrictions put upon them, we decided to empower the Merlin module with the capabilities to insert events into a database (a rather straightforward patch). The really cool part about it is that Merlin still retains its multiplexing networking capabilities, which means that you can now use Merlin as a (very, very fast) way of communicating Nagios events to other servers.

Since merlin is designed to work with a plethora of different topologies, this means that Nagios will be the easily most scalable network monitoring system of them all. If you want to monitor Google's server-park from a single tool, you'll have to use Nagios. If you want to monitor Second Life's vast and widespread server network, Nagios is the only choice. If you want to monitor the entire internet, Nagios can do that (provided you spend "some" money on hardware ;-))

If you're a handy guy when it comes to doing certificate authentication in C, I might have a job for you though. Currently all nodes have to be configured upstream in its chain of responsibility. The capability to add random servers without modifying the configuration of running servers would be even more awesome :)

1 2 >>