Merlin goes distributed
Since a couple of weeks ago (3, to be precise), two distributed merlin setups have been running non-stop in our lab without hiccups. I am, ofcourse, quite proud of this. The two scenarios we're currently testing for is 2*peers and 1*master+2*slaves. The new algorithm for preventing checks from being run that some other server is supposed to take care of seems to be working extraordinarily well, and the 15 second take-over slowdown in the loadbalanced peer situation is now gone. All peers will share all checks with each peer doing exactly half the checks each.
Slaves will ofcourse "only" do all checks assigned to them, but the merlin module will prevent checks it doesn't know about from entering Nagios. This makes it possible to configure checks on a poller that the master needn't know anything about. Someday I'll add some marker so the daemon doesn't even send events for such checks to its masters.
A lot of people have opened their eyes to Merlin and are now using it in production environments. Bold? Maybe, but they really don't have that many alternatives, and Merlin has been running steadily in a lot of production networks for weeks (or even months) now.
Next step now is to merge the reports-module with merlin. This is necessary in order to allow each poller node to be a fully fledged monitoring server in its own right, while all checkresults must still be sent to the master so it has all the information. It shouldn't take long really, and I've already merged over the improved hash library and all its tests from the reports-module, as well as some other things.
After that comes proper command forwarding. Commands concerning a particular check will have to be forwarded to the poller or peer in charge of running that particular check, or people will have to figure out which peer or slave is running the checks. Since that's very nearly impossible to do with a peered setup, this is not very practical, so command forwarding is absolutely essential in order to let sysadmins manage their monitoring systems sanely.
When that's done, I'll get busy with config syncing. This will be tricky, and will naturally require me to take care of a million small and annoying things, as well as extending the merlin protocol to accept streaming data where it currently only accepts packets with a max-size of 128KiB (or possibly 64KiB, I keep forgetting).
The scenarios for config syncing we'll be supporting is "master to poller", where the master server splits out the config the poller should handle and sends it to the poller, and "peer -> peer", where a peer simply sends all its configuration to the other peer. The peer2peer scenario is relatively straightforward, since it doesn't involve any config-splitting. The master2poller scenario is a lot more tricky. Frankly, it would be a lot easier to do things the other way around, so maybe I'll throw that one in too while I'm at it.
libgit2 at Google Summer of Code
I've been quite busy the past few weeks. Besides vacationing, with all that that entails in forms of beachvolleyboll, partying and lazing around in the sun, I've also been working with a guy named Vicent Marti and his mentor, Scott Chacon (from github). Both have put quite a lot of effort into making the Google Summer of Code 2010 a really good year for libgit2. Ramsay Jones, the co-maintainer who refuses push access, should also be mentioned. He's been working tirelessly to make sure Vicent's recent changes keep the library working flawlessly on Windows boxen.
The goal for libgit2 summer of code was to add an efficient and functional revision walking machinery, as well as capabilities to manipulate the index and references in all the ways that git.git can.
This may not seem like such a great chore until one considers the fact that Shawn O. Pearce, the man who started the libgit2 effort, has contributed tens of thousands of lines of code to git.git itself and now maintains Gerrit and JGit, spent several months writing up the revision walking machinery for JGit. It becomes even more daunting when you realize that the code to read objects out of packfiles wasn't in libgit2 when Vicent started his project.
So far, Vicent has added a commit walking machinery with plenty of tests to it. Once we add branch, tag and reflog parsing code to libgit2, it will be possible to write the "git fsck"-like application using only libgit2 calls to access the repository. This is something of a holy grail in terms of capabilities, since such an application would need to be able to read and parse all types of objects from all types of sources. The writing of such an application is not in itself a goal for libgit2 in GSoC2010, but adding the capabilities to do it is.
Vicent is, in short, doing a stellar job. His mentor, Scott Chacon, has also started writing python bindings. It makes sense, I guess. He works for (or owns, I'm not sure) github, and every piece of python code he can offload to an opensource implementation of git is ofcourse a piece of code he needn't maintain, or pay to maintain, himself.
All in all though, it's working out quite well and the community is already benefiting from Vicent's work. If you want to help out, make sure to check out the official public libgit2 repository at git://repo.or.cz/libgit2.git
Big thanks go out to Ramsay Jones, Vicent Marti and Scott Chacon for putting so much effort into this. If the pace continues the way it has, we should have the entire library in just a few months ![]()
Merlin contributions
Merlin development has really kicked off. Single-server database support is in production at all our customers (all 350 or so of them), and people in the community have started using it for production use in a distributed environment.
Three people in particular have contributed awesomely to making Merlin better for distributed environments. The first to pick it up for this purpose is a guy named Russel Jennings. He's written several concise bug-reports and done tireless testing with various versions to find where some bug was introduced and which versions of the Merlin daemon work well together. A great big thanks for that, Russel!
The other person is a guy named Sean Millichamp. He's gone the extra mile and has started sending in patches for bugs he's found. So far, he's contributed with 10 patches, making him the second most prominent developer for the merlin module-daemon pair. So far, his patches hold very high standard indeed and I have great hopes that we'll see more of his excellent contributions making it into future releases.
Jean-Marc Le Fevre has also contributed some minor patches that he deserves recognition for.
Numerous other people have reported issues with Merlin and contributed to the Wiki and HOWTO's. Thank you all ![]()
Merlin progress report
I'm clearly a workaholic when I'm fiddling with stuff I really like, and all the community interest in Merlin and Ninja lately has just made me a pure-bred hacker fanatic.
So I've implemented the state retention stuff in Merlin. Turns out that all that was really required was to make sure the status and object import works ok and is up-to-date (so I implemented an automagic way of making that happen). Then I can just read the current status from the database. I use an array sorted by object name ("host_name" for hosts and "host_name;service_description" for services) so I can use a binary search. 3000000 lookups of some randomly chosen nodes in a config with 15k hosts complete in just under 2.2 seconds. Quite impressive. Especially so when about 0.8 seconds is spent loading all the states and sorting the array in the first place.
I've also had a chance to look over the cross-host event transport stuff, which was subtly broken due to a brainfart of mine in a tertiary operation. I've tested it and it works just fine now agan.
With the changes mentioned above, I Merlin is rapidly approaching production quality in terms of its planned feature-set, so I've just released v0.5 with the hopes of attracting some more testers.
Let it rip, people, and make sure to let me know how it's going. Merlin can be downloaded from our git repositories using the following command:
git clone git://git.op5.org/nagios/merlin.git
Cheerios for now.
Nagios for huge networks
With the recent changes to the Nagios core development team, patches have been flooding in to the nagios-devel list. There's been such a flurry of improvements that I've actually had to stop working on Ninja and Merlin entirely over the past two weeks and just work on testing, adding and commenting on patches sent to the list. In view of that, I must say I'm convinced Ethan did the right thing when he extended the core dev team a bit.
However, this post is mostly about one particular patch from a guy named Jean Gabès. The patch speeds up Nagios' circular-parent-child dependency checks a *lot*. In a network 300 levels deep (root-host -> lvl1-child -> lvl2-child -> ... -> lvl300-child) where each level in itself has 500 hosts, vanilla Nagios had to be Ctrl-C'd out of after 53 minutes, while Nagios with Jean's patch completed in less than seven seconds (a speedup of more than 51000%!).
For a more modest network of 15000 nodes, (30 levels deep, 500 hosts in each level), vanilla Nagios completed a configuration check in 3 minutes and 33 seconds, while patched Nagios did the exact same job in less than one second.
Awesome, Jean. Thanks a lot indeed :-)
07/09/10 03:39:30 pm, 