Sunday, November 30, 2008

The sad state of open source monitoring tools

I've been looking lately at open source network monitoring tools. I'm not impressed at all by what I've seen so far. Pretty much the least common denominator when it comes to this type of tools is Nagios, which is not a bad tool (I used it a few years ago), but did you see its Web interface? It's soooooo 1999 -- think 'Perl CGI scripts'!

A slew of other tools are based on the Nagios engine, and are trying hard to be more pleasing to the eye -- Opsview and GroundWork are some examples. Opsview seems just a wrapper around Nagios, with not a lot of improvements in terms of both functionality and UI.

I looked at the GroundWork screencast and it seemed promising, but when I tried to install it I had a very unpleasant experience. First of all, the install script uses curses (did those guys hear about unattended installs?), and requires Java 1.5. Although I had both Java 1.5 and 1.6 on my CentOS server, and JAVA_HOME set correctly, it didn't stop the installer from complaining and exiting. Good riddance.

I should say that the first open source network monitoring tool that I tried was Zenoss, which is supposed to be the poster child for Python-based monitoring tools. Believe me, I tried hard to like it. I even went back and gave it a second chance, after noticing that other tools aren't any better. But to no avail -- I couldn't get past the sensation that it's a half-baked tool, with poor documentation and obscure user interface. It could work fine if you just want to monitor some devices with SNMP, but as soon as you try to extend it with your own plugins (called Zen Packs), or if you try to use their agents (called Zen Plugins), you run into a wall. At least I did. I got tired of Python tracebacks, obscure references to 'restarting Zope' (I thought it's based on twisted), fiddling with values for the so-called zProperties of a device, trying unsuccessfully to get ssh key authentication to work with the Zen Plugins, etc, etc. I'm not the only one who went through these frustrations either -- there are plenty of other users saying in the Zenoss forums that they've had it, and that they're going to look for something else. Which is what I did too.

I also tried OpenNMS, which was better than Zenoss, but it still had a CGI feel in terms of its Web interface.

So...for now I settled on Hyperic. It's a Java-based tool with a modern Web interface, very good documentation, and it's extensible via your own plugins (which you can write in any language you want, as long as you conform to some conventions which are not overly restrictive). Hyperic uses agents that you install on every server you need to monitor. I don't mind this, I find it better than configuring SNMP to death. It does have it quirks -- for example it calls devices that it monitors 'platforms' (instead of just 'devices' or 'servers'), and it calls the plugins that monitor specific services 'servers' (instead of services). Once you get used to it, it's not that bad. However, I wish there was a standard nomenclature for this stuff, as well as a standard way for these tools to inter-operate. As it is, you have to learn each tool and train your brain to ignore all the weirdness that it encounters. Not an optimal scenario by any means.

I'm very curious to see what tools other people use. If you care to leave a comment about your monitoring tool of choice, please do so!

I'll report back with more stuff about my experiences with Hyperic.

63 comments:

Anonymous said...

Try having a look at moodss.
http://moodss.sourceforge.net/

Anonymous said...

There's good money to be made in commercializing monitoring software packages. Why would a developer want to miss out on that cash?

Justin A said...

https://labs.omniti.com/docs/reconnoiter/

and

https://labs.omniti.com/trac/resmon

might be worth a look.

Anonymous said...

I looked at some of these same tools recently, and share the conclusion about the sad state:

http://kylecordes.com/2008/10/19/network-system-monitoring-smorgasbord/

I've settled on Zabbix for the moment, but I think Ganglia might end up part of the mix here also.

Anonymous said...

You may want to look into Zabbix. It's got a nicer UI than Nagios and seems to be getting a good amount of ongoing development.

Anonymous said...

Zabbix - Simple or complex as you want it to be.
Not affiliated but I have used it to monitor community television station infrastructure and now use it to monitor remote data collection units for commercial installations.

Anonymous said...

People use Nagios for a reason. It does the job. Nobody cares about a wanky web 2.0 interface, ajaxified. They just want to monitoring tool.

Heri said...

I use monit/munin. It's very easy to install, and the interface is slick. also in active development

Anonymous said...

Nagios rocks! Can you expound on what you don't like in its interface? Eye candy for unwashed consumers :-) is one thing, but in tools for my own use in monitoring a site, I care more about reliability and familiarity than a gee-whiz interface.

I'm not a total Luddite on this score -- I'm not suggesting we should be satisfied with ASCII character charts. But, surely, a "1999" interface, even if we would agree on that's what it has (which I don't, but let's put that aside for the moment), is perfectly adequate for the task?

Anonymous said...

zabbix works well.

Anonymous said...

Zabbix? It's pretty major. 1.6 came out with a bunch of interesting features.

We've been using it for over a year now and I think it's a good product.

http://www.zabbix.com/

Anonymous said...

We are facing the same problem at work, we're using zabbix but we aren't happy with it, we have found alternatives potential solutions but haven't been able to try them out yet.

One is based on nagios: FAN (http://fannagioscd.sourceforge.net), the other one "looks" pretty nice and is Pandora FMS (http://pandorafms.org).

Keep us posted if you try them or find something nice. ;-)

Anonymous said...

My company uses zabbix. Not sure what it has in the lines of network specific stuff though. I can't say I'm a huge fan.

Unknown said...

It's not a bad tool, but then you imply that the UI is 'so 1999'.

Grow up. The tools is what does the job. If the UI is dated by today's standards, but the tool is still good, who cares?

/greg said...

That's "too hard", not "to hard"...

Interesting article. I'll have to look into these tools and see what they are about. "Perl/CGI" interface doesn't necessarily mean bad, it just means not "flashy". Linux CGI is even worse than Perl/CGI, but damn, it works...

Unknown said...

I have used OSSIM before, its only issue is that it is designed to run on Debian machines and we were running it on a Gentoo Xen Network.

It's documentation was also slightly lacking, however as an integrated solution , it's plugin interface was quite versatile.

OSSIM supports nagios, snort, hids,ntop, and numerous other underlying tools.

Jonathan Ellis said...

Depending on what you need, a more specialized monitoring system can be better than a general purpose one. For instance, imo Ganglia is best in class at what it does: http://ganglia.info/

kordless said...

There's a free version of Splunk available which does up to 500MB of logs and/or inputs a day. With some ingenuity you can monitor a whole lot of stuff within that daily limit.

Most of Splunk's power comes from its ability to index textual data, and do field extractions on the text it eats. You can do scripted inputs to monitor servers, networks, disks, routers, etc., and then run reports and alerts based on the searches you build.

Splunk differs a little from these other products in that you need to build up the searches, reports, and dashboards needed for your particular setup. However, Splunk also has a bunch of apps you can install that extend the initial interface. These apps are currently located on splunkbase.com.

Anonymous said...

Glad you had a good experience with Hyperic, Grig. Please let us know how we can continue to make it better.

-javier

Elliot "statik": Murphy said...

Hi Grig, I haven't found anything great yet myself either, but I'm keeping my eye on a few things:

Munin, at http://munin.projects.linpro.no/
Reconnoiter, which looks like it is based on some fantastic real world experience: http://labs.omniti.com/trac/reconnoiter
Graphite, which has some interesting changes from RRDTool (but I'm not yet fully convinced) http://graphite.wikidot.com/faq

Collectd, which looks super easy to deploy, but haven't yet tried connecting it to graphite. http://collectd.org/

Hope some of these help you!

Thomas Stromberg said...

You may want to give Zabbix a shot as well: http://www.zabbix.com/ .. As a former Nagios user, I especially love that you can write quick 'command lines' to run on remote hosts instead of a plug-in, for instance, monitor how many users there are, and see graphs over time:

user_cmd=wc -l /etc/passwd | cut -d" " -f1

And then just set a warning when user_cmd exceeds 5000. Because Zabbix collects all of this historical output data in MySQL, it can get a bit heavy with >100 hosts though.

Flávio said...

In my workplace they just switched to NetIQ and it sucks too! So don't complain much.

Anonymous said...

http://bb4.org/download.html

Unknown said...

Hehe, you article is interesting as I went through almost the same process or elimination of open source monitoring toolsl. I too ended up deploying Hyperic to monitor 100 RHEL boxes, in the end Hyperic was unstable and lacked a rpm package for RHEL. (not sure if they have fixed that now)
Also a lot of the advanced functionality was only included in the pay for version - so I have given up on it. I now have a mixture of Munin and a web based service called host-tracker.

Anonymous said...

www.zabbix.com

I've looked at most of he same tools you list at one time or another over the past 3 years. I have't seen anything that comes close to the flexibility and scalability of zabbix in a production environment.

Anonymous said...

The Ruby people are pretty into Monit and the controversially-named God.

I've used Monit and I like that it's easier to set up and use than Nagios, and it has a prettier web interface, but it's useless for network (as opposed to system or process) monitoring. Hence the new product, M/Monit (I guess stands for meta-monit) but the fact that it's a giant binary and it "hooks right up" doesn't make me feel great.

Some friends of mine were working on a Common Lisp-based system with the even-worse name LoGS, which I believe will soon be supplanted by something called NOCtool, which is more general than just log monitoring, and more network-oriented, also written in Common Lisp and using a novel web framework called SymbolicWeb. I can't really recommend it though, since it's still in extremely heavy development.

Corey Goldberg said...

I have also been disappointed with the options in this area. I am currently in the process of rolling my own distributed monitoring system for windows (using wmi/win32).

not Python related, but also check out Cacti.

Anonymous said...

You should check out running Groundwork Community Edition in a virtual machine at

http://www.groundworkopensource.com/community/downloads/vmware.html

Unknown said...

check out www.zabbix.com

lalligood said...

I completely agree with your assessment of Nagios--not only is its interface remind me of 1999, it's configuration is arguably even more ancient!

One other open source network monitor that I've only spent a few minutes with but had some promise is Zabbix (http://www.zabbix.com/) At least the entire application & all of the resources you want to monitor can be configured through the web interface!

I look forward to reading an update from you.

Anonymous said...

Nagios may look old-school but it works.

Anonymous said...

try jffnms, it does all the usual monitoring stuff well, autodiscovers your networks/servers, graphs everything with rrdtool, does cisco config diffs like rancid, and lots more.

http://jffnms.org/

Anonymous said...

I've used Tivoli monitoring, CA NSM and Nagios, and Nagios is by far the best and most extensible. Creating new service definitions and commands is intuitive and incredibly useful - at all the companies I have worked for that did not use nagios I have found the employees running a private Nagios server to overcome terrible commercial monitoring tools... Complaining about the interface is silly.

Anonymous said...

Have you tried Zabbix? We've been using it with multiple locations with good succes.

est said...

I was expecting some tool with Open Flash Chart live status view, and a comet based HTTP webshell. :-)

Todd Stout said...

I feel your pain. My search for decent open-source monitoring tools began about 4 years ago. It is a shame that not much has changed since then.

Companies are still willing to pay big bucks for monitoring tools, and most open source companies feel that the basics of what unix provides is good enough.

I worked for a couple of years developing a commercial product in the telecom industry that provided basic cross-platform monitoring support. Unfortunatley, even today, if you want to be platform agnostic, SNMP is still your only choice. SNMP only gets you so far however (not to mention its a very old and crusty protocol).

Anonymous said...

I agree that stock nagios looks ugly as hell, but you can make it a lot better by installing the nagios "nuvola style" CSS and images from nagios exchange.

Anonymous said...

Osiris > Nagios

Anonymous said...

I would try SpiceWorks - http://spiceworks.com/

Anonymous said...

Did you _actually_ discredit Nagios because of the web interface? It seems a tad ... unprofessional to me.

My experience with Nagios is nothing but good: I use it with almost 100 servers, many in a virtual environment. I use the groups functionality extensively, SLA reporting tools and I export graphs with nagiosgraph.

From time to time with misbehaving applications I let Nagios restart and report. And believe me, when something happends and a dozen servers die, I am very happy that Nagios is a mature tool. I can rely on its reporting.

My advice is to use the most mature tool available. You want it out of your way for your daily work but you need _relevant_ information when something goes wrong.

Noah Gift said...

Monitoring is tough. I just don't think this probably has been 100% solved yet. I have had some decent experience with Zenoss, but I haven't done anything too tricky either.

Personally, I think there is a something to be said for writing some code yourself with Net-SNMP:

net-snmp and python.

I do concur though, that log file analysis with a central syslog server is a good idea too.

Anonymous said...

http://hobbitmon.sourceforge.net
Better than Hobbit, configured via text-files, optional agnt on monitored servers, TCP, UDP and ICMP tests inside, can write own plugins.

Unknown said...

Some functionality Opsview provides over a standard Nagios installation:

- Distributed monitoring with clustering
- SNMP trap processing / rules engine
- Automatic graphing of performance data (no configuration required)
- Data Warehouse and reporting tools
- XML configuration and monitoring APIs
- LDAP / Active Directory authentication

IMO it is not "just a wrapper around Nagios" but I'll leave everyone to form their own opinions.

-JP

Anonymous said...

You have to check out Cacti :-)!

Anonymous said...

Yet another vote here for Zabbix.

It gets the job done with a minimum of fuss; it supports basic things like SNMP and trending right out of the box, and it's very easy to add new commands.

Anonymous said...

Groundwork uses Nagios as it's base. GW is just a bunch of features like reporting and graphing on top of Nagios.

There are a number of nice themes out there for Nagios. Google is your friend.

A few points. Nagios is highly scalable. I have A parent server and four child servers reporting >8000 service checks and >1200 host checks to the parent server.

Groundwork, which I use on the parent server (the child servers are plain Nagios) expects the base install to be a clean and minimum install - not some server you have piles of crap installed on.

Anonymous said...

Hey Grig,

I went through a similar process recently and chose OpenNMS. It just "works" out of the box.

Zabbix seems nice, but it wasn't as easy to set up and the design decision to store *all* monitoring data in the database means it doesn't scale.

R.

Anonymous said...

I used to use Nagios for monitoring/alerting and MRTG for trend analysis from 2000 until about a year ago. I have since moved to Zabbix. It has similar alerting capabilities as Nagios but much better trend analysis that old good MRTG. Zabbix is not perfect, but I'm happy with it. I was recently looking at Zenoss and NetXMS but I think I'll stick with Zabbix for now.

Anonymous said...

Amazing that you would discount Groundwork because you couldn't get it installed. This reviewer had no problems:

http://www.networkworld.com/columnists/2008/071608-gearhead.html

Anonymous said...

I went through the same process last spring and settled on Hyperic as well. I like being able to code monitoring stuff in Python. UI plugins can be coded in Groovy, which feels much less aggravating than Java (in which Hyperic is written in).

Hyperic doesn't have much of an open source community compared to many of the other options. It isn't as widely used as something like Nagios, so doing web searches on any problems you run into are unlikely to come up with solutions. I have had pretty good success posting on their forums, though.

NetDiva said...

Just an FYI, GroundWork offers free support to their open source users. So if you're having trouble, it's a bit silly not take advantage of it and open a ticket. http://www.groundworkopensource.com/services/support/community-support.html

Unknown said...

I went through a similar analysis process in the summer, comparing Nagios, OpenNMS and Zenoss ( http://www.skills-1st.co.uk/papers/jane/open_source_mgmt_options.html ).

Given that I have a LONG background in IBM equivalent software, there were elements of all 3 that surpassed commercial products.

Part of the problem with Open Source is that most of us are prepared to "comment" but far fewer of us are prepared to contribute - even if it is simply contributing to the body of knowledge in the public domain through maillists and fora (and that doesn't just mean "us" personally, it also often implies enough time from our employers to contribute). With "free" software you trade an obvious price ticket (usually huge from the Big 4) for a much more nebulous cost in time, skills, and this contributing back to the pool of knowledge.

The great thing about Open Source is that this IS possible! With the Big 4, especially if you are not a major enterprise, the chance of you influencing anything (or sometimes even getting major bugs fixed) is zip.

I agree that documentation is often the problem with with Open Source offerings. Certainly I have found with Zenoss that it is possible to do most things I want but you have to dig around fora, wikis and FAQs to find more detailed information.

Just my 2 pennorth!
Jane

anjel said...

Hi, i use free online network monitoring tools by Dotcom you may check them at http://www.dotcom-monitor.com They offer paid service too. Before i used to keen on Nagios for sure :)

mray said...

Sorry to hear you had such a poor Zenoss experience. We've been making a really strong push on the documentation and end-user experience with the latest release, don't know if you got to use the new 2.3 release and the new Getting Started with Zenoss Guide. Feel free to email me with direct feedback and check us out again soon, I really feel Zenoss has a strong base and is rapidly improving the end-user experience.

Thanks,
Matt Ray
Zenoss Community Manager
mray@zenoss.com

Gheorghe Gheorghiu said...

Sunt impresionat de interesul starnit de articol , dovada ca este un subiect care preocupa pe multi. FELICITARI si tot asa pe mai departe!

Anonymous said...

The newer Groundwork version in the works has a much nicer BitRock-based installer:

http://groundworkopensource.com/community/downloads/5.3alpha.html

Daniel

Anonymous said...

I miss Osmius.

http://Osmius.net

Network devices, servers, DB and applications monitoring. Create your own Services and assign them availability or state SLA, and it comes with reports and BI to squeeze info from data, just to mention some features.

They are starting but perhaps it is worth to take a look.

vlad said...

I also tested the tools mentioned in the article and I settled on Hyperic. It has few quirks but it is extensible and it is in active development. Groundworks was ok for awhile until I run into issues where it would stop monitoring something and it was pain to track down why it suddenly stopped. The interface was not a big deal but what turned me off was that Groundworks was using older version of Nagios and this kind of version mismatch was not a good sign for me. And I wanted to use it for web site monitoring and recommended tools was WebInject - a perl script written in 1998 or so which couldn't parse javascript enabled sites.

Corey Goldberg said...

> WebInject - a perl script written in 1998

To be fair, WebInject was developed in 2004 (I wrote it)

Anonymous said...

If the application looks horrible, it is probably coded horrible too. It is 2009 and there is no excuse for a shitty interface on any piece of software. Some of the proprietary software is the worst. I can't believe people actually pay hundreds of thousands of dollars for shitty software. It is amazing.

magnet said...

The sad state of internet.

There are at least 2 kinds of people behind a computer, the admin and the user.

you're a USER.

nagios own you man !

Steve Francis said...

I don't think in-depth monitoring (dashboards, graphs, alerts for your entire infrastructure) could be made any easier than with LogicMonitor:

http://logicmonitor.com

We've automated the setup process so configuration takes just minutes. Simply enter the device's hostname and you’re done. LogicMonitor then automatically discovers devices, and keeps them up to date.

It's not free, but if you value your time, it's worth considering.

CS said...

What's the deal with all the Nagios clowns? I don't care about the interface, but guess who does care? The executives and sales types of are giving tours of the facility and showing off our WidgetSoft 2.0. The Nagios interface is craptastic, and no amount of "I'm too cool for flashy" is going to change it.

Those tours sell customers, which pay my salary and bonuses. You hipster-nerds should just go hang with RMS and bitch about anything that seems too flashy or *gasp* user friendly.

Modifying EC2 security groups via AWS Lambda functions

One task that comes up again and again is adding, removing or updating source CIDR blocks in various security groups in an EC2 infrastructur...