Application monitoring with Nagios and Elasticsearch

As the applications under your control grow, both in number and complexity, it becomes increasingly difficult to rely on predicative monitoring.

Predicative monitoring is monitoring things that you know should be happening. For instance, you know your web server should be accepting HTTP connections on TCP port 80, so you use a monitor to test that HTTP connections are possible on TCP port 80.

In more complex applications, it harder to predict what may or may not go wrong; similarly, some things can’t be monitored in predictive way, because your monitoring system may not be able to emulate the process that you want to monitor.

For example, lets say your application sends Push message to a mobile phone application. To monitor this thoroughly, you would have to have a monitor that persistently sends Push messages to a mobile phone, and some way of monitoring that the mobile phone received them.

At this stage, you need to invert your monitoring system, so that it stops asking if things are OK, and instead listens for applications that are telling it that they are not OK.

Using your application logs files is one way to do this.

Well-written applications are generally quite vocal when it comes to being unwell, and will always describe an ERROR in their logs if something has gone wrong. What you need to do is find a way of linking your monitoring system to that message, so that it can alert you that something needs to be checked.

This doesn’t mean you can dispense with predictative monitoring altogether; what is does means is that you don’t need to rely on predicative monitoring entirely (or in other words, you don’t need to be able to see into the future) to keep your applications healthy.

This is how I’ve implemented log based monitoring. This was something of a nut to crack, as our logs arise from an array of technologies and adhere to very few standards in terms of layout, logging levels and storage locations.

The first thing you need is a logstash implementation. Logstash comprises a stack of technologies: an agent to ship logs out to a Redis server; a Redis server to queue logs for indexing; a logstash server for creating indices and storing them in elasticsearch; an elasticsearch server to search your indices.

The setup of this stack is beyond this article; its well-described over on the logstash website, and is reasonably straightforward.

Once you have your logstash stack set up, you can start querying the elasticsearch search api for results. Queries are based on HTTP POST and JSON, and results are output in JSON.

Therefore, to test you logs, you need to issue a HTTP POST query from Nagios, check the results for ERROR strings, and alert accordingly.

The easient way to have Nagios send a POST request with a JSON payload to elasticsearch is with the Nagios jmeter plugin, which allows you to create monitors based on your jmeter scripts.

All you need then is a correctly constructed JSON query to send to elasticsearch, which is where things get a bit trickier.

Without going into this in any great detail, formulating a well-constructed JSON query that will parse just the right log indices in elasticsearch isn’t easy. I cheated a little in this. I am familiar with the Apache Lucene syntax that the Logstash Javascript client, Kibana, uses, and was able to formulate my query based on this.

Kibana sends encrypted queries to elasticsearch, so you can’t pick them out of the HTTP POST/GET variables. Instead, I enabled logging of slow queries on elasticsearch (threshold set to 0s) so that I could see in the elasticsearch logs what exact queries were being run against elasticsearch. Here’s an example:


{
  "size": 100,
  "sort": {
    "@timestamp": {
      "order": "desc"
    }
  },
  "query": {
    "filtered": {
      "query": {
        "query_string": {
          "query": "NOT @source_host:\"uatserver\"",
          "default_field": "_all",
          "default_operator": "OR"
        }
      },
      "filter": {
        "range": {
          "@timestamp": {
            "from": "2014-10-06T11:05:25+00:00",
            "to": "2014-10-06T12:05:25+00:00"
          }
        }
      }
    }
  },
  "from": 0
}

You can test a query like this by sending it straight to your elasticsearch API:


curl -XPOST 'http://localhost:9200/_search' -d '{"size":100,"sort":{"@timestamp":{"order":"desc"}},"query":{"filtered":{"query":{"query_string":{"query":"NOT @source_host:\"uatserver\"","default_field":"_all","default_operator":"OR"}},"filter":{"range":{"@timestamp":{"from":"2014-10-06T11:05:25+00:00","to":"2014-10-06T12:05:25+00:00"}}}}},"from":0}'

This searches a batch of 100 log entries that do not have a tag of “uatserver”, from a previous 5 minute period.

Now that we now what we want to send to elasticsearch, we can construct a simple jmeter script. In this this, we simply specify a a HTTP POST request, containing Body Data of the JSON given above, and include a Response Assertion for the strings we do not want to see in the logs.

We can then use that script in Nagios with the jquery plugin. If the script finds the ERROR string in the logs, it will generate an alert.

2 things are important here:

The alert will only tell you that an error has appeared in the logs, not what that error was; and if the error isn’t persistent, the monitor will eventually recover.

Clearly, there is a lot of scope for false negatives in this, so if your logs are full of tolerable errors (they shouldn’t be really) you are going to have to be more specific about your search strings.

The good news is that if you get this all working, its very easy to create new monitors. Rather than writing bespoke scripts and working with Nagios plugins, all you need to do is change the queries and the Response Assertions in your jmeter script, and you should be able to monitor anything that is referenced in your application logs.

To assist in some small way, here is a link to a pre-baked JMeter script that includes an Apache Lucene query, and is also set up with the necessary Javascript-based date variables to search over the previous 15 minutes.

Leave a Reply

Your email address will not be published.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>