Category Archives: Performance

Using Elasticsearch Logstash Kibana (ELK) to monitor server performance

There are myriad tools that claim to be able to monitor server performance for you, but when you’ve already got a sizeable bag of tools doing various automated operations, its always nice to be able to fulfil an operational requirement using one of those rather than having to on board another one.

I love Elasticsearch. It can be a bit of minefield to learn, but when you get to grips with it, and bolt on Kibana, you realise that there is very little you can’t do with it.

Even better, Amazon AWS now have their own Elasticsearch Service, so you can reap all the benefits of the technology without having to worry about maintaining a cluster of Elasticsearch servers.

In this case, my challenge was to expose performance data from a large fleet of Amazon EC2 server instances. Yes, there is certain amount of data available in AWS Cloudwatch, but it lacks key metrics like memory usage and load average, which are invariably the metrics you must want to review.

One approach to this would be to put some sort of agent on the servers and have a server poll the agent, but again, that’s extra tools. Another approach would be to put scripts on the servers that push metrics to Cloudwatch, so that you can augment the existing EC2 Cloudwatch data. This was something we considered, but with this method, the metrics aren’t logged to the same place in Cloudwatch as the EC2 data, so it all felt a bit clunky. And you only get 2 weeks of backlog.

This is where we turned to Elasticsearch. We were already using Elasticsearch to store information about access to our S3 buckets, which we were happy with. I figured there had to be a way to leverage this to monitor server performance, so set about some testing.

Our basic setup was a Logstash server using the S3 Input plugin, and the Elasticsearch output plugin, which was configured to send output to our Elasticsearch domain in AWS

output {
 if [type] == "s3-access" {
     elasticsearch {
         index => "s3-access-%{+YYYY.MM.dd}"
         hosts => ["search-*********-5isan2svbmpipm2xznyupbeabe.us-west-2.es.amazonaws.com:443"]
         ssl => true
    }
 } 
}

We now wanted to created a different type of index, which would hold our performance metric data. This data was going to be taken from lots of servers, so Logstash needed a way to ingest the data from lots of remote hosts. The easiest way to do this is with the Logstash input plugin syslog. We first set up Logstash to listen for syslog input.

input {
     syslog {
         type => syslog
         port => 8514
     }
}

We then get our servers to send their syslog output to our Logstash server, by giving them a universal rsyslogd configuration, where logs.domain.com is our Logstash server:

#Logstash Configuration
$WorkDirectory /var/lib/rsyslog # where to place spool files
$template LogFormat,"%HOSTNAME% ops %syslogtag% %msg%"
*.* @@logs.mydomain.com:8514;LogFormat

We now update our output plugin in Logstash to create the necessary Index in Elasticsearch:

output {
 if [type] == "syslog" {
    elasticsearch {
       index => "test-syslog-%{+YYYY.MM.dd}"
       hosts => ["search-*********-5isan2svbmpipm2xznyupbeabe.us-west-2.es.amazonaws.com:443"]
       ssl => true
    }
 } else {
    elasticsearch {
       index => "s3-access-%{+YYYY.MM.dd}"
       hosts => ["search-*********-5isan2svbmpipm2xznyupbeabe.us-west-2.es.amazonaws.com:443"]
       ssl => true
    }
 }
}

Note that I have called the syslog Index “test-syslog-…”. I will explain this in a moment, but its important that you do this.

Once these steps have been completed, it should be possible to see syslog data in Kibana, as indexed by Logstash and stored in our AWS Elasticsearch domain.

Building on this, all we had to do next was get our performance metric data into the syslog stream on each of our servers. This is very easy. Logger is a handly little utility that comes pre-installed on most Linux distros that allows you send messages to syslog (/var/log/messages by default).

We trialled this with Load Average. To get the data to syslog, we set up the following cronjob on each server:

* * * * * root cat /proc/loadavg | awk '{print "LoadAverage: " $1}' | xargs logger

This writes the following line to /var/log/messages every minute:

Jun 21 17:02:01 server1 root: LoadAverage: 0.14

It should then be possible to search for this line in Kibana

message: "LoadAverage"

to verify that it is being stored in Elasticsearch. When we do find results in Kibana, we can see that the LogFormat template we used in our server rsyslog conf has converted the log line to:

server1 ops root: LoadAverage: 0.02

To really make this data useful however, we need to be able to perform visualisation logic on the data in Kibana. This means exposing the fields we require and making sure those field have the correct data type for numerical visualisations. This involves using some extra filters in your Logstash configuration.

filters {
   if [type] == "syslog" {
       grok {
          match => { "message" => '(%{HOSTNAME:hostname})\s+ops\s+root:\s+(%{WORD:metric-name}): (%{NUMBER:metric-value:float})' }
       }
   }
}

This filter operates on the message field after it has been converted by ryslog, rather than on the format of the log line in /var/log/messages. The crucial part of this is to expose the Load Average value (metric-value) as a float integer, so that Kibana/Elasticsearch can deal with it as an integer rather than a string. If you only specify NUMBER as your grok data type, it will be exposed as a string, so you need to add the “:float” to complete the data type conversion to type integer.

To verify that it is exposed as a string, look in Kibana under Settings -> Indices. You should only have a single Index Pattern at this point (test-syslog-*). Refresh the field list for this, and search for “metric-value”. At this point, it may indicate that the data type for this is “String”, which we can now deal with. If it already has data type “Number”, you’re all set.

In Elasticsearch indices, you can only set the data type for a field when the index is created. If your “test-syslog-” index was created before we properly converted “metric-value” to an integer, you can now create a new index and verify that metric-value is an integer. To do this, update the output plugin in your Logstash configuration and restart Logstash.

output {
 if [type] == "syslog" {
    elasticsearch {
       index => "syslog-%{+YYYY.MM.dd}"
       hosts => ["search-*********-5isan2svbmpipm2xznyupbeabe.us-west-2.es.amazonaws.com:443"]
       ssl => true
    }
 } 
}

A new Index (syslog-) will now be created. Delete the existing Index pattern in Kibana and create a new one for syslog-*, using @timestamp as the default time field. Once this has been created, Kibana will obtain and updated field list (after a few seconds), and in this, you should see that “metric-value” now has a data type of “Number”.

(For neatness, you may want to replace the “test-syslog-” index with a properly named index even if you data type for “metric-value” is already “Number”).

Now that you have the data you need in Elasticsearch, you can graph it with a visualisation.

First, set your interval to “Last Hour” and create/save a Search for what you want to graph, eg:

metric-name: "LoadAverage" AND hostname: "server1"

Now, create a Line Graph visualisation for that Search, setting the Y-Axis to Average for field “metric-value” and the X-axis to Data Histogram. Click “Apply” and you should see a graph like below:

Screen Shot 2016-06-22 at 10.32.56

 

 

Application monitoring with Nagios and Elasticsearch

As the applications under your control grow, both in number and complexity, it becomes increasingly difficult to rely on predicative monitoring.

Predicative monitoring is monitoring things that you know should be happening. For instance, you know your web server should be accepting HTTP connections on TCP port 80, so you use a monitor to test that HTTP connections are possible on TCP port 80.

In more complex applications, it harder to predict what may or may not go wrong; similarly, some things can’t be monitored in predictive way, because your monitoring system may not be able to emulate the process that you want to monitor.

For example, lets say your application sends Push message to a mobile phone application. To monitor this thoroughly, you would have to have a monitor that persistently sends Push messages to a mobile phone, and some way of monitoring that the mobile phone received them.

At this stage, you need to invert your monitoring system, so that it stops asking if things are OK, and instead listens for applications that are telling it that they are not OK.

Using your application logs files is one way to do this.

Well-written applications are generally quite vocal when it comes to being unwell, and will always describe an ERROR in their logs if something has gone wrong. What you need to do is find a way of linking your monitoring system to that message, so that it can alert you that something needs to be checked.

This doesn’t mean you can dispense with predictative monitoring altogether; what is does means is that you don’t need to rely on predicative monitoring entirely (or in other words, you don’t need to be able to see into the future) to keep your applications healthy.

This is how I’ve implemented log based monitoring. This was something of a nut to crack, as our logs arise from an array of technologies and adhere to very few standards in terms of layout, logging levels and storage locations.

The first thing you need is a logstash implementation. Logstash comprises a stack of technologies: an agent to ship logs out to a Redis server; a Redis server to queue logs for indexing; a logstash server for creating indices and storing them in elasticsearch; an elasticsearch server to search your indices.

The setup of this stack is beyond this article; its well-described over on the logstash website, and is reasonably straightforward.

Once you have your logstash stack set up, you can start querying the elasticsearch search api for results. Queries are based on HTTP POST and JSON, and results are output in JSON.

Therefore, to test you logs, you need to issue a HTTP POST query from Nagios, check the results for ERROR strings, and alert accordingly.

The easient way to have Nagios send a POST request with a JSON payload to elasticsearch is with the Nagios jmeter plugin, which allows you to create monitors based on your jmeter scripts.

All you need then is a correctly constructed JSON query to send to elasticsearch, which is where things get a bit trickier.

Without going into this in any great detail, formulating a well-constructed JSON query that will parse just the right log indices in elasticsearch isn’t easy. I cheated a little in this. I am familiar with the Apache Lucene syntax that the Logstash Javascript client, Kibana, uses, and was able to formulate my query based on this.

Kibana sends encrypted queries to elasticsearch, so you can’t pick them out of the HTTP POST/GET variables. Instead, I enabled logging of slow queries on elasticsearch (threshold set to 0s) so that I could see in the elasticsearch logs what exact queries were being run against elasticsearch. Here’s an example:


{
  "size": 100,
  "sort": {
    "@timestamp": {
      "order": "desc"
    }
  },
  "query": {
    "filtered": {
      "query": {
        "query_string": {
          "query": "NOT @source_host:\"uatserver\"",
          "default_field": "_all",
          "default_operator": "OR"
        }
      },
      "filter": {
        "range": {
          "@timestamp": {
            "from": "2014-10-06T11:05:25+00:00",
            "to": "2014-10-06T12:05:25+00:00"
          }
        }
      }
    }
  },
  "from": 0
}

You can test a query like this by sending it straight to your elasticsearch API:


curl -XPOST 'http://localhost:9200/_search' -d '{"size":100,"sort":{"@timestamp":{"order":"desc"}},"query":{"filtered":{"query":{"query_string":{"query":"NOT @source_host:\"uatserver\"","default_field":"_all","default_operator":"OR"}},"filter":{"range":{"@timestamp":{"from":"2014-10-06T11:05:25+00:00","to":"2014-10-06T12:05:25+00:00"}}}}},"from":0}'

This searches a batch of 100 log entries that do not have a tag of “uatserver”, from a previous 5 minute period.

Now that we now what we want to send to elasticsearch, we can construct a simple jmeter script. In this this, we simply specify a a HTTP POST request, containing Body Data of the JSON given above, and include a Response Assertion for the strings we do not want to see in the logs.

We can then use that script in Nagios with the jquery plugin. If the script finds the ERROR string in the logs, it will generate an alert.

2 things are important here:

The alert will only tell you that an error has appeared in the logs, not what that error was; and if the error isn’t persistent, the monitor will eventually recover.

Clearly, there is a lot of scope for false negatives in this, so if your logs are full of tolerable errors (they shouldn’t be really) you are going to have to be more specific about your search strings.

The good news is that if you get this all working, its very easy to create new monitors. Rather than writing bespoke scripts and working with Nagios plugins, all you need to do is change the queries and the Response Assertions in your jmeter script, and you should be able to monitor anything that is referenced in your application logs.

To assist in some small way, here is a link to a pre-baked JMeter script that includes an Apache Lucene query, and is also set up with the necessary Javascript-based date variables to search over the previous 15 minutes.

Is Skype an appropriate tool in corporate environments?

This is a question that has plagued me for several years, in that I have never been able to establish a consistent level of Skype quality in a corporate environment, despite having lots of bandwidth and obtained the consultancy services of CCIE level network experts.

The answer to the question is ultimately, no.

Let me explain by running through the questions.

1. How does Skype work at a network level?

Skype is a “Peer To Peer” (P2P) application. That means that when 2 people are having a Skype conversation, their computers *should* be directly connected, rather than connected via a 3rd computer. For the sake of comparison, Google Hangouts is not a P2P application. Google Hangout participants connect to each other via Google Conference Servers.

2. Does Skype work with UDP or TCP?

Skype’s preference is for UDP, and when Skype can establish a direct P2P connection using UDP, which is typically the case for residential users, call quality is very good. This is because UDP is a much faster protocol than TCP when used for streaming audio and video.

3. What’s the difference between residential and corporate users?

Residential internet connections are typically allocated a temporary fixed public ip address. This IP gets registered to a Skype user on Skype’s servers, so when someone needs to contact that user, Skype knows where to direct the call, and can use UDP to establish a call between the participating users.

In corporate environments, where there are lots of users using the same internet connection, sharing of a a single public IP address between those users has to occur (Port Address Translation). That means that the Skype servers will have registered the same public ip address for all the users in that organisation. This means that Skype is not able to establish a direct UDP P2P connection between a user on the outside of that organisation and a user in that organisation, and has to use other means to make that connection.

4. What are those other means?

When direct connectivity between clients is not possible, Skype uses a process called “UDP hole punching”. In this mechanism, 2 computers that cannot communicate directly with each other communicate with one or more third party computers that can communicate with both computers.

Connection information is passed between the computers in order to try and establish a direct connection between the 2 computers participating in the Skype call.

If ultimately a direct connection cannot be established, Skype will use the intermediary computers to relay the connection between the 2 computers participating in the conversation.

In Skype terminology, these are known as “relay nodes”, which are basically just computers running Skype than have direct UDP P2P capability (typically residential users with good broadband speeds).

From the Skype Administrators Manual:

http://download.skype.com/share/business/guides/skype-it-administrators-guide.pdf

2.2.4 Relays

If a Skype client can’t communicate directly with another client, it will find the appropriate relays for the connection and call traffic. The nodes will then try connecting directly to the relays. They distribute media and signalling information between multiple relays for fault tolerance purposes. The relay nodes forward traffic between the ordinary nodes. Skype communication (IM, voice, video, file transfer) maintains its encryption end-to-end between the two nodes, even with relay nodes inserted.

As with supernodes, most business users are rarely relays, as relays must be reachable directly from the internet. Skype software minimizes disruption to the relay node’s performance by limiting the amount of bandwidth transferred per relay session. 

5. Does that mean that corporate Skype traffic is being relayed via anonymous third party computers?

Yes. The traffic is encrypted, but it is still relayed through other unknown hosts if a direct connection between 2 Skype users is not possible.

6. Is this why performance in corporate environments is sometimes not good?

Yes. If a Skype conversation is dependent on one of more relay nodes, and one of these nodes experiences congestion, this will impact on the quality of the call.

7. Surely, there is some solution to this?

A corporate network can deploy a proxy server, which is directly mapped to a dedicated public ip address. Ideally, this should be a UDP-enabled SOCKS5 server, but a TCP HTTP Proxy server can also be used. If all Skype connections are relayed through this server, Skype does not have to use relay nodes, as Port Address Translation is not in use.

8. So what’s the catch?

The problem with this solution is that it is not generally possible to force the Skype client to use a Proxy Server. When the client is configured to use a Proxy Server, it will only use it if there is no other way to connect to the Internet. So, if you have a direct Internet connection, even one based on Port Address Translation, which impacts on Skype quality, Skype will continue to use this, even if a better solution is available via a Proxy Server.

9. Why would Skype do this?

Skype is owned by Microsoft. Skype have a business product that attaches to Microsoft Active Directory that allows you do force a Proxy connection. So if you invest in a Microsoft network, Microsoft will give you a solution to enable better Skype performance in corporate networks. If you don’t want to invest in a Microsoft network, you’re stuck, and your only option is to block all outbound Internet access from your network and divert it via your Proxy server.

For a lot of companies, particularly software development companies who depend on 3rd party web services, this is not a practical option.

10. What is the solution?

At this time the primary options for desktop Audio/Video conferencing are either Skype or Google Hangouts.

When Skype can be used in an environment where P2P UDP connectivity is “always on”, it provides a superior audio/video experience to Google Hangouts, which is not P2P, and which communicates via central Google Servers.

Where an environment uses Port Address Translation, Skype performance will depend on the ability of Skype client to establish connections via relays, which means Skype performance becomes dependent on the resources available to those relays.

In this instance, Google Hangout may be a better choice where consistent quality is required, as quality can be guaranteed by providing sufficient bandwidth between the corporate network and Google.

 

The Scriptalizer

Given the extended use of Javascript in today’s web applications, one of the best ways to improve site performance is to obfuscate your javascript before it is loaded into the browser.

Trimming white space and new lines can shave as much as 50% from your script file sizes, which can make as significant difference to the initial load delay of your site.

http://www.scriptalizer.com is as good a solution as I have found for this acheter cialis sans ordonnance en pharmacie.

You upload your files, hit the button, and wham, all your beautifully formed javascript is compressed into a congealed, but wonderfully efficient, blob in a single file.

Only thing is that you have to insert a bit of logic into your code to load your human readable files in development and the single obfuscated file in Production, because you don’t wanted to have to repeat the obfuscation process every time you want to test a new piece of code, and you certainly don’t want to try and edit the obfuscated file by hand.