Breaking up 3′s mobile broadband scam

3 mobile offer a mobile broadband service. You buy a router which takes a mobile SIM, which has mobile data enabled and which is subscribed to a mobile data plan. The router connects to the mobile network and then shares that connection over a WIFI network in your home.

3 offer a number of plans based on the amount of data you want to download, ranging from 3GB per month up to 100GB per month. For example, 30GB per month costs €29.99.

If you go over your limit, you pay penalty charges. Then penalty rate is 5c per MB. This doesn’t sound like much, but if you go 1GB over your limit (half a Netflix movie), that’s €50. This applies whether you are on the 3GB or the 100GB plan. Once you get to 1GB over your limit, 3 stops the connection and you can’t download any more data until your plan renews at the start of your next billing cycle.

Does this sound like a scam? No, not really, but I haven’t included all the detail.

Given the penal rate that applies if you go over your limit, you would think that 3 had some sure fire mechanism of alerting you that you are approaching your limit or are at your limit, so that you could stop using the service.

Well, they don’t.

All they do is send a warning SMS message to the mobile phone number that is associated with the SIM card in your router. The only way that you’ll ever see this message is if you login to the router and check the messages, which you’re never going to do, because there is no reason to login to the router other than when you first set it up.

It would of course be much more logical for them to send you an email (you supply your email address when you buy the router and/or subscribe to the data plan), but they don’t do this. Nor to they make any functionality available whereby you can set up an email warning yourself.

Nor do they shut off your connection when you reach your limit. Of course, they could argue that they want to give users the capability to burst beyond their limit, in case of some emergency, but why then do they go ahead and shut down your connection when you go over your limit by 1GB, which costs €50? And why don’t they give users the option of having their connection shut off when they breach their limit? And regardless of both these points, you can also buy a data add-on for a couple of euro if you want to go above your limit in a given month.

The only explanation that stacks up here is that 3 are making a small fortune from users breaching their mobile data limits. I use the service, and am technically savvy, but in the first 4 months of using it, I breached my limit twice. I had budgeted paying €29.99 per month for 30GB per month over 4 months (Total: €119.96 for 120GB of data), but ended up paying €219.96 for 122GB of data.

Certain that 3 were playing offside here, I contacted ComReg to make a complaint. Surely, they would recognise how ridiculous it was for service providers to be warning users about breaching data limits by sending SMS messages to SIM cards buried in routers.

But no. To my amazement, I got a response from ComReg saying that mobile providers were not obliged to warn users about limit breaches by email. They were only required to send SMS messages!

Now my blood was really up.

I got to work with some Python and Selenium and wrote a script that logs into my 3 account once per hours and picks up my remaining allowance. This allowance is then posted to a my website, where it is checked by a simple web content checker app running on my phone. Every time the content changes, the app alerts me.

My plan now is to extend this functionality to other users, so that they too can cut off the supply of €50 fines being delivered to 3. If you would like to have your allowance monitored in this way, please let me know and I will send you what you need.

If you’re tech savvy, you can do it yourself. See here:

How Bitcoin mining pools work

I’ve written this to clarify my own understanding. Treat with caution.

Bitcoin mining pools exist because the computational power required to mine Bitcoins on a regular basis is so vast that it is beyond the financial and technical means of most people. Rather than investing a huge amount of money in mining equipment that will (hopefully) give you a return over a period of decades, a mining pool allows the individual to accumulate smaller amounts of Bitcoin more frequently.

The principle is quite straightforward: lots of people come together to combine their individual computing power into a single logical computing unit and share any rewards (Bitcoins) proportionally based on the amount of computing effort they contributed.

Where the complexity arises is in the regulation of the network, for instance:

How do you know how much effort was contributed by each member?

How do you prevent members jumping in and out of pools, particularly pools that haven’t mined a Bitcoin in while, and where it it likely they will mine a Bitcoin in the near future?

How do you prove that individual members are actually working?

How do you prevent more powerful members from hogging the network bandwidth of the master miner, preventing less powerful members from contributing?

How do you ensure that the master miner is getting enough throughput to have a reasonable chance of mining a Bitcoin?

There are just a sample of the problems that exist in relation to efficient and fair Bitcoin mining in pools. The following is a general explanation of how these problems are dealt with.

Let’s clarify terminology first (I’m assuming basic knowledge of how Bitcoin works here).

Hash: the end product, a binary number, of an encryption operation performed by a miner. Each new Hash is created by the miner adding a sequential string (a nonce) to the source (salt) of the encryption operation. Modern computers can generate hundreds of thousands of Hashes per second.

Target: In Bitcoin, each new Block has a Target. This is a binary number. To succeed in creating a new Block, a miner has to compute a Hash that is lower than that number. The Bitcoin protocol adjusts that number depending on the amount of activity on the network. If there is a lot of activity, the number becomes smaller, if there is less activity, the number becomes larger. The objective is to regulate the creation of new Blocks (ie Bitcoins) to 1 every 10 minutes or so.

Difficulty: This is a measure of how difficult it is for a miner to derive a Hash that is less than the current Target (ie mine Bitcoins). It is extrapolated from the amount of time it took to generate the last 2,016 blocks. At a rate of 1 block every 10 minutes, it should take 2 weeks to generate 2,016 blocks. If it has actually taken less than this, the difficulty will increase (ie the Target will be a smaller number). If it has taken more than this, the difficulty will decrease (ie the Target will be a larger number). The unit of measurement of difficulty is Hashes ie the number of Hashes on the entire network that were generated to created 2,016 Blocks.

Master Node: a full Bitcoin node that operates on the Bitcoin P2P network and which regulates a pool of members, who do not directly communicate on the Bitcoin P2P network, but who use the Master Node as a proxy.

Share: a Share is something that is particular to Mining pools. It does not form part of the wider Bitcoin protocol. It is the primary method used by the Master Node to regulate the activity of members of a pool. The next section deals with Shares.

When you start running a Bitcoin mining process, you will probably be aware of your Hash Rate. This is the number of Hashes your Bitcoin mining hardware is generating per second. These days, this is normally measured in Ghps, which means Millions (Giga) of Hashes Per Second. A typical high-end Graphics Card (GPU) on a modern PC can generate about 0.5 Ghps. A dedicated ASIC mining rig, which will cost over €1,000, might be able to generate 8,000 Ghps. A mining pool will typically measure its combined compute power in Thps, or Billions (Tera) of Hashes Per Second.

When you participate in a mining pool, and you see your hardware generating (lets say) 5,000 Ghps, this does not mean that you are submitting 5,000 Ghps to the Bitcoin network, via the Master Node. If that were the case, and all the pool members were doing the same, the Master Node that controls the pool would simply explode.

What it means is that your mining hardware can generate 5,000 Ghps locally on your computer.

This capacity isn’t used directly on the Bitcoin network. Instead, the Master Node that controls the pool acts as a proxy between the pool members and the main Bitcoin network. For this to work, the Master Node has to ensure both that the members are supplying enough Hashes for the Master Node to be able to compete on the main Bitcoin mining network, and that the allocation of any Bitcoin mined is divided proportionately according to the amount of compute effort supplied by the individual members.

To do this, the Master Node observes the Hash Rate of each of the members, and distributes computational challenges to them that all have a slightly lower Difficulty rating than the Difficultly rating of the current Block Target (lets call this the Proxy Target) . If such a Hash is found, the Master Node accepts this as a “Proof of Work Accepted”. If a Hash that is lower than the Proxy Target is not found within the allotted time, the member completes the work anyway before moving on to the next computational challenge distributed by the Master Node.

In this way, your mining process will log “Work Units Started”, which will always be a lower number than “Proof Of Works Accepted”. The smaller the gap between these numbers, the “luckier” you are. The greater the gap, the “unluckier” you are. In reality however, and over time, the gap should be consistent between across all members, as the master node will adjust the Difficultly of the computational challenges based on the Hash Rate of the member, which can change over time.

A Share is therefore the equivalent of a “Proof of Work Accepted”. The Master Node will keep a record of all “Proof of Works Accepted” from each member, and distribute Bitcoin mined based on the number of Shares each member has contributed.

Confused? Of course you are, so lets go through that again, from a different perspective.

If the Master Node were sending computational challenges to members that had a Difficulty rating that was equal to the Difficulty rating of the current Target in the Blockchain, the Master Node would just be sitting there idlly for days on end waiting for one of the members to come up with the necessary Hash to create the new Block. The Master Node would have no knowledge of what effort the other members contributed, and would have no option but to award the full reward to the successful member, even if that member only contributed 0.001% of computational effort involved in creating the Block.

Instead, the Master Node lowers the bar on the Difficulty rating (relative to the actual difficulty rating of the current Block) so that it receives lots of Hashes from the members. All but one of these Hashes will be lower than the current Blockchain target, but at least now the Master Node can confirm that its members are working, and at what rate they are working. It can then use that information to both regulate the traffic received from the members and proportionately divide any rewards.

Additionally, it can ensure that more powerful members, whose submissions are rate-limited to allow submissions from less powerful members, are not discriminated against. These more powerful nodes are given challenges with higher Difficulty ratings, but any “Proof of Work Submitted”s (ie Shares) that they accumulate are weighted according to the Difficulty rating that was set, giving them a higher proportionate of any ultimate reward.

While all  of the above may sound complex, it is still in fact just a general introduction to mining pools, and based on the mining pool that I use, Bitminter.

Other pools use variations of this methodology, but all follow the general principle that the Master Node is a proxy to the main Bitcoin network, that members must prove to the Master Node that they are working and that rewards are allocated based on the amount of work done.

A brief note about payment methodologies is also warranted. Many pools will use either the PPS (Pay Per Share) or PPLNS (Pay Per Last N Shares) method to distribute rewards.

In the PPS model, you get a payment for each Share you contribute regardless of the success or failure of the mining pool. This makes for a regular income, but doesn’t allow you to benefit when the pool has a lucky streak.

In the PPLNS model, you get paid only when Bitcoins are mined, and only on the basis of the Shares you submitted to that effort. This makes for more irregular income, but allows you to benefit when the pool has a lucky streak.



How to monitor Docker containers with Nagios and NRPE

Monitoring whether or not a Docker container is alive on a remote host should be fairly easy, right?

The standard approach in this is to include a suitable NRPE script on the remote host, and call that remotely from your Nagios server via the NRPE TCP daemon on the remote host. This script is a good example of same, and we’ll refer to it in the rest of the article.

This generally works fine when you’re doing innocuous things like checking free disk space or if a certain process is running. Checking a Docker container is a little bit harder, because the command:

docker inspect

can only be run as root, whereas the NRPE service on the remote host runs as a non-privileged user (usually called nagios).

As such, when you test your NRPE call from the Nagios server, like so:

/usr/lib64/nagios/plugins/check_nrpe -H -c check_docker_container1

Your will see a response like:

NRPE: Unable to read output


UNKNOWN - container1 does not exist.

You get this response because the nagios user cannot execute the docker control command.

Your could get around this by running NRPE on the remote host as the root user, but that really isn’t a good idea, and you should never do this.

A better play (if you are confident that your Nagios set up is secure) is to extend controlled privileged to the nagios user via sudo. You can create the following file in /etc/sudoers.d/docker to achieve this:

nagios    ALL=(ALL:ALL)  NOPASSWD: /usr/bin/docker inspect *
nagios    ALL=(ALL:ALL)  NOPASSWD: /usr/lib64/nagios/plugins/ *

This allows the nagios user to run both the wrapper script around the docker inspect command and the docker control command itself, without requiring a password. Note, only inspect permission is granted. Obviously, we don’t want to give nagios permission to actually manipulate containers.

In addition to this, we must make provision for NRPE to run the command using sudo when called via the NRPE TCP daemon. So, in nrpe.cfg, instead of:

command[check_docker_container1]=/usr/lib64/nagios/plugins/ container1

we have:

command[check_docker_container1]=sudo /usr/lib64/nagios/plugins/ container1



Using Elasticsearch Logstash Kibana (ELK) to monitor server performance

There are myriad tools that claim to be able to monitor server performance for you, but when you’ve already got a sizeable bag of tools doing various automated operations, its always nice to be able to fulfil an operational requirement using one of those rather than having to on board another one.

I love Elasticsearch. It can be a bit of minefield to learn, but when you get to grips with it, and bolt on Kibana, you realise that there is very little you can’t do with it.

Even better, Amazon AWS now have their own Elasticsearch Service, so you can reap all the benefits of the technology without having to worry about maintaining a cluster of Elasticsearch servers.

In this case, my challenge was to expose performance data from a large fleet of Amazon EC2 server instances. Yes, there is certain amount of data available in AWS Cloudwatch, but it lacks key metrics like memory usage and load average, which are invariably the metrics you must want to review.

One approach to this would be to put some sort of agent on the servers and have a server poll the agent, but again, that’s extra tools. Another approach would be to put scripts on the servers that push metrics to Cloudwatch, so that you can augment the existing EC2 Cloudwatch data. This was something we considered, but with this method, the metrics aren’t logged to the same place in Cloudwatch as the EC2 data, so it all felt a bit clunky. And you only get 2 weeks of backlog.

This is where we turned to Elasticsearch. We were already using Elasticsearch to store information about access to our S3 buckets, which we were happy with. I figured there had to be a way to leverage this to monitor server performance, so set about some testing.

Our basic setup was a Logstash server using the S3 Input plugin, and the Elasticsearch output plugin, which was configured to send output to our Elasticsearch domain in AWS

output {
 if [type] == "s3-access" {
     elasticsearch {
         index => "s3-access-%{+YYYY.MM.dd}"
         hosts => ["search-*********"]
         ssl => true

We now wanted to created a different type of index, which would hold our performance metric data. This data was going to be taken from lots of servers, so Logstash needed a way to ingest the data from lots of remote hosts. The easiest way to do this is with the Logstash input plugin syslog. We first set up Logstash to listen for syslog input.

input {
     syslog {
         type => syslog
         port => 8514

We then get our servers to send their syslog output to our Logstash server, by giving them a universal rsyslogd configuration, where is our Logstash server:

#Logstash Configuration
$WorkDirectory /var/lib/rsyslog # where to place spool files
$template LogFormat,"%HOSTNAME% ops %syslogtag% %msg%"

We now update our output plugin in Logstash to create the necessary Index in Elasticsearch:

output {
 if [type] == "syslog" {
    elasticsearch {
       index => "test-syslog-%{+YYYY.MM.dd}"
       hosts => ["search-*********"]
       ssl => true
 } else {
    elasticsearch {
       index => "s3-access-%{+YYYY.MM.dd}"
       hosts => ["search-*********"]
       ssl => true

Note that I have called the syslog Index “test-syslog-…”. I will explain this in a moment, but its important that you do this.

Once these steps have been completed, it should be possible to see syslog data in Kibana, as indexed by Logstash and stored in our AWS Elasticsearch domain.

Building on this, all we had to do next was get our performance metric data into the syslog stream on each of our servers. This is very easy. Logger is a handly little utility that comes pre-installed on most Linux distros that allows you send messages to syslog (/var/log/messages by default).

We trialled this with Load Average. To get the data to syslog, we set up the following cronjob on each server:

* * * * * root cat /proc/loadavg | awk '{print "LoadAverage: " $1}' | xargs logger

This writes the following line to /var/log/messages every minute:

Jun 21 17:02:01 server1 root: LoadAverage: 0.14

It should then be possible to search for this line in Kibana

message: "LoadAverage"

to verify that it is being stored in Elasticsearch. When we do find results in Kibana, we can see that the LogFormat template we used in our server rsyslog conf has converted the log line to:

server1 ops root: LoadAverage: 0.02

To really make this data useful however, we need to be able to perform visualisation logic on the data in Kibana. This means exposing the fields we require and making sure those field have the correct data type for numerical visualisations. This involves using some extra filters in your Logstash configuration.

filters {
   if [type] == "syslog" {
       grok {
          match => { "message" => '(%{HOSTNAME:hostname})\s+ops\s+root:\s+(%{WORD:metric-name}): (%{NUMBER:metric-value:float})' }

This filter operates on the message field after it has been converted by ryslog, rather than on the format of the log line in /var/log/messages. The crucial part of this is to expose the Load Average value (metric-value) as a float integer, so that Kibana/Elasticsearch can deal with it as an integer rather than a string. If you only specify NUMBER as your grok data type, it will be exposed as a string, so you need to add the “:float” to complete the data type conversion to type integer.

To verify that it is exposed as a string, look in Kibana under Settings -> Indices. You should only have a single Index Pattern at this point (test-syslog-*). Refresh the field list for this, and search for “metric-value”. At this point, it may indicate that the data type for this is “String”, which we can now deal with. If it already has data type “Number”, you’re all set.

In Elasticsearch indices, you can only set the data type for a field when the index is created. If your “test-syslog-” index was created before we properly converted “metric-value” to an integer, you can now create a new index and verify that metric-value is an integer. To do this, update the output plugin in your Logstash configuration and restart Logstash.

output {
 if [type] == "syslog" {
    elasticsearch {
       index => "syslog-%{+YYYY.MM.dd}"
       hosts => ["search-*********"]
       ssl => true

A new Index (syslog-) will now be created. Delete the existing Index pattern in Kibana and create a new one for syslog-*, using @timestamp as the default time field. Once this has been created, Kibana will obtain and updated field list (after a few seconds), and in this, you should see that “metric-value” now has a data type of “Number”.

(For neatness, you may want to replace the “test-syslog-” index with a properly named index even if you data type for “metric-value” is already “Number”).

Now that you have the data you need in Elasticsearch, you can graph it with a visualisation.

First, set your interval to “Last Hour” and create/save a Search for what you want to graph, eg:

metric-name: "LoadAverage" AND hostname: "server1"

Now, create a Line Graph visualisation for that Search, setting the Y-Axis to Average for field “metric-value” and the X-axis to Data Histogram. Click “Apply” and you should see a graph like below:

Screen Shot 2016-06-22 at 10.32.56



Migrating MySQL from AWS RDS to EC2

Applications that use MySQL as their underlying RDBMS commonly evolve as follows:

  1. Application and MySQL server on same EC2 instance
  2. Application balanced between multiple EC2 instances and MySQL server moved to RDS instance
  3. MySQL server moved back to EC2 with DIY High Availability infrastructure
  4. MySQL server moved to Bare Metal with DIY High Availability infrastructure in Co-Lo data centre

As each of these migration steps arrives, the size of the dataset under MySQL management is larger, and the availability of the application more critical, making each step exponentially more complex.

In recent months, I have had to manage Step 3 in this live cycle (the migration back to EC2 from RDS). The following is an account of my experience.

The dataset involved was 4TB in size. That isn’t huge by today’s standards, but its large enough to involve multiple days of data transfer and to require something more than a mysqldump and import in your planning.

The dataset was also highly volatile, in that it was being augmented 24/7, and relied on stored procedures to aggregate data on a daily basis on which commercial SLAs were based. In other words, stopping updates to the dataset for anything more than a couple of hours was not an option.

Time pressure was a further consideration. RDS has a hard limit of 6TB of disk space for an instance (and a 2TB file size limit), and our application was due to introduce new functionality that would increase the rate of data accumulation dramatically. We estimated that we had 2-3 months to complete the transition before the 6TB limit appeared on the horizon.

We did our research and decided on a strategy. We would create a Read Replica of our RDS master and allow it to come into sync. When it was in sync, we would promote it to a standard RDS instance and note the replication point in the Bin Log. We would then do a full mysqldump of the database and inject that directly into our EC2 master, which we estimated would take 96 hours. When this was complete, we would make the EC2 master a slave of the RDS master, and start replication from the point in the Bin Log we had previously noted. We estimated that the data gap would take 18-20 hours to fill, after which we would have a full and intact dataset in EC2.

This plan was fine except for one detail. Because of data relies extensively on stored procedures, it requires a lot of RAM and CPU grunt to get through its workload. Under normal circumstances, we maintained a Read Replica for the RDS master, to allow for intensive read queries that would not impact on the processing capability of the RDS master. On occasion, when there were replication issues, the Bin Log on the RDS would grow rapidly, consuming several hundred GBs of disk space. This isn’t supposed to be an issue in MySQL, but the internal mechanics of RDS and how the Bin Log is managed seem to make it an issue. We we saw the Bin Log growing to this extent, performance on the RDS master rapidly degraded, requiring us to terminate replication completely (in order that RDS would flush the Bin Log).

Given that our plan involved allowing the Bin Log to grow over 96 hours, we were obviously concerned. We discussed this with our support partners, Percona, who recommend an alternative strategy.

They suggested using the MySQL Bin Log utility to back up the Bin Log to location outside RDS, which we could then stream into our EC2 master. This would involved extra steps in the process, and tighter co-ordination, but it seemed to be a lot less riskier in terms of impacting on the RDS master. Our new plan was therefore as follows:

  1. Ensure all applications are using a DNS record for MySQL server that has 0 sec TTL
  2. Create a Read Replica of the RDS master and allow to come in sync
  3. Stop replication on the replica, note the replication point and promote to master
  4. Configure RDS master to retain at least 12 hours of Bin Log, and wait for 12 hours (ensuring that Bin Log growth does not impact on performance during this time)
  5. Start Bin Log backup from RDS master to disk on EC2 master
  6. Commence mysqldump from RDS master and inject directly into EC2 master
  7. On completion of mysqldump and injection, start restore of Bin Log file into EC2 master
  8. Verify that RDS master and EC2 master are approximately in sync
  9. Pause updates to dataset in RDS master for approx. 1 hour
  10. Verify that RDS master and EC2 master are fully in sync
  11. Stop Bin Log backup and Bin Log restore
  12. Re-create stored procedures on EC2 master
  13. Change DNS record for MySQL Server to point to EC2 master
  14. Re-commence updates to dataset

On completion of this process, we had moved our 4TB dataset from RDS to EC2 with only a 1 our interruption in the data update process. For High Availability, we created 2 slaves and managed these with MySQL Utilities. We placed 2 HA Proxy nodes in front of this MySQL server farm and balanced traffic to the HA Proxy nodes with an Elastic Load Balancer listening for TCP (rather than HTTP) connections.

Its probably also worth mentioned that EC2 also has disk limits. A single EBS volumes can have a maximum size of 16TB. To overcome this, you can combine multiple EBS volumes into an LVM set, or use software based RAID 0. We were initially concerned about using these sort of virtual disks for storing data, but this should be less of a concern when you remember than EBS itself has multiple layers of redundancy. We went for an LVM configuration.




Rackspace – Engineered to Fail

I’ve read several articles over the last few years comparing the cloud infrastructure services offered by Rackspace and Amazon AWS.

Typically, these articles arrive at no firm conclusion as to which is better, referring to issues like cost, support, availability etc.

As someone who has used both services for over 5 years, I find these articles incomprehensible. From a technical viewpoint, there is no comparison between these services. Its like comparing an iPhone 6 to a pocket calculator. Both have a screen, a battery and a digital pulse, but when it comes to sophistication and functionality, they are for all intents and purposes different services.

To put it bluntly, Rackspace is a truly awful experience. They position themselves as a “managed” cloud services provider, which should begin to give an indication of the problem. The beauty of cloud services is that they don’t need to be managed. You buy them, consume them and dispose of them.

Being a “managed” cloud services provider is like being a “managed” self-service car wash. If the car wash machine is so complex, inflexible and unreliable as to require the constant attention of a human being to ensure that users can wash their cars, then those users might as well just go home and wash their cars in the drive (ie have on-premises infrastructure).

From what I can see, the difference in Amazon and Rackspace in this regard stems from their inception.

Amazon’s AWS platform was a spin-off from their shopping function. They had lots of spare compute capacity outside peak periods and decided to hire it out, along with the tools they used to manage it. As such, it was a battle-hardened infrastructure that was used in real, live-fire web environments, and felt familiar and well-designed to actual system engineers.

Rackspace’s service seems to have been designed by marketing professionals. Its ridiculously basic, doesn’t seem to accommodate any future-proofing, and is totally inflexible. Much more attention seems to have gone into the marketing strategy (check out the number of pretty people on the Rackspace website, compared to the JPG-free Amazon AWS site) than their actual technology.

To illustrate this, I’m going to give a specific example. As well as backing up the points made above, I’m also hoping that this article will be picked up by search engines, as it highlights a major flaw in a certain Rackspace functionality, which would cause problems for Rackspace users if not addressed.

When you create a Rackspace Cloud server, you are given an option to schedule daily imaging of the server. That means you can create an offline copy of the server at a point in time, which you can restore at a later time to re-establish the functionality of that server.

To most people working in infrastructure operations, this means one thing: backup.

You think: “If I can make a daily image of my server, and hold the 7 most recent images, my backups are sorted.” Inevitably, that’s what a lot of Rackspace users are doing.

But here’s the thing (that you only find out when you ask the question): because of the way this imaging process works, the creation of the images will inevitably start to fail, and there is no mechanism in the platform to alert you when they do fail.

The explanation of the technology is given here:

To summarise:

A Rackspace server image is composed of 2 parts: the base image file when the image was first created, and the extended image file that contain all the changes to the image that have been made since the base was created.

That means that when you restore the image, the base is restored first, and then all the changes are applied to the base from the extended image.

That means that if the data and you server is changing, but not necessarily growing (eg you could be writing huge logs, or a huge database, but pruning effectively) the size of your image is constantly growing. For First Generation Rackspace servers, if an image gets to greater than 160GB/250GB (which is peanuts in today’s Big Data world) the imaging will fail. For Next Generation servers, there is apparently no limit, but check out the comments of the “Racker” on this Rackspace support thread:

“Next Gen has no limits for either Windows or Linux, but as an image gets really large, there may be an increased chance of the process failing (Things sometimes go wrong when you are talking about moving hundreds of GBs of data).”

Wow! Like who would need to manage “hundreds of GBs” of data in 2016?! What is this? Star Trek!?

This is consistent with what I was told on a support thread by another Racker, namely that imaging is offered on a “Best Effort” basis. Remember, this is bits and bytes technology we’re talking about here, where stuff normally works or doesn’t. We’re not talking about nuclear fusion.

The same Racker goes on to say:

“For customers who run into these limits, there is generally a larger issue though. The truth is that you really should NOT be using imaging as a backup solution. Think about it, does it really make sense to backup tons and tons of data every day when only a few things changed on the server? Do you really want to spin up a new Cloud Server just to recover a single file?”

That’s a sort of valid point, but here’s a question: if scheduled daily imaging isn’t suitable for backup, why the hell is scheduling daily imaging made available as a feature, inviting hapless Ops Engineers to think that their servers are being reliably backed up when really they are not? What exactly is the purpose of scheduled daily image if not for backup?

And the reason the point is only “sort of” valid is that there are times when you will need to make a full daily image of a server. Let’s say you have a MySQL server that has a 200GB data payload. You can’t run a mysqldump against that every night, because it will grind the server to a standstill. You have to do a bit-for-bit image of the system to back it up (as recognised by Amazon RDS service, where you can schedule daily snapshots of your RDS instances).

It actually gets worse.

Imaging cannot only fail because of image size, but also because of “bugs” in the Rackspace platform. A few weeks back, I noticed that imaging on one of our smaller Rackspace servers had stopped working. I dialled up a support chat and ask the guy who responded what was going on.

Theodore R: Garreth! thank you for holding. We have a known bug in ORD that we've seen a few failures on scheduled images. To help with this. Go ahead and cancel the two jobs stuck at 0%. Then de-activiate the schedule then re-enable the schedule. I'm sorry about this it is a known issue we are working on resolving this.

Me: If you knew there was a bug why didn't you tell your customers?

Theodore R: I don't have that answer. As I'm front line support but I will bring that up to my manager in our team meeting today.

Theodore R: I do apologize about this

So they had a bug in their platform that has probably disabled scheduled images for hundreds customers, which isn’t alerted, and they haven’t told anyone!

This is just a sample of the grind I go through with Rackspace every week. While writing this, I am monitoring a ticket they’ve opened to tell me that one of my servers has failed and they are working on it. I have been instructed:

“Please do not access or modify ‘<server-name>’ during this process.”

Of course, it doesn’t seem to dawn on them that this could be a public web server, with thousands of users knocking on the http door all the time, and the only way I can stop this is to login to the server to shutdown the web server, which I am apparently not supposed to do.”

If you still don’t believe me, you can look at another piece of evidence. For the last year, Rackspace have been offering a service called “Fanatical Support for Amazon AWS” (Pretty People on web page? Check.)

Yes, you can pay Rackspace to “manage” your investment in their main competitor. This is basically Rackspace saying “Yes, we know our service is dogfood, but in order to keep the lights on, we going to try and squeeze a few dollars out of customers who’ve seen the light and are moving elsewhere.

Like I said at the start, ignore the clickbait “comparison” articles. Rackspace is something you should avoid in your IT organisation in the same way you avoid IE6 and Blackberrys.

Never presume its safe to use your credit card online

In light of the recent information security breach at TalkTalk, I thought it would be a good opportunity to share my thoughts on information security and the use of credit cards to purchase goods and services online.

I am not by any means an information security expert, but I can credibly claim to have more knowledge of the subject that the average member of the public, and probably the average IT professional too.

My experience derives primarily from managing systems used to book flights and hotel rooms, in which customer credit cards are used as the method of payment. In my most recent role in this regard, over €40m of revenue per month was flowing through systems under my control.

To understand the underlying risk in passing sensitive data to a computer system you encounter online, the most important point to understand is that for every commercial organisation that ever existed, information security is a drag on profitability.

In itself, that isn’t surprising or unique. There are lots of business functions that are a drag on profitability. The difference with information security is that while its significance is generally understood by decision makers, the complexity of the risk involved is not, which means that when its drag factor is considered with every other drag factor, its tends to get bumped down the list more easily when decisions have to be made about priorities.

For instance, if a Marketing Director says to a CEO that a product launch needs to be delayed to develop a new advertising campaign, because a focus group indicated that the original marketing campaign was not appealing, the logic is immediately accessible to the CEO, and the decision is relatively simple.

If, an the other hand, an IT Director says to a CEO that a new product launch has to be delayed because the software underpinning the payment system for the product hasn’t been penetration tested for Cross Site Scripting vulnerabilities, the logic is less accessible, and the CEO will probably consider the situation of terms of probability, not logic.

“What are the chances of our software being targeted when there are millions of online systems for hackers to choose from? We’ve released products with this software before and everything was fine, so why not this time?”

This type of thinking is pervasive in corporations that rely on information systems and ask us to trust them with our data. When it comes to information security, more often than not, risk is considered in terms of what is probable, not what is possible. The critical flaw in this is that a decision makers estimation of probability is always influenced both by their experience and their wider commercial objectives. If they keep subconsciously diluting risk because it interferes with their commercial objectives, and nothing ever goes wrong, it becomes easier to dilute that risk further and further each time. More often than not they’ll get away with this. There are millions of online systems, and you need to be unlucky to be targeted, but you’re just as likely to be targeted as any one else.

A real life example of this is readily available. I recently had to make a trip to England, and needed to book 4 days parking at Dublin Airport, which would cost in the region of €30, which I knew I would have to pay for with my credit card.

Being relatively familiar with what goes on behind the scenes on such sites,  I tend to rank the security they offer higher than more obvious considerations like price. The gold standard for me is the availability of PayPal as a payment option. There was a time when asking users to pay using PayPal was seen as second rate, which resulted in many online retailers discounting it in favour of custom solutions that made their online presence seem apparently more sophisticated.

Thankfully, those days are gone. PayPal have a proven track record when it comes to credit card security, and my advise would be to always choose PayPal as your payment option rather than entering your card details anew into a integrated payment system.

In the absence of PayPal, I would always favour payment solutions that link into payment gateways like Realex and WorldPay. These solutions require you to enter your credit cards details, but these are either forwarded directly to the payment gateway provider, or entered on a page provided by the gateway provider. The key thing to understand is that the organisation you purchasing the product or service from has very limited responsibility in terms of managing your credit card data. This is a good thing, because it means they have to deal with the issue of information security being a drag on profitability less frequently than organisations that attempt to process credit card payments themselves.

When neither of these options are obviously available, I will always look for an online statement from the product or service provider about how they deal with credit card payments. The key element to look for in such a statement is either attestation to or certification of what is known as PCI DSS compliance.

PCI DSS compliance is a set of standards agreed by the credit card industry which organisations storing or communicating credit card data are supposed to adhere to. There is no law or industry requirement that they should, although many banks will refuse to deal with organisations if they don’t.

Implementing PCI DSS compliance in an organisation is an onerous and expensive task. Not only is the baseline standard difficult to achieve, particularly if the organisation has grown rapidly and the standard has to be implemented retrospectively, but it is also evolving all the time, requiring organisations to devote dedicated resources to ensuring that processes and procedures are up to the date. Its also a standard that involves much more than software. It touches every part of the organisation, from HR to Accounting to Marketing to IT.

Any organisation that has been through this knows the pain involved, and when you achieve compliance, its something you want to tell people about. As such, if I were dealing with an organisation that has full PCI DSS compliance in place, I’d expect them to make that clear on their website.

Lets take a look at the parking options available at Dublin Airport to see how they fared in relation to the payments options available to me.

The first one I tried, because it was the cheapest, was I got my quote and clicked through the booking process to where my credit card details were required. There was no PayPal option, and no evidence of the site using a payment gateway, so I started searching for some evidence of PCI DSS compliance. This is what I found on their Frequently Asked Questions page:

How do I know my booking details are secure?

You can rest assured that your personal data is safe with us. Every booking is encrypted via SSL protocol. Along with encryption we take all the measures required to keep all your personal data safe.

This is an incredibly anaemic statement for a website asking you to enter your credit card details. Their reference to the “SSL protocol” means that your details are encrypted while they are in transit from your computer to theirs (which is a very basic standard), but no information is provided regarding what happens to your details once they arrive at the other end. The reference to “SSL” is also telling. The primary protocols used in the transfer of data across the web are SSL and TLS. Up until about a year ago, SSL was predominant, until a flaw was discovered in it, resulting in the well-managed websites switching to TLS. The fact that QuickPark’s web site continues to refer to “SSL” is not encouraging, never mind the total absence or any reference to the PCI DSS standard.

My next attempt was on the Dublin Airport Authority’s website, where you can book parking in the parking areas owned by the airport. This was at

Again, there was no PayPal option, and no reference to a payment gateway provider, so I again went searching for a statement on information security. This is what I found on their Pre-Booking Frequently Asked Questions page:

How do I know my booking details are secure?

The details you provide are encrypted to prevent them being read over the internet. This is indicated by the GeoTrust icon on the car parking page. You can click on the icon for more detail.

This statement is similarly anaemic to the one provided by QuickPark, which surprised me, given that the Dublin Airport Authority is a long-standing semi-state body, compared to QuickPark, which is a relatively small private company.

Again there is reference to the transmission of data over the internet, but no reference to the management of data on the other side. The reference to the GeoTrust logo is meaningless, but was presumably included as it features the word “Trust”. Obtaining a a GeoTrust logo for your website, or a logo from any one of hundreds of similar providers, costs about $20 per annum. All it signifies is from whom you bought your digital certificate to encrypt your data transmission. It means nothing is terms of how your data is managed by the company once its gets to its destination.

At this point, I decided to change tack, and did a Google search for “Dublin airport parking pci dss”.

The first result I got was for the Clayton Hotel, which is just off the motorway near the exit for the airport, and which offers car parking to airport users.

On their parking Frequesting Asked Questions page ( they state:

How do I know my booking details are secure?

To ensure that you are trading in a secure environment, Clayton Hotel Dublin Airport has contracted the services of Advam. Advam is the leading provider of integrated global card services for private enterprise and government agencies in Australia and around the globe. Advam is a Tier 1 payment processor which adheres to the most stringent of industry accreditations including Level 1 PCI DSS compliance, EMV certification and ISO 9001 accreditation. When you enter your payment details online, you will notice that you will are using a secure site which uses 1024 bit tunnelling encryption to protect your information during transmission. Every transaction processed through Advam’s payment switch is protected by the latest in encryption technology and a combination of state of the art firewalls and intrusion detection systems guard every point of ingress and egress on the Advam network.

This was obviously cut and paste from another document provided by the company referenced in it, Advam, but that in itself isn’t a problem. From this statement I can see that not only is the transmission of the data secure, but that responsibility for the data has been handed of to a third party who specialise in credit card security and who have achieved PCI DSS compliance. This is the type of statement I would expect to see from an organisation asking for my credit card data, and this was the option I chose.

In considering this example, I need to go back to my earlier point about information security being a drag on profitability. As noted, its one drag in a mix of many different drags, but its a drag that tends to get pushed aside because it isn’t one that decision makers can easily relate to.

When this is not the case, or in other words when a decision maker has decided to sufficiently prioritise information security, its a painful process for the organisation involved, and part of the payback is making sure that everyone knows the effort you’ve gone to, particularly if your competitors haven’t.

From that point of view, finding anaemic statements like those referred to above turns on a warning light for me. The absence of more comprehensive information about information security doesn’t mean that these organisation are insecure, but it does mean that they aren’t particularly bothered about promoting information security as a feature of their service offering, which suggests they’ve haven’t invested particularly deeply in it.

At this juncture, it would be nice to be able to go back and view what TalkTalk said about their information security before they were hacked, but at the time of writing, the entire TalkTalk website is just one big blurb about how TalkTalk take information security more seriously than anything else. Presumably, it will be that way for some time.

That said, its unlikely that very many of TalkTalk’s customers ever bothered checking out their statements about information security.

If you don’t want to be in the same position they are in today, its probably a habit you should get in to.



Thoughts on Ansible variables

If you want to use Ansible to really empower your configuration management function, its important to have a solid understanding of how variables work.

Here’s a few must-knows:

Values in ansible.cfg are environment variables, not script variables

The ansible.cfg file is provided to allow the user set default values that are used when ansible is executed from a local environment. This file isn’t a YAML file, which is why assignments use “=” rather than “:”.

The values in this file are set as environment variables when Ansible runs. You cannot access these values directly as script variables eg

remote_user = root

does not provide you with a

{{ remote_user }}

variable in your playbooks.

Its important to create your variables in the right place: inventory or play

Generally, a variable will apply to either a host (or group of hosts), or to a task (play) within a playbook. Decide early where your variable applies and create it in the right place.

For variables that apply to hosts (eg a username to login with) create the variable in either:

Your inventory file:


server1 ansible_ssh_user=admin

Under your group_vars directory:

#file: ./groups_vars/server_group_1


Under your host_vars directory:

#file: ./host_vars/server1


You can also create host-related variables deeper in your playbook:

- hosts: webservers
  remote_user: admin

but I don’t recommend this. Ansible provides sufficient functionality to create an abstraction layer for variables above the play/task level, and it makes sense to use it.

For variables that are specific to plays, the value can be set closer to the point of execution, for example:

After the hosts specification:

- hosts: webservers
     app_version: 12.03

As a parameter for the role that is being applied to the hosts:

- hosts: webservers
    - { role: app, app_version: 12.03 }

Variables in Ansible have precedence rules

Particular care needs to be paid to precedence. In instances, you may want a variable to have an absolute value which cannot be changed by an assignment in any other part of the playbook or from the command line. In other instances you may wish to allow a variable to be changed. These behaviours are controlled by where you create the assignment of the variable.


Installing Passenger for Puppet on Amazon Linux


Puppet ships with a web server called Web Brick. This is fine for test and use with a small number of nodes, but will cause problems with larger fleets of nodes. It is recommended to use the Ruby application server, Passenger, to run Puppet in production environments.


Provision a new server instance.

Install required RPMs. Use Ruby 1.8 rather than Ruby 2.0. Both are shipped with the Amazon Linux AMI at the time of writing, but you need to set up the server to use version 1.8 by default.

sudo yum install -y ruby18 httpd httpd-devel mod_ssl ruby18-devel rubygems18 gcc mlocate
sudo yum install -y gcc-c++ libcurl-devel openssl-devel zlib-devel git

Make Ruby 1.8 the default

sudo alternatives --set ruby /usr/bin/ruby1.8

Set Apache to start at boot

sudo chkconfig httpd on

Install Passenger gem

sudo gem install rack passenger

Update the location DB (you will need this to find files later)

sudo updatedb

Find the path to the installer and add this to the path

locate passenger-install-apache2-module
sudo vi /etc/profile.d/
export PATH=$PATH:/usr/lib/ruby/gems/1.8/gems/passenger-5.0.10/bin/
sudo chmod 755 /etc/profile.d/

Make some Linux swap space (the installer will fail on smaller instances if this doesn’t exist)

sudo dd if=/dev/zero of=/swap bs=1M count=1024
sudo mkswap /swap
sudo chmod 0600 /swap
sudo swapon /swap

At this point, open a separate shell to the server (you should have 2 shells). This isn’t absolutely essential, but the installer will ask you to update an Apache file mid-flow, so if you want to do things to the letter of the law, a second shell helps.

Next, run the installer, and accept the default options.

sudo /usr/lib/ruby/gems/1.8/gems/passenger-5.0.10/bin/passenger-install-apache2-module

The installer will ask you to add some Apache configuration before it completes. Do this in your second shell. Add the config to a file called /etc/httpd/conf.d/puppet.conf. You can ignore warning about the PATH.

<IfModule mod_passenger.c>
  PassengerRoot /usr/lib/ruby/gems/1.8/gems/passenger-5.0.10
  PassengerDefaultRuby /usr/bin/ruby1.8

Restart Apache after you add this and then press Enter to complete the installation

Next, make the necessary directories for the Ruby application

sudo mkdir -p /usr/share/puppet/rack/puppetmasterd
sudo mkdir /usr/share/puppet/rack/puppetmasterd/public /usr/share/puppet/rack/puppetmasterd/tmp

Copy the application config file to the application directory and set the correct permissions

sudo cp /usr/share/puppet/ext/rack/files/ /usr/share/puppet/rack/puppetmasterd/
sudo chown puppet:puppet /usr/share/puppet/rack/puppetmasterd/

Add the necessary SSL config for the ruby application to Apache. You can append this to the existing puppet.conf file you created earlier. Note that you need to update this file to specify the correct file names and paths for your Puppet certs (puppet.pem in the example below).The entire file should now look like below:

LoadModule passenger_module /usr/lib/ruby/gems/1.8/gems/passenger-5.0.10/buildout/apache2/
<IfModule mod_passenger.c>
  PassengerRoot /usr/lib/ruby/gems/1.8/gems/passenger-5.0.10
  PassengerDefaultRuby /usr/bin/ruby1.8
# And the passenger performance tuning settings:
# Set this to about 1.5 times the number of CPU cores in your master:
PassengerMaxPoolSize 12
# Recycle master processes after they service 1000 requests
PassengerMaxRequests 1000
# Stop processes if they sit idle for 10 minutes
PassengerPoolIdleTime 600
Listen 8140
<VirtualHost *:8140>
    # Make Apache hand off HTTP requests to Puppet earlier, at the cost of
    # interfering with mod_proxy, mod_rewrite, etc. See note below.
    PassengerHighPerformance On
    SSLEngine On
    # Only allow high security cryptography. Alter if needed for compatibility.
    SSLProtocol ALL -SSLv2 -SSLv3
    SSLHonorCipherOrder     on
    SSLCertificateFile      /var/lib/puppet/ssl/certs/puppet.pem
    SSLCertificateKeyFile   /var/lib/puppet/ssl/private_keys/puppet.pem
    SSLCertificateChainFile /var/lib/puppet/ssl/ca/ca_crt.pem
    SSLCACertificateFile    /var/lib/puppet/ssl/ca/ca_crt.pem
    SSLCARevocationFile     /var/lib/puppet/ssl/ca/ca_crl.pem
    #SSLCARevocationCheck   chain
    SSLVerifyClient         optional
    SSLVerifyDepth          1
    SSLOptions              +StdEnvVars +ExportCertData
    # Apache 2.4 introduces the SSLCARevocationCheck directive and sets it to none
    # which effectively disables CRL checking. If you are using Apache 2.4+ you must
    # specify 'SSLCARevocationCheck chain' to actually use the CRL.
    # These request headers are used to pass the client certificate
    # authentication information on to the puppet master process
    RequestHeader set X-SSL-Subject %{SSL_CLIENT_S_DN}e
    RequestHeader set X-Client-DN %{SSL_CLIENT_S_DN}e
    RequestHeader set X-Client-Verify %{SSL_CLIENT_VERIFY}e
    DocumentRoot /usr/share/puppet/rack/puppetmasterd/public
    <Directory /usr/share/puppet/rack/puppetmasterd/>
      Options None
      AllowOverride None
      # Apply the right behavior depending on Apache version.
      <IfVersion < 2.4>
        Order allow,deny
        Allow from all
      <IfVersion >= 2.4>
        Require all granted
    ErrorLog /var/log/httpd/puppet-server.example.com_ssl_error.log
    CustomLog /var/log/httpd/puppet-server.example.com_ssl_access.log combined

The ruby application is now ready. Install the puppet master application. Note, do NOT start the puppetmaster service or set it to start at boot.

sudo yum install -y puppet-server

Restart Apache and test using a new puppet agent. You can also import the the ssl assets from an existing puppet master into /var/lib/puppet/ssl. This will allow you existing puppet agents to continue to work.

Allowing puppet agents manage their own certificates


Why would you want to allow a puppet agent manage the certificates the puppet master holds for that agent? Doesn’t that defeat the whole purpose of certificate based authentication in puppet?

Well, yes, it does, but there are situations in which this is useful, but only where security in not a concern!!

Enter Cloud Computing.

Servers in Cloud Computing environments are like fruit flies. There are millions of them all over the world being born and dying at any given time. In a an advanced Cloud configuration they can have lifespans of hours, if not minutes.

As puppet generally relies on fully qualified domain names to match agent requests to stored certificates, this can become a bit of a problem, as server instances that come and go in something like Amazon AWS can sometimes be required to have the same hostname at each launch.

Imagine the following scenario:

You are running automated performance testing, in which you want to test the amount of time if takes to re-stage an instance with a specific hostname and run some tests against it. Your script both launches the instance an expects the instance to contact a puppet master to obtain its application.

In this case, the first time the instance launches, the puppet agent will generate a client certificate signing request, send that to the master, get it signed and pull the necessary catalog. The puppet master will then have certificate for that agent.

Now, you terminate the instance and re-launch it. The agent presents another signing request, with the same hostname, but this time the puppet master refuses to play, telling you that it already has a certificate for that hostname, and the one you are presenting doesn’t match.

You’re snookered.

Or so you think. The puppet master has a REST api that is disabled by default but when you can open up to it receive HTTP requests to manage certificates. To enable the necessary feature, add the following to your auth.conf file

path /certificate_status
auth any
method find, save, destroy
allow *

Restart the puppet master when you’ve done this.

sudo service puppetmaster restart

Next, when you start you server instance, include the following script at boot. It doesn’t actually matter when this is run, provided it is run after the hostname of the instance has been set.


curl -k -X PUT -H "Content-Type: text/pson" --data '{"desired_state":"revoked"}' https://puppet:8140/production/certificate_status/$HOSTNAME

curl -k -X DELETE -H "Accept: pson"  https://puppet:8140/production/certificate_status/$HOSTNAME

rm -Rf /var/lib/puppet/ssl/*

puppet agent -t

This will revoke and delete the agent certificate on the master, delete the agent’s copy of the certificate and renew the signing process, giving you new certs on the agent and master and allowing the catalog to be ingested into the agent.

You can also pass a script like this as part of the Amazon EC2 process of launching an instance.

aws ec2 run-instances  --user-data file://./

Where is the name of the locally saved script file, and it is saved in the same directory as your working directory (otherwise include the absolute path).

With this in place, each time you launch a new instance, regardless of its hostname, it will revoke any existing cert that has the same hostname, and generate a new one.

Obviously, if you are launching hundreds of instances at the same time, you may have concurrency issues, and some other solution will be required.

Again, this is only a solution for environments where security is not an issue.