Category Archives: Data

Migrating MySQL from AWS RDS to EC2

Applications that use MySQL as their underlying RDBMS commonly evolve as follows:

  1. Application and MySQL server on same EC2 instance
  2. Application balanced between multiple EC2 instances and MySQL server moved to RDS instance
  3. MySQL server moved back to EC2 with DIY High Availability infrastructure
  4. MySQL server moved to Bare Metal with DIY High Availability infrastructure in Co-Lo data centre

As each of these migration steps arrives, the size of the dataset under MySQL management is larger, and the availability of the application more critical, making each step exponentially more complex.

In recent months, I have had to manage Step 3 in this live cycle (the migration back to EC2 from RDS). The following is an account of my experience.

The dataset involved was 4TB in size. That isn’t huge by today’s standards, but its large enough to involve multiple days of data transfer and to require something more than a mysqldump and import in your planning.

The dataset was also highly volatile, in that it was being augmented 24/7, and relied on stored procedures to aggregate data on a daily basis on which commercial SLAs were based. In other words, stopping updates to the dataset for anything more than a couple of hours was not an option.

Time pressure was a further consideration. RDS has a hard limit of 6TB of disk space for an instance (and a 2TB file size limit), and our application was due to introduce new functionality that would increase the rate of data accumulation dramatically. We estimated that we had 2-3 months to complete the transition before the 6TB limit appeared on the horizon.

We did our research and decided on a strategy. We would create a Read Replica of our RDS master and allow it to come into sync. When it was in sync, we would promote it to a standard RDS instance and note the replication point in the Bin Log. We would then do a full mysqldump of the database and inject that directly into our EC2 master, which we estimated would take 96 hours. When this was complete, we would make the EC2 master a slave of the RDS master, and start replication from the point in the Bin Log we had previously noted. We estimated that the data gap would take 18-20 hours to fill, after which we would have a full and intact dataset in EC2.

This plan was fine except for one detail. Because of data relies extensively on stored procedures, it requires a lot of RAM and CPU grunt to get through its workload. Under normal circumstances, we maintained a Read Replica for the RDS master, to allow for intensive read queries that would not impact on the processing capability of the RDS master. On occasion, when there were replication issues, the Bin Log on the RDS would grow rapidly, consuming several hundred GBs of disk space. This isn’t supposed to be an issue in MySQL, but the internal mechanics of RDS and how the Bin Log is managed seem to make it an issue. We we saw the Bin Log growing to this extent, performance on the RDS master rapidly degraded, requiring us to terminate replication completely (in order that RDS would flush the Bin Log).

Given that our plan involved allowing the Bin Log to grow over 96 hours, we were obviously concerned. We discussed this with our support partners, Percona, who recommend an alternative strategy.

They suggested using the MySQL Bin Log utility to back up the Bin Log to location outside RDS, which we could then stream into our EC2 master. This would involved extra steps in the process, and tighter co-ordination, but it seemed to be a lot less riskier in terms of impacting on the RDS master. Our new plan was therefore as follows:

  1. Ensure all applications are using a DNS record for MySQL server that has 0 sec TTL
  2. Create a Read Replica of the RDS master and allow to come in sync
  3. Stop replication on the replica, note the replication point and promote to master
  4. Configure RDS master to retain at least 12 hours of Bin Log, and wait for 12 hours (ensuring that Bin Log growth does not impact on performance during this time)
  5. Start Bin Log backup from RDS master to disk on EC2 master
  6. Commence mysqldump from RDS master and inject directly into EC2 master
  7. On completion of mysqldump and injection, start restore of Bin Log file into EC2 master
  8. Verify that RDS master and EC2 master are approximately in sync
  9. Pause updates to dataset in RDS master for approx. 1 hour
  10. Verify that RDS master and EC2 master are fully in sync
  11. Stop Bin Log backup and Bin Log restore
  12. Re-create stored procedures on EC2 master
  13. Change DNS record for MySQL Server to point to EC2 master
  14. Re-commence updates to dataset

On completion of this process, we had moved our 4TB dataset from RDS to EC2 with only a 1 our interruption in the data update process. For High Availability, we created 2 slaves and managed these with MySQL Utilities. We placed 2 HA Proxy nodes in front of this MySQL server farm and balanced traffic to the HA Proxy nodes with an Elastic Load Balancer listening for TCP (rather than HTTP) connections.

Its probably also worth mentioned that EC2 also has disk limits. A single EBS volumes can have a maximum size of 16TB. To overcome this, you can combine multiple EBS volumes into an LVM set, or use software based RAID 0. We were initially concerned about using these sort of virtual disks for storing data, but this should be less of a concern when you remember than EBS itself has multiple layers of redundancy. We went for an LVM configuration.

 

 

 

Managing Logstash with the Redis Client

Users of Logstash will be familiar with the stack of technologies required to implement a logstash solution:

The client that ships the logs to Redis

Redis which queues up the files for indexing

Logstash which creates the indices

Elasticsearch which stores the indices

Kibana which queries Elasticsearch

When you’re dealing with multiple components like this, things will inevitably for wrong.

For instance, say for some reason you client stops, and then you start it again 4 days later, and now the stack has to process 4 days of old log files before letting you search the latest ones.

One of the best ways to deal with this is to setup the Redis queue (“list” is the correct term) so that you can selectively remove entries from the list, so that chunks of old logs can be skipped.

Take a look at this config from the logstash shipper:


output {
  stdout { debug => false debug_format => "json"}
  redis { host => "172.32.1.172" data_type => "channel" key => "logstash-%{@type}-%{+yyyy.MM.dd.HH}" }
}

You’ll see here that I’ve modified the default key value for logstash, by adding the log file type and date stamp to the key. The default key value in the Logstash documentation is “logstash’, which means every entry goes into Redis with the same key value.

You will also notice that I have changed the data_type from the default “list” to “channel’, more of which in a moment.

To see what this means, you should now login to your Redis server with the standard redis-cli command line interface

To list all available keys, just type


KEYS *logstash*

and you will get something like


redis 127.0.0.1:6379> keys *logstash*
 1) "logstash-nodelog-2014.03.07.17"
 2) "logstash-javalog-2014.03.07.15"
 3) "logstash-applog-2014.03.07.14"
 4) "logstash-catalina-2014.03.08.23"
 5) "logstash-applog-2014.03.08.23"
 6) "logstash-catalina-2014.03.07.15"
 7) "logstash-nodelog-2014.03.07.14"
 8) "logstash-javalog-2014.03.07.14"
 9) "logstash-nodelog-2014.03.08.23"
10) "logstash-applog-2014.03.07.15"
11) "logstash-javalog-2014.03.08.23"

This shows that your log data are now stored in Redis according to log file type, and data and hour, rather than all just under the default “logstash” key. In other words, there are now multiple keys, rather than just the “logstash” key which is the default.

You also need to change the indexer configuration at this point, so that it looks for multiple keys in Redis rather than just the “logstash” key


input {
  redis {
    host => "127.0.0.1"
    type => "redis-input"
    # these settings should match the output of the agent
    data_type => "pattern_channel"
    key => "logstash*"

    # We use json_event here since the sender is a logstash agent
    format => "json_event"
  }
}

For data_type here, I am using “pattern_channel”, which means the indexer will ingest the data from any key where the key matches the pattern “logstash*”.

If you don’t change this, and you have changed your shipper, none of your data will get to Elasticsearch.

Using Redis in this way also requires a change to the default Redis configuration. When Logstash keys are stored in Redis in a List format, the List is constantly popped by the Logstash indexer, so it remains in a steady state in terms of memory usage.

When the Logstash Indexer pull data from a Redis channel, the data isn’t removed from Redis, and therefore grows.

To deal with this, you need to set up memory management in Redis, namely:

maxmemory 500mb
maxmemory-policy allkeys-lru

What this means is that when Redis reaches a limit of 500mb of used memory, it will drop keys according to a “Least Recently Used” algorithm. The default algorithm is volatile-lru, which is dependent on the TTL value of the key, but as Logstash doesn’t set the TTL on Redis keys, which need to use the allkeys-lru alternatively instead.

Now, if you want to remove a particular log file type from a particular date and time from the Logstash process, you can simply delete that data from Redis


DEL logstash-javalog-2014.03.08.23

You can also check the length of individual lists by using LLEN, to give you an idea of which logs from which dates and times will take the longest to process


redis 127.0.0.1:6379> llen logstash-javalog-2014.03.08.23
(integer) 385460

You can also check you memory consumption in Redis with:

redis 127.0.0.1:6379>info