Applications that use MySQL as their underlying RDBMS commonly evolve as follows:
- Application and MySQL server on same EC2 instance
- Application balanced between multiple EC2 instances and MySQL server moved to RDS instance
- MySQL server moved back to EC2 with DIY High Availability infrastructure
- MySQL server moved to Bare Metal with DIY High Availability infrastructure in Co-Lo data centre
As each of these migration steps arrives, the size of the dataset under MySQL management is larger, and the availability of the application more critical, making each step exponentially more complex.
In recent months, I have had to manage Step 3 in this live cycle (the migration back to EC2 from RDS). The following is an account of my experience.
The dataset involved was 4TB in size. That isn’t huge by today’s standards, but its large enough to involve multiple days of data transfer and to require something more than a mysqldump and import in your planning.
The dataset was also highly volatile, in that it was being augmented 24/7, and relied on stored procedures to aggregate data on a daily basis on which commercial SLAs were based. In other words, stopping updates to the dataset for anything more than a couple of hours was not an option.
Time pressure was a further consideration. RDS has a hard limit of 6TB of disk space for an instance (and a 2TB file size limit), and our application was due to introduce new functionality that would increase the rate of data accumulation dramatically. We estimated that we had 2-3 months to complete the transition before the 6TB limit appeared on the horizon.
We did our research and decided on a strategy. We would create a Read Replica of our RDS master and allow it to come into sync. When it was in sync, we would promote it to a standard RDS instance and note the replication point in the Bin Log. We would then do a full mysqldump of the database and inject that directly into our EC2 master, which we estimated would take 96 hours. When this was complete, we would make the EC2 master a slave of the RDS master, and start replication from the point in the Bin Log we had previously noted. We estimated that the data gap would take 18-20 hours to fill, after which we would have a full and intact dataset in EC2.
This plan was fine except for one detail. Because of data relies extensively on stored procedures, it requires a lot of RAM and CPU grunt to get through its workload. Under normal circumstances, we maintained a Read Replica for the RDS master, to allow for intensive read queries that would not impact on the processing capability of the RDS master. On occasion, when there were replication issues, the Bin Log on the RDS would grow rapidly, consuming several hundred GBs of disk space. This isn’t supposed to be an issue in MySQL, but the internal mechanics of RDS and how the Bin Log is managed seem to make it an issue. We we saw the Bin Log growing to this extent, performance on the RDS master rapidly degraded, requiring us to terminate replication completely (in order that RDS would flush the Bin Log).
Given that our plan involved allowing the Bin Log to grow over 96 hours, we were obviously concerned. We discussed this with our support partners, Percona, who recommend an alternative strategy.
They suggested using the MySQL Bin Log utility to back up the Bin Log to location outside RDS, which we could then stream into our EC2 master. This would involved extra steps in the process, and tighter co-ordination, but it seemed to be a lot less riskier in terms of impacting on the RDS master. Our new plan was therefore as follows:
- Ensure all applications are using a DNS record for MySQL server that has 0 sec TTL
- Create a Read Replica of the RDS master and allow to come in sync
- Stop replication on the replica, note the replication point and promote to master
- Configure RDS master to retain at least 12 hours of Bin Log, and wait for 12 hours (ensuring that Bin Log growth does not impact on performance during this time)
- Start Bin Log backup from RDS master to disk on EC2 master
- Commence mysqldump from RDS master and inject directly into EC2 master
- On completion of mysqldump and injection, start restore of Bin Log file into EC2 master
- Verify that RDS master and EC2 master are approximately in sync
- Pause updates to dataset in RDS master for approx. 1 hour
- Verify that RDS master and EC2 master are fully in sync
- Stop Bin Log backup and Bin Log restore
- Re-create stored procedures on EC2 master
- Change DNS record for MySQL Server to point to EC2 master
- Re-commence updates to dataset
On completion of this process, we had moved our 4TB dataset from RDS to EC2 with only a 1 our interruption in the data update process. For High Availability, we created 2 slaves and managed these with MySQL Utilities. We placed 2 HA Proxy nodes in front of this MySQL server farm and balanced traffic to the HA Proxy nodes with an Elastic Load Balancer listening for TCP (rather than HTTP) connections.
Its probably also worth mentioned that EC2 also has disk limits. A single EBS volumes can have a maximum size of 16TB. To overcome this, you can combine multiple EBS volumes into an LVM set, or use software based RAID 0. We were initially concerned about using these sort of virtual disks for storing data, but this should be less of a concern when you remember than EBS itself has multiple layers of redundancy. We went for an LVM configuration.