Rackspace – Engineered to Fail

I’ve read several articles over the last few years comparing the cloud infrastructure services offered by Rackspace and Amazon AWS.

Typically, these articles arrive at no firm conclusion as to which is better, referring to issues like cost, support, availability etc.

As someone who has used both services for over 5 years, I find these articles incomprehensible. From a technical viewpoint, there is no comparison between these services. Its like comparing an iPhone 6 to a pocket calculator. Both have a screen, a battery and a digital pulse, but when it comes to sophistication and functionality, they are for all intents and purposes different services.

To put it bluntly, Rackspace is a truly awful experience. They position themselves as a “managed” cloud services provider, which should begin to give an indication of the problem. The beauty of cloud services is that they don’t need to be managed. You buy them, consume them and dispose of them.

Being a “managed” cloud services provider is like being a “managed” self-service car wash. If the car wash machine is so complex, inflexible and unreliable as to require the constant attention of a human being to ensure that users can wash their cars, then those users might as well just go home and wash their cars in the drive (ie have on-premises infrastructure).

From what I can see, the difference in Amazon and Rackspace in this regard stems from their inception.

Amazon’s AWS platform was a spin-off from their shopping function. They had lots of spare compute capacity outside peak periods and decided to hire it out, along with the tools they used to manage it. As such, it was a battle-hardened infrastructure that was used in real, live-fire web environments, and felt familiar and well-designed to actual system engineers.

Rackspace’s service seems to have been designed by marketing professionals. Its ridiculously basic, doesn’t seem to accommodate any future-proofing, and is totally inflexible. Much more attention seems to have gone into the marketing strategy (check out the number of pretty people on the Rackspace website, compared to the JPG-free Amazon AWS site) than their actual technology.

To illustrate this, I’m going to give a specific example. As well as backing up the points made above, I’m also hoping that this article will be picked up by search engines, as it highlights a major flaw in a certain Rackspace functionality, which would cause problems for Rackspace users if not addressed.

When you create a Rackspace Cloud server, you are given an option to schedule daily imaging of the server. That means you can create an offline copy of the server at a point in time, which you can restore at a later time to re-establish the functionality of that server.

To most people working in infrastructure operations, this means one thing: backup.

You think: “If I can make a daily image of my server, and hold the 7 most recent images, my backups are sorted.” Inevitably, that’s what a lot of Rackspace users are doing.

But here’s the thing (that you only find out when you ask the question): because of the way this imaging process works, the creation of the images will inevitably start to fail, and there is no mechanism in the platform to alert you when they do fail.

The explanation of the technology is given here:

https://community.rackspace.com/products/f/25/t/3778

To summarise:

A Rackspace server image is composed of 2 parts: the base image file when the image was first created, and the extended image file that contain all the changes to the image that have been made since the base was created.

That means that when you restore the image, the base is restored first, and then all the changes are applied to the base from the extended image.

That means that if the data and you server is changing, but not necessarily growing (eg you could be writing huge logs, or a huge database, but pruning effectively) the size of your image is constantly growing. For First Generation Rackspace servers, if an image gets to greater than 160GB/250GB (which is peanuts in today’s Big Data world) the imaging will fail. For Next Generation servers, there is apparently no limit, but check out the comments of the “Racker” on this Rackspace support thread:

https://community.rackspace.com/general/f/34/t/461?pi1830=1

“Next Gen has no limits for either Windows or Linux, but as an image gets really large, there may be an increased chance of the process failing (Things sometimes go wrong when you are talking about moving hundreds of GBs of data).”

Wow! Like who would need to manage “hundreds of GBs” of data in 2016?! What is this? Star Trek!?

This is consistent with what I was told on a support thread by another Racker, namely that imaging is offered on a “Best Effort” basis. Remember, this is bits and bytes technology we’re talking about here, where stuff normally works or doesn’t. We’re not talking about nuclear fusion.

The same Racker goes on to say:

“For customers who run into these limits, there is generally a larger issue though. The truth is that you really should NOT be using imaging as a backup solution. Think about it, does it really make sense to backup tons and tons of data every day when only a few things changed on the server? Do you really want to spin up a new Cloud Server just to recover a single file?”

That’s a sort of valid point, but here’s a question: if scheduled daily imaging isn’t suitable for backup, why the hell is scheduling daily imaging made available as a feature, inviting hapless Ops Engineers to think that their servers are being reliably backed up when really they are not? What exactly is the purpose of scheduled daily image if not for backup?

And the reason the point is only “sort of” valid is that there are times when you will need to make a full daily image of a server. Let’s say you have a MySQL server that has a 200GB data payload. You can’t run a mysqldump against that every night, because it will grind the server to a standstill. You have to do a bit-for-bit image of the system to back it up (as recognised by Amazon RDS service, where you can schedule daily snapshots of your RDS instances).

It actually gets worse.

Imaging cannot only fail because of image size, but also because of “bugs” in the Rackspace platform. A few weeks back, I noticed that imaging on one of our smaller Rackspace servers had stopped working. I dialled up a support chat and ask the guy who responded what was going on.

Theodore R: Garreth! thank you for holding. We have a known bug in ORD that we've seen a few failures on scheduled images. To help with this. Go ahead and cancel the two jobs stuck at 0%. Then de-activiate the schedule then re-enable the schedule. I'm sorry about this it is a known issue we are working on resolving this.

Me: If you knew there was a bug why didn't you tell your customers?

Theodore R: I don't have that answer. As I'm front line support but I will bring that up to my manager in our team meeting today.

Theodore R: I do apologize about this

So they had a bug in their platform that has probably disabled scheduled images for hundreds customers, which isn’t alerted, and they haven’t told anyone!

This is just a sample of the grind I go through with Rackspace every week. While writing this, I am monitoring a ticket they’ve opened to tell me that one of my servers has failed and they are working on it. I have been instructed:

“Please do not access or modify ‘<server-name>’ during this process.”

Of course, it doesn’t seem to dawn on them that this could be a public web server, with thousands of users knocking on the http door all the time, and the only way I can stop this is to login to the server to shutdown the web server, which I am apparently not supposed to do.”

If you still don’t believe me, you can look at another piece of evidence. For the last year, Rackspace have been offering a service called “Fanatical Support for Amazon AWS”

https://www.rackspace.com/managed-aws (Pretty People on web page? Check.)

Yes, you can pay Rackspace to “manage” your investment in their main competitor. This is basically Rackspace saying “Yes, we know our service is dogfood, but in order to keep the lights on, we going to try and squeeze a few dollars out of customers who’ve seen the light and are moving elsewhere.

Like I said at the start, ignore the clickbait “comparison” articles. Rackspace is something you should avoid in your IT organisation in the same way you avoid IE6 and Blackberrys.

One thought on “Rackspace – Engineered to Fail

  1. Adam

    I have experienced this exact thing. Rackspace is hugely overpriced and limiting. We have slowly moved to Digital Ocean (as have many devops that I know) where the feature set is small but growing — but one key thing, it all works perfectly.

    Rackspace seems to have a good name with clients (as you mention, great marketing) and its hard to explain to them it really offers more. Its very annoying that these clients are comforted by the presence of Rackspace in the solution even though they have no idea what is best for them.

    Hopefully, over time, they will see Rackspace is just an overpriced mess.

Leave a Reply

Your email address will not be published.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>