Share what you know with millions of people

Focus is the best place to turn what you know into remarkable content
×
0

Amazon EC2 has gone down -what would a prefered hosting platform be? (notice that Focus is working!)

Attachments

Best Answer

1
Andrew Baker
Director, Service Operations, SWN Communications Inc.

A few have said it already, but it definitely bears repeating:

A successful site -- particularly a complex site -- is dependent on good infrastructure architecture AND good application architecture. You can't simply throw redundant hardware at a poorly/inadequately built app and expect everything to work, any more than you can develop the best application without consideration of the underlying infrastructure, and expect miracles.

Two key lessons from this experience should be:

-- Find ways to verify or simulate failure conditions in the cloud, as you would with your own hosted infrastructure, so that you can be sure that your DR is appropriate

-- Ensure that the SLA you are provided matches your business needs, and that you mitigate as many operational risks as you can. Mitigation costs always seem high *until* you have a 24-48 hour outage, and then everyone is asking, "why didn't we pay for xxxxx?!?"

-- The cloud does not absolve you of your responsibility to plan for DR.

-ASB: http://xeesm.com/AndrewBaker

0
Ashley Davies
Ashley Davies Replied on May 4, 2011

A fantastic response Andrew, in your opinion would you tend to stay away from the 'cloud' and invest in architecture to support a website? or would you look to outsource to a 3rd party/cloud to handle that for you?

0
Andrew Baker
Andrew Baker Replied on May 4, 2011

Thanks, Ashley. Depends on the situation. :)

I am a proponent of using the cloud -- intelligently -- as a sourcing option for various services. Whether I would use the cloud or not for hosting a given website would depend on what that website needed to provide, how critical it was, how much infrastructure I am already managing (and how), and what kind of budget I was dealing with. If there is tentative demand, moderate risk, variability of resource needs, and average budget, I would at least consider the cloud to manage both consumption and financial risks. Especially if I didn't already have infrastructure that would be adequate to support the venture.

2
Dan McComas
Director, Engineering, Focus

The fact that Focus is working is pretty much luck at this point. Our primary Amazon center is in the west and their issues are in their Virginia center. That being said, I noticed that SimpleGeo stayed up and saw a tweet this morning from @joestump about how they distribute across multiple Amazon zones (we have a database slave in the east for this same purpose).

The only way around this is do plan ahead and have failover mechanisms, which is seems reddit, quora and many others do not. There's usually a reason why they don't, it is expensive!

0
Michael Schmier
Michael Schmier Replied on April 21, 2011

Seems like fail over used to be a pretty expensive thing for start-ups pre-cloud. Can fail-over be acheived with a single provider in the cloud now? Or should fail-over include different providers?

0
John Haugeland
John Haugeland Replied on April 22, 2011

You're saying the only reason you're still working is dumb luck, but your CTO is saying "but this was still the best choice."

It becomes clear that you guys don't actually know what you're doing.

0
John Haugeland
John Haugeland Replied on April 22, 2011

That you're holding up failover as expensive when dedicated with double failover is generally radically cheaper than Amazon means that you also haven't even begun to look at the actual cost structure of any alternatives.

Reddit could have prevented these several days of downtime for about two hundred dollars a month. If you think they haven't lost a year at that rate on ad revenue, or if you think it'll be a year before their next outage given that in the last year they've had five of 12 hours or more, then great, You're Doing It Right.

When you start actually doing the numbers, you'll begin to realize that Amazon is to hosting what Bose is to headphones - a marketing company taking advantage of clueless consumers who don't know how to price compare.

0
Dan McComas
Dan McComas Replied on April 22, 2011

If you think reddit could have prevented this from happening for 200 a month you clearly don't know what you're doing either. You clearly have issues with amazon and your comments here are anything but productive. Take your troll comments elsewhere.

0
John Haugeland
John Haugeland Replied on April 22, 2011

Actually, Dan, I know exactly what I'm doing, and that number isn't pulled out of thin air. I made the case to them almost 18 months ago.

Don't call me a troll just because I have an opinion that isn't the same as yours, sir, and don't call me wrong just because I have an opinion that you've discarded without actually asking what it is.

I own a web host, and I have customers that do quite a bit more traffic than focus.com.

There's a reason you and your staff disagree on the wisdom of your own hosting choices: you haven't actually taken the time to set metrics or figure out what your options are.

It is not a sign of wisdom to argue with a plan without asking what it is, Mr McComas.

0
John Haugeland
John Haugeland Replied on April 22, 2011

Incidentally, you are misapplying internet terminology. Trolls are people who show up and question your sexuality, or other such complete side issues, in order to pick fights.

Someone showing up with specific fact-driven criticisms of your public claims is not a troll; they're just someone who disagrees.

Your attempt to diminish someone for disagreeing with you politely, in effort to ignore that your own CTO is making claims you cannot agree with about your own infrastructure, speaks to your character, sir.

I confess to disappointment that your reaction to "you and your CTO don't agree, and here's how you can do this cheap" is "go away, troll." A better response might have been something like "I don't see how it's possible to keep Reddit up on $200; that seems completely outside the market possibilities I've seen. Could you explain, please?"

Sadly, some people on the internet simply cannot cope with the idea that they might not actually have the best answer to every single question, and will lash out rather than to say "thank you for providing me information which will protect me from the next Amazon outage, and at the same time save my company a lot of money."

C'est la vie.

If you choose to apologize for the inappropriate response you gave, I'll show you how to get past this vulnerability and expense pair. There's a reason all your other readers are saying the same things I am, and that reason is neither that I am wrong, that I am a troll, or that you've got the only answer on the books.

C'mon, guy. Criticism isn't that hard to take decently.

0
Dan McComas
Dan McComas Replied on April 22, 2011

Actually, you came in and said this:
"It becomes clear that you guys don't actually know what you're doing."

That, is a troll statement, no two ways about it. Once you make a statement like that, the conversation is over. Your intentions are clear, you do not wish to have a conversation about technology, you wish to impose your opinions on me and I'm not interested in that. Next time, if you want to have a real conversation about something don't insult someone and then expect them to respond favorably. You should know better.

0
John Haugeland
John Haugeland Replied on April 22, 2011

Observing that the owner and the person responsible for maintaining are giving diametrically opposed, mutually incompatible views leads to the belief that you don't know what you're doing is not a troll statement, and the only way it leads to the conversation being over is if you aren't willing to face the criticism.

Not all criticism is trolling, Dan.

It's not really appropriate for you to complain about insulting people while you're calling them a troll and saying that a plan you've never heard or seen is wrong, Dan. You're doing what you're complaining about, at a more serious level than what you're complaining about.

We should both know better, it seems.

Why is it that you say you're up through dumb luck, yet your CTO says "doing it this way is probably still the best way?"

Do you simply not see the incompatible contrast there?

0
John Haugeland
John Haugeland Replied on April 22, 2011

But hey, easier to keep complaining than to actually address what's said to you, right? That makes saving face much easier.

0
Dan McComas
Dan McComas Replied on April 22, 2011

I have obviously fed the troll too much and will not be participating in this conversation any longer. Happy trolling!

0
John Haugeland
John Haugeland Replied on April 22, 2011

You sure spend a lot of time spreading insults, for someone who wants to complain about perceived insults.

My opinion at this point is that you do not have the engineering skills to summon a legitimate response to the criticisms made.

You can keep saying troll, or you can hide silently, *or* you can make a legitimate response and prove me wrong.

You and I both know which one you're going to do, and how your readers will interpret it.

Be clear: you just don't know enough to respond legitimately. That's why you keep saying troll: it gives you an emotional excuse to not be embarrassed that you're out of your depth in something this basic.

Have a nice day, Dan. Maybe your dumb luck will hold the next time Amazon goes down, which statistically will be in seven weeks.

2
Remy Bergsma
Community Manager, MailPlus

Prefered hosting platform would be Akamai: they have been hosting some of the biggest platforms/websites in the world. Not the cheapest and stuff can still break (it's all machinery after all, built by humans), but they seem to have done a pretty decent job over the years.

Note that Amazon EC2 is not a small platform either, but scale does not guarantee uptime, as you have noticed.

2

We have a very different architecture than most people at SimpleGeo. First off, our API services are *completely* sectioned off from our website. Our website can go down without affecting our API.

Our API is built across two AZs. We were in three, but brought one down to save cash for the near-term. We use an ELB out front, which is region-wide and, from what I've gathered, sits above any regions AZ infrastructure. This allows API servers and infrastructure in one AZ to crap its pants, while the other one stays up.

Behind the ELB/API servers sits three Cassandra clusters, a bunch of isolated/share-nothing services (geocoders and the like), and RabbitMQ servers. The Cassandra clusters have a replication strategy, which in Cassandra is a pluggable component, that replicates each piece of data in each AZ. This way we can keep reads local to a specific AZ and writes are propagated around the region to other AZs.

All writes go through the queues, which allows us to continue accepting writes even if our entire storage clusters go offline (this has saved us a few times).

The key to surviving these kinds of outages are, as usual, share nothing, redundancy, automated failover, automated replication, etc.

1

DR/BC needs to be a primary objective when planing and implementing any IT project, outsourced or not. The 'cloud' isn't magic, the 'cloud' isn't fail-proof, the 'cloud' requires hardware, software, networking, security, support and execution just like anything else. All the fancy marketing speak, recommendations and free trials can't replace the need to do obsessive due diligence before trusting any provider no matter how big and awesome they may seem.

Why do DCs have UPS and Diesel Generators on-site? they know power companies can and do fail.

Why do we buy servers will dual power supplies? we know they can and do fail.

Why do we implement RAID? we know hard drives can and do fail.

Vendors can and do fail. Prepare for the worst, period.

0
John Haugeland
John Haugeland Replied on April 22, 2011

Sadly, if you tell this to the Focus staff, their response is "go away, troll."

Some people don't even learn when it goes south. :(

1

Putting all of your eggs in one cloud, so to speak, no matter how much redundancy they say they have seems to be short-sighted in my opinion. If you are utilizing an MSP, HSP, CSP, IAAS, SAAS, PAAS, et all to attract/increase/fulfill a large percentage of your revenue or all of your revenue like many companies are doing nowadays then you need to assume that all vendors will eventually have an issue like this that affects your overall uptime, brand and churn rate. A blip here and there is tolerable, but they started reporting issues 10+ hours ago.

1
Ben Kepes
Director, Diversity Analysis

Cloud doesn't completely mean that software companies can forget about infrastructure - they still need to think about redundancy and resiliency. Notice that only one AWS zone went down? Smart money works across multiple zones.

Even smarter money looks to leverage multiple providers in multiple locations....

0
Andrew Mosson
CTO, Focus

To directly answer the question - Amazon may still be the best place to host.

IT systems fail, but they don't all tend not to fail at once. Amazon has lots of redundancy built in. They have 4 data centers which are independent and within each data center they have separate isolated infrastructures that can fail separately.

In this case, it appears that the failure mode took out a wide swath of the Virginia data center (although it isn't all of it it as Rightscale has been up and down this morning - I am assuming they are hosted there).

If you host at Amazon, you can plan for the failure. In our case, we keep database replicas in multiple data centers and use Rightscale to quickly launch an environment in the case of a failure. This sort of planning is much easier to do via a cloud provider.

One can and should argue about the failure rates at the various cloud providers, but that should only be one on the considerations.

0
John Haugeland
John Haugeland Replied on April 22, 2011

"They have 4 data centers which are independent and within each data center they have separate isolated infrastructures that can fail separately."

The germane issue isn't whether they can fail separately; it's whether they can fail together. Considering as how that's already happened twice this year in Amazon's five outages this year, and considering Amazon's stratospheric prices, I have a hard time understanding why you think Amazon is a better choice than some traditional datacenter like CalPOP, who will charge you less for a half-rack with a dedicated gigabit than Amazon does for for modest servers.

0
Andrew Mosson
Andrew Mosson Replied on April 22, 2011

As I tried to point in my answer, but maybe not well enough, is that one can, within Amazon, architect around this type of failure. In our case, we replicate our data to a separate data center and are able to automatically launch our entire environment if need be. If we had been affected by this, we would have been down for a short time while our environment launched in another data center.

0
Krishnan Subramanian
Industry Analyst
  • Recommended by:

Not sure what others mean by zones. If it is availability zones, it didn't help people avoid outage. It has to be across multiple regions or multiple cloud providers. Multiple availability zones was of no use in this incident. But redundancy across multiple regions or multiple cloud provider has high costs and latency issues.

0
  • Recommended by:

That is false. This particular outage was solely isolated to EBS and services based on top of EBS (e.g. RDS).

SimpleGeo was up in 3 AZs, and do not make use of EBS or RDS for major production services, and didn't have a single issue with downtime. This was due, in part, to being able to route around the fully downed AZ and degraded issues in other AZs.

0
Krishnan Subramanian
Industry Analyst
  • Recommended by:

@Joe, the folks at GoodData told me that they had their issues in spite of spreading it over different availability zones

http://www.cloudave.com/11886/some-lessons-from-aws-outage/#comment-13191

0
Dennis Morgan
CEO/Consultant, DK Morgan Group
  • Recommended by:

I would look to the Gartner Magic Quadrant first and then engage a good consultant to help out. Otherwise, Companies have to make their own decision about DR and Application redundancy. Remember you must test your DR and Application recoverability. Do not wait until you need it. You may be dissapointed.

0
Brielle Nikaido
Manager, Market Strategy, Salesforce.com
  • Recommended by:

Hey Everyone,

Thanks for participating in this incredible conversation! I wanted to let everyone know that we will be hosting a Focus Roundtable on Amazon Web Services' recent outage on May 9th at 12pm PT / 3pm ET. This event has garnered a lot of attention and Andrew Baker, Ben Kepes, John McCoy, George Reese and Christian Reilly aim to separate the facts from FUD (fear, uncertainty and doubt).

Here is the link to the event page: http://www.focus.com/events/information-technology/focus-roundtable-amazon-we...

0
  • Recommended by:

Important topic, but I think the real question is not "what would a prefered hosting platform" but "what can we do to promote failure proof design that will outlive an outage like this one".

The "focus roundtable" that Brielle mentioned sounds interesting. In the same mindset (maybe a bit more geeky and hands on) we are having a meetup about this topic.

If you are in SF next week (for Google I/O?) I would love to invite you to the 'T1000 Gathering' meetup on May 9th at 5.30pm to discuss what strategies we can promote in 'startup land' (and in general) to build new things with technical failure in mind and how to prevent issues like the ones that the EBS outage generated.

It is an informal meetup with no set speakers but a big topic to discuss - would love your input http://bit.ly/mpU5tT

0
  • Recommended by:

Brightcove is another cloud option. They have this App Cloud that that’s built on open standards like HTML5, Javascript, and CSS3. And your apps can be distributed across all kinds of devices, like PCs, tablets, and smartphones, and they provide video analytics. If you're serious about looking at other options, I'd check them out. (http://www.brightcove.com/en/content-app-platform)

-1
  • Recommended by:

Amazon's downtime is stratospherically high, and their prices are spectacularly inflated. Their ping times are terrible and they offer little that anyone else doesn't offer. Anyone holding them up as a good solution without an explanation has no idea what they're talking about.

The same hosting platform as always is preferred: dedicated boxes at redundant geographically disparate locations managed by different companies. That way when host 1 shits the bed, hosts 2 and 3 keep churning.

Nobody who has even a rudimentary best-practice hosting setup has been affected by the Amazon outage in any way other than a speed hit as their resources shift to a secondary center.

Stop following the new-media goons around. They don't know what they're doing. There's a reason they're down twice a month and making excuses.

Ask a gray-beard.

Answer This Question