Share what you know with millions of people
Focus is the best place to turn what you know into remarkable content
Amazon EC2 has gone down -what would a prefered hosting platform be? (notice that Focus is working!)
Best Answer
- Recommended by:
- Michael Schmier,
- Brielle Nikaido,
- Justin Pirie
The fact that Focus is working is pretty much luck at this point. Our primary Amazon center is in the west and their issues are in their Virginia center. That being said, I noticed that SimpleGeo stayed up and saw a tweet this morning from @joestump about how they distribute across multiple Amazon zones (we have a database slave in the east for this same purpose).
The only way around this is do plan ahead and have failover mechanisms, which is seems reddit, quora and many others do not. There's usually a reason why they don't, it is expensive!
- Recommended by:
- John McCoy,
- John Haugeland
Prefered hosting platform would be Akamai: they have been hosting some of the biggest platforms/websites in the world. Not the cheapest and stuff can still break (it's all machinery after all, built by humans), but they seem to have done a pretty decent job over the years.
Note that Amazon EC2 is not a small platform either, but scale does not guarantee uptime, as you have noticed.
- Recommended by:
- Brielle Nikaido,
- Dan McComas,
- Todd Hoff
We have a very different architecture than most people at SimpleGeo. First off, our API services are *completely* sectioned off from our website. Our website can go down without affecting our API.
Our API is built across two AZs. We were in three, but brought one down to save cash for the near-term. We use an ELB out front, which is region-wide and, from what I've gathered, sits above any regions AZ infrastructure. This allows API servers and infrastructure in one AZ to crap its pants, while the other one stays up.
Behind the ELB/API servers sits three Cassandra clusters, a bunch of isolated/share-nothing services (geocoders and the like), and RabbitMQ servers. The Cassandra clusters have a replication strategy, which in Cassandra is a pluggable component, that replicates each piece of data in each AZ. This way we can keep reads local to a specific AZ and writes are propagated around the region to other AZs.
All writes go through the queues, which allows us to continue accepting writes even if our entire storage clusters go offline (this has saved us a few times).
The key to surviving these kinds of outages are, as usual, share nothing, redundancy, automated failover, automated replication, etc.
- Recommended by:
- John Haugeland
DR/BC needs to be a primary objective when planing and implementing any IT project, outsourced or not. The 'cloud' isn't magic, the 'cloud' isn't fail-proof, the 'cloud' requires hardware, software, networking, security, support and execution just like anything else. All the fancy marketing speak, recommendations and free trials can't replace the need to do obsessive due diligence before trusting any provider no matter how big and awesome they may seem.
Why do DCs have UPS and Diesel Generators on-site? they know power companies can and do fail.
Why do we buy servers will dual power supplies? we know they can and do fail.
Why do we implement RAID? we know hard drives can and do fail.
Vendors can and do fail. Prepare for the worst, period.
- Recommended by:
- John Haugeland
Putting all of your eggs in one cloud, so to speak, no matter how much redundancy they say they have seems to be short-sighted in my opinion. If you are utilizing an MSP, HSP, CSP, IAAS, SAAS, PAAS, et all to attract/increase/fulfill a large percentage of your revenue or all of your revenue like many companies are doing nowadays then you need to assume that all vendors will eventually have an issue like this that affects your overall uptime, brand and churn rate. A blip here and there is tolerable, but they started reporting issues 10+ hours ago.
- Recommended by:
- John Haugeland
Cloud doesn't completely mean that software companies can forget about infrastructure - they still need to think about redundancy and resiliency. Notice that only one AWS zone went down? Smart money works across multiple zones.
Even smarter money looks to leverage multiple providers in multiple locations....
- Recommended by:
- Dan McComas
To directly answer the question - Amazon may still be the best place to host.
IT systems fail, but they don't all tend not to fail at once. Amazon has lots of redundancy built in. They have 4 data centers which are independent and within each data center they have separate isolated infrastructures that can fail separately.
In this case, it appears that the failure mode took out a wide swath of the Virginia data center (although it isn't all of it it as Rightscale has been up and down this morning - I am assuming they are hosted there).
If you host at Amazon, you can plan for the failure. In our case, we keep database replicas in multiple data centers and use Rightscale to quickly launch an environment in the case of a failure. This sort of planning is much easier to do via a cloud provider.
One can and should argue about the failure rates at the various cloud providers, but that should only be one on the considerations.
Not sure what others mean by zones. If it is availability zones, it didn't help people avoid outage. It has to be across multiple regions or multiple cloud providers. Multiple availability zones was of no use in this incident. But redundancy across multiple regions or multiple cloud provider has high costs and latency issues.
That is false. This particular outage was solely isolated to EBS and services based on top of EBS (e.g. RDS).
SimpleGeo was up in 3 AZs, and do not make use of EBS or RDS for major production services, and didn't have a single issue with downtime. This was due, in part, to being able to route around the fully downed AZ and degraded issues in other AZs.
@Joe, the folks at GoodData told me that they had their issues in spite of spreading it over different availability zones
http://www.cloudave.com/11886/some-lessons-from-aws-outage/#comment-13191
I would look to the Gartner Magic Quadrant first and then engage a good consultant to help out. Otherwise, Companies have to make their own decision about DR and Application redundancy. Remember you must test your DR and Application recoverability. Do not wait until you need it. You may be dissapointed.
Hey Everyone,
Thanks for participating in this incredible conversation! I wanted to let everyone know that we will be hosting a Focus Roundtable on Amazon Web Services' recent outage on May 9th at 12pm PT / 3pm ET. This event has garnered a lot of attention and Andrew Baker, Ben Kepes, John McCoy, George Reese and Christian Reilly aim to separate the facts from FUD (fear, uncertainty and doubt).
Here is the link to the event page: http://www.focus.com/events/information-technology/focus-roundtable-amazon-we...
Important topic, but I think the real question is not "what would a prefered hosting platform" but "what can we do to promote failure proof design that will outlive an outage like this one".
The "focus roundtable" that Brielle mentioned sounds interesting. In the same mindset (maybe a bit more geeky and hands on) we are having a meetup about this topic.
If you are in SF next week (for Google I/O?) I would love to invite you to the 'T1000 Gathering' meetup on May 9th at 5.30pm to discuss what strategies we can promote in 'startup land' (and in general) to build new things with technical failure in mind and how to prevent issues like the ones that the EBS outage generated.
It is an informal meetup with no set speakers but a big topic to discuss - would love your input http://bit.ly/mpU5tT
Brightcove is another cloud option. They have this App Cloud that that’s built on open standards like HTML5, Javascript, and CSS3. And your apps can be distributed across all kinds of devices, like PCs, tablets, and smartphones, and they provide video analytics. If you're serious about looking at other options, I'd check them out. (http://www.brightcove.com/en/content-app-platform)
Amazon's downtime is stratospherically high, and their prices are spectacularly inflated. Their ping times are terrible and they offer little that anyone else doesn't offer. Anyone holding them up as a good solution without an explanation has no idea what they're talking about.
The same hosting platform as always is preferred: dedicated boxes at redundant geographically disparate locations managed by different companies. That way when host 1 shits the bed, hosts 2 and 3 keep churning.
Nobody who has even a rudimentary best-practice hosting setup has been affected by the Amazon outage in any way other than a speed hit as their resources shift to a secondary center.
Stop following the new-media goons around. They don't know what they're doing. There's a reason they're down twice a month and making excuses.
Ask a gray-beard.
Events
- Dos and Don'ts of Small Business Marketing May 29 @ 11 am PT
- Lead Nurturing 202: The Next Generation May 31 @ 11 am PT
- The Tricks to Paid Media June 6 @ 11 am PT
- Display Advertising for Brand Awareness June 20 @ 11 am PT














A few have said it already, but it definitely bears repeating:
A successful site -- particularly a complex site -- is dependent on good infrastructure architecture AND good application architecture. You can't simply throw redundant hardware at a poorly/inadequately built app and expect everything to work, any more than you can develop the best application without consideration of the underlying infrastructure, and expect miracles.
Two key lessons from this experience should be:
-- Find ways to verify or simulate failure conditions in the cloud, as you would with your own hosted infrastructure, so that you can be sure that your DR is appropriate
-- Ensure that the SLA you are provided matches your business needs, and that you mitigate as many operational risks as you can. Mitigation costs always seem high *until* you have a 24-48 hour outage, and then everyone is asking, "why didn't we pay for xxxxx?!?"
-- The cloud does not absolve you of your responsibility to plan for DR.
-ASB: http://xeesm.com/AndrewBaker