Share what you know with millions of people

Focus is the best place to turn what you know into remarkable content
×
0

What's the biggest, baddest IT failure you've ever seen?

IT failures are everywhere -- what's the biggest one you have personally witnessed? Tell us what happened and why it went wrong.

Attachments

2
@chrisspivey
Posted on Aug. 5, 2010

A "crisis situation" for a botched HR Payroll implementation at a division of PepsiCo. The CFO, CIO and SVP of HR sent a video to every employee in the division apologizing. I got to play a part in cleaning up the mess...

The video can be seen on my site www.SpiveyNco.com

1
Matt Davis
Desktop VDI and Cloud Specialist , Microsoft
Posted on Aug. 9, 2010

Cable seeking backhoe in Downtown Dallas. Took out FDIC communications as well as several other companies. AT&T had armed guards securing the ditch. I'm not sure what ever happened to the backhoe operator or sub contractor.

0
@CantDisplayForFear
Posted on Aug. 5, 2010
  • Recommended by:

I've seen the root cause of trading operations halted for 2 days at a top tier bank, I've seen cable operators go black for an entire day. I've seen some crazy stuff that I WISH I could discuss, but if I did I would never work in this town again.

0
dc
Posted on Aug. 5, 2010
  • Recommended by:

Fire supression release in data center
Sprinkler head release on top of produtction systems

0
CL
Posted on Aug. 5, 2010
  • Recommended by:

Setting up Disaster Recovery site (6 month project) and 30 days later COLO declares bankruptcy.

0
eek
Posted on Aug. 6, 2010
  • Recommended by:

An cleaner unclipping a series ethernet connection taking down the entire network for a day - A test database that was actually live and deleted 10 years of active service records - Basement flood taking out national server bank. All eventually recoverable albeit with some pain.

0
Jim Wasson
Posted on Aug. 6, 2010
  • Recommended by:

Client decided to create their own data center (on the fly) within their lease space without consulting the professionals. After a botched electrical installation, the space caught fire over a weekend with no fire suppression within the space. 13 cabinets were destroyed with no back ups.
It was catastrophic for the company.

0
Michael Krigsman
CEO, Asuret Inc.
Posted on Aug. 6, 2010
  • Recommended by:

Folks, these are great answers, but please share more details. What happened and WHY?

0
Brian M
Posted on Aug. 6, 2010
  • Recommended by:

Rollout of a hospital patient care software. Tested fine, with about 10 users. As it went live across 9 hospitals and countless clinics, everyone was wanting to see it at once. Crashed the servers it was on. Since we as dektop techs were the only one on site, we had to walk around and start telling everyone to use the old fashioned method - paper and pencil. Not enough testing, no stress test on the servers or network. After a day of it going up and down, they finally got smart and brought the hospitals on one at a time.

0
Anthony Freed
Managing Editor, Infosec Island Network
Posted on Aug. 6, 2010
  • Recommended by:

Heartland Payment Systems - Management knew of a breach to corporate systems in late 2007. Processing systems breached somewhere in early to mid 2009, most records lost since TJX. Breach lasted until fall 2008 - one of the card issuers alerted Heartland to anomalies. Breach announce in January of 2009. More than one million consumer credit card records were compromised.

0
Michael Krigsman
CEO, Asuret Inc.
Posted on Aug. 6, 2010
  • Recommended by:

Still trying to understand WHY these projects failed. Were the failures caused by a crazy CIO, faulty technology, insufficient budget, etc?

0
Anthony Freed
Managing Editor, Infosec Island Network
Posted on Aug. 6, 2010
  • Recommended by:

@Michael

My opinion is that it is because there is no "security" - only best efforts at mitigation preparedness in case there is a data loss event.

0
Michael Krigsman
CEO, Asuret Inc.
Posted on Aug. 6, 2010
  • Recommended by:

Anthony, I'm trying to get a better understanding of the underlying dynamics that created these failures. Did I not ask the question properly?

0
Anthony Freed
Managing Editor, Infosec Island Network
Posted on Aug. 6, 2010
  • Recommended by:

Your question was fine - there is just going to be a long long list of answers - you started a good list in your query - I was only seeking to be succinct.

Vulnerabilities inevitably out number solutions - at least at this point in systems evolution. The best we can do is work hard to stay one step behind the bad guys.

And sometimes the only reason an organization has not experienced a data loss event may be that no one has targeted them, committed enough resources, or simply tried hard enough.

0
Michael Krigsman
CEO, Asuret Inc.
Posted on Aug. 6, 2010
  • Recommended by:

I guess it naturally leads to the question of how to prevent these kinds of problems? That seems more complicated.

0
Anthony Freed
Managing Editor, Infosec Island Network
Posted on Aug. 6, 2010
  • Recommended by:

Yes - unfortunately, I doubt I could offer you any insight, given your expertise eclipse my own significantly!

I would offer for those not in the 'know' that there are a multitude of vulnerabilities that remain unaddressed for which there are open-source and commercial solutions, which is never a bad place to begin.

And further, there remains issues in lax coding and development that could be resolved with more diligence in the industry and the implementation of uniform best practices, another good jumping off point.

There is also a disconnect between compliance and security for management, and sometimes a failure on the IT side to effectively translate IT risk into actionable items for the non-techies.

Lastly, I am firm believer that Data Loss Prevention, or DLP, is only one side of the security coin - DLR, or Data Loss Resiliency is the other: detect, isolate, and mitigate for the sake of business continuity.

Cheers!

(PS Michael - we would enjoy the opportunity to publish you at InfosecIsland.com too)

0
Steve k
Posted on Aug. 8, 2010
  • Recommended by:

I think a lot of these issues while they have a technology aspect to it, the cable was cut, the sever ran to hot, and so on. I think those are only the symptoms of managements failure to craft and follow good policies and procedures. For example, the case I have with the bank going offline, why didn't they have a proper DR solution in place. If they did, why didn't it failover? Maybe they should run production from DR for 3 months out of a year to solidly verify DRs is truly a DR And not a pipe dream.

0
Steve Heusser
Operations Manager, SolutionPro Inc
Posted on Aug. 11, 2010
  • Recommended by:

In my experience the vast majority of catastrophic IT Failures I have seen have been started by one or more of these three main causes.
1. Lack of proper planning
2. Inadequate testing
3. Poor project management

Any one of these can cause a catastrophe in the IT world and I have seen each of these in action.

0
Edwin D'Hondt
Posted on Aug. 13, 2010
  • Recommended by:

silence in a data center (big) room.
due to a electricity kick (much more than regularly 220 voltage) all printboards of all MS servers burned and went down. Result = silence.
real problem : no-break installation was not in sync with a change at external power station. All Experts (as well internal as external) were not working in a pro-active mode.

0
  • Recommended by:

The worst IT failure I've seen involved a data center provider who shall remain nameless. Our DC was a little over three hours away from our corporate office. I had made a trip one day to perform some routine maintenance and install a new Citrix CAG.
In the DC building's parking garage I got an email from a tech on my Blackberry that a server had become unresponsive in addition to notes on other concerns. I got to the floor of the building where our DC was to discover the primary HVAC failed. The secondary HVAC was not wired properly and failed when fail-over was supposed to occur and the circuit was "stuck." The DC honestly was chaos with DC personnel and outside vendors running here and there, some with ladders. When I got to my racks, some equipment was already down, some in the process of going down, alert LEDs lit everywhere and equipment running HOT...one of the worst scenarios I could have imagined. I called the IT Manager and stated I had to bring the company down immediately. All company sites connect to our DC for Citrix and other services.
I found out later the HVAC had failed about 2 hours before my arrival and was not restored until about 2 hours after my arrival. My company was down until about 9:00 that evening as I had to allow the equipment to cool down.

0
  • Recommended by:

James Martin did some beautiful research decades ago - still true today - that suggests 82% of "big bad IT failures" go like this:

1. Analysts interview over-burdened business folks and produce lots of text "requirements" and weird diagrams the business people don't really understand.

2. Over-burdened business folks, frustrated nothing's happening, eventually sign off these requirements to get something happening.

3. IT build it/deliver it, but it's not what the business wanted.

(Oh, and - 4. IT get blamed.)

0
  • Recommended by:

We experienced a complete power outage due to an incident in our main datacenter. While the UPS were able to provide electricity to all racks, the AC were not powered: they stopped and the datacenter became hot in less than 10minutes.
The servers all begun to shut down one after the other.
Fortunately, we had received the alerts on the power outage before the monitoring servers stop, so we were able to quickly attend the building and manually shut down the remaining systems, while our colleagues of Energy were busy fixing the power issue.
The company lost a lot of money that day. And one of our storages became mad and it's only after long hours online with the support that we could fix the issue!
This incident comes back to my mind everytime I see a request for change on power equipment in a datacenter!

Answer This Question