Failures, Cloud Disaster Recovery, New services, NAT pass-through stuff
Post date: May 18, 2014 8:54:34 AM
- Massive IT fail and my own thoughts and personal confession. I'm so glad that I haven't done anything like that.
- Well, I have had one very close by incident, but recovered it from it so that customer didn't even notice it.
- Once I screwed up production system data for one database table. But that's just because I was sick at home, in fever and almost like drunk. Then the they applied horrible pressure from the customer to make immediate changes straight into the production. Well, technically the screwup was really small, only one new line missing. But that got replicated to tons of computers as well as the data being collected was affected by that. Of course it was possible to afterwards clean that, and it didn't stop production. But cleanup was painful as usual. Most annoying thing was that I actually noticed my mistake during the run just by checking data being processed and I tried to stop the update at that point, but it was already partially done and started to get replicated. I just which I would have reversed the order. First check data, and then process it, instead first starting process and then checking data while waiting it to get processed. Clear fail. There's no reason why I couldn't have done that, before actually updating tables. Because I was of course able to run the process in steps. After all, that's the thing that that really bugged me personally. Clear failure to verify a key thing. But there's one more key point, the fail I experienced, didn't actually have anything to do with the change I made for the customer. It was a change which was done earlier, and just wasn't in production yet. So I did check things, which I expected to be worth of checking, meaning the the changes I made and things affected by those. But I encountered a secondary trap, which was laid in the code base several weeks earlier. Which clearly wasn't properly checked at the time the change was made. After all, I could have avoided the problem very easily by checking all data in processing steps and verifying the data before running the final update to production. So this err is very human. Hurry. pressure, not feeling well, so let's just get it done quickly and that's it. - This could be perfect example from the Mayday / Air Crash Investigation TV show, how to make things fail catastrophically. - Fixing the data issue in database on primary server took only about 15 minutes. I'm still quite sure there were hidden ripple effect from this event which probably did mean losing about two days of work indirectly. Having database backup would have been one solution, or using test environment, but either was available due to time pressure and me being at home. Because production system was live, having backup would have been worthless anyway, because it would have 'rolled back' way too many transactions.
- Yet another really dangerous way of doing things is that you'll remote connect to workstation and then open database management software and connect it back to the server. It's so easy to give commands to server in such situation accidentally by thinking you're commanding the workstation. Luckily I haven't ever failed with that, but I have often recognized the risk to major failure and so have my colleagues.
- Cloud DR and RTO:
- In many cases having the data isn't the problem. If it's in some application specific format, accessing it can be the real problem in case the primary system isn't working. Let's say you're using accounting system XYZ. They provide you an off-line backup option where you get all the data. Something very bad happens and the company / their systems disappear. Great, now you have the data, but accessing and using it is a whole another story. Let's say they used something semi common, like MSSQL server or ProgresSQL and you you got giga bytes of schema dump. Nothing is lost, but basically it's totally inaccessible to everyone. If you have made escrow, great. Then starts the very slow and painful process of rebuilding the system which can utilize that data. Of course if you got competent IT staff, they probably can hand pick the "most important vital records" from that data, but it's nowhere near the level needed by normal operations. So RTO can be very long, like I said earlier. I'm sure most of small customers don't have their own data at all, nor do they have escrow to gain access to the application in case of major fail.
- Let's just all hope that something bad like that won't happen, because it'll be painful, even if you're well prepared. I have several systems where I do have the data and escrow or even the source. But I assume setting up the system will take at least several days even in the cases where I do have the source code for the project(s). In some cases situation could be even much worse. Let's say that the service provider was using PaaS and it failed and caused the problem. Now you have software based on AWS, App Engine, Heroku or something similar, but the primary platform to run the system isn't available anymore. Yet again, you can expect very long RTO. But competent staff will get it going at some point, assuming that you have the code and data.
- Checked out services like: Pandoo "Web operating system", Digital Shadows "Digital attack protection & detection", Wallarm, "Threat and Attack detection & protection", ThetaRay "Hyper-dimensional Big Data Threat Detection", Divide "BYOD", Cyber Ghost "VPN", and Lavaboom "privatey email".
- Studied a few protocols more PCP and NAT-PMP. Yet IPv6 should make all thsese work-a-round protocols unnesessary. I hope nobody really is going to use NTPv6.