So if I have a list of 50 tests to carry out, I carry out all 50 successfully and something still breaks, does this mean my testing was wrong?
No, it means something outside the parameters of the testing happened. As it is outside testing parameters, it most likely means it was an unforeseen error, which couldn't have been tested against.
I say this as someone who works with both software and networking and has experienced exactly the same situation the guys at Evo are having - prior to going into the wild, everything checks off against all the tests you think or are told you need to run.
You go live, everything goes titsup and you've no idea why and need to find the fault. The problem being that in order to find the fault you need to have live users, but no-one can get live because of the fault.
I was lucky - I only had 50 users to worry about, and could roll the remaining 300 or so back to the pre-fuckup instance quickly, but it still took us a week to find the problem and fix it. We'd even had a live environment test the weekend before on the production server and it was working then.
Sometimes shit doesn't go wrong until it's live in the wild, and then you end up looking like complete cocks, irrespective of what you've done pre-launch.
I understand that people are annoyed about this - I am too - but I'm also sick of the armchair technical analysis over this. There is no way you can know how this error occurred, so please just stop with the 'They fucked up, how could they not know this was going to happen' stuff.
tl:dr - sometimes testing and simulation won't show up a fault that happens in a wholy live environment.