I apologize for the possible length of this post. Also, I say fuck a lot when I'm angry or excited. Doubly when I'm both. Sorry.
...plus I've had a lot of coffee today.
OH MY GOD, FINALLY.
I've been wanting to start a new thread about this on GAF for ages, but I'm a lowly man who cannot create threads yet. Today is my magic day to talk about this subject. And by "talk" I mean "rant."
AAA game development studios (and the large publishers that own/deal with most of them) are fucking dumb when it comes to the world being announced with LumberYard and GameLift. I know that sounds harsh, but I'll explain and hopefully it will shed light on my frustration.
Background
Building public-facing and private systems infrastructure is what I do for a living. I've been at it for, I don't know, ~8 years. I've spent a lot of time screaming and being unbelievably frustrated with traditional hosting solutions like the massive Rackspace, as well as helping them test Rackspace Cloud in its infancy. It's terrible. They're terrible. The benefit you get from the cost is atrociously bad. We're talking about a world where requesting a new web server to account for increased load takes a week depending on how specific your needs are. A week. A fucking week! What happens if you are hit unexpectedly by a huge traffic spike? You're screwed, that's what. Having locally hosted hardware (like if Sony has their own server farm) obviously increases that response time, but we're talking still taking hours or days. That's preposterous.
My old job's infrastructure was pretty large. Not Netflix large or anything (not even close), but our monthly bill at Rackspace fluctuated between $75k-175k/mo depending on the time of year and what we were expecting for load. Now, when you're paying that much damn money for a managed infrastructure ("managed" in the sense of support when you're asleep or whatever), you'd expect that things would break less often. Sadly, we were constantly hit with downtime from technician muck-ups, bad deployments (of servers, not code), bad, un-tested OS updates that we never asked for... you name it. I digress... my point is that getting caught off-guard by high traffic times was a huge pain in the ass simply due to the nature of dealing with physical servers running everything. Serving traffic, doing load balancing, SQL masters and slaves, whatever. Any change in capacity meant a TON of planning with Rackspace to make sure they had the hardware available. That's not acceptable these days. Like at all. Not for any industry that constantly needs to deal with surprise peak traffic. Hell, even for planned peak traffic it's a shitty system.
What the answer isn't
Relying on the cloud and virtual machines.
What the answer IS
Relying on the cloud and virtual machines.
I know that sounds silly, but I'll explain.
You can't pretend that just moving your physical servers to VMs (virtual machines) is going to fix all your problems. It can't and it won't. It's not a 1:1 solution at all because it's nowhere near that simple.
That said, AWS (and hopefully someday Azure *laughs* and Google's Compute Engine) is absolutely the answer. Not because it's in the cloud, not because it's virtualized, but because it is both of those things + a massively robust and capable set of tools and features that can and will make your infrastructure a fucking monster when it has to be and a tiny mouse when it can be.
At my old, aforementioned job, we eventually moved *everything* to AWS. We optimized code on our end to take advantage of the insanely useful ecosystem that AWS offers and we saw the following:
- Cost reduction (down to $12/k mo)
- 3-4x performance on one web app
- 2-3x performance increase on an older, antiquated web app
Performance being pageload times, response times, you name it. Data access from any of our databases was hilariously faster; the list goes on and on...
...but that list doesn't even include the things I loved most (and the things most relevant to the gaming world) like the ability to modularize your services and functions to eliminate single points of failure or automatically horizontally scale your infrastructure with demand in real time if you make it/let it.
What that means is this:
Say 20,000 concurrent users is the most your infrastructure can handle before it completely shits the bed. In a typical server infrastructure, physical or VM, the most you can really do to put out the fire is add more servers. Physically, as I mentioned, this takes time and a lot of it. With VMs you at least have the pleasure of easily spinning up a new VM (especially if you have templates on hand), but you still run into the same dumb shit in the end: it's all manual, it all takes time, and if you are hosting your own VMs, you run the risk of no longer having the hardware available to spin up new VMs.
^^ that's exactly what almost all game studios and their publishers are doing. They have locally and remotely hosted physical boxes, locally or remotely hosted virtual boxes (AWS, Azure, Dreamhost [maybe that was a joke, but I can't tell] or wherever) or a combination of those options. It's shitty, it's slow for responding to crises or dealing with issues and, in this day and age, it's fucking lazy and irresponsible.
Taking that same 20k concurrent user limit into account, the scenario on AWS is much, much different. As a very small example, you could set up a cluster of servers that operate on a set of rules. Those rules define that once the load on those servers reaches a limit, new servers are automatically created to compensate. Traffic slows down? Automatically decommission those servers and let the active connections drain so that new connections hit the remaining pool of servers. Traffic dies down even more? Decommission more automatically. You can set limits on this stuff, too, so that you set standards for a base amount of servers and also a ceiling amount if you're worried about cost going through the roof.
That's amazing as hell and, if utilized properly, could save the gaming studios SO much money with crazy amounts of added benefit. Hey, you know all those shitty times that services or MP games go down because of high load? *cough*
But that's just the tip of the iceberg.
You can do the exact same shit with databases, not just webservers.
You can do the same thing with logic tasks.
You can do the same with notification delivery and email delivery.
You can do the same with with security. Hell, you can even easily isolate which applications, services and servers have access to each other to mitigate security risks with REALLY easy rulesets.
Then there are the availability zones. Need high-availability for your application or service? Have it live in multiple availability zones so that devices and connections can fail over to each other if something takes a dump. You can make it so that load balancers can balance connections across these zones so you can be making use of all zones simultaneously. You can deploy in different regions of the world and bake in connection logic so that people are routed to the nearest region. It's glorious!
Deploy shitty code or a change that broke something? Roll it back. You can keep things versioned and store recent versions of machine images to quickly roll back to. You can keep versions of configuration of server clusters that you can roll back to. You can fix a shitty deploy in minutes, and when downtime costs you money that's a whole lot of money you're saving.
If you account for these features in your development and deployment pipeline, you can even use services that will know what size and kind of server you want spun up based on which application is running. Again, I'm not even getting into the nitty gritty here; this is all just basic "shit you can do with cloud infrastructure."
There's a reason that Netflix uses the holy bejesus out of AWS. They have tens of thousands of virtual instances serving their content to your home. They have many hundreds of instances just to handle logging of your issues when they occur. Can you imagine doing something that robust in a traditional hosted world? Fucking barf.
The gaming industry should have been smart enough to take note of the shining beacon of "holy shit, that's possible?!" scaling infrastructure that has been Netflix, and they should also be ashamed that they didn't.
That being said...
I'm not a huge XB1 fan (sorry!) but
what Microsoft Studios is doing with the new Crackdown game is only a fraction of what's possible in the world of server and computing infrastructure in the cloud right now. It's more cost effective, more capable, more easily monitorable and manageable than anything the traditional server world can offer... so why the fuck has no one charged head-first into this? It's absolutely insane to me that seemingly no one/almost no one in the gaming studio world has gotten out in front of this yet.
And I don't mean Microsoft's initial "everyone is online all the time, so we just always use 'the power of the cloud' to make games better" so much as I'm referring to common sense shit like capacity planning and management so that multiplayer games aren't a steaming sack of shit on day one. Remember every single launch of GTA V (last gen, current gen, PC)? Every single time was a shit show of wondering if you were going to connect to servers or if you'd be booted from sessions, stuck in loading hell, or unable to play in general. Why do we live in a world where a development team of 1,000+ people didn't have the foresight to spend some of those development resources on a better online infrastructure? That blows my mind and makes me so sad as both a gamer and a technologist.
Can you imagine a world where your multiplayer sessions on consoles no longer run like soggy dick because the development and engineering teams were smart enough to take advantage of all these tools?
Have I mentioned how cheap this is compared to the old hat way of doing things?
Okay... I'm going to shut up in a minute.
The point is that I'm really, really excited for what Amazon just announced, even if just primarily because I'm so damn relieved to see that someone is addressing this gaping anus in the game development world. It sucks that GameLift is only available for developers using LumberYard, but there's good news: it doesn't fucking matter. All the stuff that GameLift does can be done already. Today. There's an AWS-SDK for almost any modern programming language in use today, so this sort of functionality doesn't have to mean studios moving to an entirely new engine; they can very, very easily bake it into their own.
So if there are any AAA game devs on this thread and you happen to read this post:
Where the fuck have you guys been? Is there a legitimate reason other than investment of time and initial cost that this isn't being more widely and quickly adopted in the industry? How are things like better reliability, better uptime and massive cost savings not a crazy motivational force to the CEOs and CIOs and CTOs to do this?
Okay bye. I love you. I'm sorry this was so long and I said fuck and anus.
Coffee.