Server disconnects, black screen issues, DDoS attacks, login failures, endless queues – the Overwatch 2 launch didn’t always go as planned, did it? Because now players seem to have reached a stage where they’re prepared for the online multiplayer features of their AAA games to go bankrupt at launch.
Forza Horizon 5, Splatoon 3, Halo Infinite and Battlefield 2042 are just a few of the games released last year that suffered major network infrastructure issues at launch. Overwatch 2 may be the latest game to be affected, but it certainly won’t be the last.
With that in mind, it’s fair to say that many are wondering where things go wrong and what the studio can do to avoid network problems like Overwatch 2 experienced.
Why do AAA games continue to struggle with infrastructure issues?
The reason infrastructure problems often persist for weeks and sometimes months after a new game’s launch is that these problems aren’t easy to diagnose.
Developers often have to dig through a huge number of potential problems to figure out what the problem is. Moreover, solving the first problem can lead to other problems. Take Overwatch 2’s launch, for example. Blizzard caused a massive server outage, and even weeks after its release, some of these issues are still ongoing.
So what is causing these problems? There’s no way to pinpoint Overwatch 2’s infrastructure issues to a single one, but I’ll focus on standard network infrastructure for AAA shooters like Overwatch and a few things I witnessed first-hand when working in AAA studios. You can.
Often most of these problems boil down to a handful of specific problems.
Most AAA games use centralized servers
Most game studios use multiple centralized servers to handle all of the game’s major data processing and management functions. In general, centralized servers have advantages such as being more cost-effective and easier to manage and deploy, but they also have significant downsides.
All the most important data is stored in one (or very few) points. This means you have a huge target for DDoS attacks, as taking down a centralized server can bring down an entire network in that region. Additionally, centralized servers can become a bottleneck for players. When a player complains of network congestion in-game, it’s technically not a ‘running out of servers’ issue, it’s caused by heavy traffic on a single node on a centralized server.
Specific Hardware Requirements
Many large studios run their games on highly specialized hardware with low availability worldwide, such as using his 4Ghz or higher CPUs for Unreal servers, so the game can be difficult to scale. Game instances with many players per instance require faster and more powerful CPUs, making resources harder to come by.
Things get even tougher when QA teams only certify game servers on specific server models, and DevOps/LiveOps teams struggle during times of traffic spikes. Despite being able to use other models, “We’ve always done it this way, and we need to continue this way,” they say to other providers/vendors because they want to follow specific QA procedures. This can be frustrating if you can’t scale.
Some studios may be tempted to use big servers with as many CPU cores as possible to cut costs. The result is a very high density of players per physical node, and an attack or problem at one node can spill over to thousands of players. This results in the same risks as using a centralized server. Server density and single points of failure make them easy targets for hackers and DDoS.
What are the answers to these infrastructure problems?
No matter the size of your studio, planning for infrastructure issues is difficult.
That said, there are some simple answers to questions like the above. Of course, its usefulness will vary depending on whether you’re using a centralized server and what kind of hardware your network is running on.
● Distributed rather than centralized servers Distributed networks use distributed servers
, in which all data processing and network management is distributed across the network rather than being centralized. Distributed networks are not only highly flexible and scalable, such as adding new servers whenever the need arises, but they also have the advantage of being able to distribute servers geographically across multiple locations. This evenly distributes server load, reduces the risk of bottlenecks, and avoids the kind of massive outages that can occur with DDoS attacks against centralized networks. In addition, since it is possible to distribute via multiple providers, the risk of service outages in the event of a provider problem is reduced.
Predicting how many players will play your next new game is like accurately predicting the weather two months from now
Using cloud/edge infrastructure providers
One of the easiest ways to reduce connectivity drops and other network issues experienced by players is to integrate with a cloud-based or edge infrastructure provider. Edge servers can also support low latency and high bandwidth by connecting to servers closer to the player to reduce the distance traveled by data.
The testing phase for online games is equally important and terrifying. Minor problems always arise. But by running and testing your game on multiple infrastructure providers and widely available machines, you can reduce the risk of disruptions and catch potential problems up front. For example, you can save time and money by working with platforms and partners that don’t require additional internal resources from multiple engineers or DevOps teams.
Automate, Automate, Automate
Today’s infrastructure, combined with the diversity of services offered by studios, often results in complex puzzles. What was once managed by a few scripts written by system administrators is no longer. Automation and deployment solutions such as Kubernetes, containerized payloads, microservices and CI/CD solve problems, but also introduce new challenges.
The only way to take advantage of better (and more complex) infrastructure is full-scale automation. A studio may have a good team of engineers, but those engineers should be focused on developing the best games. Seeking to rebuild existing tools to automate infrastructure (despite the fact that there are tools on the market today that do just that) is the best use of studio resources. I can’t say
Reputational Damage and Financial Costs of Poor Infrastructure Management
When a popular multiplayer game goes down for even a short period of time, it has a huge financial impact. Blizzard, for example, has moved Overwatch to a free-to-play model. With high pricing gone, Blizzard’s main source of revenue will be in-game purchases. Imagine how much you lost in the first few weeks when players couldn’t connect to a match, or simply abandoned the game because of a problem and didn’t come back until it was fixed.
As another example, take Roblox, which makes over $5 million in revenue each day. What kind of financial impact did the three-day outage associated with the collaboration with Chipotle in November 2021 have? (although Roblox says this has nothing to do with the collaboration with Chipotle).
Having an open mind to flexibly utilize infrastructure rather than relying on fixed infrastructure with no escape when problems arise
And more importantly, it’s about efficiently having enough servers to meet the demand for new game releases and updates. Predicting how many players will play your next new game is like asking you to accurately predict the weather two months from now. We’ve seen examples of major AAA titles that could have millions of players failing to launch, and examples of a one-man military game developer gaining millions of players overnight. I’ve watched it.
●What is the solution?
Again, flexibility is the answer. Some studios may want to over-provision significantly, with upper management signing a contract to provide hundreds of additional servers, and contracting that under a long-term contract. What if your game has a successful launch, but the player count drops off shortly after launch (which is possible, by the way)? That leaves you with hundreds of servers that you pay for but don’t use.
Infrastructure problems, however short-term, can have long-lasting negative effects on your game. With so many new games entering the market, players who experience major issues at launch may abandon the game and move on to something else.
For example, look at the server issues in the recently released game World War 3. Angry players took over the Steam marketplace and review bombed games to express their frustration. This hurts the reputation of other potential players, lowers your game’s rank in the marketplace, and lowers your visibility.
What will Overwatch 2 do next?
If Microsoft’s acquisition of Activision Blizzard is successful, the company may look to shift more resources to Azure’s cloud infrastructure. However, relying on a single cloud provider is not a perfect solution, as Halo Infinite continues to have infrastructure issues on Azure.
It makes sense for Blizzard to leverage Azure as a primary service provider, but the company will need to pursue infrastructure partnerships beyond this relationship given its long-term plans for Overwatch 2. By adding a provider other than Azure you can switch or “back up” when things go wrong, reducing the risk of infrastructure issues in Overwatch 2 and improving the overall quality of service and player experience.
It’s important to be open-minded and flexibly use infrastructure, rather than relying on a fixed infrastructure with no escape when problems inevitably arise. Today’s cloud and server providers want a stable business for predictable forecasts and revenues. The traffic flow of every online game out there doubles and triples in the first 24 hours, making it a nightmare to keep up with both technically and business-wise.