Xbox One: The first HPC gaming console

I just read an interesting article on Ars Technica and it got me thinking a bit... the article is linked below for reference:

In the article the Ars author addresses how the Xbox One (XB1) will use cloud computing to offload processing from the XB1 for "latency-insensitive" calculations, based on an interview they did with Matt Booty (General Manager of Redmond Game Studios and Platforms).


IBM Blue Gene and Xbox One: the console as an HPC system

The most interesting detail about Microsoft's plans for the XB1 is how they are planning to use their cloud resources to supplement the XB1 hardware. Rather than approaching cloud resources as a virtual machine environment which streams a game running on central servers to a thin client (which appears to be the model that Sony will be using with the PS4), Microsoft is approaching the next-gen console problem using a classic high performance computing (HPC) distributed processing model.

In this processing model, every XB1 is an HPC head node running an interactive application (the game) which submits certain heavy visualization tasks to their centralized compute resources for processing, and receives results back to update the visualization model in real time. The article states that they anticipate having approximately 3 compute resources in the cloud for every XB1 (head node), effectively making every XB1 its own very small HPC compute cluster. This ratio may change over time as they figure out how to increase usage of their backend infrastructure for XB1 visualization tasks.


Visualization problems on the WAN: How do we enable "cloud-based consoling"?

Looking at the two proposed solutions for next-gen consoles from Microsoft and Sony, we have two basic strategies for use of cloud resources: Microsoft advocates extra compute resources available in the cloud to the local console, and Sony champions a faster local console along with a hosted cloud gaming streaming solution (using Gaikai's technology). In both cases a very fast, low jitter, low latency network connection isn't currently a realistic possibility on most Internet connections (or "wide area network/WAN links") available to consumers. This creates problems for both proposed models.

To cover a bit of background on the issue, I should mention that I work in research computing for a major research university, and part of my team's tasks are to provide HPC resources for real-time dynamic visualization systems. Essentially, the classic HPC model is a distributed computing model similar to running SETI@Home on multiple PCs - with a "head node" that creates tasks, gathers/displays results, and a task management system to keep all the compute nodes working together and delivering results to the head node.

Usually when we try to solve dynamic visualization problems using standard HPC systems, the default plan is to use compute resources that are local to each other and build very low latency connectivity (Infiniband, etc.) between head node, storage, and compute resources. The low latency allows us to throw multiple problem types at an HPC system while still rendering "close enough to real time" that the visualization is useful. The whole system is usually a distributed memory and processing architecture (lots of individual servers), but we tie it together with a very fast interconnect (network) to minimize latency issues. This causes it to appear to be a single visualization system to the user who is running an application on the head/visualization node.

My team also supports high-bandwidth video conferencing - think HP's old Halo teleconferencing system, full Cisco Telepresence rooms, or similar low-latency, high bandwidth systems. These systems have issues with network jitter and latency, and the problems introduced by a less-than-perfect network on these systems are very similar to what you see with cloud servers running games and streaming them to a thin client over the internet - as Sony intends for the Gaikai/PS4 model. My team actually runs a closed private network across multiple Provinces (States) in order to manage the network jitter, latency, and traffic contention issues that occur when you need real-time two way streaming of large amounts of visualization data. You can't have a dynamic bidirectional interaction at high quality if your application is at all sensitive to real-time interactions and you have any network jitter and latency issues... your connection will stutter like an HD YouTube video on a bad DSL connection, even if you have plenty of bandwidth.


The promise of a console that scales

The cycle of "hardcore" PC/console tribal warfare often revolves around a leapfrog performed by consoles every 8-10 years or so, in which for a brief moment in their lifecycle consoles eclipse the steady march of PC hardware improvements, and shine as the best graphical experience available to a consumer. This is usually a short-lived victory, and within a year or two PC hardware has continued its unyielding press toward the future and has once again surpassed the best visuals of a given console generation. Arguments then descend back into the standard volleys of console ease of use, PC flexibility, price/performance ratios, and the fact that many of the best games only come to consoles, and large scale gaming (MMOs, etc.) generally only show up on PCs.

With this new generation of console, I believe that Microsoft is attempting to build not just a console that takes over the living room, but one that will scale to match the PC's pace in computational prowess over time. This may seem a strange premise considering the relatively anemic hardware that the Xbox One will ship with, but hear me out.

In the research community, visualization tasks often vastly outstrip local available HPC resources (try visualizing/modeling all the physical structure changes of a major earthquake on a metropolitan area, for instance), and this problem has created a push to solve real-time visualization tasks on distributed HPC systems. A current focus of the research community is on cloud/grid-computing models for visualization, to separate out processes which are latency insensitive and can be offloaded to remote resources. Separating those processes and allowing for distributed visualization efforts allows for a much greater set of resources which can be applied to a given visualization problem. HPC research groups are very much interested in "cloud computing" to support expanded visualization efforts that cannot be handled on local research systems.

It's a difficult problem to solve, which requires parallelizing the CPU/GPU workloads and then identifying which tasks can handle greater latency without negatively affecting the visualization app running on the interactive (head) node. Those tasks are then shipped off over the network to various HPC compute nodes and the results delivered back to the interactive visualization application as the tasks complete. This appears to be Microsoft's strategy with the Xbox One. From Microsoft's description of the XB1 backend architecture, they already have a standard HPC head/compute node model and tasking system built into the XB1 infrastructure. If they distribute a "head node" unit (the XB1) to every user, and keep that unit focused on critical low latency calculations, their primary task in future years to increase the XB1's performance is going to be to build out an HPC queuing model for high latency processing/visualization methods.

In the Ars article linked above, Matt Booty gives several examples of how Microsoft plans to offload current high latency visualization tasks to their central compute resources. Future efforts will likely involve tweaking where CPU/graphics subsystem processes run (local or remote) as network conditions for the average user change over the next decade, and as the HPC research community creates better methods to pass visualization tasks to remote systems. The good news is that there are a lot of people working on this problem for purposes other than gaming. Real-time distributed visualization tasks over a WAN is an area of visualization research that has quite a bit of focus in the HPC community, so Microsoft won't be doing all of the work to solve these issues on their own.

The advantage to this strategy is that the XB1 console can gradually shift more and more of its visualization processing capability to "the cloud" as time goes on, which will allow the console to offset graphics technology developments beyond simply creating more efficient code and squeezing resources out of the existing XB1 hardware. This does presume that progress is made on HPC distributed task models for visualization within the research community. Effectively, the XB1 could be 10x as powerful in 8 years than it is today - as long as over time they can efficiently solve the problem of separating latency-tolerant tasks and building out an HPC visualization model over WAN systems.

Consider the following entirely hypothetical situation (with arbitrarily made up numbers) as an illustration. Matt Booty states in his Ars interview that, "A rule of thumb we like to use is that [for] every Xbox One available in your living room we’ll have three of those devices in the cloud available." Let's presume that Microsoft plans to keep that hardware ratio the same for the life of the console (they have it budgeted in their infrastructure costs), and they replace their server infrastructure on the corporate standard 3-year cycle. This means that every year, 1/3 of the servers are replaced with new equipment. Every year, each Xbox One potentially has 1/3 of it's available remote hardware resources upgraded. Every year, the console gets to take advantage of Moore's Law.

It appears that Microsoft is playing to its company strengths with the Xbox One console system, and focusing on a technology for "cloud based consoling" where they can bring their years of enterprise experience as an asset. For their model to work, it will be critical to have a solid enterprise infrastructure background and research teams that understand and can push the boundaries of enterprise HPC technologies. Microsoft, with decades of enterprise focus, has the background to support an enterprise infrastructure in distributed computing for their console, and they've demonstrated that with the Live service for Xbox 360. The next version of their infrastructure is going to be a high performance computing infrastructure, and provide direct computational resources to their consoles which can grow over time. The PC might not be nearly as far ahead at the end of the next console cycle as it is today.



Console scale up... console scale down

In addition to the ability to compete on the high end with better processing capabilities long-term, an HPC model for their console enables Microsoft to scale down as well as up. This might explain why the Xbox One only has the processing capabilities that it has, which are indisputably mid-range for CPU/GPU technology. As a front-end to a larger distributed computation system, the console doesn't have to contain all of the CPU/GPU resources to provide a high-end experience. More importantly, by scaling back the required front-end equipment, you get to mobile device packaging sooner.

I would not be surprised at all to see a new package from Microsoft in a couple of years... something with the same SoC from AMD in a different die size, running the same hypervisor and two virtual machines, one Windows 8 for applications, the other the Xbox OS for games. An identical hardware environment from an Xbox developer perspective, but running on something the size and shape of a Surface Pro. The only difference would be the built-in screen and the fact that the Windows 8 VM will probably have a larger partition size (and get a lot more use). The same HPC infrastructure would provide additional compute resources to this full portable Xbox as well (presuming they figure out how to deal with wireless latency issues).


So when does the Xbox One get the processing power needed for a Holodeck?

Strategically, if Microsoft succeeds in building out the next Xbox as a functional distributed HPC system, with every XB1 functioning essentially as a head node, this will be a clear win for them in this console generation, in my opinion. The disadvantage of the somewhat slower hardware on the front end for the XB1 becomes a non-issue as tasks are transferred to dedicated compute resources, and they have the advantage of lower buy-in cost for the consumer, as well as having a more scalable system that is less dependent on low network latency requirements when compared to a pure thin client model for "cloud gaming", as is seen in Sony's Gaikai/PS4 strategy.

That's not to say that Sony couldn't run the same model of computing on the PS4, once the major problems with offloading visualization workloads on a WAN-based HPC cluster are solved. Sony could easily run the same computing model on the PS4 (with a more powerful "head node" as a bonus... if that power is needed). Sony would, however, be playing catch up in critical aspects of distributed infrastructure design, process queuing/task management, and toolsets for developers. It would be a hard conversion to make partway through a console life cycle, and I highly doubt they would be able to take the lead in performance using this kind of computational model unless they start working on it now.

So... after thinking about it a bit, and looking at the research world's development focus on distributed visualization tasks using an HPC framework, I think Microsoft actually has the better strategy for this generation of console technology from a potential performance perspective. They're taking more of a risk with their computing model than Sony, but if it pays off and they can use one of the major areas of focus in research computing to supplement their console's processing abilities, I expect the performance difference (both in processing and responsiveness) will be significant between the two platforms over time.

So if the goal is to build a holodeck... they appear to be working toward that goal by setting up the Xbox One/Kinect as the head node and primary interface for an HPC supercomputing infrastructure. With Illumiroom/Oculus Rift technologies waiting in the wings, here's hoping they figure out the remaining pieces during this next console generation.