We are on the cusp of a new console generation, and many gamers want to talk about technical specifications. As we saw with the recent confirmation of 8GB GDDR 5 in PS4, not everyone has an understanding of what they mean. Instead of complaining about that, I thought it would be a good idea to try and explain some basic concepts, hopefully in a manner that is easy to follow.
A few things first:
- I've never shipped a commercial game, and I have no insider knowledge of any kind about existing or upcoming consoles
- As per the above, some things I write here could well be incomplete or wrong, please point that out so I can continue improving this OP
- I hope this thread can also serve as a central discussion point about what specs mean, without degenerating too much into console (or general platform) warfare
- The explanations in this thread will focus on a gaming perspective, and what the individual components and their specifications mean for gaming
- Due to the number of topics touched upon, this will by necessity be a very shallow overview. If you want slightly more detail and a more general view of these topics, Wikipedia is a good starting point
CPU
The CPU serves to perform general purpose processing in a console. What does this mean? It means performing tasks such as deciding for each non-player actor in an FPS what their next action will be, or packaging data to send over the network in any online game, or orchestrating (but not actually performing) the display of graphics and the playback of sound.
Specifications and terms which might turn up when discussing CPU performance include:
- Cores: Most modern CPUs have multiple independent processing cores. This means that they can work on more than one program stream (often called thread) at the same time. The number of cores determines the number of such threads that can be processed at the same time. Cores can be symmetrical or asymmetrical, the former is the traditional and usual case while the latter means that the individual cores differ from each other in architecture. For example, Xbox 360 has 3 symmetrical cores, PS4 will have 8 symmetrical cores, PS3 uses 8 asymmetrical cores, and Wii has a single core.
- Hardware threads: As explained above, cores execute program streams which are called threads. On some architectures, more than one thread may be active per core at a time in hardware, a functionality also widely referred to as SMT (simultaneous multithreading). This usually serves to achieve better utilization of that core. Note that this is incomparable to having an additional full core in performance, and also note that any number of software threads can run on any core by distributing time slices. Examples of hardware multithreading include Pentium 5, the Xbox 360 cores or the PS3 PPE, which could all execute 2 threads per core, or IBM's Power7 which performs up to 4-way SMT.
- Clock frequency: maybe the most well known factor in CPU performance. The clock rate describes the frequency (that is, cycles per second) with which instructions are moved through the processor pipeline. With all else being equal, a processor clocked at twice the frequency will be able to perform twice as much work per unit of time. The PS3 and 360 SPUs were both clocked at 3.2 GHz, while PS4 and 720 are rumored to clock at 1.6 GHz. Modern desktop PC processors clock anywhere from 2.5 to 4.2 Ghz.
- Instruction-level parallelism (ILP): almost all modern processor architectures are superscalar. This means that they are capable of executing more than one instruction per clock cycle by distributing them to individual functional units on the processor. A closely related metric is IPC (instructions per clock), describing the number of instructions that can be completed in one clock cycle by one core. This is an often overlooked factor in discussions on message boards, and one of the reasons why e.g. a Sandy Bridge (Core i7) core at half the frequency of an Atom or ARM core easily outperforms them.
- Pipeline stages: To enable high clock frequencies, instructions in modern CPUs are pipelined, which means that they are executed in small chunks over multiple clock cycles. This is generally useful, but can lead to problems e.g. when results need to be available before it can be decided how the program execution will continue. These are called pipeline stalls, and the magnitude of their impact on performance depends on the length of the pipeline.
- Out-of-order execution: In the simple case, a processor will execute instructions in exactly the order provided by the instruction stream of the program -- this is known as in-order execution. However, quite a while ago in computer architecture, it was discovered that it can often be advantageous to execute instructions out of order. This can lead to much better utilization of the available processor resources, especially in cases where compiler optimization fails. All modern desktop CPUs and the next-gen console CPUs support out-of-order execution, while 360 and PS3 were both in-order
- Single instruction multiple data (SIMD): SIMD instructions and units allow a processor to work on multiple pieces of data with one instruction and per clock cycle. SIMD units are usually characterized by their width in bits. For example, a 128 bit SIMD unit can work on 4 single-precision floating point values (32 bit each) at a time. By definition, this works well only when performing the same sequence of instructions on many data elements.
- Cache size: CPUs run at multiple GHz, and it can take dozens or even hundreds of cycles to get data from main memory. Caches serve as an intermediate storage for data that is often accessed and greatly reduce access latency. Often there are multiple levels of cache, with smaller, faster layers fed by larger, slower ones. one important factor is whether individual cache layers are shared by all cores on a chip or exclusive to a core. Shared caches can be used to speed up communication between cores and reduce the penalty for moving threads.
What does all of this mean? To understand how a processor will perform at a task we first need to decide how that task will be impacted by the individual performance characteristics of the processor. Will there be a lot of SIMD-friendly number crunching? Can the task be distributed across multiple threads efficiently? Will there be lots of data-dependent unpredictable branching?
Let's look at some examples to get an idea of how different architectures would fare:
- Multiplying 2 dense matrices (e.g. some core graphics-, audio-, or physics-related engine code). Here, we know exactly what we are going to do, and the task is easily parallelized across cores and on SIMD units. On the other hand, a long pipeline or even a lack of out-of-order execution will not hurt much (given a competent compiler).
- Interpreting a scripting language (e.g. what Skyrim does for much of its actor behavior). We may interpret multiple separate scripts on different cores, but we can't multi-thread a single script. SIMD is likely to be mostly useless, and long pipelines are likely to hurt our throughput. Out-of-order execution will improve performance by a significant degree.
GPU
Traditionally, GPUs are meant to render graphics onto the screen and that's it. This is still their main purpose, however, since around 2006 the field of general-purpose computation on GPUs has increased in importance. What this means is taking some tasks that would have traditionally been performed on CPUs and letting the GPU work on them instead.
The most important GPU specifications are:
- Shader processors: The main processing elements on a GPU. You can think of them as simple CPU cores with low frequency, very wide SIMD units and high penalties for branching code. They execute all the pixel, vertex, geometry and hull shader code that a modern 3D engine throws at the GPU. Thus, they limit the computational complexity of these effects. Since different vendors count these differently, it makes most sense to me to just look at the number of floating point operations that can be performed per cycle on the whole GPU.
- Clock frequency: Just like on CPUs, determines the number of hardware cycles per second. Usually around 1 GHz on high-end PC hardware now, ~ 500 Mhz on PS3 and 360. Obviously, the frequency impacts the performance of all the other components of the GPU.
- Render Output Pipelines (ROPs): These take shader output, and write/blend it to buffers in memory. The number of ROPs therefore impacts the maximum number of pixels that can be rendered per unit of time. Modern ROPs can perform multiple Z operations for each color operation, which allows the GPU to more quickly discard pixels which will not be visible in the final rendered image (since they are behind some obstructing geometry).
- Texture Mapping Units (TMUs): TMUs gather and filter texture data which is used as one of the inputs to various shader programs. The number of these units available determines the detail and filtering quality of textures you can use in a 3D scene.
- Caches and local storage: Just like CPUs, modern GPUs feature caches to alleviate external memory bandwidth and latency issues. Unlike most CPUs, they also reserve small local memory spaces for programmers to actively use for communication or caching. This is similar to the local store on each Cell SPE.
Now let's again look at some examples and see how the individual specs impact them:
- Increased geometric asset detail. This will put more vertex processing strain on our GPU, but will leave ROPs and TMUs entirely unaffected.
- Increased rendering resolution. Here we will increase the amount of pixel processing processing required, while other shader processing should stay at similar levels. Texturing load will also increase, but we may get better texture caching. ROPs performance requirements will increase significantly.
Memory
Memory is used to store data needed by your game/OS. This may seem obvious, but it's clearly necessary to establish that memory, by itself, performs no computation. For games, the majority of memory is usually taken up by graphics asset data, such as textures, models and animation data. Audio and gameplay-related data is usually comparatively small. Of course, this also depends on the type of game. A corridor shooter will require less memory for gameplay data than an RTS or large-scale open world game with many active actors.
Memory is characterized by several distinct aspects, each of which is individually important:
- Capacity: Very straightforward, this is the amount of data that can be stored in a given block of memory.
- Bus width: The number of bits that can be transferred to/from memory per cycle. This is usually limited by (and limits) the number of memory chips needed to implement a given capacity.
- Clock frequency: Just like CPU and GPU, memory will also operate at some clock rate. Together with the bus width, this determines the bandwidth of the memory, and together with the delay in clock cycles of various operations it determines the latency.
- Bandwidth: The amount of data that can be transferred to and from memory in a given unit of time. In some cases, bandwidth is unidirectional (that is, it can be used to transfer either in one direction or the other), and in other cases it's bidirectional.
- Latency: The time it takes to access any given location in memory. In practice this determines, once the CPU or GPU requests some value that is not in cache, how long it will take until this value is accessible to it.
- Layout: Memory can be set up in any number of blocks of different types. If the major main memory block is all using the same type of memory and accessible by both the CPU and GPU, the layout is usually called uniform. A uniform layout is easier to program and more straightforward to implement, but restricts you to a single memory type and limits high-end performance. PS3 and PCs use a split memory layout with separate main and graphics memory, while PS4 uses a unified layout. Xbox 360 and Wii U are unified in terms of main memory, but with a separate embedded memory pool.
For memory, there are only a few types that are in general use, so let's go over them quickly:
- DDR3: Largest capacity per chip, low bandwidth, medium latency. Main system memory on PC and Wii U, rumored to be used in the next Xbox.
- GDDR5: Lower capacity per chip, higher bandwidth, higher per-clock latency (partially offset by higher clock). Used in all high-end GPUs on PC as well as in PS4.
- eDRAM: Very low capacity due to being embedded on-chip, low latency and potentially high bandwidth (with a wide bus). Used for the 360 GPU framebuffer and on Wii U.
- eSRAM: Even lower latency and capacity, used to implement caches.
Let's finish this section up by once more looking at a couple of use cases and how they impact memory:
- Increasing the framerate. Going e.g. from 30 to 60 FPS will not require any additional memory capacity, but significantly higher bandwidth for the GPU and potentially also lower latency.
- Increasing level size. This will mostly impact capacity, since you need to keep a larger set of assets in memory. However, since the set of assets used in each individual frame is not likely to increase much in size, bandwidth requirements are mostly unaffected (and so is latency).
Other hardware
Consoles used to include lots of special fixed function hardware to perform a variety of tasks. While the tendency in general in hardware has been towards programmability and more general purpose computation, a few components that get mentioned often should be discussed:
- Audio DSPs: Digital Signal Processors are very efficient at the kind of processing required for e.g. audio. This is particularly important when your main CPU is comparatively weak at these tasks. Wii U, PS4 and the next Xbox are all rumoured to feature some dedicated audio hardware, while on PCs audio processing is mostly done entirely on the CPU.
- Video Encoding Hardware: Video encoding is a very performance intensive task, and one which can be accelerated significantly by dedicated hardware. Wii U uses dedicated video compression hardware for streaming to the gamepad, and PS4 also includes such hardware for its streaming and recording features. Nvidia plans to use hardware in the 600-series GPUs to enable streaming to Shield.
General Terms
Here I planned to list a few more terms that are used for different components, and often crop up in discussions, but there's really only one I can think of that isn't covered yet:
- GFLOPs: Giga Floating Point Operations Per Second. In console discussions, this is usually referring to the maximum number of single-precision floating point operations theoretically possible on some hardware per second. With the information about CPUs and GPUs outlined above, we can imagine that this number is a function of (core count) * (ILP) * (SIMD width) * (clock frequency).
An 8-core Jaguar CPU at 1.6 GHz performs ~100 GFLOPs, Cell in PS3 around 200, the Xenos GPU in Xbox 360 managed 240 and the GPU in PS4 does 1800. AMDs current high-end graphics chip does 4300 and Titan does 4500.
Congratulations if you read all of the above, it should now be clear to you why e.g. doubling the memory capacity of PS4 will not automatically increase the framerates of games running on it, or why the Wii U CPU at a much lower clock rate can still keep up with the Xbox 360 CPU in some tasks -- and why others are problematic for it.
I geeked out a bit too much when I started writing this, especially on the CPU part. In the interest of only focusing on stuff useful to gamers, keeping this readable in some fashion at least and keeping the time for writing it below 4 hours I got somewhat more focused/shallow in the other parts.