People keep saying GPUs can do hardware tiling/swizzling. They can't.
The driver usually does it in the driver on the CPU, MS added dedicated hardware for the task.
In short, the XBONE's DME's are just jacked up DMA units, 2 clearly came from a GCN GPU as they contain swizzling hardware that is present in all modern GPU's.
http://libsh.org/ref/online/onlinese12.html
On many GPUs swizzling is free or at least cheap, and smart use of swizzling can make your code more efficient. However, inappropriate use of swizzling can also make your code incredibly hard to read, and may make it hard for the compiler to optimize it.
http://www.opengl.org/wiki/GLSL_Optimizations
Swizzle masks are essentially free in hardware. Use them where possible.
http://fgiesen.wordpress.com/2011/01/17/texture-tiling-and-swizzling/
While somewhat awkward in software, this kind of bit-interleaving is relatively easy and cheap to do in hardware since no logic is required (it does affect routing complexity, though). Or you could do this:
http://http.developer.nvidia.com/GPUGems2/gpugems2_chapter35.html
Not only does this save on arithmetic in the vertex processor, but it saves interpolants as well. Further, it avoids the construction of vectors in the fragment program: swizzles are free (on NVIDIA GeForce FX and GeForce 6 Series GPUs), but the move instructions required to construct vectors one channel at a time are not. We saw this same issue in Section 35.1.1.
http://www.math.bas.bg/~nkirov/2008/NETB101/projects/Cg-HTMLs_and_Files-Referat-F40215/page4.html
Because the swizzle operator is implemented efficiently in the
GPU hardware, its use is usually free.