Because they made a mistake in part of it, it is standard GCN and that is obvious to anyone, thats why, it is not 3072ops/clock and we have been over this multiple times on the forum.
Also you got it waaaaaya wrong.
The L2 is 512kbit (shared between all the CU's) the L1 is 16kbit per CU.
As per this link.
http://www.vgleaks.com/durango-gpu-2/2/
As per the same link under 'compute' you'll see that each SIMD executes a array of 16 threads. A SIMD vector could be albeit bad terminology interpreted as a vector of SIMD units.
from vgleaks
"Each shader core has a local 64-way L1 cache of 16 KB, composed of 256 64-byte cache lines. "
what i get wrong
64 way 16kb 256 lines
"L2 Cache
The GPU contains four separate 8-way L2 caches of 128 KB, each composed of 2048 64-byte cache lines."
from gcn white paper
" L1 instruction cache that is 4-way associative and backed by the L2 cache. Cache lines are 64B long
and typically hold 8 instructions"
shared across 4 cu's