Because with rasterization, triangles/primitives are mapped to the pixel grid before pixel shading occurs. The ALUs (Arithmetic and Logic Units) within the CUs work on these portions of the pixel grid for each polygon by breaking it further down into pixel fragments and passing the work to ALU to work on it on a pixel-by-pixel basis.
To ensure the maximum level of parallelism, in order to keep you ALU doing useful work, you want triangles that span a large number of pixels. So that when shading, in batches of pixels sized to correspond to the width of the ALUs, you're maximizing utilization of your ALU.
A crude example is as follows:
Let's say I have a 4x4 fragment passed to my 16-wide SIMD unit within the CU.
If the polygon spans 14 of the 16 pixels, then 14 out of 16 pixels get shaded in a single clock cycle (87.5% utilization)
If the polygon is smaller and spans only 4 of the 16 pixels, only 4 pixels get shaded per clock cycle (25% utilization) so the efficiency is significantly lower.
So in the above example scene with many 4x pixel triangles, adding more CUs to my GPU barely helps real world shading performance at all, because I'm getting only 25% utilization out of my ALUs. Increasing GPU clock speed however, will help overall performance more, because in a 30fps game with 33ms frame time, increasing my clock speed will mean I can shade more real 4x pixel polygons within my frame-time budget.
The above is kind of a gross oversimplification, but it's just to give you the gist.