• Hey Guest. Check out your NeoGAF Wrapped 2025 results here!

First Cell demo (MPEG2 decoding)

Another thing that was discusses at B3D. That sort of stuff is in all likelyhood also be done on the SPE. Sure, AI routines are easier to make using heavy integer computations, but there is nothing that makes floating point based AI routines impossible (or so they said).

Thanks, I suppose I will have to agree with them on that one, it just goes against instinct I suppose.
You do not happen to remember what the thread title was do you ?
If it is buried in another "R520 Infomania" forget about it. :)
 
It was probably in one of the mammoth Cell threads, but I have no idea which one.

I just did a test with my 350 mhz P2. I ran three mpeg2 videos and one avi video in multiple instances of BSplayer. Here's what I concluded from the test:
Whatever BS player is doing, that's incredibly impressive if all those videos are in fact 720x480 resolution DVD vob files. I don't think I can run four MPEG1 videos smoothly on my Cel633 using regular media player.
 
Good job missing the point and turning what I said into something you want it to sound.

Gofreak is much more on the spot with what he said. The things I said applies to most every chip, I'm simply stating that the performance gain is not necessarily as much as people claim it is. I never disagreed that it's a step up. Did they even disclose the fact that PS3 will use 8-SPE, I never heard, so if that's true enlighten me.

People always claim their stuff is easy to program, but the given point is that the learning curve is different. You can't deny the fact that whenever a new console comes out, it takes time for them to reach their true potential. Sony already said they will use a variation of the OpenGL library, so who knows how similar they are. It's not like it's the samething as developing for the PC, which is what MS is doing with XNA. Sony said the same thing about PS2, and I don't think it needs to be said what it's like now.
 
sonycowboy said:
So, that would mean a 5.3GH system could do 20 MPEG2's vs the 48 shown here? Of course, I would assume more efficient decoders have been written since then, but then again, we don't know how efficiently the 266 actually handled the MPEG2 decoding, but that was a loooong time ago and the with progressive scan added to MPEG2 since then we're still screwed as to what kind of processing power is needed.

I just did a test: I built a DShow filtergraph outputting a SD MPEG2 to the Null renderer (to remove the video card from the equation), and on this old 1.4 ghz P4 400 mhz FSB (it's an engineering sample I got from Intel many years ago), it consumed about 20% of the CPU. That's with all the overhead of DirectShow/Kernel Streaming/etc.

So a modern 3.4 ghz machine could probably decode at least 12 MPEG2 streams simultaneously. A dual core could probably do at least 24. Also I was using an older Intervideo MPEG2 codec, so I don't know if a better one has come out since.

Probably you could squeeze out some more if you remove any OS overhead from the picture.

If I get the time, I'll run the test again on a better machine and see how it goes.
 
I have an athlon 2ghz (eqivalent to 2.8ghz p4),
I could get 8 avi's running before it started looking bad

but the avi I was playing is divx so I guess that's more
than SDTV resolution
 
koam said:
Is cell going to ever hit the pc/mac market?

I'd rather see IBM and Sony challenge the Wintel monopoly with a line of Cell-based personal computers, myself. It'd be nice to see some truly new computers come to market as we move into the 21st century, ones that aren't saddled with decades of legacy hardware and software architecture.
 
Tellaerin said:
I'd rather see IBM and Sony challenge the Wintel monopoly with a line of Cell-based personal computers, myself. It'd be nice to see some truly new computers come to market as we move into the 21st century, ones that aren't saddled with decades of legacy hardware and software architecture.

And when that general purpose Cell-based computer grinds to a halt because it can't do Out of Order operations I'm sure you'll be smiling then.
 
seismologist said:
I have an athlon 2ghz (eqivalent to 2.8ghz p4),
I could get 8 avi's running before it started looking bad

but the avi I was playing is divx so I guess that's more
than SDTV resolution

DivX consumes a lot more CPU power than MPEG2.
 
So to clarify, they're running 48 tiled MPEG2 streams at roughly 240 X 180 resolution/tile.

1920 x 1080 = 2,073,600

2,073,600/48= 43,200/tile

43,200= 240 x 180 resolution/tile

240 x 180= 1.333333333333 aspect ratio, which equates to a 4:3 aspect ratio (the tiles appear to be in a 4:3 format)

Give or take a few pixels, does that estimate sound about right?
 
rastex said:
And when that general purpose Cell-based computer grinds to a halt because it can't do Out of Order operations I'm sure you'll be smiling then.

Yes, I'm sure such machines would 'grind to a halt' frequently and be practically unusable, much like the proposed Cell-based CGI workstations will. :lol
 
HokieJoe said:
So to clarify, they're running 48 tiled MPEG2 streams at roughly 240 X 180 resolution/tile.

1920 x 1080 = 2,073,600

2,073,600/48= 43,200/tile

43,200= 240 x 180 resolution/tile

240 x 180= 1.333333333333 aspect ratio, which equates to a 4:3 aspect ratio (the tiles appear to be in a 4:3 format)

Give or take a few pixels, does that estimate sound about right?

no, they said the video is scaled down so the native resolution is higher than that.
 
gofreak said:
You're thinking about things on way too high a level. Forget about concurrency between two different programs for a second, and think about concurrency within a program. There's a lot of concurrency in games waiting to be unlocked. Just because most games currently are single-threaded (and for good reason - hardware threading is a very recent thing, and multiple cores even more recent), doesn't mean they can't be concurrent going forward. They'll have to be, at least if they want to take advantage of the performance on offer...as Dr. Dobb's said, the free lunch is over as far as computing performance is concerned.

gofreak is a programmer I can tell :)

Multithreaded games need to be the wave of the future. Clock speeds can only get so high before you start running into physical limitations.
 
Tellaerin said:
Yes, I'm sure such machines would 'grind to a halt' frequently and be practically unusable, much like the proposed Cell-based CGI workstations will. :lol
Did I say unusable? And a CGI workstation is VERY different compared to a general purpose computer.
 
Only 6 SPEs are doing the decode BTW, and one was sitting idle. Any word yet on what process the CPU and GPU will be on? I trust Cell should appear in 65nm, but what about the GPU? I thought it would be 65nm as well, but last I heard, it was more likely to be 90nm. PEACE.
 
rastex said:
Did I say unusable? And a CGI workstation is VERY different compared to a general purpose computer.

No, you didn't say it, you just implied it rather strongly. :) And the real point to that post (which I probably should've stated explicitly, I'll admit) was that I don't find the idea of an inexpensive multimedia personal computer based on Cell to be all that implausible, and feel that any drawbacks inherent to the processor itself can be compensated for easily enough in the overall design. Games and multimedia apps are probably the most hardware-intensive uses of your average home PC, and it's fairly safe to say that these areas happen to be where Cell excels. Besides, the status quo's long overdue for a shakeup. :) Apple seems content to remain in the comfortable niche they've carved for themselves, so if change is going to happen at all at this point, it'd have to originate with someone besides the current players. (I'll also admit that I feel poor IBM's been marginalized in a space they helped create, and it'd do my heart good to see them claw their way back to center stage. Romanticism on my part, but hey.)
 
rastex said:
And when that general purpose Cell-based computer grinds to a halt because it can't do Out of Order operations I'm sure you'll be smiling then.

And your argument itself is so very valid as the future of media and consumer electronics will see workloads bounded by how fast you can run MS Word and run SPEC benchmarks. Because, lets face it, the future is in static processing and making sure I can open Thunderbird faster than I can physically realize there is a difference.

It's been almost 10 years since Keith Diefendorff's paper on future processing requirements, it's been almost 5 years since STI started looking at processing needs -- I happen to think they're right and your just the last in a long line of people who find excuses for why we need to stick with throwing more and more transistors (for increasingly diminishing returns to boot) at expanding x86 for tasks which are, for the vast majority of uses, already beyond a humans perceptual need. OOO is a great example of throwing a huge number of transistors at a problem for returns that don't justify the area costs considering the task loads the average consumer uses. And the ironic thing is that even Intel is moving in this direction as their R&D roadmaps all show Cell-esque devices due around 2010 which use one or two large cores surrounded by reductionist|specialized cores reminescent of the SPU.
 
seismologist said:
no, they said the video is scaled down so the native resolution is higher than that.


So basically they're starting with roughly 48 (+480 x 360) picture streams and down-rezzing it to roughly 48 (+240 x 180) streams to fit on a 1920x1080 display device.

Also noteworthy is that the actual resolution they're down-rezzing from could be higher than 480x360, since Dish Network's MPEG2 DVB stream is 544x480 lines.
 
Vince said:
And your argument itself is so very valid as the future of media and consumer electronics will see workloads bounded by how fast you can run MS Word and run SPEC benchmarks. Because, lets face it, the future is in static processing and making sure I can open Thunderbird faster than I can physically realize there is a difference.
That's the future for limited-function devices like a media server, sure. It makes sense for a device only doing media loads & decodes to have a crippled primary core scheduling tasks and secondary vector cores running them. CELL taking over in these scenarios is the consumer electronics equivalent of outsourcing: Your job is simple and doesn't require brains, just a lot of easily repetitive tasks that go parallel nicely.

I don't believe that's the future for PCs (any kind of PC with direct user interaction: home PCs, workstations, etc).

PC users aren't just running one consistent process at any given point in time. Even PC users that just run one visible application at any given time are still running background services created by a large number of different developers.

Do you expect companies like Symantec to go to great lengths to ensure that their background virus scanner's memory access patterns aren't stalling the CPU or that their cache behavior is optimal for the current environment?

Do you seriously believe that an in-order CPU with strict limitations on latency and instruction dispatch can function well in an environment where the machine's owner expects to browse random websites and compose documents, all while their favorite IM client and mp3 player are running in the background?

CELL is a neat idea for the theoretical world where all developers care about going parallel in a generic way to maximize utilization and a single process has full control of the machine.

Vince said:
And the ironic thing is that even Intel is moving in this direction as their R&D roadmaps all show Cell-esque devices due around 2010 which use one or two large cores surrounded by reductionist|specialized cores reminescent of the SPU.
At least Intel has large cores which keep the PC user experience functioning as expected by customers.

The CELL approach starts by gutting the features from the primary core that enable fast-response general-purpose multi-tasking on today's PCs. Of course, the die space that used to enable this is replaced by special-purpose vector CPUs operating against a new instruction set which requires more education, doesn't work with any of the profiling tools developers are familiar with and places strict requirements on how software accesses hardware resources.

Having general-purpose C/C++ code perform poorly and requiring SPE use for effective utilization is not a recipe for knocking Intel and AMD off the PC map. You can't just make up for that with a high FLOPS rating reached in a lab environment using a pathological application.

That won't stop the analysts today, though. It would probably take a PC world example like Intel doing something monumentally stupid (e.g. crippling the P4 in favor of SSE functionality) to make them realize why CELL architecture is not a viable future for the desktop. No one would be dumb enough to think that a Ferrari would make an effective monster truck, and this is no different.
 
PizzaFarmer said:
That's the future for limited-function devices like a media server, sure. It makes sense for a device only doing media loads & decodes to have a crippled primary core scheduling tasks and secondary vector cores running them. CELL taking over in these scenarios is the consumer electronics equivalent of outsourcing: Your job is simple and doesn't require brains, just a lot of easily repetitive tasks that go parallel nicely.

I don't believe that's the future for PCs (any kind of PC with direct user interaction: home PCs, workstations, etc).

So, you're saying that if you were to do an abstract asymptotic analysis of future consumer electronics computational demands, you'll find that they're bounded by static processing tasks?

I don't even need to respond to this, it's that asinine. There is both direct and anecdotal envidence of this everywhere, from academic papers (From Diefendorff to Hofstee ) on the topic, to the shift in Intel's architectural strategy. Or look at the movement in OS design and the move with Longhorn and OSX.x towards centralizing|focusing on the utilization and manipulation of digital media.

Your comments of running all those background applications as justification is highly fallicious for several reasons. First off, running all those apps on current processors utilize minimal percentages of the aggregate resources and scale linearly or sublinearly (eg. They aren't a limit on preformance); Secondly they aren't inheriently incapable of being utilized effecienctly in the BPA.

How many people are upgrading their Pentium4 or even their PII because it doesn't browse the web fast enough or they can't use MS Word well enough? Is there anyone who believes these apps are driving demand? Of course not, you're insane if you believe this and the proof is in the talk of Longhorn becoming a demand driver and forcing turn-over. Why is this... perhaps they're focus on digital media and demands on dynamic processing requirements? The GPU will pick-up alot of these demands, but Cell|BPA would be capable as we've seen.

At least Intel has large cores which keep the PC user experience functioning as expected by customers.

Intel is also 5 years behind, lets compare the 3rd Generation Cell processors being fabbed at 32nm and then see how they relate. The difference is that STI has had Cell ICs in developers and DCC hands since 2004; Intel has a few powerpoint presentations that have cool looking pictures next to the year 2010.

Do you seriously believe that an in-order CPU with strict limitations on latency and instruction dispatch can function well in an environment where the machine's owner expects to browse random websites and compose documents, all while their favorite IM client and mp3 player are running in the background?

Uh, yes. Lets just say you'll lose a net 50% preformance by getting rid of OOOE when running your tasks, mostly as a result of L2 cache misses and the latency penalty to main RAM... If you miss the L2, all processors will have similar hits of upwards of a thousand cycles. I mean, The 1st generation BPA's PPE (forgetting about SMT gains, etc) is clocked at over 4GHz. You're saying it won't be able to run MS Word, AIM and your shitty Virus Scanner to the point that the average consumer will notice the difference? This being the same consumer who is likley running a Celeron. And from an architectural PoV, OoO is a net loss of effeciency when compared with the area it utilizes - it's a losing proposition.

Having general-purpose C/C++ code perform poorly and requiring SPE use for effective utilization is not a recipe for knocking Intel and AMD off the PC map. You can't just make up for that with a high FLOPS rating reached in a lab environment using a pathological application.

Totally fallicious comment. Outiside of stating that you program the SPUs in C/C++ and that the SPU architecture was designed by guys like Gschwind who had the compilers in mind as they wrote most of them; I can only offer anecdotal evidence at this point, but JFYI, the recent Cool Chips demonstration by Toshiba which showed off the 6 SPUs decoding 48 MPEG2 streams concurrently and resizing them to fit in a composite 1920x1080 screen was written without the programmer doing any explicit thread scheduling. Quite the nightmare development enviroment, huh?

____________
PS. And BTW, the Pentium4 (Northwood) wasn't crippled because of favoring SSE. It was due to the engineering going out of control in terms of transistor budget and what was becoming an IC that was so far gone in terms of diminishing returns per area cost that Intel's Dan Boggs capped the area at that of the PPro and cut the shit out of it.

And, actually, it's a perfect example of how x86 and the type of processor you're advocating just isn't feasible for the consumer marketplace (as proven by Intel's actions on the P4). They cut out the following:

Went from two full FPUs to 1 full and one which was incapable of executing MMX, SSE or SSE2. Almost a 2-fold reduction in FPU logic area and power consumption for a net preformance loss of only ~ 5%.

They cut the L1 Cache in half and lowered the load capacity to 1 per clock. Cut the physical size of the Trace Cache and implimented a compression algorithm instead. Doubled the dense L2, but cut out the 1MB of L3 that was to be strapped to the IC. All disclosed at Micro-33.
 
Srider said:
The cell will be good at doing what it does best, people like MS is obviously not going to put out something not competitive. From a practicality viewpoint, xbox2 will likely have an advantage in terms of development than PS3 will at launch, since it's a familiar API to the developers. It'd be interesting to see if the PS3 gets delayed due to not enough games available at launch. We'll see at E3.
But the PS3 has OpenGL/ES and Cg which are familiar tools too, or more familiar to some than the Windows-only APIs...

Srider said:
Remember when the Emotion Engine came out and Sony was boasting how it can do crazy calculations comparing it to super computers? Look at PS2 now. Most of these things are simply marketing to make it appealing to potential clients. I'll believe it when a game developer gets his/her hands on the Cell and put out some good looking stuff.
So what? This mpeg2 decoding demo shows the realworld performance of the Cell, not on paper.
 
Very interesting, if not useless. Does anyone honestly think the same chip will be in the PS3 for < 300 dollars ?
 
Srider said:
I'm simply stating that the performance gain is not necessarily as much as people claim it is.

Invoking DP performance as a reason why performance gain "isn't as much as people claim" isn't very sound. Again, compare Cell's DP performance with that of a desktop CPU.

As for Cell in PCs, I don't see it happening in the short/medium term. Apple is probably the only short/medium term avenue for getting Cell into desktop PCs, and even then it'd be as an add-on processor rather than a CPU replacement. Technicalities aside, there's a huge software mountain to overcome. Besides, STI has barely discussed its chances as a desktop replacement themselves..it's not a market they're targetting for now.

Srider said:
Remember when the Emotion Engine came out and Sony was boasting how it can do crazy calculations comparing it to super computers? Look at PS2 now

No one here is comparing a 8-SPE Cell chip to a supercomputer. The closest STI has come to that is to describe it as "supercomputer-like" performance. Which isn't quite the same thing (they characterise it as such because of the relatively massive floating point computational performance rather than because of absolute numbers), and it's simply PR-fluff anyway. Every company engages in that.

Ryudo said:
Very interesting, if not useless. Does anyone honestly think the same chip will be in the PS3 for < 300 dollars ?

Well, we don't know the clockspeed of the chip Toshiba did this demo on, so that makes this question harder to answer. From the viewpoint of SPE quantity, I don't think it'll be exactly the same, but close. From a cost perspective, we don't know how much it costs to manufacture this thing...the die size is roughly the same as the EE first was, so..
 
gofreak said:
Invoking DP performance as a reason why performance gain "isn't as much as people claim" isn't very sound. Again, compare Cell's DP performance with that of a desktop CPU.

As for Cell in PCs, I don't see it happening in the short/medium term. Apple is probably the only short/medium term avenue for getting Cell into desktop PCs, and even then it'd be as an add-on processor rather than a CPU replacement. Technicalities aside, there's a huge software mountain to overcome. Besides, STI has barely discussed its chances as a desktop replacement themselves..it's not a market they're targetting for now.
"PS3 Linux Kit" is the way to go :)
 
gofreak said:
As for Cell in PCs, I don't see it happening in the short/medium term. Apple is probably the only short/medium term avenue for getting Cell into desktop PCs, and even then it'd be as an add-on processor rather than a CPU replacement. Technicalities aside, there's a huge software mountain to overcome. Besides, STI has barely discussed its chances as a desktop replacement themselves..it's not a market they're targetting for now

Absolutely, but there is a difference between what's feasible due to technology and what's feasible due to marketplace economics, of which I'm only addressing the former. But I'd agree, the Apple and Linux crowd are the obvious (and only) major inroads I can see that are feasible due to the marketplace.

EDIT: Beat by ThirdEye :)
 
Vince said:
So, you're saying that if you were to do an abstract asymptotic analysis of future consumer electronics computational demands, you'll find that they're bounded by static processing tasks?
Yes, static processing tasks performed in arbitrary order and with an arbitrary number of them running at any given time.

Vince said:
I don't even need to respond to this, it's that asinine.
Well, you did respond, because clearly you're above the rest of us discussing the application of common sense and user scenarios to theory and expectations.

Vince said:
First off, running all those apps on current processors utilized under 10% of the aggregate resource (eg. They aren't a limit on preformance); Secondly they aren't inheriently incapable of being utilized effecienctly in the BPA.
Mind sharing how much experience you have trying to run today's general-purpose code on an in-order processor with tight cache restrictions?

Your 10% data point proves nothing about switching architectures. It is conjecture that CELL shouldn't have problems handling that.

Vince said:
How many people are upgrading their Pentium4 or even their P2 because it doesn't browse the web fast enough or they can't use MS Word enough? Is there anyone who believes these apps are driving demand? Of course not, you're insane if you believe this and the proof is in the talk of Longhorn becoming a demand driver and forcing turn-over -- and why is this... perhaps they're focus on digital media and demands on dynamic processing requirements? The GPU will pick-up alot of these demands, but Cell|BPA is more than capable.
I used Word and web-surfing as a few popular examples of "modern general-purpose applications written in C/C++". More conjecture about CELL being powerful enough and your future of a media processing-based future.

Then again, maybe all people will care about in five years is running Mathematica and watching ten DVDs at the same time, which would make you right and the rest of us bored out of our minds. That, and BeOS would rise from the dead to be the OS of choice for parallel mpeg-decode lovers everywhere ;)

Vince said:
Intel is also 5 years behind, lets compare the 3rd Generation Cell processors being fabbed at 32nm and then see how they relate. The difference is that STI has had Cell ICs in developers and DCC hands since 2004; Intel has a few powerpoint presentations that have cool looking pictures next to the year 2010.
In theory, this matters how? x86 has userbase. x64 has a great migration story. Intel and AMD are supposed to be shaking in their boots because of a different architecture slanted towards running software scenarios people aren't willing to pay for yet?

Vince said:
Uh, yes. Lets just pull a random number and say you'll lose a net 50% preformance by getting rid of OoO when running your tasks, the 1st generation BPA's PPE (forgetting about SMT, etc) is clocked at over 4GHz. You're saying it won't be able to run MS Word, AIM and your shitty Virus Scanner to the point that the average consumer will notice the difference? This being the same consumer who is likley running a Celeron. And from an architectural PoV, OoO is a net loss of effeciency when compared with the area it utilizes - it's a losing proposition.
If only you could randomly pull a constant number for perf loss. The problem is that you've introduced bottlenecks throughout the design that didn't exist there before with a potentially huge existing number of inputs to the system that affect the resulting numbers.

Vince said:
Totally fallicious comment. I can only offer anecdotal evidence at this point, but JFYI, the recent Cool Chips demonstration by Toshiba which showed off the 6 SPUs decoding 48 MPEG2 streams concurrently and resizing them to fit in a composite 1920x1080 screen was written without the programmer doing any explicit thread scheduling. Quite the nightmare development enviroment, huh?
That's comedy gold right there.

I pointed out that CELL is slanted towards pathological lab scenarios and your Exhibit A is a pathological lab scenario. You sure proved me wrong!

And "duh" to your comment on thread scheduling. Do you really expect today's well-understood model of context-switching and thread-handling are going to work well in a CELL environment that places shackles on RAM access? I'm sure you remember the last section of the CELL GDC slides ...

Vince said:
PS. And BTW, the Pentium4 (Northwood) wasn't crippled because of favoring SSE. It was due to the engineering going out of control in terms of transistor budget and what was becoming an IC that was so far gone in terms of diminishing returns per area cost that Intel's Dan Boggs capped the area at that of the PPro and cut the shit out of it.
In your rush to show us how much you know about the history of the P4, you read my post too quickly. I didn't say the P4 was crippled for favoring SSE. I described a possible foolish path Intel could take by crippling the rest of the P4 at the expense of more SSE functionality.

No offense - You're clearly a smart guy, but your clear passion for PS3 and CELL seems to have driven you to use way too much frog DNA and duct tape to fill in the missing gaps supporting the 100% guaranteed greatness of CELL in every scenario. Everything has tradeoffs, even cool new technology from engineers we admire.
 
And when that general purpose Cell-based computer grinds to a halt because it can't do Out of Order operations I'm sure you'll be smiling then.
At least without OOE I don't have to worry about my FPU calculations being unpredictable. :P
 
PizzaFarmer said:
Yes, static processing tasks performed in arbitrary order and with an arbitrary number of them running at any given time.

Which is seen in what task that will bring the PPEs in Cell or the XCPU to it's knees? We seem to have fundimentally differing views of the future of programming. I see the future based around digital media and interactivity, removing the static pages and UI of today with dynamic ones. You, obviously, don't see this as the future because if you did the bounds on what limits preformance would be more aligned with what I'm talking about, or what was stated in the links I provided.

I used Word and web-surfing as a few popular examples of "modern general-purpose applications written in C/C++". More conjecture about CELL being powerful enough and your future of a media processing-based future.

It's hardly conjecture, read the full papers in the links I posted for an academic analysis. I hardly consider it conjecture when there is a functional web browser on PS2 using the R5900 core -- which is a 300MHz, in-order processor with 24KB of L1 cache, no L2. I mean, here's a pathological case where the lack of OOE should be downright destructive. Nevermind that the PPE has SMT and is clocked at 4+GHz.

Then again, maybe all people will care about in five years is running Mathematica and watching ten DVDs at the same time, which would make you right and the rest of us bored out of our minds. That, and BeOS would rise from the dead to be the OS of choice for parallel mpeg-decode lovers everywhere ;)

Haha, well I do believe the future will be in UI's that are interactive and more dynamic. I want to be able to scroll across my XMB interface (or whatever) and move across the actual HDTV program playing in little thumbnails on my 1080p TV, maybe record and watch at the same time... whatever. I can sit here all night and throw out ideas.

If only you could randomly pull a constant number for perf loss. The problem is that you've introduced bottlenecks throughout the design that didn't exist there before with a potentially huge existing number of inputs to the system that affect the resulting numbers.

Actually, thats not a random number and I edited it when I realized that as I believe I took it from Marco. It's actually quite a good guess at the hit you'll take without OoO and the other part about introducing bottlenecks is just generalized critique that's non-descript enough that I can't comment as it's all unknown to the rest of us.

I pointed out that CELL is slanted towards pathological lab scenarios and your Exhibit A is a pathological lab scenario. You sure proved me wrong!

If it's only a "pathological lab scenario" - implying that it's not a real-world example - then why does ATI have an entire line of ICs (Xilleon is the most prominent) that's primary function if nothing but HDTV and multiple stream SDTV decoding and rescaling? Albeit at a much lesser degree than the "pathological" Cell demo which will manifest itself in quite real products to decode multiple HDTV streams. As I said before, I think we have different views of where the future of computing is...

In your rush to show us how much you know about the history of the P4, you read my post too quickly. I didn't say the P4 was crippled for favoring SSE. I described a possible foolish path Intel could take by crippling the rest of the P4 at the expense of more SSE functionality.

And I stated that your foolish path (you said "e.g. crippling the P4 in favor of SSE functionality") isn't what happened or they're planning on doing. But, then I realized that the P4 story is somewhat analogous as it showed that Intel realized how far into the diminishing returns they were and cut out a huge part of their processor for minimal preformance differential, which is why I stated it. It's a prime example of what I'm talking about. Almost half of the FPU area and power consumption was taken out for a 5% loss in preformance...

And I never said that Cell was the best architecture or the only one, far from it as I stated in a subsequent posts that it won't move into the established x86 realm. But, like you said of x86 during it's time, Cell is an architecture which fits it's period and the unique requirements of it... and does it well.
 
Vince said:
Which is seen in what task that will bring the PPEs in Cell or the XCPU to it's knees? We seem to have fundimentally differing views of the future of programming. I see the future based around digital media and interactivity, removing the static pages and UI of today with dynamic ones. You, obviously, don't see this as the future because if you did the bounds on what limits preformance would be more aligned with what I'm talking about, or what was stated in the links I provided..
I actually don't think our views on the user scenarios differ that much.

I'm just more concerned about the near effects of pursuing in-order processors combined with current software trends (tons of processes running simultaneously, always-available services, etc). Also, there's no question about the value of a good vector unit, but rather a question of how far the tilt should go towards lots of custom vector units replacing general-purpose processors.

Your editing frequency is a force to be reckoned with. I give thanks to the preview button :)

Vince said:
Outiside of stating that you program the SPUs in C/C++ and that the SPU architecture was designed by guys like Gschwind who had the compilers in mind as they wrote most of them
And here's an edit of my own now that I noticed you editing old posts ;)

In a sense, you can program the SPEs in C/C++, but intrinsics don't count. I'm talking about absolute general-purpose code being offloaded through the magic of copy/paste in a text editor. Have you seen anything from Sony that indicates otherwise?
 
In a sense, you can program the SPEs in C/C++, but intrinsics don't count.
That's for how GCC works with SIMD aspects of SPEs - doesn't mean every other compiler does/will do the same. Some compilers DO come with native support for SIMD types(not talking specifically about Cell here) you know.

I'm talking about absolute general-purpose code being offloaded through the magic of copy/paste in a text editor.
There's nothing stopping SPEs from running general purpose code. Whether that's efficient use of the processor is up for debate, but then most windows software we run today doesn't utilize hw in most efficient ways either.
 
Fafalada said:
That's for how GCC works with SIMD aspects of SPEs - doesn't mean every other compiler does/will do the same. Some compilers DO come with native support for SIMD types(not talking specifically about Cell here) you know.
Definitely a fair point considering that there will probably be other compilers available for PS3 that expose the SPEs in different ways. Just amusing to see Sony trumpeting around the "higher-level languages this time around!" banner despite the actual implementation described so far.

Fafalada said:
There's nothing stopping SPEs from running general purpose code. Whether that's efficient use of the processor is up for debate, but then most windows software we run today doesn't utilize hw in most efficient ways either.
How good is integer/branch support on the SPEs? I haven't seen much coverage of that besides commentary on the lack ...

And of course, there's a difference between general-purpose code PC utilizing an x86 CPU poorly because the code wasn't designed for it and general-purpose code trying to run well on a CPU designed around vector operations.
 
Sorry for interrupting your discussion but judging by the lack of "ooohs" and "aaahs" in this thread I'm sensing that the demo isn't that impressive. Or is it? Hypothetically, what proccessor available today will carry out the same task with ease?
 
sly said:
Sorry for interrupting your discussion but judging by the lack of "ooohs" and "aaahs" in this thread I'm sensing that the demo isn't that impressive. Or is it? Hypothetically, what proccessor available today will carry out the same task with ease?

I think it's the lack of comparable data on other processors that is holding back the "OMG Cell am win" proclamations. Though the most optimistic estimate we've had on the PC side thusfar in this thread is that a highend dual-core could maybe handle 24 streams simultaneously (though it's debateable if things would scale up linearly from aaaaa0's data), so I suppose you can draw your own conclusions..
 
I do not see OOOe doing anything really to make an environment full of several applications running at the same time much more responsive: Multi-Threading yes, OOOe alone no.

I see the direction Intel is taking with IPF to be Intel's future: it is likely that MT and multi-core will be what it is in store for IPF in the short as well as the long-term of Intel's strategy.

Raising the frequency, going to a single banger-core and implementing SMT to the core while adding more cores per die as a potentially winning approach.

Intel's dedication and performance obtained with Itanium 2 made me think again about what in-order cores can do, especially when they are helped by a well thought-out ISA and clever compilers. Their x86 emulation software (IA32-EL) seems also quite good and in the future it could ease the transition to the desktop space if Intel decides to push for it.

We have yet to see what the ex EV7 and EV8 guys can do with the next generation Itanium core, so my hopes are quite high for this architecture.

Cost-wise ? Surrounding the cores with lots of custom SRAM tightly enabled Intel to manufacture 400+ mm^2 at less than $140 per chip thanks to the much lower leakage problems with SRAM blocks and to the great deal of redundancy you can quite easily achieve that way (Itanium 2 revision, I think it is the 6 MB L3 cache Madison, but I might be wrong as I am going completely of memory). This allowed them to relax issues related to in-order scheduling and data dependancies and load-use latency penalty which stalls execution units.

They brought memory closer, which is also what CELL has tried to do.

About the PPE: I think they realize well what is its role and I do not think it is an accident it was basically doubled in size going from ISSCC's DD1 revision to Cool Chips' DD2 revision as evidenced when comparing the two chips die micro-graphs.

The Broadband Engine is a chip based on the CELL/BPA architecture, it is only one implementation. The CELL architecture places no limits on the PPE core: we could have OOOe if an implementation required it (mantaining ISA compatibility with the Broadband Engine), we could have a 3-4 issue core with HW support for more threads in the hardware, we could have even more advanced branch-prediction, etc...

Also, we are not limited to the PPE:SPE ratio that is present in the Broadband Engine: a chip with 2 PPE's and 4 SPE's (or more or less... it depends on the manufacturing process, the final clock-speed and your power and chip area budgets) per PPE. If you have some efficient switch design you can increase the number of PPE blocks with SPE's attached to each PPE block (each Processor Element if we can still call it that way): 4 PPE's with 2-4 SPE's attached to each of them and a switch connecting the two pairs of Processor Elements (1 PPE + xSPE's = 1 PE).

If yours is a general distaste of Statically scheduled processors (although neither the PPE nor Itanium chips are really statically scheduled as if we want to be nit-picky then branch-prediction and laod-store re-ordering stretch the definition of in-order core a little bit IMHO ;)) I invite you to look more at the growing importance of cores based on ARM chips, Renesas/Hitachi's SH-x series, Intel Itanium technology, Fujitsu's SPARC64 cores, etc... they are not all at the level of SUN's Ultra Sparc chips :).
 
Definitely a fair point considering that there will probably be other compilers available for PS3 that expose the SPEs in different ways. Just amusing to see Sony trumpeting around the "higher-level languages this time around!" banner despite the actual implementation described so far.
We'll see as they reveal more. Anyway GCC as a starting solution obviously isn't a top choice - but I gather Sony chose it over paying royalties for IBM solutions :P
And GCC will improve over time as always...

How good is integer/branch support on the SPEs? I haven't seen much coverage of that besides commentary on the lack ...
It's pretty complete from what I've heard so far. Admitedly I am hesitant to call any SoA SIMD solution "complete", but that's a FPU specific complaint.
Integer SIMDs have never tried touching AoS much in the past, I guess that could also be part of the answer why the SPE ISA was designed as it is...

And of course, there's a difference between general-purpose code PC utilizing an x86 CPU poorly because the code wasn't designed for it and general-purpose code trying to run well on a CPU designed around vector operations.
Well the analogy is running code not particularly designed for the target CPU ;)
But my point was that for desktop system use efficiency isn't that big of a deal. Maybe SPE thread performance wouldn't be too great, but if a bunch of tasks are running on those 8 extra hardware threads, that'll still speed things up overall nicely :P
 
In theory, this matters how? x86 has userbase. x64 has a great migration story. Intel and AMD are supposed to be shaking in their boots because of a different architecture slanted towards running software scenarios people aren't willing to pay for yet?
I see what you mean, but games are not what I'd call 'software people are not willing to pay for', and assuming that Cell indeed performs really well at that task, at a lower price - that alone justifies existence of that architecture IMO. Not just games, but things like multiple realtime thumbnail previews of HDTV streems seem just handy for future TVs. I agree in that Intel has nothing to worry about anytime soon, but there is obviously something to gaming with multicore CPUs, considering that even MS abandonesd Intel as an Xbox client to IBM's another multicore chip.
 
gofreak said:
I thought I saw at one point a longhorn demo that showed multiple movie streams on a desktop, all rotated and skewed in different ways (just to show the new GUI)..it had more than 2 on screen, though not 48 ;) Anyone else remember that? I can't remember if they were all the same stream, just replicated, or not though..

I guess I'm thinking of the Tiger demo, but they had five or six or so video streams sizing and playing on a screen of MPEG4. But also, that was (assumedly) videocard-assisted.
 
Top Bottom