• Hey Guest. Check out your NeoGAF Wrapped 2025 results here!

Crytek Interview - quite a lot on the PS3/X360 (excellent read)

gofreak

GAF's Bob Woodward
Spotted by Nemo80 on B3D..

A german game magazine has posted a pdf of an interview with Cevat Yerli, CEO of Crytek. It's quite an interesting read, and goes into more technical detail about the next gen systems, and working with them, than most interviews of this type.

It's in german, and available here:

http://www.gamestar.de/dev/pdfs/crytek.pdf

Here's the babelfish, it's fairly readable but if anyone wants to take a crack at a "proper" translation, please do! I've bolded some parts of particular interest (well to me anyway).

You work straight on the next game and concurrently on the CryEngine2...

... said we to it something?

You, I * g * where do not go the technical journey?

In the rough one the journey goes to an Streaming architecture, CROSS platform to Multi Threading, thus multi-core and Multi CPU, and over each pixel at least a Shader will run.

Do you remain with SM3.0 or do jump it directly on SM4.0?

We support SM2.0 and upward

And upward?

And upward.

The new PC processors and the coming consoles set substantial on Multi Threading. How do you use the Potential?

We scale the individual modules such as animation, physics and parts of the graphics with the CPU, depending on how many threads the hardware offers. We support both Multi CPU systems and Multi Threading and multi-core. With three CPUs with two hardware Threads each (dual core CPUs) it can be that we scale on six Threads. Possibly we do in addition, without it, depending on, how quickly the individual cores and/or CPU Threads run. But we develop a system, which analyzes, how much Threading power are available and scale the engine then accordingly.

x86, power PC and power PC plus Cell: All architectures have their own Threading organization.

The 360 resembles Hyperthreading. There is in principle three CPUs with two Hyperthreads each. If you ask the hardware manufacturers, is not it naturally like that. But if one analyzes it as a software developer, it is nothing different one than Hyperthreading. That is, one has six Threads, actually however only three times 1.5 Threads. On the PlayStation 3 it looks differently with Cell: The head CPU has two Threads (somewhat better than Hyperthreading), and in addition comes seven synergetic processors. The eigth SPU existing in the Design was omitted.

Because of Yields?

A pure aspect of production. The SPUs are not as flexible as a conventional CPU, and therefore we scale there differently.

Between individual architectures hardly the complete code can be transferred. At least Low level calls x86 and power PC must be modified, but how portable is the code within the two power-PC-BASED consoles?

Also not really portable. We have, I believe, the only German enterprise PS3-Devkits. Accordingly we can look ourselves in the hardware and practice instead of only speculate. The PS3 is a system, which needs special adjustments, which are particularly begun to cut on the PS3-Architecture - a simple port does not function.

It is thus not like that, how Sony stated that you a beautiful Layer have, your code, and everything functioned on Cell marvelously?

Wish themselves in such a way. But that is far still. Also the Devkits is not so far yet. On the basis of the provided information to the Devkits one must operate at quite a Low level, in order to get which from the hardware.

Topic interprocess communication: how relevant are the differences between Multi Thread architectures?

A very relevant question. The following things are relevant with the Multi Threading on the hardware side: Does the Threads run on genuine cores (they in each case their own register set have?)? Or there is a hardware abstraction as with power the PC, where - here two - the Threads has genuine own registers sets, but on the same core is nevertheless, so that at the Issuen of instructions on only individually existing units both Threads cannot work. Multi Threading with Hyperthreading tries only to always distribute the instructions on the different super+scalar units (Math operations on the Integer and floating unit, load/net curtain etc.).

With several cores still the question about the bus binding exists: Do they divide the same bus to the periphery and to main storage? How is the Caches implemented, does divide all Threads the same Cache? And: Shared MEMORY vs. independent local memory. A complete power PC core is also part of the Cell system. This is present however not only as independent arithmetic and logic unit, but also as host for the individual Cell cores. With Cell architecture still individual Cell cores connected by an ultrahigh bus are in the system except the power PC core, which can communicate independently with one another. If one uses the optimal parallelism, one can by this parallel architecture and the super fast bus during optimal extent of utilization a linear scaling with the number of Cell cores to reach.

How do Multi Thread systems on single core PCS behave?

As developers we are in the bloedesten situation, in which we could be. We must support 32 and 64 bits, single and Multi CPU, Singleund Multi Thread, Cell and not Cell as well as OpenGL and DirectX. The expenditure to develop a technology which uses these parameters optimally, is extremely high. The technical expenditure is at least twice as high as with the first CryEngine.

The step from 32 to 64 bits was program-technically surely simpler than from single to Multi Threading.

Unfortunately it is not a step. We cannot dare the step yet, we must both support.

Are there performance problems, if multi threaded the programmed Cry engine 2 on a single core PC runs?

The code can run sequentially. One loses thereby some efficiency, but what one wins by further optimizations, is higher. Thus the price to lose a certain Framerate if the code on a single Thread CPU runs, is so marginal that one can clean-get by other code improvements again. Most PCSpiele is anyway not correctly optimized, also far Cry.

In the console segment the developers take often amazing out of a relatively harmless hardware, because they must make it. Would run on the PC exactly the same, one need for Doom 3 probably only a Geforce 4 Ti.

That is the point. The development of the hardware is so rapid, which one gets only short time, around which rauszuholen. Exactly the same it is also with the CPUs. If one goes with a Multi Threading Renderer on a CPU, one must take oneself evenly the time to optimize. The largest problem thereby are Cache Misses. Additionally one should avoid global memory between the individual Threads. Simply said: If we access the same pot, the pot may not change. If I access briefly before you an element, then kriegst you it no longer, or it is not any more what you expected. In order to go around, one changes in best in a step the something, gives the ErErgebnis then further to the open memory and there for other CPUs freely (Unlocking).

How keep do I consistent with six Threads physics?

One can solve that not only technically, but also creatively. One must go new ways: As we can, and that is a basic question, how we can scale our play from single Thread to eight Threads qualitatively. In such a way that equal Gameplay it remains but the play qualitatively better ausieht, or in such a way that I can play it better. On the PC one must reduce its choice to cosmetic improvements, since everyone has the power, but it would give serious differences in the Gameplay. On the technically fixed consoles one could permit also better Gameplay with the suitable optimizations. Accordingly one must examine two scaling bar as multi-platform developers: FX and Gameplay. On the PC often only FX is optimized (higher dissolution of texture etc.). We would like to scale both FX and Gameplay, over the intensity of the logic or over the quality of the Shader for example.

How does it stand with the Portability between x86-PC and Xbox 360?

Architecture is different in principle, but nevertheless a lot more similarly than from Xbox 360 to PS3. The CPUs of PC, Xbox 360 and PS3 has actually only a relevant similarity - and that is Multi Threading. When generic CPU is the 360-Prozessor the most efficient, but if one takes the seven SPUs of the PS3 in addition, then achievement conditions look again differently. Before we had the PS3-Devkits, we thought that PS3 and Xbox 360 are closer than PC and console - with the development. Does not seem so however * laughs *

Therefore you optimize your engine on the SPUs of the PS3?

Definitely. That is a must for us, since we want to use the power of the PS3 completely. Accordingly the PS3 is gotten nearly a completely own engine architecture, a kind sub-architecture in the CryEngine 2.


The development of the console title far Cry Instincts still outward gave. This time is portiert?

Ähem ...

... the graphics interfaces provide nevertheless for additional expenditure?

Yes, because of OpenGL IT for the PS3 we must describe the entire Renderer. If one takes it exactly, the CryEngine 2 will have altogether any special solution for any special problem. If that abstracts a play developer, the technology is optimized very specifically. Otherwise one cannot use the power perfectly. That can alternative be abstracted naturally also in such a way that it runs on all systems, but then loses the strongest platform at most.

A couple of things stuck out for me:

- He characterises the X360 dual-threading as being more like "1.5" threads (not a massively controversial idea, the threads are sharing hardware afterall), but timely to mention I suppose given how people just compare threads on one system vs another as if they were the exact same thing.

- Threading on the PPE in Cell is different from the XeCore (and seemingly better) - this surprises me, I didn't think there was much difference at all between the two

- They have PS3 kits and are making an "engine within an engine" so to speak, just for PS3.

- The gap between PS3 and X360 from a porting point of view is bigger than PC to X360.

- His points on the points affecting multithreading and parallelism seem to in particular highlight challenges on X360..

- They want to use physics to scale both gameplay and presentation (relevant to recent discussion i guess).
 
segasonic said:
paypal me 10 Dollars and I'll translate it for you :D

Babelfish has done half the work for you already :P

It's easy enough to get the gist of what's being said, but a natural translation would be better (and so so appreciated :)).
 
It is thus not like that, how Sony stated that you a beautiful Layer have, your code, and everything functioned on Cell marvelously?

Wish themselves in such a way. But that is far still. Also the Devkits is not so far yet. On the basis of the provided information to the Devkits one must operate at quite a Low level, in order to get which from the hardware.

Crappy documentation again Sony? Shame on you!
 
is any upcoming game using the Crytek-engine? i am wondering, as this engine has so much potential and every game seems to use either Id's oder Epic's engines. or has EA forbidden licensing?
 
Kangu said:
Crappy documentation again Sony? Shame on you!

That's not what he's saying, I don't think. I think he's saying the documentation suggests that you want to work at a relatively a low level to get the best out of the processor - which is unsurprising.
 
"Threading on the PPE in Cell is different from the XeCore (and seemingly better) - this surprises me, I didn't think there was much difference at all between the two"

I didnt really read it like that. He was saying that PPE has two threads, wich is the same on the XeCPU on one core. When discussing the 1.5 threads, I would believe that the efficiency is more like 1.5 threads instead of 2 threads because of cache sharing etc, and lack of certain cache locks etc. I am speculating here but I do not believe that there are any fundamental differences between PPE and ONE XeCPU core if we now are comparing those two.
 
Shompola said:
"Threading on the PPE in Cell is different from the XeCore (and seemingly better) - this surprises me, I didn't think there was much difference at all between the two"

I didnt really read it like that. He was saying that PPE has two threads, wich is the same on the XeCPU on one core. When discussing the 1.5 threads, I would believe that the efficiency is more like 1.5 threads instead of 2 threads because of cache sharing etc, and lack of certain cache locks etc. I am speculating here but I do not believe that there are any fundamental differences between PPE and ONE XeCPU core if we now are comparing those two.


"That is, one has six Threads, actually however only three times 1.5 Threads. On the PlayStation 3 it looks differently with Cell: The head CPU has two Threads (somewhat better than Hyperthreading)"

The "head CPU" is the PPE.

The "dual threads" behaving like 1.5 threads has more to do with that they're simply sharing one core than anything else. It's about the performance gain saved from "wasting" execution cycles when one thread would block.

I don't know how the Cell PPE is different/better, but judging from his comments it is.
 
gofreak said:
"That is, one has six Threads, actually however only three times 1.5 Threads. On the PlayStation 3 it looks differently with Cell: The head CPU has two Threads (somewhat better than Hyperthreading)"

The "head CPU" is the PPE.

The "dual threads" behaving like 1.5 threads has more to do with that they're simply sharing one core than anything else. It's about the performance gain saved from "wasting" execution cycles when one thread would block.

I don't know how the Cell PPE is different/better, but judging from his comments it is.

Sorry., for whatever reason I thought you were commenting on the PPE and not CELL itself.
 
Shompola said:
"Threading on the PPE in Cell is different from the XeCore (and seemingly better) - this surprises me, I didn't think there was much difference at all between the two"

I didnt really read it like that. He was saying that PPE has two threads, wich is the same on the XeCPU on one core. When discussing the 1.5 threads, I would believe that the efficiency is more like 1.5 threads instead of 2 threads because of cache sharing etc, and lack of certain cache locks etc. I am speculating here but I do not believe that there are any fundamental differences between PPE and ONE XeCPU core if we now are comparing those two.

I remember reading Deano's GDC Europe presentation, where Deano stated that one should think of the two issue core for the CELL running at 3.2GHz, as being two cores running at half the speed (1.6Ghz) that share cache. Basically, that the hardware allows two threads, but you don't magically get twice the speed. I would assume the same is true of the three core XCPU.
 
Shompola said:
Sorry., for whatever reason I thought you were commenting on the PPE and not CELL itself.

Hehe, I'm a little confused now, but to clarify I was talking about the PPE in Cell.

I assumed up until now that it was virtually identical to the XeCore, but not so sure now, at least regarding this aspect.
 
gofreak said:
Hehe, I'm a little confused now, but to clarify I was talking about the PPE in Cell.

I assumed up until now that it was virtually identical to the XeCore, but not so sure now, at least regarding this aspect.

It can be a software issue, in the OS and the sceduler there. Longshot yah but who knows why they are different.
 
gofreak said:
That's not what he's saying, I don't think. I think he's saying the documentation suggests that you want to work at a relatively a low level to get the best out of the processor - which is unsurprising.

Well, he actually says that based on the documentation "to get anything out of the hardware" you have to work "a lot" on low level. "A lot" can mean many things I suppose
 
hadareud said:
Well, he actually says that based on the documentation "to get anything out of the hardware" you have to work "a lot" on low level. "A lot" can mean many things I suppose

Thanks, I guess my point just was it wasn't a reflection on the documentation.

Anyone got a good translation yet? ;)

Some bits are still confusing, like how they jumped from talking about Far Cry Instincts to OpenGL on the PS3 :P
 
gofreak said:
Thanks, I guess my point just was it wasn't a reflection on the documentation.

Anyone got a good translation yet? ;)

Some bits are still confusing, like how they jumped from talking about Far Cry Instincts to OpenGL on the PS3 :P

My german is a bit rusty but I think that the "Ähem ..." was a rough translation of "We dont have anything to do with that piece of shit port next question".

(and no I dont really know German:))
 
sonycowboy said:
I remember reading Deano's GDC Europe presentation, where Deano stated that one should think of the two issue core for the CELL running at 3.2GHz, as being two cores running at half the speed (1.6Ghz) that share cache. Basically, that the hardware allows two threads, but you don't magically get twice the speed. I would assume the same is true of the three core XCPU.
Based on what I know of the Cell, this is, essentially, exactly correct. However, based on what I know of similar architecture to the Xe360 CPU, it isn't the same for the 360.

The Cell is, hands down in my opinion, a much better design for gaming. The design of the PPE and SPU's isn't that they work in conjuction (a better phrase would be they aren't in any way reliate upon one another), but rahter they work independently from each other, churning away specific code that they each excel at. This is why I'm so impressed with the Cell over the XCPU. The PPE blows through code that doesn't need a 3.2Ghz processor, such as general game code, linear A.I and the like. Sony states the PPE runs at 3.2Ghz simply for PR reasons; it doesn't technically. Unlike one of the XCPU cores however, the two threads on the PPE can run concurrently (eating up seperate code at the same time, during the same cycle), both running at speeds that equal 3.2Ghz (it doesn't have to be a 50/50 even split, aka 1.6Ghz. It can be 2.5Ghz and 700Mhz, etc.). It all depends on what each dev wants to run, and when. The SPU's of course are left to churn through physics, non-linear A.I, vertex data, etc., but we should all understand how cool the SPU's are by now.

The XCPU on the other hand does indeed have 2 threads per core, but they can't run concurrently. The second thread can pick up any available cycles to use, but it can't run seperate code at the same time. I'm not completely versed in the XCPU though, so that's really all I'll add about that.

My point is simply that the PPE is quite a different design than the XCPU cores, and is one that I (and as we see here, Crytek do as well) think is a superior architecture when you examine the needs of a gaming engine.

Sorry for the long post.
 
gofreak said:
Anyone got a good translation yet? ;)
Quickly did the beginning, I'm trying to do the whole thing a bit later ...

/GameStar/dev: So where's the technical journey leading?

Cevat Yerli: In general it's leading to multi-threading, so multi-core and multi-cpu, a streaming architecture, crossplatform, and for each pixel there will be at least one shader.

/GameStar/dev: Are you staying with SM3.0 or are you jumping to SM4.0 straight away?

Cevat Yerli: We're gonna support SM2.0 and above

/GameStar/dev: And above?

Cevat Yerli: And above

/GameStar/dev: The new PC processors and the upcoming consoles are heavily applying multithreading. How are you gonna utilize the potential?

Cevat Yerli: We're scaling the the individual modules , like animation, physics and parts of the graphics with the cpu, depending how many threads the hardware has to offer. We're going to support both multi-cpu systems and multithreading and multicore. With 3 cpu's with 2 hardware threads each (dual core cpu's) it's possible that we are going to scale for 6 threads. Maybe we're not gonna do it though, depending how fast the individual cores or the cpu-threads are running respectively. We're developing a system that's analyzing how much threading power is available and we are going to scale accordingly.

/GameStar/dev: x86, PowerPC and PowerPc + Cell. All architectures have their own threading organisation ...

Cevat Yerli: The 360 solution resembles Hyperthreading. In principle it's 2 cpu's with 2 Hyperthreads each. If you're asking the hardware manufacturers, that's not the case appearantly. But analyzing it from a software developers standpoint it's no different from hyperthreading. That means that you're supposed to have 6 threads, but it's only 1.5 threads by 3 in reality. With PS3's cell things are looking differently: the main cpu has 2 threads (slightly better than hyperthreading) and then you're getting the synergetic processors. The 8th spu was cut.

/GameStar/dev: Yields?

Cevat Yerli: A pure manufacturing reason. The spu's are not as flexible as your conventional cpu, that's why we have to scale differently.
 
hadareud said:
Quickly did the beginning, I'm trying to do the whole thing a bit later ...

*snip*

Cheers! I appreciate this must take some time, don't worry if you don't get round to it :)

On a different note, I guess this interview tells us what another developer is using more CPU power for next-gen. "animation, physics and parts of the graphics". Sounds familiar (and right up Cell's alley).

Heian-kyo said:
Based on what I know of the Cell, this is, essentially, exactly correct. However, based on what I know of similar architecture to the Xe360 CPU, it isn't the same for the 360.

The Cell is, hands down in my opinion, a much better design for gaming. The design of the PPE and SPU's isn't that they work in conjuction (a better phrase would be they aren't in any way reliate upon one another), but rahter they work independently from each other, churning away specific code that they each excel at. This is why I'm so impressed with the Cell over the XCPU. The PPE blows through code that doesn't need a 3.2Ghz processor, such as general game code, linear A.I and the like. Sony states the PPE runs at 3.2Ghz simply for PR reasons; it doesn't technically. Unlike one of the XCPU cores however, the two threads on the PPE can run concurrently (eating up seperate code at the same time, during the same cycle), both running at speeds that equal 3.2Ghz (it doesn't have to be a 50/50 even split, aka 1.6Ghz. It can be 2.5Ghz and 700Mhz, etc.). It all depends on what each dev wants to run, and when. The SPU's of course are left to churn through physics, non-linear A.I, vertex data, etc., but we should all understand how cool the SPU's are by now.

The XCPU on the other hand does indeed have 2 threads per core, but they can't run concurrently. The second thread can pick up any available cycles to use, but it can't run seperate code at the same time. I'm not completely versed in the XCPU though, so that's really all I'll add about that.

My point is simply that the PPE is quite a different design than the XCPU cores, and is one that I (and as we see here, Crytek do as well) think is a superior architecture when you examine the needs of a gaming engine.

Sorry for the long post.

There's been some speculation about this, particularly recently. Although the idea of an arbitrary clock split is new. Is this your own speculation, or more than that..?
 
gofreak said:
There's been some speculation about this, particularly recently. Although the idea of an arbitrary clock split is new. Is this your own speculation, or more than that..?
Sorry, I should have been more specific; it is my own speculation, mainly due to what I've seen of the Cell documentation, and how STI alludes to various ways of maximizing PPE usage whilst not overloading it with code the SPU's can blow through much more efficiently. A variable clock split is, of what I know, pretty difficult to implement, but it would certainly be easier to do so in a locked down console versus a PC (wherein implementing such a feature would be kinda pointless).

Nevertheless, it's very possible that it won't be variable, and the two threads will indeed be locked (likely 1.6 and 1.6), but it would be really cool if STI implements this, simply for the fact that the Cell is such a parallel computing beast, and this would be one more option to devs. In reality though, a variable clock speed isn't that big of an advantage over defined limits. It'd just be cool. :D
 
its nice that the cell can do multiple things with its SPU/SPE's but what if there isnt much to do, what if it were more efficient to split ONE task into little ones, is the SPE's nature going to make it hard to share information between them? I think thats waht alot of programmers are complaining about memory management on the chip is lacking on the transistor level which means its up to the programmer. Not many people like messing with memory on that low a level...boy can it be a bitch, but I have confidence that conventions and solutions will pop up. The sheer number of software coming to the PS3 ensures this anyway.

Anyway all is see with the new consoles is potential at the cost of housekeeping ease. The question is...which method will be more successful in the long run, a bunch of fully functional cores like a dual core amd64 *which will go to quad, sixteen etc core* or the Cell/360 method of stripping down logic circuits and cramming as much potential as possible with added complexity.
 
DonasaurusRex said:
its nice that the cell can do multiple things with its SPU/SPE's but what if there isnt much to do, what if it were more efficient to split ONE task into little ones, is the SPE's nature going to make it hard to share information between them?

There's a very fast ring bus connecting all the SPEs together. They're designed to share data with one another, and cooperate if the dev wants. You can have them work together on shared data (mutual exclusion required obviously), or you can set them up as a pipeline, one feeding the other or whatever. Splitting one task up into smaller multiple tasks to assign to SPUs is actually ideal for Cell.
 
Shompola said:
It can be a software issue, in the OS and the sceduler there. Longshot yah but who knows why they are different.

Likely, seeing the huge size of the DD2 revision of the PPE, the PPE has more resources that were duplicated to allow each thread to step on the other's feet much less thus avoiding more stalls.

If you wonder why amny CPU vendors advertise HT/SMT and then they say it only adds ~10-15% to the CPU die area (to add another hardware thread) remember that much logic and thus resources are just shared between threads duplicating as few things as they can get away with.

An HT/SMT enabled core with two threads, but a small number of execution units to feed (a not very wide execution core) will likely provide still some form of speed-up, but having two separate cores each with its own caches, its own memory interface to main RAM, etc... would be better.

The more resources you duplicate the better performance will be (to a certain degree, let's not extrapolate too much from this), but then you get closer and closer to a dual core solution (in some cases it is preferrable: both Intel and AMD quickly jumped to the dual core scenario with AM never jumping on the HT/SMT bandwagon to begin with) in terms of diea area needed.

In a way if the PPE is running two threads at any given time (50/50) then you could see it as a dual 1.6 GHz core (but I do not like this idea much: each would be an awfully designed 1.6 GHz core... uber long pipeline and awful branch misprediction and cache latency penalties) or as two 3.2 GHz cores with 1 iALU each and shared FPU and VMX units as well as shared cache...

Probably the XeCPU cores have more stalls if two threads are active at the same time, both trying to do work: there might be fetching restrictions, etc... I heard similar things about when can a new thread fetch instructions, how the two can cooperate... the weird thing, I read them about CELL's PPE a long while ago.

From that time the PPE almost doubled in size though from the DD1 revision to the DD2 revision.
 
gofreak said:
There's a very fast ring bus connecting all the SPEs together. They're designed to share data with one another, and cooperate if the dev wants. You can have them work together on shared data (mutual exclusion required obviously), or you can set them up as a pipeline, one feeding the other or whatever. Splitting one task up into smaller multiple tasks to assign to SPUs is actually ideal for Cell.


ahh thanks, sounds like the problem then is coordinating them all IF you use them all could be one hell of a balancing act but damn the possibilities.
 
DemonCleaner said:
i guess that means we´ll see much less multiplatform games next gen

Was it any different last gen? Nah. The differences were just as big if not bigger.

And case in point, this dev sees the differences and challenges therein, yet they've not shied away from any of the platforms (in fact making a specific version of their engine just for PS3 to go alongside the others).
 
gofreak said:
Cheers! I appreciate this must take some time, don't worry if you don't get round to it :)

For some strange reason I couldn't stop once i started, so I translated the whole thing.

There's mistakes in, translating german tech talk to english tech talk is not the easiest thing to do, plus in the end I think I lost both my English and German. Here we go:

/GameStar/dev: So where's the technical journey leading?

Cevat Yerli: In general it's leading to multi-threading, so multi-core and multi-cpu, a streaming architecture, cross platform, and for each pixel there will be at least one shader.

/GameStar/dev: Are you staying with SM3.0 or are you jumping to SM4.0 straight away?

Cevat Yerli: We're gonna support SM2.0 and above

/GameStar/dev: And above?

Cevat Yerli: And above

/GameStar/dev: The new PC processors and the upcoming consoles are heavily applying multithreading. How are you gonna utilize the potential?

Cevat Yerli: We're scaling the individual modules , like animation, physics and parts of the graphics with the cpu, depending how many threads the hardware has to offer. We're going to support both multi-cpu systems and multithreading and multicore. With 3 cpu's with 2 hardware threads each (dual core cpu's) it's possible that we are going to scale for 6 threads. Maybe we're not gonna do it though, depending how fast the individual cores or the cpu-threads are running respectively. We're developing a system that's analyzing how much threading power is available and we are going to scale accordingly.

/GameStar/dev: x86, PowerPC and PowerPC + Cell. All architectures have their own threading organization ...

Cevat Yerli: The 360 solution resembles Hyper threading. In principle it's 2 cpu's with 2 Hyper threads each. If you're asking the hardware manufacturers, that's not the case though. But analyzing it from a software developer’s standpoint it's no different from hyper threading. That means that you're supposed to have 6 threads, but it's only 1.5 threads by 3 in reality. With PS3's cell things are looking differently: the main cpu has 2 threads (slightly better than hyper threading) and then you're getting the synergetic processors. The 8th spu was cut.

/GameStar/dev: Yields?

Cevat Yerli: A pure manufacturing reason. The spu's are not as flexible as your conventional cpu, that's why we have to scale differently.

/GameStar/dev: Between the individual architectures it's almost impossible to port the whole code. You have to at least modify the low level commands. x86 and PowerPC are different, but how easily can you port from one console to the other considering they're both PowerPC based?

Cevat Yerli: You can't port it either really. We're the only german developer that has a PS3 dev kit I believe. Accordingly we can look at the hardware in real life rather than speculating about it. The PS3 is a system that needs further adaptions that are especially written for the PS3 architecture - simply porting just isn't working.

/GameStar/dev: So Sony's claim - you have a nice layer, your throwing your code at it, and everything is working beautifully on Cell - is not the case?

Cevat Yerli: They wish. But it's a long way off that still. The devkits aren't that far either yet. Based on the information on the devkits you have to do a lot of low level work, to get something out of this hardware.

/GameStar/dev: Regarding interprocess-communication: How relevant are the differences between the different multi-threading approaches?

Cevat Yerli: An important question. The following things are very important for hardware based multithreading: Are the threads running on real cores (do they have their own registry set)? Or is there a hardware abstraction like with the PowerPC, where the -two in this case- threads have their own registry sets, but they are still on the same core, so with the issuing of instructions to individual units, both threads can't work at the same time. Multi-threading at hyper threading is only trying to distribute the instructions to the superscalar units (math-operations to integer and float units, load store etc.). With multiple cores the question of bus-connection to the peripherals and to the main memory. How is the cache implemented, are all threads sharing the same cache? Plus: shared memory vs. stand alone local memory. A complete PowerPC core is also part of the cell system. But it's not only an independed processing unit, it's also the host for the individual cell cores. In the cell architecture you
have apart from the PowerPC core also individual cell cores that are connected through an ultra-high end-bus and can communicate with each other independently. If you exploit the optimized parallelizing, you can achieve a linear scaling over all cell cores.

/GameStar/dev: How are multi-core systems behaving on single-core pc's?

Cevat Yerli: As developers we are in the worst possible situation. We have to support 32 and 64 bit, Single and Multicore cpu's, single and multithreading, cell and not-cell and OpenGL and DirectX. The expenditure to develop one technology, that is utilizing all those parameters perfectly is extremely high. The technical expenditure is at least twice as highs than CryEngine 1.

/GameStar/dev: The step from 32 to 64 bit was technically surely easier than going from single to multithreading.

Cevat Yerli: Unfortunately there is no step. We cannot really take the step, we have to support both.

/GameStar/dev: Are there performance issues with the multi-threaded CryEngine 2 running on single core pc's?

Cevat Yerli: The code can run sequentially. You're losing a bit of efficiency, but what you are gaining with optimization is higher. SO the price of sustaining a loss of the frame rate when running on a single-thread pc is so small, that you can easily get it back from that. Most of the PC games are not optimized anyway, Far Cry isn't either.

/GameStar/dev: With consoles, developers are getting astounding performance out of average hardware, because they have to. If this was the case with pc's you'd probably only need a Geoforce4 TI for running Doom 3.

Cevat Yerli: My point exactly. The evolution of hardware is running at such a fast rate, that you don't get to work with it for long. It's the same with cpu's, you have to take your time to optimize. The biggest problem with that are the cache misses. Also, you should avoid a global memory between the individual threads. Simply put: if we are reaching into the same pot, the pot must not change. If I am reaching for an element before you, you are not getting it anymore - or not the one that you expected at the very least. To bypass this you ideally have to change something in one step, pass the result on to the open memory and release it for other cpu's (unlocking).

/GameStar/dev: How to you keep the physics consistent with 6 threads.

Cevat Yerli: You could solve that not only technically but also creatively. You have to go new ways: How can we, and that's a main thing, how can we scale our game from 1 to 8 threads qualitatively. So that the game-play stays the same, but the game is either looking better or I can play it better. On PC it will only be cosmetically changes, because everyone has the power, but then there are big differences in game play. On consoles you can get better game play out of optimizing. That's why you have to test to scales as a multi-platform developer: FX and game play. On a PC you often only optimize FX (higher resolution etc.). We want to scale both FX and game play though, for example from the intensity of AI and shaders.

/GameStar/dev: How does porting work from x86 to X360?

Cevat Yerli: In general the architecture is different of course, but they are quite a bit more similar then Xbox360 and PS3. The cpu's of PC, 360 and PS3 have only one similarity - multi-threading. As a generic cpu the Xbox 360 processor is the most powerful one, if you're taking the 7 Spu's of the PS3 into account things change. Before we had the PS3 devkits we thought PS3 and Xbox 360 were closer in design than PC and console. That's not the case though *laughs*

/GameStar/dev: So you are optimizing your engine for the cell spu's?

Cevat Yerli: Sure. We have to, because we want to utilize PS3's power in full. Accordingly the PS3 will get it's own engine architecture, kind of a sub-architecture of CryEngine 2.

/GameStar/dev: You outsourced the Far Cry Instincts port. Now you have to port yourselves?

Cevat Yerli: Umh ...

/GameStar/dev: ... the graphics interfaces are requiring an extra effort?

Cevat Yerli: Yeah, because of OpenGL ES for the PS3 we have to recode our whole rendering. If you look at it closely, CryEngine 2 will have 2 solutions for each system in total. If a developer abstracts that, the technology is optimized very specifically. Otherwise you cannot utilize the whole power. Alternatively you can abstract it in a way that it is not running on all system, then the strongest platform is losing out the most ...
 
Thanks hadareud. Always good to get human translations of tech briefs. It's not easy, as you pointed out, so thanks.

Sounds interesting about the PPE. Is that the reason for the big increase in size from DD1 to DD2? There was a lot of speculation about why it looked like STI cloned certain portions of the original PPE, but not the full thing. I'd like to hear more about how those 2 threads run concurrently as opposed to the XeCPU, which really is like hyperthreading. I mean, when a thread stalls, another kicks in on a XeCPU core. But how's it work for the PPE in Cell? Are the two threads swapping clock ticks? Are they both running the same time and only taking turns using the VMX/FPU/ALU? I know Pana's asked Hofstee about some of this, but I don't believe there was ever a clear response. PEACE.
 
Pimpwerx said:
Thanks hadareud. Always good to get human translations of tech briefs. It's not easy, as you pointed out, so thanks.

Sounds interesting about the PPE. Is that the reason for the big increase in size from DD1 to DD2? There was a lot of speculation about why it looked like STI cloned certain portions of the original PPE, but not the full thing. I'd like to hear more about how those 2 threads run concurrently as opposed to the XeCPU, which really is like hyperthreading. I mean, when a thread stalls, another kicks in on a XeCPU core. But how's it work for the PPE in Cell? Are the two threads swapping clock ticks? Are they both running the same time and only taking turns using the VMX/FPU/ALU? I know Pana's asked Hofstee about some of this, but I don't believe there was ever a clear response. PEACE.

In a HyperThreaded design, well POWER5 and Pentium IV to make two examples, normally both threads can be working at the same time: on the Pentium IV the trace cache contains u-ops tagged by thread identifier and the re-order and issue logic can decide to pull instructions from more than the main executing thread if it sees that there are functional units left idle (due to a dependency induced stall or other reasons like... not much work to be done in the main thread) and it manages to find and issue an instruction from another thread which can safely execute raising CPU utilization.

The big point is "being able to safely execute": keep 100% of the resources shared ( extreme case of HT/SMT = no HT/SMT at all hehe, back to good ol' fashioned context-switching ;)) or share less and less resources till the 0% extreme case of having a dual-core solution.
 
There's a big difference between Intel's Hyperthreading and IBM's PowerPC SMT and how they interact with the OS, and I think that's what this interview is alluding to, even though it's not a perfect comparison.

In Intel's cores, there are queues and buffers (such as the uop scheduling queue and the load and store buffers) that are cut in half when you run 2 threads on the same core - no thread can use more than half the buffer space. This partitioning exists regardless of the performance of either thread, and threads stall whenever when they fill their half of the partition. If, for example, one thread has more than half the total available load requests outstanding, it will stall while the other thread will continue to allocate uops, including load requests the other thread could have used. In general, Intel will throttle threads to prevent overuse of buffers independently of the OS, so that Linux and Windows don't have to worry about it, but it also lacks a lot of flexibility. A thread can be throttled in Hyperthreading mode that is not throttled when run as a standalone thread, or it can use pipeline resources inefficiently.

IBM took a completely different approach starting with Power5 - it can throttle each thread at the decode stage based on OS feedback. For example, if a thread is making a large number of L2 cache misses and introducing blocking dependencies, the OS can signal to decode instructions for that thread at a slower rate, while letting the more efficient thread decode instructions at a faster rate (this is even more important for these in-order console cores than for out-of-order Power5 cores.) This gives much more efficient SMT performance than Intel, since it is based on empirical feedback; the problem is, the processor is dependent either on the OS or the low-level virtualization layer called the 'hypervisor' http://researchweb.watson.ibm.com/hypervisor/Research_Hypervisor.shtml to make the thread-throttling decisions, since the core has limited visisbility to do it on its own.

Here is where Microsoft may be screwed two ways. The first is that Sony is working closely with IBM on the hypervisor layer for the Cell, and gains from IBM's experience with SMT control in its hypervisor research as well as AIX and Linux. Xbox360 may be dependent on whatever scheduling layer Microsoft can come up with before the system's release, possibly with minimal collobaration with IBM.
The second is that some of the empirical data a Power5 core uses to determine relative thread efficiency within a core is hindered by the Xbox360's shared L2 cache design. If a thread has many L2 cache misses, it may not be entirely that thread's working set's fault - it may just have an access pattern that works badly with another thread on another core. Throttling that thread may actually be the worse option for SMT performance. To avoid those problems, the Xbox360 may be forced to use more brute force processor partitionings that are closer to Hyperthreading.

If you're interested, here's a good article on hyperhtreading and SMT, and throttling based on L2 cache hit rate:
http://www.cl.cam.ac.uk/TechReports/UCAM-CL-TR-619.pdf
Ars technica also discussed the differences some time back:
http://arstechnica.com/articles/paedia/cpu/POWER5.ars/2
 
no problem everyone. The translation only took me an hour or so - the real problem was that I dreamt the whole night about this interview :D
 
So for the XeCPU, would it be fair to say that without significant profiling and optimisation, it would be risky to put two time-critical threads on one core? effectively you only have three cores, with some spare time available that can be taken advantage of when the primary threads stall?

Are those stalls predictable enough to actually code a companion thread and be able to predict its throughput?



The PPE on CELL sounds odd though. Surely 2x1.6GHz thread might as well be 1x3.2GHz in terms of power acheiveable? Aren't you adding complexity without reason?

Any downside to this? 1.6GHz is a lot less than 3.2 of the XeCPU.



Its sounding like both machines are more different than they are the same.
 
So SONY and IBM decided to do a dual core more or less to go around some problems with hardware threads on a single core? That does not sound like a clever design at all and makes even less sense if the clock frequency is divided evenly between the semi cores as I like to call them if this is true. What antipode said is very interesting though.
 
so XeCPU will be shit as its just hyperthreading with shared cache to provide lots inefficiency.

and CELL PPE will be shit because its dual core but only half the speed?

Well I guess the PC bots were partially right - I'd expect raw PC CPU power to beat the main processors of these two fairly quickly. The only saving grace seems to be the SPEs on CELL.
 
Top Bottom