• Hey, guest user. Hope you're enjoying NeoGAF! Have you considered registering for an account? Come join us and add your take to the daily discourse.

Programming |OT| C is better than C++! No, C++ is better than C

Next week I'll have to pass the OCA Java SE 7 Programmer 1 certification. I have some mock exams to do but it seems so fucking hard... I mean I know how to code in Java (SE & EE) but there are some weird ass questions >< Plus, the whole thing is in english only, so some questions are a bit hard to understand for me :/
Got 58% on my first mock and 62% on my second but I can't help bu t think I got lucky on some answers...

Any tips or good website to study ? :p

1) if you don't have at least a consistent 70%+ don't bother. are sure you HAVE to take the exam next week? You might still pass if you cram, but...
2) did you study the material? Even if you are an expert java developer, the books will warn you about the many, many pitfalls in the exam. I din't take the exam, but I studied this book http://www.amazon.it/dp/1617291048/, it's very good. Also this site might help you http://www.techfaq360.com/scjp7Isuccesskit.jsp

In general, most questions are divided in
a)trick questions like "you spent 2 minutes looking at this code but you didn't realize a semicolon was missing, dummy".
b)terminology.
c) actual code execution.

good luck.
 
1) if you don't have at least a consistent 70%+ don't bother. are sure you HAVE to take the exam next week? You might still pass if you cram, but...
2) did you study the material? Even if you are an expert java developer, the books will warn you about the many, many pitfalls in the exam. I din't take the exam, but I studied this book http://www.amazon.it/dp/1617291048/, it's very good. Also this site might help you http://www.techfaq360.com/scjp7Isuccesskit.jsp

In general, most questions are divided in
a)trick questions like "you spent 2 minutes looking at this code but you didn't realize a semicolon was missing, dummy".
b)terminology.
c) actual code execution.

good luck.

Actually it's not next week but on the 3rd march, but I'm going all in from now on. I've studied it a bit while working but from my 2 mock exams I realize that there are many things I didn't know. It's tricky since you almost never encounter what's in those questions. And most syntax mistakes in the code are highlighted by my IDE ><

Anyway, thanks for the references!
 

iapetus

Scary Euro Man
Have they changed JCP a lot recently? Because when I took it, the requirements for a pass mark were basically that you had a pulse and could read a bit. Think I was out in 10-15 minutes of the exam, after double-checking all answers and making sure I hadn't missed a whole section of the exam... And I worked with some people who only managed 50/50 on the pulse and read a bit requirements but still had Java certification. Seriously, people who couldn't actually launch a Java program, but still managed to get certification.
 

Makai

Member
Have they changed JCP a lot recently? Because when I took it, the requirements for a pass mark were basically that you had a pulse and could read a bit. Think I was out in 10-15 minutes of the exam, after double-checking all answers and making sure I hadn't missed a whole section of the exam... And I worked with some people who only managed 50/50 on the pulse and read a bit requirements but still had Java certification. Seriously, people who couldn't actually launch a Java program, but still managed to get certification.

I need to learn some light Java--not developer level--for a job opportunity.

What's the best way to learn online?
lol

entrement: https://www.codecademy.com/learn/learn-java
 
Currently working through an assignment for our Parallel Computing class using C++ and OpenMP along with an Intel Xeon Phi to parallelize a real world TSP problem (hit every city possible using flight info from http://openflights.org/).

Super interesting but end of quarter crunch is hitting me like a freight train, wish I had more time to devote to these sorts of projects...might end up going back and cleaning it up to put up on github to use for my resume over Spring Break I 'spose, but for now getting the minimum requirements seems like a sound goal :p
 
Ok so I am making a game that has battles between various Monster classes.

The base class is Monster. We have several types. Spiders, Gorgons, Basilisks, etc...

I have a fight function that takes two pointers.

Code:
fight(Monster* m1, Monster* m2)
{
stuff
}

Now I use pointers because to make use of polymorphism.

I want to give the user the ability to select which types of monsters fight. Is there a way to do this without dynamic memory allocation?

Because I'm currently doing it like this:

Code:
int main()
{ Monster* m1
  Monster* m2

Which monsters would you like to fight? Choose one at a time from the options given.

Select monster one:

1. Gorgon
2. Spider
3. etc...

if (user selected choice 1)
{
m1 = new Gorgon
}

etc...

But I've heard that using dynamic memory allocation like this is bad? Or did I just misunderstand a post on stack overflow?

http://stackoverflow.com/questions/22146094/why-should-i-use-a-pointer-rather-than-the-object-itself

It's very unfortunate that you see dynamic allocation so often. That just shows how many bad C++ programmers there are.
 

Koren

Member
Now I use pointers because to make use of polymorphism.
Polymorphism doesn't force you to use pointers as arguments. It's probably not the best solution either... You could use references, for example.

I can understand that you'd want to avoid copying the object too often, though (especially if it's a large object and/or if you don't want to spend the time defining proper copy-constructors, etc.), and even more, slicing it.

Is there a way to do this without dynamic memory allocation?
Your link gives you some insight on this:
Pointers

However, there are other more general uses for raw pointers beyond dynamic allocation, but most have alternatives that you should prefer. [...] You need polymorphism. You can only call functions polymorphically (that is, according to the dynamic type of an object) through a pointer or reference to the object. If that's the behaviour you need, then you need to use pointers or references.

Beside that, that depends on how you really need polymorphism there.

But I've heard that using dynamic memory allocation like this is bad? Or did I just misunderstand a post on stack overflow?
It's bad if you use it without good reasons...

Imagine that each time you need a loop you use
int* i = new int
...

Some of the problems with dynamic allocations are discussed in your link. Among others, when you use dynamic allocation, you're in charge of the memory... and responsible for the leaks!

But dynamic allocation exists for good reasons... You often need it.

I'd suggest you to look into unique_ptr and shared_ptr semantics, though, you could find those useful.
 

Somnid

Member
Ok so I am making a game that has battles between various Monster classes.

The base class is Monster. We have several types. Spiders, Gorgons, Basilisks, etc...

I have a fight function that takes two pointers.

Now I use pointers because to make use of polymorphism.

I want to give the user the ability to select which types of monsters fight. Is there a way to do this without dynamic memory allocation?

You need to allocate instances of your classes which requires dynamic memory allocation. You aren't creating any more than you need so this seems perfect efficient to me, it's more about the mutability factor. I think in C++ you'd use a reference instead of a pointer though.
 
Ok so I am making a game that has battles between various Monster classes.

The base class is Monster. We have several types. Spiders, Gorgons, Basilisks, etc...

I have a fight function that takes two pointers.

Code:
fight(Monster* m1, Monster* m2)
{
stuff
}

Now I use pointers because to make use of polymorphism.

I want to give the user the ability to select which types of monsters fight. Is there a way to do this without dynamic memory allocation?

Because I'm currently doing it like this:

Code:
int main()
{ Monster* m1
  Monster* m2

Which monsters would you like to fight? Choose one at a time from the options given.

Select monster one:

1. Gorgon
2. Spider
3. etc...

if (user selected choice 1)
{
m1 = new Gorgon
}

etc...

But I've heard that using dynamic memory allocation like this is bad? Or did I just misunderstand a post on stack overflow?

http://stackoverflow.com/questions/22146094/why-should-i-use-a-pointer-rather-than-the-object-itself

You could always create a static pool of monster objects and then re-cast them to correct type. Although my C++ is rusty so I don't know the full ramifications of this on memory.
 

Haly

One day I realized that sadness is just another word for not enough coffee.
Not that I think you should avoid dynamic allocation in this case, your situation and usage is fine, but if you wanted to do it another way, what you could do is give the monster class a function called "init()" or "initalize()" that takes a bunch of parameters and gives the monster the stats you want.

Eg:
Code:
public function init(string name, int hp, int atk, int def) {
  this.name = name;
  this.hp = hp;
  ...
}

This way of doing it is limiting in that you can't use polymorphism. The upside is that you can create any kind of monster you want without creating a whole new class for it. If your application allowed the user to dictate an arbitrary type of monster, like for a monster maker, you might use an approach like this instead of concrete classes.

Statically allocating a monster class is bad for a real game because you'll have a lot of objects (graphics, physics and UI) that will want access to your monsters. Your components can't share their resources with only a single scope unless you stuff all your game code in one file, and that's just bad architecture. But for a light exercise like this where you only have one file, there's ways around dynamic allocation.


EDIT: On second thought maybe this is not the best general advice so let it be stricken.
 

JeTmAn81

Member
Not that I think you should avoid dynamic allocation in this case, your situation and usage is fine, but if you wanted to do it another way, what you could do is give the monster class a function called "init()" or "initalize()" that takes a bunch of parameters and gives the monster the stats you want.

Eg:
Code:
public function init(string name, int hp, int atk, int def) {
  this.name = name;
  this.hp = hp;
  ...
}

This way of doing it is limiting in that you can't use polymorphism. The upside is that you can create any kind of monster you want without creating a whole new class for it. If your application allowed the user to dictate an arbitrary type of monster, like for a monster maker, you might use an approach like this instead of concrete classes.

Statically allocating a monster class is bad for a real game because you'll have a lot of objects (graphics, physics and UI) that will want access to your monsters. Your components can't share their resources with only a single scope unless you stuff all your game code in one file, and that's just bad architecture. But for a light exercise like this where you only have one file, there's ways around dynamic allocation.


EDIT: On second thought maybe this is not the best general advice so let it be stricken.

image.php


Where's your confidence
 

Makai

Member
I transcribed some code I wrote in F# into C# for a code interview. I didn't realize how much I hate C syntax now. :,(

No tuples, unions, or pattern matching is a huge deal.
 

Aikidoka

Member
I did a straightforward implementation of pthreads for a double-prec general matrix multiplication algorithm. For having 10 threads, the program is a bit faster than doing it without threading as I'd expect. However, after increasing the thread size to 20, the performance is on par (if not slightly worse) than no threading. I don't quite understand why adding more threads seems to be slower than not having any threads at all.

I am a physics Ph. D student, but my research must be done on supercomputers (specifically Titan), so I am taking a graduate computer science course in parallel programming. I haven't had really any prior "formal" training in computer science so that analysing performance seems a bit out of my depth.
 

Mr.Mike

Member
I did a straightforward implementation of pthreads for a double-prec general matrix multiplication algorithm. For having 10 threads, the program is a bit faster than doing it without threading as I'd expect. However, after increasing the thread size to 20, the performance is on par (if not slightly worse) than no threading. I don't quite understand why adding more threads seems to be slower than not having any threads at all.

I am a physics Ph. D student, but my research must be done on supercomputers (specifically Titan), so I am taking a graduate computer science course in parallel programming. I haven't had really any prior "formal" training in computer science so that analysing performance seems a bit out of my depth.

Creating the threads has an overhead. And depending on your machine you might be making more threads than your machine can actually do at one time, in which case you wouldn't really gain anything by having more, but you would still be spending processing time to create the threads. Also, if you're not actually working with that big of a test case you can definitely spend more time creating the threads than just carrying out the operation on the small set, even if your computer can do a lot of threads in parallel.

If you're doing this on a laptop your computer can probably process two threads at a time, on a desktop with an i3 or i5 it's 4 I think, and with an i7 you can have 8. If you're doing this on your schools supercomputers, then I have no idea.
 

Aikidoka

Member
Creating the threads has an overhead. And depending on your machine you might be making more threads than your machine can actually do at one time, in which case you wouldn't really gain anything by having more, but you would still be spending processing time to create the threads. Also, if you're not actually working with that big of a test case you can definitely spend more time creating the threads than just carrying out the operation on the small set, even if your computer can do a lot of threads in parallel.

If you're doing this on a laptop your computer can probably process two threads at a time, on a desktop with an i3 or i5 it's 4 I think, and with an i7 you can have 8. If you're doing this on your schools supercomputers, then I have no idea.

I'm just doing this on my desktop, which has an Intel i7-5820K CPU. Is the number of cores the deciding factor for how many threads run concurrently? So if my CPU has 6 cores, then I should only expect 6 threads to run at once. My Task Manager shows that I have around 2000 threads at any given time, are all but 6 asleep/spinning then?
 

Mr.Mike

Member
I'm just doing this on my desktop, which has an Intel i7-5820K CPU. Is the number of cores the deciding factor for how many threads run concurrently? So if my CPU has 6 cores, then I should only expect 6 threads to run at once. My Task Manager shows that I have around 2000 threads at any given time, are all but 6 asleep/spinning then?

High-end Intel processors have a thing where they can do two threads at once, so you would actually be able to do 12 threads at a time. And yeah, the vast majority of those threads are asleep.
 

Koren

Member
I did a straightforward implementation of pthreads for a double-prec general matrix multiplication algorithm. For having 10 threads, the program is a bit faster than doing it without threading as I'd expect. However, after increasing the thread size to 20, the performance is on par (if not slightly worse) than no threading. I don't quite understand why adding more threads seems to be slower than not having any threads at all.

I am a physics Ph. D student, but my research must be done on supercomputers (specifically Titan), so I am taking a graduate computer science course in parallel programming. I haven't had really any prior "formal" training in computer science so that analysing performance seems a bit out of my depth.
As Mr Mike said, you're limited to 1 or 2 threads per core at best (2 threads on a single core will only be interesting if the threads are sometimes waiting for I/O), thus anything between 1 (on older computers) and 16 (on some i7).

Beside the thread handling overhead, also keep in mind that each core will use its own cache, and if they're working on the same data, you may increase the RAM accesses (each cores need to get the same data) and thus slowing the process. Avoid especially writes that could invalidate caching...

With parallelism over multiple machines, the data transfers between the machines can sometimes nullify or worsen your computation time...
 

luoapp

Member
I did a straightforward implementation of pthreads for a double-prec general matrix multiplication algorithm. For having 10 threads, the program is a bit faster than doing it without threading as I'd expect. However, after increasing the thread size to 20, the performance is on par (if not slightly worse) than no threading. I don't quite understand why adding more threads seems to be slower than not having any threads at all.

I am a physics Ph. D student, but my research must be done on supercomputers (specifically Titan), so I am taking a graduate computer science course in parallel programming. I haven't had really any prior "formal" training in computer science so that analysing performance seems a bit out of my depth.

What's your language? How did you benchmark the program? There are also different considerations between programming for supercomputer (multicore/multi cpu /multi node with GPU acceleration) and local single cpu with multi core. From my experience, you should focus on mid to high level programming, like when to move data to local memory, how many threads, when to start/sync them, etc. and leave the low level stuff (like matrix multiplication or fft) to libraries. Let's be honest, you're not gonna do better than that.
 

JeTmAn81

Member
Isn't there a decent amount of overhead generated by context switching between threads as well? There's a lot of nuance being handled by the operating system when it comes to this stuff.
 

Koren

Member
Isn't there a decent amount of overhead generated by context switching between threads as well? There's a lot of nuance being handled by the operating system when it comes to this stuff.
Not sure what you're referring to exactly...

If you're doing heavy computations, the OS won't mess that much with your code. It'll require some processor cycles from time to time, of course, but it won't stop your threads each microsecond, so the overhead will be limited.

The cores will crunch your numbers most of the time. You can still run two threads by core, but the switch between both will be handled by HT and the overhead is far smaller (since all registers are efficiently switched by the processor) and the switch will only occur in case the core is waiting for something.

That being said, two thread per core is not always so efficient, when your code don't stall often for I/O, and worse, there's *many* cache problems, so two threads in a sngle core can often be slower than a single one.

From my experience, you should focus on mid to high level programming, like when to move data to local memory, how many threads, when to start/sync them, etc. and leave the low level stuff (like matrix multiplication or fft) to libraries. Let's be honest, you're not gonna do better than that.
Yes, but... he's *studying* parallel programming. It's not a bad problem to study when you try to understand how to parallelize an algorithm. If he writes high-level software and let libraries do the low-level stuff, he won't learned much about the way those things work.

Also, mid/high-level is nice when you have low-level algorithm implemented, but, and especially when you're doing research, there quite often no availablable solution to perform a given task. It's not always FFT or Matrix multiplication...

(and without research on those topics, you wouldn't get improvements on those things either... the story behind DFFT is especially interesting, when they published the algorithm for the first time, they apparently hadn't even realized they had a quasi-linear complexity instead of a quadratic one)
 

luoapp

Member
Yes, but... he's *studying* parallel programming. It's not a bad problem to study when you try to understand how to parallelize an algorithm. If he writes high-level software and let libraries do the low-level stuff, he won't learned much about the way those things work.

Also, mid/high-level is nice when you have low-level algorithm implemented, but, and especially when you're doing research, there quite often no availablable solution to perform a given task. It's not always FFT or Matrix multiplication...

He is doing physics research, not CS research though. Anyway, if he is interested, by all means, knock yourself out on those in-place out-of-place butterflies. One thing I found out, is that a great number of the numerical problems in research, if can be helped with parallelization , generally can be treated with (a) using all the math libraries very well, (b) doing "embarrassingly parallel" very well.
 

Mr.Mike

Member
So in doing genetic algorithms I've decided that I ought to, if possible, keep the population size small enough to keep it all in the cache. I figure this should improve performance a lot.

So my CPU L2 cache (per core) is 256Kb. Does that mean I can have 256Kb of stuff in there, or do I have to consider the overhead of the index? Like, can I expect to be able to have 256 thousand chars in there, or do I have to keep in mind the space needed to keep track of memory addresses?
 

JeTmAn81

Member
Not sure what you're referring to exactly...

If you're doing heavy computations, the OS won't mess that much with your code. It'll require some processor cycles from time to time, of course, but it won't stop your threads each microsecond, so the overhead will be limited.

The cores will crunch your numbers most of the time. You can still run two threads by core, but the switch between both will be handled by HT and the overhead is far smaller (since all registers are efficiently switched by the processor) and the switch will only occur in case the core is waiting for something.

That being said, two thread per core is not always so efficient, when your code don't stall often for I/O, and worse, there's *many* cache problems, so two threads in a sngle core can often be slower than a single one.

I guess I was thinking of 8+ threads running on a single core when really, what's the point in that? Single-cored multithreading doesn't give you parallelism, right? You need more physical units in order to perform operations in parallel. But context switching also costs you more than just the registers, right? I thought you had to save the process stack as well.

It's also not clear that what he wants to do will benefit from multi-threading.
 

luoapp

Member
So in doing genetic algorithms I've decided that I ought to, if possible, keep the population size small enough to keep it all in the cache. I figure this should improve performance a lot.

So my CPU L2 cache (per core) is 256Kb. Does that mean I can have 256Kb of stuff in there, or do I have to consider the overhead of the index? Like, can I expect to be able to have 256 thousand chars in there, or do I have to keep in mind the space needed to keep track of memory addresses?
I don't think you have direct access to cpu cache. It will be swapped out when cpu thinks it's necessary.
 
So in doing genetic algorithms I've decided that I ought to, if possible, keep the population size small enough to keep it all in the cache. I figure this should improve performance a lot.

So my CPU L2 cache (per core) is 256Kb. Does that mean I can have 256Kb of stuff in there, or do I have to consider the overhead of the index? Like, can I expect to be able to have 256 thousand chars in there, or do I have to keep in mind the space needed to keep track of memory addresses?

Caches are organized as a series of "lines", frequently 64 bytes.

When you load an address, you load an entire cache line. So in the worst case scenario, you could only fit 256KB / 16 = 16KB of data in cache, if each 4 byte integer you cared about were not in the same contiguous 64 byte block of memory.

This is why people try to write their algorithms such that you process data that is physically close together at the same time, because it reduces the number of cache misses. This is still considered a micro optimization though, so you don't have to spend too much time worrying about this unless you profile your code and find out your cache miss rate is high
 

Slavik81

Member
So in doing genetic algorithms I've decided that I ought to, if possible, keep the population size small enough to keep it all in the cache. I figure this should improve performance a lot.

So my CPU L2 cache (per core) is 256Kb. Does that mean I can have 256Kb of stuff in there, or do I have to consider the overhead of the index? Like, can I expect to be able to have 256 thousand chars in there, or do I have to keep in mind the space needed to keep track of memory addresses?
To perfectly fill the cache, you would need to consider the memory used to store your code, and all the other memory used for local variables and other such things.

You'd also have to recognize that a real cache is not usually fully associative. That is to say, certain regions of memory are assigned to certain parts of the cache. Data from some particular place in main memory can't just go anywhere when it gets cached; there's a limited number of slots it can be stored in. Though, I don't think that's commonly considered during optimization. Languages don't generally give you tools to deal with it anyways.

Actually achieving 100% cache utilization is a pipe dream. You are best-off just trying to keep your data small and your memory accesses near to each other. Try to keep your reads and your writes somewhat separated, too. And profile. On the micro level, the performance impacts of various sorts of changes can be quite unpredictable.
 

Koren

Member
He is doing physics research, not CS research though.
True enough. He most probably won't create a better FFT ^_^ But sometimes, research in physics is more about CS than actual physics ^_^

One thing I found out, is that a great number of the numerical problems in research, if can be helped with parallelization , generally can be treated with (a) using all the math libraries very well, (b) doing "embarrassingly parallel" very well.
I agree. Although quite often, conditions at boundaries when you cut the problem in parts can make your parallel algorithm explode, so it's always interesting to understand the problems in parallelization to perform correctly the "b" case.

Actually achieving 100% cache utilization is a pipe dream. You are best-off just trying to keep your data small and your memory accesses near to each other. Try to keep your reads and your writes somewhat separated, too. And profile.
This, 100%.


I guess I was thinking of 8+ threads running on a single core when really, what's the point in that? Single-cored multithreading doesn't give you parallelism, right? You need more physical units in order to perform operations in parallel.
Indeed. Two threads may be interesting in the case of HT cores, in some cases, but anything over this will decrease your performances.
 

Aikidoka

Member
Thanks everyone for the helpful discussion. Cache and memory access is also something I'm not totally familiar with and would appreciate some clarification.
First, to try and analyse my code performance I am using the clock() function from <timer.h> , which, I believe, gives me the CPU time of the algorithm (is CPU time more or less useful for threaded programs?). Well, the problem I am having is that the time elapsed varies wildly between runs with the same parameters. Googling (basically reading stack-exchange) reveals that all fluctuations can be attributed to memory/cache issues. I guess, they mean that variables are stored in different physical locations from run-to-run . Is there a way to minimize this effect, other than simply averaging over multiple runs (making sure to clear the cache each time)?

Also, I am expected to do some optimization for this matrix multiplication. The most straightforward way, I guess, is to tile or block the matrices. The Wikipedia for this states that the optimal blocking size would be sqrt(M), where M is the number of cache lines. Looking at Task Manager just states that "L1 cache: 384 KB; L2 cache: 1.5MB; L3 cache: 15.0 MB". How do I gather the number of cache lines and which level of cache should I optimize for? Is there a way to make the code itself calculate this value or does it need to be "hand tuned"? I am also looking at other optimizations that are cache oblivious but blocking just seems so straightforward that I might as well do it.


True enough. He most probably won't create a better FFT ^_^ But sometimes, research in physics is more about CS than actual physics ^_^
.

This has been fairly true - granted I'm still at the beginning of Ph. D research - porting my code to work on GPU/CPU hybrid architecture is a current goal-post, and while reinventing the wheel is pretty foolish, I don't think there's really any library that can load-balance your code for you or guarantee that weak scaling is achieved. Though I'm not even sure if there are any current Linear-Algebra libraries that have complex matrix-matrix multiplication that is optimized for GPUs yet (I know single-prec works great).
 
Thanks everyone for the helpful discussion. Cache and memory access is also something I'm not totally familiar with and would appreciate some clarification.
First, to try and analyse my code performance I am using the clock() function from <timer.h> , which, I believe, gives me the CPU time of the algorithm (is CPU time more or less useful for threaded programs?). Well, the problem I am having is that the time elapsed varies wildly between runs with the same parameters. Googling (basically reading stack-exchange) reveals that all fluctuations can be attributed to memory/cache issues. I guess, they mean that variables are stored in different physical locations from run-to-run . Is there a way to minimize this effect, other than simply averaging over multiple runs (making sure to clear the cache each time)?

Also, I am expected to do some optimization for this matrix multiplication. The most straightforward way, I guess, is to tile or block the matrices. The Wikipedia for this states that the optimal blocking size would be sqrt(M), where M is the number of cache lines. Looking at Task Manager just states that "L1 cache: 384 KB; L2 cache: 1.5MB; L3 cache: 15.0 MB". How do I gather the number of cache lines and which level of cache should I optimize for? Is there a way to make the code itself calculate this value or does it need to be "hand tuned"? I am also looking at other optimizations that are cache oblivious but blocking just seems so straightforward that I might as well do it.




This has been fairly true - granted I'm still at the beginning of Ph. D research - porting my code to work on GPU/CPU hybrid architecture is a current goal-post, and while reinventing the wheel is pretty foolish, I don't think there's really any library that can load-balance your code for you or guarantee that weak scaling is achieved. Though I'm not even sure if there are any current Linear-Algebra libraries that have complex matrix-matrix multiplication that is optimized for GPUs yet (I know single-prec works great).
There's a great talk by Andrei Alexandrescu called Writing Fast Code that might be interesting to you.

If you want to optimize for cache friendliness specifically, I'd recommend cachegrind (part of Valgrind) if you're on a Unix-based operating system. I don't know of similar tools for Windows, unfortunately.

The cache is super important because getting a cache hit on a memory access is about an order of magnitude or two faster than going to memory. The less time you take for your memory accesses, the less the CPU will stall (wait while memory is being fetched from RAM) and the more throughput you'll get.

Memory layout actually plays a big role in if the CPU can put data in the cache. Say you have an array of doubles that are arranged in memory in a linear fashion, just one double after the other. When you're doing a loop over that array, the CPU can very easily predict which areas of memory you'll access next and put as much of that array in the cache as possible (pre-fetching the data). If you had an array of pointers to doubles for whatever reason, the CPU would first have to dereference the pointer to get to the data, making prefetching much harder.
 

upandaway

Member
Need a little bit of help... we had this question on a test

Optimize the following function (nonspecific CPU):
Code:
void memcpy(char *src, char *dst, unsigned num_bytes) {
    unsigned i;
    for (i = 0; i < num_bytes; i++) {
        *dst++ = *src++;
    }
}

Short of assembly injection in there, what possible optimizations can you think of?
No one in my class got full mark for the question, and can't get an official answer, so I have no clue. It looks like a relatively simple question, but yeah.

For record my answer was
Code:
#define BLOCK_SIZE 4
void memcpy(register char *src, register char *dst, uint_fast32_t num_bytes) {
    register uint_fast32_t i;
    for (i = num_bytes / BLOCK_SIZE; i; --i) {
        *dst = *src;     //1
        ++dst;
        ++src;
        *dst = *src;     //2
        ++dst;
        ++src;
        *dst = *src;     //3
        ++dst;
        ++src;
        *dst = *src;     //4
        ++dst;
        ++src;
    }
    for (i = num_bytes % BLOCK_SIZE; i; --i) {
        *dst = *src;
        ++dst;
        ++src;
    }
}

I also thought about casting to int* and using a union to read 4 chars at once (assuming specific endianness)
Code:
union buffer {
    char[4] b;
    int i;
}
But apparently that's not enough either.
 

upandaway

Member
We're not supposed to change how the method works but yeah the copying by 4 bytes is what I was referring to with the union, but he didn't give full marks on that either.
 

Somnid

Member
4 byte boundary is common but I feel nonspecific CPU means algorithmic optimization since there's no reason it has to be 4 bytes. And on that thought I can't think of a way to optimize because you need to copy each element so there's no way to not loop through all of them.
 

ricki42

Member
Need a little bit of help... we had this question on a test

Optimize the following function (nonspecific CPU):
Code:
void memcpy(char *src, char *dst, unsigned num_bytes) {
    unsigned i;
    for (i = 0; i < num_bytes; i++) {
        *dst++ = *src++;
    }
}

Short of assembly injection in there, what possible optimizations can you think of?

Don't know how much of a difference it would make, but I think you could get rid of the unsigned i by looping backwards over num_bytes instead: ( ; num_bytes>0; --num_bytes.
 
Need a little bit of help... we had this question on a test

Optimize the following function (nonspecific CPU):
Code:
void memcpy(char *src, char *dst, unsigned num_bytes) {
    unsigned i;
    for (i = 0; i < num_bytes; i++) {
        *dst++ = *src++;
    }
}

Short of assembly injection in there, what possible optimizations can you think of?
No one in my class got full mark for the question, and can't get an official answer, so I have no clue. It looks like a relatively simple question, but yeah.

For record my answer was
Code:
#define BLOCK_SIZE 4
void memcpy(register char *src, register char *dst, uint_fast32_t num_bytes) {
    register uint_fast32_t i;
    for (i = num_bytes / BLOCK_SIZE; i; --i) {
        *dst = *src;     //1
        ++dst;
        ++src;
        *dst = *src;     //2
        ++dst;
        ++src;
        *dst = *src;     //3
        ++dst;
        ++src;
        *dst = *src;     //4
        ++dst;
        ++src;
    }
    for (i = num_bytes % BLOCK_SIZE; i; --i) {
        *dst = *src;
        ++dst;
        ++src;
    }
}

I also thought about casting to int* and using a union to read 4 chars at once (assuming specific endianness)
Code:
union buffer {
    char[4] b;
    int i;
}
But apparently that's not enough either.

::memcpy(dst, src, size);

No way you're going to get faster than the builtin version

Edit: Ok, fine I know that's not in the spirit of the question. So in that case I would stick with your original idea of casting to int*, but there are some improvements you could make. First, why do you need the union? Just use a 4-byte type, the union is unnecessary. Second, what if you happen to be on a 64-bit cpu? Then you have native support for 8-byte addressing, and 4-byte addressing will be slower. Putting all this together:

Code:
#include <stdint.h>

void memcpy32(uint32_t *src, uint32_t *dst, unsigned num_longs)
{
    for (unsigned i=0; i < num_longs; ++i)
        *dst++ = *src++;
}

void memcpy64(uint64_t *src, uint64_t *dst, unsigned num_uint64s)
{
    for (unsigned i=0; i < num_uint64s; ++i)
        *dst++ = *src++;
}

void memcpy(char *src, char *dst, unsigned num_bytes)
{
    // If this CPU supports 64-bit addressing, consume as many bytes
    // as possible using 64-bit addressing
    if (sizeof(void*) == 8)
    {
        unsigned num_uint64s = num_bytes / 8;
        memcpy64((uint64_t*)src, (uint64_t*)dst, num_uint64s);
        unsigned bytes_consumed = num_uint64s*8;
        dst += bytes_consumed;
        src += bytes_consumed;
        num_bytes -= bytes_consumed;
    }

    // There may still be uncopied bytes.  For example, maybe num_bytes==12
    // or maybe this isn't a 64-bit CPU anyway and that entire if branch was skipped.
    // So copy the remaining bytes in the fastest possible way.
    if (num_bytes > 0)
    {
        unsigned num_uint32s = num_bytes / 4;
        memcpy32((uint32_t*)src, (uint32_t*)dst, num_uint32s);
        unsigned bytes_consumed = num_uint32s*8;
        dst += bytes_consumed;
        src += bytes_consumed;
        num_bytes -= bytes_consumed;
    }

    // There may still be remaining bytes (for example if num_bytes==13), so
    // copy the leftovers 1 at a time, since there can be a maximum of 3 we
    // don't have to worry about doing this efficiently.
    for (int i=0; i < num_bytes; ++i)
        *dst++ = *src++;
}

Note that the builtin implementation is *still* much faster than this, because it takes advantage of SSE / SIMD instructions on your CPU to copy multiple chunks in parallel.
 

luoapp

Member
We're not supposed to change how the method works but yeah the copying by 4 bytes is what I was referring to with the union, but he didn't give full marks on that either.

The union is a poor implementation of the idea. The user, not the library, needs to declare the union. Pointer casting is better.
 
Need a little bit of help... we had this question on a test

Optimize the following function (nonspecific CPU):
Code:
void memcpy(char *src, char *dst, unsigned num_bytes) {
    unsigned i;
    for (i = 0; i < num_bytes; i++) {
        *dst++ = *src++;
    }
}

The function declaration as it currently written implies that the two pointers may alias each other. The loop optimizer in the compiler is therefore unable to determine statically whether the two memory regions pointed to overlap and therefore it cannot safely vectorize the copy loop. If you change the function definition to:
Code:
void memcpy(char *__restrict src, char *__restrict dst, unsigned num_bytes)
and then examine the instructions that the compiler generates you should see that the compiler has vectorized the loop. If you are using MSVC you may see that for the version that permits aliasing the compiler has inserted some instructions to dynamically detect if there is aliasing and then it falls back to a scalar version. This will only yield benefits on chips with vector instructions though, so may not be all that useful.
 

Koren

Member
::memcpy(dst, src, size);

No way you're going to get faster than the builtin version

Edit: Ok, fine I know that's not in the spirit of the question.
I also find the question really strange...

I don't think you can optimize such a thing in C without knowing the underlying architecture. I mean, beside the 32/64 bits, and without even taking "recent" technologies into account, in x86 assembly as old as 386, you would use for example
rep movsq
which would be better than any loop you could write.

Grandted, there's a chance that compiler do some optimization to get to this, but what kind of optimization would you expect?



So, just for the sake of the question:

*) I agree on the fact that moving 32bits or 64bits values is better. But I would first move enough bytes so that the addresses are aligned, them move 32 or 64bits values, them move the remaining bytes.

If the difference between both ptr is not a multiple of 4/8, I'm not sure whether you should align the origin or the destination, though...

Also, I wouldn't try to move 32 bits values after moving 64bits ones, at best you'll move one, and I think the cast and the arithmetics would take more time than moving 4 bytes.

*) I think Ricki42 is right: counting DOWN to 0 is far better than counting up, since, on most processors, the "dec" opcode will set the zero flag, so the test don't need a comparison.

*) I'd also use x>>3 instead of x%8, x&3 instead of x%8, etc.

Yes, most compilers will do this properly, but since I don't really understand what he's looking for...

So, quick'n dirty (and maybe buggy):
Code:
void memcpy(uint8_t* src, uint8_t* dst, unsigned num_bytes) {
    // Align
    #ifdef CPU64
        if (num_bytes >> 3) {
            unsigned x = ((unsigned long) src) & 0x7;
            if (x!=0) {
                x = 8-x;
                movebytes(src, dst, x);
                src += x;
                dst += x;
                num_bytes -= x;
            }
        }
    #else
        if (num_bytes >> 2) {
            unsigned x = ((unsigned long) src) & 0x3;
            if (x!=0) {
                x = 4-x;
                movebytes(src, dst, x);
                src += x;
                dst += x;
                num_bytes -= x;
            }
        }
    #endif
    
    // Move words
    #ifdef CPU64
        unsigned nb64 = num_bytes >> 3;
        uint32_t* src64 = (uint64_t*)src;
        uint32_t* dst64 = (uint64_t*)dst;
        memcpy64(src64, dst64, nb64);
        nb64 <<= 3;
        src += nb64;
        dst += nb64;
        num_bytes &= 0x07;
    #else
        unsigned nb32 = num_bytes >> 2;
        uint32_t* src32 = (uint32_t*)src;
        uint32_t* dst32 = (uint32_t*)dst;
        memcpy32(src32, dst32, nb32);
        nb32 <<= 2;
        src += nb32;
        dst += nb32;
        num_bytes &= 0x03;
    #endif
    
    // Move remaining bytes
    movebytes(src, dst, num_bytes);
}
 
I did a straightforward implementation of pthreads for a double-prec general matrix multiplication algorithm. For having 10 threads, the program is a bit faster than doing it without threading as I'd expect. However, after increasing the thread size to 20, the performance is on par (if not slightly worse) than no threading. I don't quite understand why adding more threads seems to be slower than not having any threads at all.

I am a physics Ph. D student, but my research must be done on supercomputers (specifically Titan), so I am taking a graduate computer science course in parallel programming. I haven't had really any prior "formal" training in computer science so that analysing performance seems a bit out of my depth.

A few questions:

1. What compiler are you using, and what version is the compiler?
2. Are you compiling the program on the Titan itself, or compiling it on your workstation and uploading it?
3. Are you able to exclusively reserve nodes for testing your program?

With those questions out of the way: the slowdown is almost certainly due to the memory overhead of migrating threads between cores, or more specifically migrating the memory that each thread touches. I don't know what the interconnect is on your system (I haven't used Titan before but I imagine it is similar to EU supercomputers) but if your threads are regularly moving between cores then the resulting traffic at the memory controller will slow down the program massively. A simple change you can make is to pin each thread to a core, and then allocate memory at the node the thread is pinned on (Here I am assuming that your matrices are big enough to require dynamic memory allocation). To do this you need to know what the topology of your system is. For that you can use the HWLOC library, which is a standard part of HPC Linux installs. You can query the core number and then pin the thread to that core using pthreads (or the OS equivalent if you want). For memory allocation you should explore the concept of Non Uniform Memory Allocation or NUMA. In most SCs with complex topologies and large interconnects they will have many memory banks, each with different access properties. If you use the system wide memory allocation functions the OS allocator will decide for you and it doesn't typically make a good choice. Examine the Libnuma library to aid in this. Here are some links to the libraries I have mentioned.

HWLOC: https://www.open-mpi.org/projects/hwloc/
Libnuma: http://linux.die.net/man/3/numa

Another thing you can look at is vectorization, but it tends to be more intrusive on code. It will likely force a change in your data layout which may not be acceptable. Anyway, some other tips; Make sure you are compiling with the following flags:
Code:
-O3 -march=native

If you are compiling your program on your workstation/laptop/whatever and then uploading then change the flags to this:

Code:
-O3 -march=bdver1

If possible endeaver to use the most recent version of your compiler. Try these things out and remeasure and then let me know if that did anything. If you need more in depth advice you might be able to ask someone in the CS department, a HPC researcher or something.
 

cyborg009

Banned
can someone help me real quick with this problem I'm having? I'm trry to pass my struct to a function but its not working.

header.h
Code:
#define CARDS 13
#define SUITS 4
#define DECK 52
struct card{
        char *suit;
        int *value;
    }deck[DECK];


void make(struct Card *);

maker class to create decks
Code:
void make(struct Card d[]){
    char *suits[] ={"Spade","Hearts","Club","Diamond"};
    //enum Face {ACE =14}:
    char num[13];
    
//print the struct items
     printf("\n");
    for (int j = 0; j < SUITS; j++) {
        
        for (int i = 0; i < CARDS; i++) {
            d[j].suit =suits[j];
            d[j].value=&i;
            printf("%s\t ",d[j].suit);
            printf("%d ",*d[j].value+2);
        }
        printf("\n");
    }
}

My main is just a simple call to the function passing the struct array.

But I'm getting error: array type has incomplete element type 'struct Card'.
 
I

*) I think Ricki42 is right: counting DOWN to 0 is far better than counting up, since, on most processors, the "dec" opcode will set the zero flag, so the test don't need a comparison.
[/code]

Moving backwards is the literally the worst case scenario for cache misses

Edit: Ahh you just meant the index variable. I don't think that's a worthy optimization. Compared to the cost of moving an arbitrary amount of memory, the cost of 1 cpu instruction is negligible
 

upandaway

Member
Thanks a lot everyone, you guys are super helpful.

First, why do you need the union? Just use a 4-byte type, the union is unnecessary.
I just meant for the union to switch the byte order in case the endianness switches them around when loading the int, but I forgot it switches things back when writing it to dst too, whoops.

Didn't know CPUs can have vector commands
 

Koren

Member
Moving backwards is the literally the worst case scenario for cache misses

Edit: Ahh you just meant the index variable. I don't think that's a worthy optimization. Compared to the cost of moving an arbitrary amount of memory, the cost of 1 cpu instruction is negligible
Yes, I meant the index variable.

I don't think it's such a menial change. Even more so because I've already used it in real operations (on images) and got a couple % increase in speed. If the loop is short, using a zero test on the loop counter is slightly faster.

If you want more proof, take those two functions:
Code:
int f(int x) {
	for(int i=0; i<1000000000; ++i) {
		x = x ^ 0x33333;
	}
	return x;
}

int g(int x) {
	for(int i=1000000000; i!=0; --i) {
		x = x ^ 0x33333;
	}
	return x;
}

Here is the result of gprof:
Code:
                3.71    0.00       1/1           f(int) [2]
                3.26    0.00       1/1           g(int) [3]


In the case above, the "normal loop", with an increasing i up to max, in pseudo-asm is :
Code:
S: mov src > dst
    inc src
    inc dst
    inc i
    sub max, i
    jump_not_zero S

Counting in reverse from max to 0 makes it
Code:
S: mov src > dst
    inc src
    inc dst
    dec i
    jump_not_zero S

Assuming all operations are 1 cycle and not taking into pairable opcodes, you gain 16% of execution time since you're going from 6.n to 5.n (it's less than that in reality, but you see the idea).


That being said, it's still stupid, since
Code:
rep mov src dst
on x86 for example has a cost of 3+n on Pentium (plus setting the counter which is O(1) ), so it's 5 times faster than any "normal" loop you could write like those above, not even taking into account SSE and other parallel opcodes. The limiting factor, even without SSE, will be memory, not code...
 

NotBacon

Member
can someone help me real quick with this problem I'm having? I'm trry to pass my struct to a function but its not working.

header.h
Code:
#define CARDS 13
#define SUITS 4
#define DECK 52
struct card{
        char *suit;
        int *value;
    }deck[DECK];


void make(struct Card *);

maker class to create decks
Code:
void make(struct Card d[]){
    char *suits[] ={"Spade","Hearts","Club","Diamond"};
    //enum Face {ACE =14}:
    char num[13];
    
//print the struct items
     printf("\n");
    for (int j = 0; j < SUITS; j++) {
        
        for (int i = 0; i < CARDS; i++) {
            d[j].suit =suits[j];
            d[j].value=&i;
            printf("%s\t ",d[j].suit);
            printf("%d ",*d[j].value+2);
        }
        printf("\n");
    }
}

My main is just a simple call to the function passing the struct array.

But I'm getting error: array type has incomplete element type 'struct Card'.

Your function takes 'struct Card', when you declared it 'struct card'.
 

upandaway

Member
@Koren
Yeah, when I was doing optimization homework, changing the loop condition to a 0 comparison decreased the "for" line's cycles by about half in the profiler.
 

Koren

Member
can someone help me real quick with this problem I'm having? I'm trry to pass my struct to a function but its not working.
Well...

Code:
struct card{

Code:
void make(struct Card *);

Remember, it's case sensitive ;)


Beside that, I find your choices quite strange, and you'll probably get nasty surprises... If your table of suit chains is destroyed at the end of the function make function, congratulations, you have char* pointers that point to a random place in memory.

Even worse, all your values point towards an integer that change... At the end of the function, all the values of all the cards are the same, when you exit the function, all of them points to an invalid adress.

Why, why, do you want to use an int* to store the value?!
 

Koren

Member
@Koren
Yeah, when I was doing optimization homework, changing the loop condition to a 0 comparison decreased the "for" line's cycles by about half in the profiler.
It's only interesting if the inside of the loop is really short, but yes, I'm convinced it can be useful.

There's usually plently of other things to optimize before, but it's such a simple change that it's often a good trick.
 

cyborg009

Banned
Your function takes 'struct Card', when you declared it 'struct card'.

Damn didn't notice that but now its saying error: array type has incomplete element type 'struct Card' for the function header in the source file,

Well...

Code:
struct card{

Code:
void make(struct Card *);

Remember, it's case sensitive ;)


Beside that, I find your choices quite strange, and you'll probably get nasty surprises... If your table of suit chains is destroyed at the end of the function make function, congratulations, you have char* pointers that point to a random place in memory.

Even worse, all your values point towards an integer that change... At the end of the function, all the values of all the cards are the same, when you exit the function, all of them points to an invalid adress.

Why, why, do you want to use an int* to store the value?!

Well I was originally but because once I have a order over the value of 10 I have to change it to a King, Ace and etc...
 
Top Bottom