::memcpy(dst, src, size);
No way you're going to get faster than the builtin version
Edit: Ok, fine I know that's not in the spirit of the question.
I also find the question really strange...
I don't think you can optimize such a thing in C without knowing the underlying architecture. I mean, beside the 32/64 bits, and without even taking "recent" technologies into account, in x86 assembly as old as 386, you would use for example
which would be better than any loop you could write.
Grandted, there's a chance that compiler do some optimization to get to this, but what kind of optimization would you expect?
So, just for the sake of the question:
*) I agree on the fact that moving 32bits or 64bits values is better. But I would first move enough bytes so that the addresses are aligned, them move 32 or 64bits values, them move the remaining bytes.
If the difference between both ptr is not a multiple of 4/8, I'm not sure whether you should align the origin or the destination, though...
Also, I wouldn't try to move 32 bits values after moving 64bits ones, at best you'll move one, and I think the cast and the arithmetics would take more time than moving 4 bytes.
*) I think Ricki42 is right: counting DOWN to 0 is far better than counting up, since, on most processors, the "dec" opcode will set the zero flag, so the test don't need a comparison.
*) I'd also use x>>3 instead of x%8, x&3 instead of x%8, etc.
Yes, most compilers will do this properly, but since I don't really understand what he's looking for...
So, quick'n dirty (and maybe buggy):
Code:
void memcpy(uint8_t* src, uint8_t* dst, unsigned num_bytes) {
// Align
#ifdef CPU64
if (num_bytes >> 3) {
unsigned x = ((unsigned long) src) & 0x7;
if (x!=0) {
x = 8-x;
movebytes(src, dst, x);
src += x;
dst += x;
num_bytes -= x;
}
}
#else
if (num_bytes >> 2) {
unsigned x = ((unsigned long) src) & 0x3;
if (x!=0) {
x = 4-x;
movebytes(src, dst, x);
src += x;
dst += x;
num_bytes -= x;
}
}
#endif
// Move words
#ifdef CPU64
unsigned nb64 = num_bytes >> 3;
uint32_t* src64 = (uint64_t*)src;
uint32_t* dst64 = (uint64_t*)dst;
memcpy64(src64, dst64, nb64);
nb64 <<= 3;
src += nb64;
dst += nb64;
num_bytes &= 0x07;
#else
unsigned nb32 = num_bytes >> 2;
uint32_t* src32 = (uint32_t*)src;
uint32_t* dst32 = (uint32_t*)dst;
memcpy32(src32, dst32, nb32);
nb32 <<= 2;
src += nb32;
dst += nb32;
num_bytes &= 0x03;
#endif
// Move remaining bytes
movebytes(src, dst, num_bytes);
}