I find myself tuning a piece of code where memory is copied using memcpy and the third parameter (size) is known at compile time.
The consumer of the function calling memcpy does something similar to this:
template <size_t S>
void foo() {
    void* dstMemory = whateverA
    void* srcMemory = whateverB
    memcpy(dstMemory, srcMemory, S) 
}
Now, I would have expected that the memcpy intrinsic was smart enough to realise that this:
foo<4>()
... Can replace the memcpy in the function with a 32 bit integer assignment. However, I surprisingly find myself seeing a >2x speedup doing this:
template<size_t size>
inline void memcpy_fixed(void* dst, const void* src) {
    memcpy(dst, src, size);
}
template<>
inline void memcpy_fixed<4>(void* dst, const void* src) { *((uint32_t*)dst) =  *((uint32_t*)src); }
And rewriting foo to:
 template <size_t S>
 void foo() {
    void* dstMemory = whateverA
    void* srcMemory = whateverB
    memcpy_fixed<S>(dstMemory, srcMemory) 
}
Both tests are on clang (OS X) with -O3. I really would have expected the memcpy intrinsic to be smarter about the case where the size is known at compile time.
My compiler flags are:
-gline-tables-only -O3 -fno-omit-frame-pointer -mno-omit-leaf-frame-pointer
Am I asking too much of the c++ compiler or is there some compiler flag I am missing?
                        
If both source and destination buffers are provided as function parameters:
then clang++ 3.5.0 uses
memcpyonly whenSis big but it uses themovlinstruction whenS = 4.However, your source and destination addresses are not parameters of this function and this seems to prevent the compiler from making this aggressive optimization.