Do strncpy/memcpy/memmove copy the data byte by byte or in another efficiently way?

Question

Do strncpy/memcpy/memmove copy the data byte by byte or in another efficiently way?

5.1k views Asked by Leon At 22 January 2019 at 10:48

As we know, in a multi-bytes word computer such as x86/x86_64, it is more efficiently to copy/move a big bulk of memory word by word (4 or 8 bytes per step), than to do so byte by byte.

I'm curious about which way would strncpy/memcpy/memmove do things in, and how do they deal with memory word alignment.

char buf_A[8], buf_B[8];

// I often want to code as this
*(double*)buf_A = *(double*)buf_B;

//in stead of this
strcpy(buf_A, buf_B);
// but it worsen the readability of my codes.

Original Q&A

There are 6 answers

AudioBubble On 22 January 2019 at 10:51

From cpp-reference:

Copies count bytes from the object pointed to by src to the object pointed to by dest. Both objects are reinterpreted as arrays of unsigned char.

NOTES

std::memcpy is meant to be the fastest library routine for memory-to-memory copy. It is usually more efficient than std::strcpy, which must scan the data it copies or std::memmove, which must take precautions to handle overlapping inputs.

Several C++ compilers transform suitable memory-copying loops to std::memcpy calls.

Where strict aliasing prohibits examining the same memory as values of two different types, std::memcpy may be used to convert the values.

So it should be the quickest way to copy data. Be aware however, that there are several cases where the behavior is undefined:

If the objects overlap, the behavior is undefined.

If either dest or src is a null pointer, the behavior is undefined, even if count is zero.

If the objects are potentially-overlapping or not TriviallyCopyable, the behavior of memcpy is not specified and may be undefined.

eerorika On 22 January 2019 at 11:10

Does strcpy/strncpy copy the data byte by byte or in another efficiently way?

The C++ nor C standard don't specify how strcpy/strncpy are implemented exactly. They only describe the behaviour.

There are multiple standard library implementations and each choose how to implement their functions. It is possible to implement both of those using memcpy. The standards don't exactly describe the implementation of memcpy either, and the existence of multiple implementations apply to it just as well.

memcpy can be implemented taking advantage of full word copy. A short pseudocode of how memcpy could be implemented:

if len >= 2 * word size
    copy bytes until destination pointer is aligned to word boundary
    if len >= page size
        copy entire pages using virtual address manipulation
    copy entire words
 copy the trailing bytes that are not aligned to word boundary

To find out how a particular standard library implementation implements strcpy/strncpy/memcpy, you can read the source code of the standard library - if you have access to it.

Even further, when the length is known at compile time, the compiler might even choose to not use the library memcpy, but instead do the copy inline. Whether your compiler has built in definitions for standard library functions, you can find out in the documentation of the respective compiler.

Ruslan On 22 January 2019 at 11:21

In general, you don't have to think too much about how memcpy or other similar functions are implemented. You should assume they are efficient unless your profiling proves you wrong.

In practice it indeed is optimized nicely. See e.g. the following test code:

#include <cstring>

void test(char (&a)[8], char (&b)[8])
{
    std::memcpy(&a,&b,sizeof a);
}

Compiling it with g++ 7.3.0 with the command g++ test.cpp -O3 -S -masm=intel we can see the following assembly code:

test(char (&) [8], char (&) [8]):

    mov     rax, QWORD PTR [rsi]
    mov     QWORD PTR [rdi], rax
    ret

As you can see, the copy is not only inlined, but also collapsed into a single 8-byte read and write.

Oliv On 22 January 2019 at 11:51

In this case you may prefer to use memcpy as this is the equivalent of *(double*)buf_A = *(double*)buf_B; without undefined behavior.

You should not worry about calling memcpy because by default the compiler supposes that a call to memcpy has the meaning defined in the c library. So depending on the type of the argument and or the knowledge of the size of the copy at compilation-time, the compiler may choose to not call the c library function and inline a more adapted memory copy strategy. On gcc you can disable this behavior with the -fno-builtin compiler option: demo.

The replacement of memcpy call by the compiler is desired because memcpy will check the size and alignment of the pointers to use the most efficient memory copy strategy (It may start to copy as small blocks as char by char to very large blocks using AVX512 instruction for example). These checks and whatsoever the call to memcpy cost.

Also If you are looking for efficiency, you should be concerned about memory alignment. So you may want to declare the alignment of your buffer:

alignas(8) char buf_A[8];

Victor Gubin On 22 January 2019 at 12:16

It depends on the compiler you are using and C run-time library you are using. In most cases string.h functions like memcmp, memcpy, strcpu, memset etc implemented using assembly in the CPU optimized way.

You can found the GNU libc implementations of those functions for the AMD64 arhitecture. As you can see it may use SSE or AVX instructions to copy 128 and 512 bits per iteration. Microsoft also bundle source code of their CRT together with Visual Studio (the same approaches mostly, MMX, SSE, AVX loops supported).

Also compiler uses special optimization for such functions, GCC call them builtins other compiler call them intrinsic. I.e. compiler may choose - call a library function, or generate CPU specific assembly code optimal for the current context. For example, when N argument of memcpy is constant i.e memcpy(dst, src, 128) compiler may generate inline assembly code (something like mov 16,rcx cls rep stosq), and when it is a variable i.e. memcpy(dst,src,bytes) - compiler may insert call to library function (something like call _memcpy)

**Leon** · Accepted Answer · 2019-01-22T15:14:03+00:00

I think all of the opinions and advices on this page are reasonable, but I decide to try a little experiment.

To my surprise, the fastest method isn't the one we expected theoretically.

I tried some code as following.

#include <cstring>
#include <iostream>
#include <string>
#include <chrono>

using std::string;
using std::chrono::system_clock;

inline void mycopy( double* a, double* b, size_t s ) {
   while ( s > 0 ) {
      *a++ = *b++;
      --s;
   }
};

// to make sure that every bits have been changed
bool assertAllTrue( unsigned char* a, size_t s ) {
   unsigned char v = 0xFF;
   while ( s > 0 ) {
      v &= *a++;
      --s;
   }
   return v == 0xFF;
};

int main( int argc, char** argv ) {
   alignas( 16 ) char bufA[512], bufB[512];
   memset( bufB, 0xFF, 512 );  // to prevent strncpy from stoping prematurely
   system_clock::time_point startT;

   memset( bufA, 0, sizeof( bufA ) );
   startT = system_clock::now();
   for ( int i = 0; i < 1024 * 1024; ++i )
      strncpy( bufA, bufB, sizeof( bufA ) );
   std::cout << "strncpy:" << ( system_clock::now() - startT ).count()
             << ", AllTrue:" << std::boolalpha
             << assertAllTrue( ( unsigned char* )bufA, sizeof( bufA ) )
             << std::endl;

   memset( bufA, 0, sizeof( bufA ) );
   startT = system_clock::now();
   for ( int i = 0; i < 1024 * 1024; ++i )
      memcpy( bufA, bufB, sizeof( bufA ) );
   std::cout << "memcpy:" << ( system_clock::now() - startT ).count()
             << ", AllTrue:" << std::boolalpha
             << assertAllTrue( ( unsigned char* )bufA, sizeof( bufA ) )
             << std::endl;

   memset( bufA, 0, sizeof( bufA ) );
   startT = system_clock::now();
   for ( int i = 0; i < 1024 * 1024; ++i )
      memmove( bufA, bufB, sizeof( bufA ) );
   std::cout << "memmove:" << ( system_clock::now() - startT ).count()
             << ", AllTrue:" << std::boolalpha
             << assertAllTrue( ( unsigned char* )bufA, sizeof( bufA ) )
             << std::endl;

   memset( bufA, 0, sizeof( bufA ) );
   startT = system_clock::now();
   for ( int i = 0; i < 1024 * 1024; ++i )
      mycopy( ( double* )bufA, ( double* )bufB, sizeof( bufA ) / sizeof( double ) );
   std::cout << "mycopy:" << ( system_clock::now() - startT ).count()
             << ", AllTrue:" << std::boolalpha
             << assertAllTrue( ( unsigned char* )bufA, sizeof( bufA ) )
             << std::endl;

   return EXIT_SUCCESS;
}

The result (one of many similar results):

strncpy:52840919, AllTrue:true

memcpy:57630499, AllTrue:true

memmove:57536472, AllTrue:true

mycopy:57577863, AllTrue:true

It looks like:

memcpy, memmove, and my own method have similar result;
What does strncpy do magic, so that it is the best one even faster than memcpy?

Is it funny?

TechQA.

Do strncpy/memcpy/memmove copy the data byte by byte or in another efficiently way?

There are 6 answers

Related Questions in MEMCPY

Related Questions in STRCPY

Related Questions in STRNCPY

Related Questions in MEMMOVE

Related Questions in EFFECTIVE-C++

Popular Questions

Trending Questions