I want to create thread via linux clone() and and wait for it to finish. Such a seemingly simple case has become difficult for me because I don’t know how to wait in the calling thread for the end of the called. Linux wait() does not work for threads, just for processes. I started studying how the pthread library is implemented in various libc implementations.
For example, I have a little program with pthread_join() call:
void* waited_foo(void* p)
{
sleep(1); //EDIT
printf("1111\n");
return NULL;
}
int main(int argc, char* agrv[])
{
pthread_t tid;
pthread_attr_t attr;
pthread_attr_init(&attr);
if (pthread_create(&tid, &attr, &waited_foo, NULL))
{
fprintf(stderr, "Error creating thread\n");
return 1;
}
pthread_join(tid,NULL);
sleep(1);
return 0;
}
I trace all the syscalls via strace:
strace -f -e trace=\!brk,mmap,mprotect,munmap,rt_sigprocmask ./a.out
On alpine linux with musl libc:
clone(child_stack=0x7f9f63666af8, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID|0x400000strace: Process 163 attached
, parent_tid=[163], tls=0x7f9f63666b38, child_tidptr=0x7f9f636fdf90) = 163
[pid 163] nanosleep({tv_sec=1, tv_nsec=0}, <unfinished ...>
[pid 162] futex(0x7f9f63666b70, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 163] <... nanosleep resumed>0x7f9f63666aa0) = 0
[pid 163] ioctl(1, TIOCGWINSZ, {ws_row=39, ws_col=231, ws_xpixel=0, ws_ypixel=0}) = 0
[pid 163] writev(1, [{iov_base="1111", iov_len=4}, {iov_base="\n", iov_len=1}], 21111) = 5
[pid 163] futex(0x7f9f63666b70, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 162] <... futex resumed>) = 0
[pid 163] <... futex resumed>) = 1
[pid 162] futex(0x7f9f636fdf90, FUTEX_WAIT, 163, NULL <unfinished ...>
[pid 163] exit(0) = ?
[pid 162] <... futex resumed>) = 0
[pid 163] +++ exited with 0 +++
nanosleep({tv_sec=1, tv_nsec=0}, 0x7fff336faca0) = 0
On debian linux with GNU libc:
clone(child_stack=0x7f4859865e30, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7f48598669d0, tls=0x7f4859866700, child_tidptr=0x7f48598669d0) = 11383
futex(0x7f48598669d0, FUTEX_WAIT, 11383, NULLstrace: Process 11383 attached
<unfinished ...>
[pid 11383] set_robust_list(0x7f48598669e0, 24) = 0
[pid 11383] nanosleep({tv_sec=1, tv_nsec=0}, 0x7f4859865d50) = 0
[pid 11383] write(1, "1111\n", 51111) = 5
[pid 11383] madvise(0x7f4859066000, 8368128, MADV_DONTNEED) = 0
[pid 11383] exit(0) = ?
[pid 11383] +++ exited with 0 +++
<... futex resumed> ) = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffd9d9d3460) = 0
And now I really don't understand:
- why in musl need 2 futex wait?
- why on glibc used 1 futex_wait without futex_wake? what the magic?)
- why futex_wait use pid of new thread?
Under Linux, the
futex()system call is a sort of swiss army knife as it is able to accomplish various actions. In the context of the pthread library, it is used in conjunction withclone()system call.Behavior in Linux/GLIBC
The below explanation is based on the following
straceoutput when using the GNU C library:In the GLIBC,
pthread_create()callsclone()to which are passed the following flags: CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID. The latter two flags are the ones to focus on to answer the question.The manual presents CLONE_PARENT_SETTID as:
In the
straceoutput example, the identifier of the newly created thread is stored at parent_tidptr=0x7f48598669d0.The manual presents CLONE_CHILD_CLEARTID) as:
In other words, the kernel will atomically reset the memory area pointed by the parameter child_tidptr=0x7f48598669d0 and call
futex()with the FUTEX_WAKE operation (meaning of "and do a wakeup on the futex" in the manual).We can notice that both flags refer to the same address 0x7f48598669d0 into which the thread identifier is stored a creation time and will be reset at thread exit time.
pthread_join()relies on this mechanism to wait for the end of the thread. It callsfutex()with the FUTEX_WAIT operation. The manual presents it as:So, it checks that the memory area pointed by the 1st parameter uaddr contains the value specified by the 3rd parameter val. In your example, it checks that the value 11383 is at address 0x7f48598669d0. As long as the memory area contains this value, the caller is suspended. The value 11383 is nothing else than the thread identifier (at kernel level) returned by clone. And the address 0x7f48598669d0 is the one passed to
clone()that will be reset by the kernel and for which a FUTEX_WAKE operation will be done when the thread is finished.That is why
straceoutput shows the resume of the FUTEX_WAIT operation right after the termination (exit) of the thread. The FUTEX_WAKE operation has been invoked by the kernel as specified by the CLONE_CHILD_CLEARTID flag passed toclone().Behavior in Linux/MUSL
The below explanation is based on the following
straceoutput when using the MUSL C library:In the source code, the following set of flags are passed to
clone()bypthread_create(): CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID|CLONE_DETACHED. Compared to GLIBC, there is an additional flag: CLONE_DETACH. But according to the manual, this flag is deprecated by CLONE_THREAD.A creation time, the detach_state field in the thread's descriptor is set to DT_JOINABLE (this is the default when nothing else is specified in the creation attributes passed to
pthread_create()). The value of the latter constant is 2 as defined in the internal/pthread_impl.h file:The thread entry function passed to
clone()is defined in src/thread/pthread_create.c as:In the above code snippet, the entry point specified by the user is called through
args->start_func(args->start_arg)and its result is passed to the internal__pthread_exit(). This is the place where the detach_state field of the thread's descriptor is first reset to 0 (value of DT_EXITED) for whichfutex()is called with the FUTEX_WAKE_PRIVATE operation and the count value equal to 1 to wake up one waiting thread:In the above code snippet, the internal
__wake()function hides the call tofutex()with the FUTEX_WAKE(_PRIVATE) operation:The service
pthread_join()defined in src/thread/pthread_join.c callsfutex()with the FUTEX_WAIT_PRIVATE operation and the current value of the field detach_state in the thread descriptor. That is to say 2 as this field is set to DT_JOINABLE at thread creation time:The call to
futex()is done by __timedwait_cp() defined in _thread/_timedwait.c:To sum up, on the newly created thread side, the field detach_state of the thread's descriptor is first set to 2 (DT_JOINABLE) by the caller of
pthread_create(). Once the entry point returns, the field is set to 0 (DT_EXITED) by the thread and it callsfutex()with the FUTEX_WAKE_PRIVATE operation and the value 1 to wake up one waiting thread on this field.With
pthread_join(), a call tofutex()with the FUTEX_WAIT_PRIVATE operation is done to wait as long as the detach_state field is equal to 2 (DT_JOINABLE). So, the service returns once the thread sets the field to 0 and calls FUTEX_WAKE_PRIVATE.Conclusion
As opposed to GLIBC, MUSL doesn't rely on the CLONE_PARENT_SETTID and CLONE_CHILD_CLEARTID flags to manage the join operation of the threads.
So, why the GLIBC is using FUTEX_WAIT instead of FUTEX_WAIT_PRIVATE as we are inside a process with shared memory between threads. I guess there are two reasons:
clone()is for general purposes: it may not concern only inter-thread synchronization. This could be used for inter-process synchronization. So, the kernel uses FUTEX_WAKE.