Arguments of clone() system call
This blog reference most of it’s content to Chapter 28.2 of The Linux Programming Interface
clone()
system call has following signature:
int clone(int (*func) (void *), void *child_stack, int flags, void *func_arg, ... /* pid_t *ptid, struct user_desc *tls, pid_t *ctid */ );
func
is the address of the entry function; child_stack
is the starting pointer on which the new function’s stack will be built; flags
are sets of bit masks that are used to specify the behaviors of this clone
operation. We focus on some of the flags. Before starting to delve into the discussion of specific flags, it’s useful to have a general understanding of what is process
and thread
in linux. As far as i read, following quote from the book The Linux Programming Interface have the most concise and accurate description:
At this point, it is worth remarking that, to some extent, we are playing with words when trying to draw a distinction between the terms thread and process. It helps a little to introduce the term kernel scheduling entity (KSE), which is used in some texts to refer to the objects that are dealt with by the kernel scheduler. Really, threads and processes are simply KSEs that provide for greater and lesser degrees of sharing of attributes (virtual memory, open file descriptors, signal dispositions, process ID, and so on) with other KSEs. The POSIX threads specification provides just one out of various possible definitions of which attributes should be shared between threads.
I could not describe the difference and relationship between process
and thread
better than this quote, so no more words about process
and thread
. Let’s go into some of the flags. Different flags combinations in the clone()
call will create different KSEs that will share resources with the calling KSE in different level and aspects. In following discussion we avoid using process
and thread
to prevent ambiguity, instead KSE is used to denote the returned entity by clone()
CLONE_FILES
If specified the returned KSE shares the same table for descriptors, which means that file descriptor creation and deallocation is visible between each other. For example, if in the calling KSE there is a new socket
created, it will be automatically usable in the returned KSE. This flag makes the calling KSE and the returned KSE not only share the file description, but also the file descriptor. Please note the difference: file descriptions can be referenced by multiple file descriptors, both in same process or in different process. If this flag is not specified, the returned KSE will have a copy of the calling KSE’s file descriptor table, which will increment the reference count for the file description that the file descriptors point to. In this scenario, two different file descriptors point to the same file description(system wide resource), and they share the properties that are decided by the file description, like read/write positions, but they are different file descriptors. If inside one KSE, the file descriptor is closed, the file descriptor in another KSE is still usable. But if CLONE_FILES is specified, the calling KSE and the returned KSE share the same file descriptor, not copy.
CLONE_FS
If specified, calling KSE and returned KSE share current working directory and root directory. If any one of them changes those value, the other one sees them. Again, if not specified, the returned KSE have a copy for that of the calling KSE and after the copy, they will have individual working directory and root directory, with change of them not affecting each other.
CLONE_VM
If specified, the calling KSE and the returned KSE share the same virtual memory table. Otherwise, the returned KSE get a copy of the calling KSE’s virtual table, like in fork()
CLONE_SIGHAND
If specified, the calling KSE and the returned KSE share the same handling behavior for every signal. If not specified, the returned KSE get a copy of current behavior from the calling KSE, but when any of them changes the signal behavior, the other one can not see it.
Pending signals and signal masks are NOT shared between the calling KSE and the returned KSE, even if this flag is specified. panding signals and signal masks are KSE specific.
Imagine that this flag is specified and both KSE share the same signal handler, when one of them changes the handler, so the handler address is changed, what happens if the other KSE get the signal and need to call this handler(which is changed by another KSE to a different address)? The only way this works is that the two KSEs must have same virtual memory address. Say if one of the KSE load some library into the virtual memory and changes the handler address to this memory region, if the two KSEs share the same virtual memory, the other one can safely calls the handler, otherwise segmentation fault is supposed to happen. So if CLONE_SIGHAND is specified, CLONE_VM must also be specified.
CLONE_THREAD
If specifed the returned KSE have the same thread group ID as the calling KSE, otherwise a new thread group ID is created for the returned KSE. Thread group ID is the same thing as process ID. Following diagram illustrate the relationship between different KSEs and what is POSIX thread:
POSIX thread, KSE, PID/TGID/TID relationship |
There some key points about the effect of this flag:
- We can call KSEs created with CLONE_THREAD flag
threads
- No signals is sent to the calling KSE when
thread
is terminated, sothread
can not be waited likeprocess
; the right way to wait athread
to terminate is throuth thejoin()
semantics. The cornerstone behind thejoin
is futex, which we dicuss in futex. For how thejoin
works and the behaviors of thejoin
, i will write another blog. For now we need to konw thatthread
created with CLONE_THREAD can not be waited usingwait()/waitpid()
and must usejoin
to wait for it - When all KSEs inside one TGID(PID) terminate, a SIGCHLD signal is sent to parent process of this TGID
- If any
thread
inside one thread group callsexec()
, all other threads except for the thread group leader are terminated and the new program is executed inside the thread group leader - If any
thread
inside one thread group callsfork
, anyotherthread
inside this thread group can callwait
on itfork
only have something to do with the callingthread
, except for thewait
operation above, otherthread
does not have much to do with the forked process
- If CLONE_THREAD is specified, CLONE_SIGHAND must be specifed, again CLONE_VM must be specifed
CLONE_PARENT_SETTID/CLONE_CHILD_SETTID/CLONE_CHILD_CLEARTID
Those are flags to support POSIX threads.
- CLONE_PARENT_SETTID:
clone
will set the ID of the returned KSE to the parameterpid_t *ptid
. The value is the same as the return value ofclone
.- The set of the value happens before the duplication of virtual memory, so even the CLONE_VM not specified, the child and parent both can see the newly created
pid
. CLONE_VM is required for POSIX threads - Getting
pid
through parameter and through the return value is different: for example, if the returned KSE terminates immediately before the parent has the chance to do the assignment of the return value, and if the SIGCHLD handler in parent use thepid
, thepid
is invalid, because thepid
has not been assigned yet. But if thepid
is aquired by argumentptid
, due to the fact that the write of thepid
toptid
is done before theclone
returns, the parent SIGCHLD handler can safely use thispid
- The set of the value happens before the duplication of virtual memory, so even the CLONE_VM not specified, the child and parent both can see the newly created
- CLONE_CHILD_SETTID:
clone
write the ID of the newly created KSE into the child’s memory location specified by argumentpid_t *ctid
. Note that if CLONE_VM is specified, this will also affect the parent. For POSIX threads, CLONE_VM must be specified. So for the POSIX thread implmentation, CLONE_PARENT_SETTID and CLONE_CHILD_SETTID overlapps in functionality - CLONE_CHILD_CLEARTID:
clone
zeros the memory pointed bypid_t *ctid
pthread_join under the hood
In linux, the pthread_join/pthread_create
is implemented based on these three flags. When pthread_create
creates threads, CLONE_PARENT_SETTID and CLONE_CHILD_CLEARTID is used, pid_t *ptid
and pid_t *ctid
are set to point to the same location. CLONE_CHILD_SETTID is irrelevent because POSIX thread requires the CLONE_VM. Kernel does the following trick to support POSIX threads:
- Kernel treat the memory pointed to by
pid_t *ptid
andpid_t *ctid
as futex - When
pthread_join
joins thepid_t
, it actuallyFUTEX_WAIT
on this futex, if condition is not met, then calling thread is put into block - When the KSE terminates, since then CLONE_CHILD_CLEARTID is specified, the futex is cleared and
FUTEX_WAKE
is called on this futex, which wakes up thread that is waiting on this futex. This mechanism achieve the behavior that thephread_join
calling thread is blocked until the termination of the joined thread denoted bypid_t
CLONE_SETTLS
If specified, the argument user_desc *tls
is used as thread-local storage. This storage is only accesible by the newly created KSE
use clone
to implement fork
and POSIX threads
fork
and POSIX threads can be implemented by clone
with different flags specifed:
fork
corresponds to flags combination:CLONE_VM | CLONE_VFORK | SIGCHLD
- POSIX threads corresponds to flags combination:
CLONE_VM | CLONE_FILES | CLONE_FS | CLONE_SIGHAND | CLONE_THREAD | CLONE_SETTLS | CLONE_PARENT_SETTID | CLONE_CHILD_CLEARTID | CLONE_SYSVSEM
// Demonstrate the use of the clone(..) to simulate fork and std::threads
#include <chrono>
#include <csignal>
#include <cstddef>
#include <ctime>
#include <iostream>
#include <mutex>
#include <pthread.h>
#include <sched.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/wait.h>
#include <thread>
#include <unistd.h>
#include <vector>
#define STACK_SIZE 65536
void sig_handler(int) { exit(0); }
// mutex to synchronize printf
std::mutex mtx;
// entry function for clone(..)
int clone_func(void *) {
for (;;) {
{
std::lock_guard<std::mutex> lck{mtx};
std::cout << "clone thread: " << getpid() << std::endl;
}
std::this_thread::sleep_for(std::chrono::seconds(1));
}
}
// level two std::thread function, used to demonstrate that even they are
// created nested, they are peers with the thread that created them
void level_two() {
for (;;) {
{
std::lock_guard<std::mutex> lck{mtx};
std::cout << "POSIX thread, id: " << std::this_thread::get_id()
<< std::endl;
}
std::this_thread::sleep_for(std::chrono::seconds(1));
}
}
// level one std::thread function
void level_one() {
auto t = std::thread(level_two);
t.join();
}
pid_t child_pid;
pid_t parent_pid;
int main(int argc, char *argv[]) {
std::signal(SIGINT, sig_handler);
// Stack for the new thread
char *stack;
// Top of the stack
char *stackTop;
pid_t pid;
// Allocate memory for the stack
stack = (char *)malloc(STACK_SIZE);
if (stack == NULL) {
exit(EXIT_FAILURE);
}
// Calculate the top of the stack
stackTop = stack + STACK_SIZE;
// use `ps --pid <pid> -O tid,lwp,nlwp -L` to see the difference
if (argc > 1) {
// CLONE_THREAD flag prevent from creating new thread group ID(the same as
// process ID); this thread will be peers to threads that are created by
// std::thread
// emulate the POSIX threads, like std::thread
pid = clone(clone_func, stackTop,
CLONE_VM | CLONE_FILES | CLONE_FS | CLONE_SIGHAND |
CLONE_THREAD | CLONE_SETTLS | CLONE_PARENT_SETTID |
CLONE_CHILD_CLEARTID | CLONE_SYSVSEM,
&parent_pid, NULL, &child_pid);
if (pid == -1) {
exit(EXIT_FAILURE);
}
} else {
// create new thread group ID, aka creating new process ID
// emulate fork
pid = clone(clone_func, stackTop, SIGCHLD, NULL);
if (pid == -1) {
exit(EXIT_FAILURE);
}
}
{
std::lock_guard<std::mutex> lck{mtx};
printf("Parent process: Created child thread with PID = %d\n", pid);
printf("Parent process: PID = %d\n", getpid());
}
// standard POSIX comforming threads
std::vector<std::thread> threads;
for (int i = 0; i < 2; ++i) {
threads.push_back(std::thread(level_one));
}
// wait for signals
pause();
}