Handling timeouts in child processes

Building a tool for mutation testing has many challenges in it. Sandboxing is one of them. We do trust original code of a program, but we cannot trust mutated code: a mutant can crash or simply run into an infinite loop. The most obvious solution for this problem is to run this code in a child process and limit its execution time.

In this article I want to describe several approaches on how to handle timeouts in child processes. Please, let me know if you see any flaws in these solutions.

Timer, Worker, and Watchdog

I have found one of the solutions on the internet. I find it very elegant!

The parent process, the watchdog, forks two processes: timer and worker. Timer sleeps for some time, while the worker does its job. Watchdog is waiting for either of those to finish. If timer finishes first, then the worker is timed out. And vice versa.

Here is an illustration of this idea:

It looks very straightforward, but there are few more details when it comes to the implemntation. Let’s look at them (full code listing is available online).

typedef void (*work_t)(void);

void watchdog_worker_timer(work_t work) {
  const pid_t timer_pid = fork();
  if (timer_pid == -1) {
    perror("fork timer");
    abort();
  }

  if (timer_pid == 0) {
    /// Timer process
    sleep(1);
    exit(0);
  }

  const pid_t worker_pid = fork();
  if (worker_pid == -1) {
    perror("fork worker");
    abort();
  }
  if (worker_pid == 0) {
    /// Worker process
    work();
    exit(0);
  }

  const pid_t finished_first = waitpid_eintr();
  if (finished_first == timer_pid) {
    printf("timed out\n");
    kill(worker_pid, SIGKILL);
  } else if (finished_first == worker_pid) {
    printf("all good\n");
    kill(timer_pid, SIGKILL);
  } else {
    assert(0 && "Something went wrong");
  }

  waitpid_eintr();
}

This function does exactly what is described above. It takes a pointer to a function that does the actual work and sets timeout to be 1 second. The tricky part however is the call to waitpid_eintr. Here is the body:

pid_t waitpid_eintr() {
  pid_t pid = 0;
  while ( (pid = waitpid(WAIT_ANY, 0, 0)) == -1 ) {
    if (errno == EINTR) {
      continue;
    } else {
      perror("waitpid");
      abort();
    }
  }
  return pid;
}

Call to waitpid can fail for many reasons. One of them that is likely to happen is EINTR, or ‘interrupted function call’. You can get more details here and from man page: man 2 intro on macOS and man 3 errno on Linux.

In our case there is no need for some special treatment - we just keep calling waitpid until it succeeds or fails with some other reason.

This solution is elegant and easy to understand, though it has one disadvantage: we need to create an extra process, thus wasting system’ resources. Fortunately, there is another approach.

System timers

mutation testing

Comments