【Paper】KRR: Efficient and Scalable Kernel Record Replay

Introduction

Different RRs

Modern operating system kernels like Linux are enormous and prone to contain bugs. In order to locate these bugs, developers created Virtual Machine Record Replay (VM-RR) to record a faulty kernel execution, and then replay it to identify the bug. However, current implementations of RR are for either application or whole VMs, but there is no dedicated RR for kernels.

Kernel Record Replay

Why do we need a kernel-only RR mechanism? The key reason is that kernel developers are only concerned the kernel itself, not the whole VM. Traditional whole-VM RR records the execution of the whole VM, which requires the execution to be serial (i.e., no parallelization). Serializing the execution effectively imposes a big kernel lock, disabling parallelization and degrading performance to that of a single core or even worse.

To better understand the performance cost, it is crucial to distinguish between parallelization and concurrency. Parallelization means two or more threads progress at the same time, usually on different CPU cores; while concurrency means one thread can run on a CPU core while another thread has started but not finished. Generally, parallelization implies concurrency.

KRR An overview of KRR. But I won't go through all the details here.

Based on this idea, Kernel Record Replay (KRR) records only the kernel execution. This yields much better performance because of two key reasons:

First, KRR requires only the kernel to be serial, allowing user-space execution to remain parallelized. This is an advantage over VM RR because a vCPU runs in user-space most of the time; thus, KRR avoids the overhead incurred by serial user-space execution.

Second, the authors stated that data centers increasingly use kernel-bypassing techniques to avoid kernel overhead. This is also an advantage for KRR because it does not have to record these I/O operations.

In summary, by focusing only on kernel execution, we can reduce the record scope to kernel only and thereby avoid the overhead of serializing user-space execution.

Evaluation

To be an effective RR, KRR must satisfy two criteria:

  1. It must reproduce kernel bugs, and
  2. It must have low record overhead and fast replay.

For bug reproducibility, the authors reproduced 16 out of 17 randomly chosen kernel bugs and successfully reproduced all 5 chosen CVEs. These result confirms the effectiveness of bug reproduction.

Regarding record overhead, KRR incurs 1.52x to 2.79x slowdown compared to native execution, while VM-RR's slowdown is much higher: 8.97x to 29.94x. This clearly shows that we can significantly reduce the overhead.

Ideas

More Fine-grained RR

This paper earned three USENIX badges and the source code is available. Looking at the repository, one of my lab members noted that this project took a PhD student more than two years to complete. It's likely that a much more fine-grained RR would require even more development time. This leads to a key research question: How can we develop a complex RR mechanism more effortlessly?

Why We Can't Record True Parallel Execution?

The biggest challenge for KRR and VM-RR is parallelism. They must serialize the execution of their record target. This begs a natural question would be:

Do we truly need serialization, or can we record truly parallel execution?

I suspect we can't record truly parallel execution. Even two cores reach the same record point at the same time, there's no guarantee that they would get exact same timestamp. This problem looks like it might be a undecidable problem.