This is a detailed writeup from Project Zero about a bug in the Linux Kernel, introduced through an optimization with an unconsidered edge case.
A quick summary of the relevant OS concepts: Virtual Memory Areas (VMAs), which include the code segment, data segment, and heap, have special rules[1] that must be applied when a page fault occurs. The mappings of addresses to VMAs are stored in vm_area_struct
s, which are stored in a red-black tree or possibly cached in a per-thread 4-member array. Whenever a VMA is freed in any thread for the process, all of these per-thread caches must be invalidated.
To avoid having to do this expensive operation, each cache is tagged with a seqnum
equal to the seqnum
of the per-process mm_struct
; when the seqnum
s are out of sync, we know to invalidate the cache. Because the seqnum
can overflow, though, we still need to flush all the caches every time that happens.
In the patch in question, someone added the further optimization of simply avoiding the flush if the process is single-threaded. They reasoned that VMA lookups would always trigger the update of the cache seqnum
and mm_struct
seqnum
simultaneously, avoiding the need for the flush.
However, this added a (normally) rather unlikely bug in the case that, immediately after incrementing the mm_struct
seqnum
to wrap to zero, a second thread is created, pushing the seqnum
back up to 0xffffffff
and making the cache in the first thread appear valid again.
The authors manage to exploit the bug first by doing a large number of VMA remappings to trigger the overflow, then creating a fake VMA using the pointers from the deallocated one – leaked to logs by a warning message – to bypass Kernel controls. The VMA contains a fault handler that is used on page faults, which can be used, for instance, to run a binary with root privileges.
[1] http://students.mimuw.edu.pl/SO/Linux/Kod/include/linux/mm.h.html