DragonFly commits List (threaded) for 2004-11
cvs commit: src/sys/vm
dillon 2004/11/10 09:39:20 PST
DragonFly src repository
Fix a very serious bug in contigmalloc() which we inherited from FreeBSD-4.x.
The contigmalloc() code incorrectly assumes that a page in PQ_CACHE can
be reused without having to do any further checks and it unconditionally
busies and frees such pages, and assumes that the page becomes PQ_FREE even
though it might actually have gone to a PQ_HOLD state. Additionally the
contigmalloc() code unconditionally sets m->object to NULL, ignoring the
fact that the page will be in the VM page bucket hash table if object
happens to not be NULL, leading to page bucket hash table corruption.
The fix is two fold. First, we add checks for m->busy, (m->flags & PG_BUSY),
m->wire_count, and m->hold_count and do not reuse a page with any of those
set. We do this for all pages, not just PQ_CACHE pages, though it is
believed that it only needs to be done for PQ_CACHE pages. Second, we
replace the m->object = NULL assignment with an assertion that it is
already NULL, since it had better be NULL and we cannot just set it to NULL
unconditionally without blowing up the VM page hash table.
Symptoms of the bug include:
* Filesystem corruption, in particular with slower disk drivers (e.g.
like the 'twe' driver), or in systems with drivers which use
contigmalloc() a lot (e.g. require bounce buffers).
Mangled directory entries, bad indirect blocks (containing data instead
of indirect block pointers), and files containing other file's data.
* 'page not found in hash' panic.
This is the last major VM issue in DragonFly, one that has plagued in
particular David Rhodus (who is a heavy user of the 'twe' driver) for over
a year. I would never have found this bug if not for DR's persistence and
the dozens of kernel cores he was able to provide me over the last year. We
finally got a core with a 'smoking gun', after having written a program
(/usr/src/test/debug/vmpageinfo.c) to run through all the VM pages and check
their hash table association for correctness it became obvious that pages
were being reused without being removed from the hash table which finally
led to contigmalloc*().
Many thanks to: David Rhodus! Free gift enclosed!
Revision Changes Path
1.11 +22 -14 src/sys/vm/vm_contig.c