[netsa-tools-discuss] segfaults in rwflowpack/rwflowappend

Michael Welsh Duggan mwd at cert.org
Wed Dec 3 17:43:16 EST 2014


We found it!  You were correct.  There is an extremely subtle bug
involving the redblack code and multithreading.  The redblack
implementation we are using uses a global null element (called RBNULL in
the code).  This null element, startlingly, is not necessarily
constant.  When deleting a node, the null element's color and up pointer
can both change and be used.  If another thread is also deleting an
element from a tree, they can both be modifying this null element.

And here's the fun part: the bug can happen even if the deletions are
happening in different trees!  This is what seems to be happening in
this case.  The code in rwflowappend and the skpolldir code both do tree
deletions.  We haven't seen this bug before because it requires multiple
threads with different trees deleting items frequently enough that they
step on each other's toes.  Under most of our testing conditions, it is
likely that rwflowappend wasn't updating its tree frequently enough to
trigger this problem.

More below.

John Green <John.Green at jisc.ac.uk> writes:

> On Mon, 2014-12-01 at 14:13 -0500, Michael Welsh Duggan wrote:
>> In this case, the thread that modifies the tree is normally the only
>> thread that owns and modifies the tree.  The one possible counterexample
>> that we have found should only happen during a controlled shutdown.  I'm
>> assuming that you were not attempting to shut down the daemon when you
>> encountered this problem?
[...]
>> Thank you for experimenting.  If I could ask a few questions... 
>>  Where did you insert these mutexes?
> I was really just trying to eliminate this as the cause of the
> corruption.  I added a mutex to the struct rbtree in redblack.h and
> added lock/unlock to the prolog/epilogue of every "public" redblack
> function (rbsearch/rbdelete etc).   Sounds like this wasn't the cause
> though and it is more general memory corruption (tending to show up in
> the redblack structures due to their frequency of use?). 

Please note that using a mutex per tree won't fix this problem, since
the RBNULL entry is shared between trees.

We are testing a fix for this, and intend to make a new SiLK release
that includes this fix before the end of the month.  Thank you for
reporting this bug, and for your careful analysis thereof.

-- 
Michael Welsh Duggan
(mwd at cert.org)


More information about the netsa-tools-discuss mailing list