[netsa-tools-discuss] segfaults in rwflowpack/rwflowappend

Mon Dec 1 14:13:48 EST 2014

Thank you for your interest in SiLK, and your informative bug report.
The rest of my reply is in-line.

John Green <John.Green at jisc.ac.uk> writes:

> Before I dig any deeper, does anyone else get very occasional segfaults
> in the redblack tree routines of rwflowpack and rwflowappend (silk-3.9.0
> and libfixbuf-1.6.1 compiled as per the spec file)?

We have yet to encounter any segfaults similar to the traces you include
in this email.

> Both look to be corrupted data structures - perhaps a thread
> synchronization issue?  Alternatively it could be an issue with my
> server.

It's hard to say for sure.  We looked carefully at the code related to
your backtraces, and in each case we believe the red-black tree involved
is either protected by mutexes, or is running in a single-threaded
context.

> rwflowpack occurs more frequently (once every few days)
>
> /usr/sbin/rwflowpack --input-mode=fcfiles
> --sensor-configuration=/etc/silk/sensor.conf --pack-interfaces
> --flush-timeout=360 --file-cache-size=4096 --polling-interval=30
> --site-config-file=/etc/silk/silk.conf
> --incoming-directory=/netflow/silk/rwflowpack/incoming
> --output-mode=incremental-files
> --incremental-directory=/netflow/silk/rwflowpack/sender-incoming
> --pidfile=/netflow/silk/rwflowpack/log/rwflowpack.pid --log-level=debug
> --log-directory=/netflow/silk/rwflowpack/log --log-basename=rwflowpack
>
> (gdb) bt
> #0  __strcmp_sse2_unaligned ()
>     at ../sysdeps/x86_64/multiarch/strcmp-sse2-unaligned.S:30
> #1  0x00007fd023ac8bee in rb_traverse (insert=insert at entry=0,
> key=0x3c84e00, 
>     rbinfo=rbinfo at entry=0x2526510) at redblack/redblack.c:264
> #2  0x00007fd023ac9017 in rbdelete (key=<optimized out>,
> rbinfo=0x2526510)
>     at redblack/redblack.c:177
> #3  0x00007fd023d1d850 in remove_unseen (pd=0x25266a0, pd=0x25266a0)
>     at skpolldir.c:294
> #4  pollDir (vpd=0x25266a0) at skpolldir.c:528
> #5  0x00007fd023d1e3be in sk_timer_thread (v_timer=0x3a80110) at
> sktimer.c:193
> #6  0x00007fd0227e40a4 in start_thread (arg=0x7fd021db8700)
>     at pthread_create.c:309
> #7  0x00007fd022518ccd in clone ()
>     at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

In this case, the thread that modifies the tree is normally the only
thread that owns and modifies the tree.  The one possible counterexample
that we have found should only happen during a controlled shutdown.  I'm
assuming that you were not attempting to shut down the daemon when you
encountered this problem?

> rwflowappend has occurred once
> /usr/sbin/rwflowappend
> --incoming-directory=/netflow/silk/rwflowappend/incoming
> --error-directory=/netflow/silk/rwflowappend/error
> --root-directory=/netflow/silk/data
> --site-config-file=/etc/silk/silk.conf --compression-method=best
> --pidfile=/netflow/silk/rwflowappend/log/rwflowappend.pid
> --log-level=warning --log-destination=syslog --threads=16
>
> (gdb) bt
> #0  0x00007f426905f0ea in rb_delete_fix (x=0x7f4269298ac0 <rb_null>, 
>     rootp=<optimized out>) at redblack/redblack.c:704
> #1  rb_delete (z=<optimized out>, rootp=0x19d1090) at
> redblack/redblack.c:655
> #2  rbdelete (key=key at entry=0x19e1da0, rbinfo=0x19d1080)
>     at redblack/redblack.c:183
> #3  0x0000000000403146 in destroyOutputStream
> (state=state at entry=0x19e1da0)
>     at rwflowappend.c:778
> #4  0x0000000000403ccc in appender_main (vstate=vstate at entry=0x19e1da0)
>     at rwflowappend.c:1207
> #5  0x000000000040408a in appender_main (vstate=0x19e1da0)
>     at rwflowappend.c:1007
> #6  0x00007f42684990a4 in start_thread (arg=0x7f42650e1700)
>     at pthread_create.c:309
> #7  0x00007f42681cdccd in clone ()
>     at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

This one is even more confusing, since as far as we can determine every
access to this red-black tree is carefully protected by the
'appender_tree_mutex'. 

> I'll try rebuilding without optimization to enable better debugging.

John Green <John.Green at jisc.ac.uk> writes:

> On Tue, 2014-11-25 at 14:23 +0000, John Green wrote:
>
> To followup - I added a mutex to isolate modifications to the redblack
> tree.  I haven't experienced any further segfaults, although they were
> sporadic before anyhow. 

Thank you for experimenting.  If I could ask a few questions...  What OS
and OS version are you running?  What configure flags did you use when
building SiLK?  Where did you insert these mutexes?  Were you running
normally at the time of the crashes or were you attempting to shut the
daemons down.  Also, could we get a section of logs that lead up to a
crash?  Especially if you can trigger the errors with debug-level
logging (--log-level=debug).

-- 
Michael Welsh Duggan
(mwd at cert.org)