[netsa-tools-discuss] segfaults in rwflowpack/rwflowappend

John Green John.Green at jisc.ac.uk
Tue Dec 2 10:09:39 EST 2014


On Mon, 2014-12-01 at 14:13 -0500, Michael Welsh Duggan wrote:
> In this case, the thread that modifies the tree is normally the only
> thread that owns and modifies the tree.  The one possible counterexample
> that we have found should only happen during a controlled shutdown.  I'm
> assuming that you were not attempting to shut down the daemon when you
> encountered this problem?

Hi Michael,

Thanks for looking at this issue.  Yes it occurs during normal
operation.

> Thank you for experimenting.  If I could ask a few questions... 

>  What OS and OS version are you running? 

It is running on Debian Wheezy (64 bit). I *was* building the RPMs on a
Fedora 20 and converting to deb using alien, which probably added more
complexity than is necessary.  However I have just compiled from source
on the Debian box and got a further crash within the redblack code (see
below).

>  What configure flags did you use when
> building SiLK?

./configure --prefix=/usr --exec-prefix=/usr --bindir=/usr/bin
--sbindir=/usr/sbin --sysconfdir=/etc --datadir=/usr/share
--includedir=/usr/include --libdir=/usr/lib --libexecdir=/usr/libexec
--localstatedir=/var --sharedstatedir=/var/lib --mandir=/usr/share/man
--infodir=/usr/share/info --enable-data-rootdir=/data --disable-assert
--enable-ipv6 --enable-output-compression=none --without-adns
--without-c-ares --without-libipa --without-pcap
--with-python=/usr/bin/python --disable-static

resulting in
    * Configured package:           SiLK 3.9.0
    * Host type:                    x86_64-unknown-linux-gnu
    * Source files ($top_srcdir):   .
    * Install directory:            /usr
    * Root of packed data tree:     /data
    * Packing logic:                via run-time plugin
    * Timezone support:             UTC
    * Default compression method:   SK_COMPMETHOD_NONE
    * IPv6 network connections:     YES
    * IPv6 flow record support:     YES
    * IPFIX collection support:     YES (-lfixbuf -lgthread-2.0 -pthread
-lglib-2.0)
    * NetFlow9 collection support:  YES
    * sFlow collection support:     YES
    * Fixbuf compatibility:         libfixbuf-1.6.1 >= 1.6.0
    * Transport encryption support: YES (-lgnutls -lgcrypt)
    * IPA support:                  NO
    * ZLIB support:                 YES (-lz)
    * LZO support:                  YES (-llzo2)
    * LIBPCAP support:              NO
    * C-ARES support:               NO
    * ADNS support:                 NO
    * Python interpreter:           /usr/bin/python
    * Python support:               YES (-Wl,-z,relro -Xlinker
-export-dynamic -Wl,-O1 -Wl,-Bsymbolic-functions -L/usr/lib -lz -ldl
-lutil -lm -Wl,-z,relro -L/usr/lib/python2.7/config-x86_64-linux-gnu
-lpython2.7 -pthread)
    * Python package destination:   /usr/lib/python2.7/dist-packages
    * Build analysis tools:         YES
    * Build packing tools:          YES
    * Compiler (CC):                gcc
    * Compiler flags (CFLAGS):      -I$(srcdir) -I
$(top_builddir)/src/include -I$(top_srcdir)/src/include -DNDEBUG
-D_ALL_SOURCE=1 -D_GNU_SOURCE=1 -O3 -fno-strict-aliasing -Wall -W
-Wmissing-prototypes -Wformat=2 -Wdeclaration-after-statement
-Wpointer-arith
    * Linker flags (LDFLAGS):       
    * Libraries (LIBS):             -llzo2 -lz -ldl -lm


>  Where did you insert these mutexes?
I was really just trying to eliminate this as the cause of the
corruption.  I added a mutex to the struct rbtree in redblack.h and
added lock/unlock to the prolog/epilogue of every "public" redblack
function (rbsearch/rbdelete etc).   Sounds like this wasn't the cause
though and it is more general memory corruption (tending to show up in
the redblack structures due to their frequency of use?). 
  
If there are no other reports then I guess it is something specific to
our local setup/hardware.  I'll continue investigating.

>   Were you running
> normally at the time of the crashes or were you attempting to shut the
> daemons down.  

Running normally.

> Also, could we get a section of logs that lead up to a
> crash?  Especially if you can trigger the errors with debug-level
> logging (--log-level=debug).

Most recent crash (compiled from source on debian wheezy) finished with
Dec  2 11:12:31 server
rwflowpack[7726]: /netflow/silk/rwflowpack/sender-incoming/iw-ORG27499_20141202.11.30IhPN
Dec  2 11:12:31 server
rwflowpack[7726]: /netflow/silk/rwflowpack/sender-incoming/ow-ORG27499_20141202.10.fkqqO0
Dec  2 11:12:31 server
rwflowpack[7726]: /netflow/silk/rwflowpack/sender-incoming/ow-ORG27499_20141202.11.gymZuJ
Dec  2 11:12:31 server rwflowpack[7726]: Successfully moved 1694/1694
files.
Dec  2 11:13:00 server rwflowpack[7726]: FlowCap Files Reader processing
20141202111033_P1.2.3.4.Y64MSu

(gdb) bt
#0  rb_openlist (rootp=<optimized out>) at redblack/redblack.c:764
#1  rbopenlist (rbinfo=0x0) at redblack/redblack.c:206
#2  0x00007f269aa82c61 in remove_unseen (pd=<optimized out>) at
skpolldir.c:261
#3  pollDir (vpd=0x21c36a0) at skpolldir.c:528
#4  0x00007f269aa83838 in sk_timer_thread (v_timer=0x2205700) at
sktimer.c:193
#5  0x00007f269952c0a4 in start_thread (arg=0x7f2698d08700) at
pthread_create.c:309
#6  0x00007f2699260ccd in clone ()
at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thanks
John

Jisc is a registered charity (number 1149740) and a company limited by guarantee which is registered in England under Company No. 5747339, VAT No. GB 882 5529 90. Jisc’s registered office is: One Castlepark, Tower Hill, Bristol, BS2 0JA. T 0203 697 5800.
 
Jisc Collections and Janet Ltd. is a wholly owned Jisc subsidiary and a company limited by guarantee which is registered in England under Company No. number 2881024, VAT No. GB 614 9442 38. The registered office is: Lumen House, Library Avenue, Harwell, Didcot, Oxfordshire, OX11 0SG. T 01235 822200.



More information about the netsa-tools-discuss mailing list