[netsa-tools-discuss] Help troubleshooting rwflowpack not recording flows

Mark Thomas mthomas at cert.org
Fri Nov 3 16:49:34 EDT 2017


Eric-

That's an interesting problem you are seeing.

I am sure you have thought some of these items, but here is a list of
some things I would check:

* When rwflowpack flushes the files, there should be a log message
  similar to the following for each sensor:

    'S1': forward 0, reverse 0, ignored 0

  If that is not appearing for this "rouge" router, it may indicate
  that a thread in rwflowpack has been deadlocked or is blocked
  (perhaps in a read call).

* If all the counts in that log message are 0, it means rwflowpack
  either is not receiving the data or is unable to parse it.

* If the counts in that message are non-zero: Is rwflowpack ignoring
  all the flow records (if so, there should be lots of messages in
  the log).

* Check the log messages to see which internal template rwflowpack
  is using for the data.  If it says it is processing data with the
  "ignore template", that it is ignore the records for some reason
  (usually because the template lacks some element).

* Try setting the SILK_IPFIX_PRINT_TEMPLATES environment variable
  prior to starting rwflowpack, which causes the rwflowpack to write
  each IPFIX template it receives to the log file.  Since this is a
  UDP connection, the Juniper should be re-sending them every few
  minutes.  If the templates to not appear, it may indicate a
  network connection issue or a threading issue in rwflowpack.

* Since your connection is UDP the following is not an issue,
  but....  When data arrives at rwflowpack via TCP, there can be
  conditions when rwflowpack cannot keep up with the data and the
  TCP connection blocks.

* Look at the output from netstat on your platform to see if there
  is some saturation issue.  (I would not think this would be a
  problem for UDP.)

* Ensure the SILK_LIBFIXBUF_SUPPRESS_WARNINGS environment variable
  is not set.  That environment variable disables warnings from
  libfixbuf, and it is possible (though unlikely) that libfixbuf is
  reporting errors that you are not seeing.

* Depending on how much you want to debug this, consider enabling
  the trace-level compile-time logging by #define-ing the
  SKIPFIXSOURCE_TRACE_LEVEL C macro in
  silk/src/libflowsource/ipfixsource.c and the SKIPFIX_TRACE_LEVEL C
  macro in the skipfix.c file in that same directory.  Recompile and
  reinstall SiLK.  Those messages are printed at --log-level=debug.

* If you are going to rebuild rwflowpack, enable debugging symbols
  (configure --enable-debugging).  Attach a debugger to rwflowpack
  and see if a thread appears to be blocked.  (I know this is
  difficult to tell if you do not know the SiLK C code.)

* Does re-arranging the order and/or ports to which samplicator
  sends data have an effect?

Good luck finding the cause of the issue.  If I can be of further
help please let me know, and also let me know what you discover.

Thanks,

-Mark


-----Original Message-----
From: Junk Mail <inetjunkmail at icloud.com>
Date: Thu, 2 Nov 2017 17:36:01 +0000
To: <netsa-tools-discuss at cert.org>
Subject: [netsa-tools-discuss] Help troubleshooting rwflowpack not recording
	flows

Hello:

We've been using Silk for a few years.  Recently, we upgraded our
hardware and performed a fresh install with a more current version.  
As we began building the new server, we noticed on the old server that
traffic from one particular interface on one of our routers was not
being written to disk.  We did PCAP's on the collector of the data
being sent to the rwflowpack port and could see the interface data in
question in Wireshark but, for some reason, it wouldn't show up on
disk after being processed by rwflowpack (it was working for about a
year up until this point and the router's flow configuration and
software version were unchanged).  Lots of service/server restarts and
other steps later, we were at a loss.

We use UDP for flow data and use samplicator to send the flow data to
multiple processes so we tried sending a copy of the flow data from
the broken router to the new server we were building and, low and
behold, the data was being written to disk.  Thinking we had found
some strange bug in the old version and knowing that we were moving to
the new server, we gave up on diagnosis and escalated the move to the
new hardware.

Now, after a few months on the new server, we are seeing something
similar.  It's the same router but this time, we not writing _ANY_
flows for that device.   Again, PCAP's confirm the data is reaching
the rwflowpack listener.  We've tried adding --log-level=debug to the
process but didn't see anything interesting in the logs.  I get that
it being the same router suggests that the router may be the
issue...and it may be, but the PCAP data suggests otherwise.  Since
Wireshark can use the template and parse the data I'm inclined to
think the data is good.

I don't have any info on the versions installed on the original server
but the new server has SiLK version 3.12.2 and libfixbuf version
1.7.1.  The old server was probably a couple years old version so, if
it is a bug, it's been around a while. This is IPFIX data from a JUNOS
version 13.3 router.

Are there any ideas of things we can look at next?

Thanks for any help,
Eric


More information about the netsa-tools-discuss mailing list