Tuesday, October 16, 2012

Basic Pig usage to process Argus data

Some quick notes on testing out Pig in local mode to process some basic Argus data.

Argus
  • Capture a sampling of network traffic with Argus
    • argus -w capture.arg -i eth0
  • Pre-process the Argus data
    • ra -r capture.arg -nn -c, -s proto saddr sport daddr dport bytes - tcp or udp > capture.csv
Install Hadoop and Pig
  • cd /usr/local
  • tar -xvzf hadoop-0.20.2.tar.gz 
  • ln -s /usr/local/hadoop-0.20.2 /usr/local/hadoop
  • tar -xvzf pig-0.10.0.tar.gz 
  • ln -s /usr/local/pig-0.10.0 /usr/local/pig
  • Add to your .bash_profile
    • export JAVA_HOME=/usr
    • export PATH=$PATH:/usr/local/hadoop/bin:/usr/local/pig/bin
Run your Pig scripts like
  • pig -x local <blah>.pig
Sample Pig configs

sum_dport.pig


argfile = load 'capture.csv' using PigStorage(',') as (Proto,SrcAddr,Sport,DstAddr,Dport,TotBytes);
grouped = group argfile by Dport;
mysum   = foreach grouped generate group, SUM(argfile.TotBytes);
store mysum into 'arg_pig_out';

sum_dport_by_saddr.pig

argfile = load 'capture.csv' using PigStorage(',') as (Proto,SrcAddr,Sport,DstAddr,Dport,TotBytes:int);
grouped = group argfile by (SrcAddr,Dport);
mysum   = foreach grouped generate group, SUM(argfile.TotBytes);
bysum   = order mysum by $1 desc;
top10   = limit bysum 10;
dump top10;

sum_dport_by_saddr_filter22.pig

argfile = load 'capture.csv' using PigStorage(',') as (Proto,SrcAddr,Sport,DstAddr,Dport:chararray,TotBytes:int);
onlyssh = filter argfile by Dport matches '22';
grouped = group onlyssh by (SrcAddr,Dport);
mysum   = foreach grouped generate group, SUM(onlyssh.TotBytes);
bysum   = order mysum by $1 desc;
top10   = limit bysum 10;
dump top10;

unique_srcip_dstip_dstport.pig

argfile = load 'capture.csv' using PigStorage(',') as (Proto,SrcAddr);
grouped = group argfile by SrcAddr;
uniq    = distinct grouped;
top10   = limit uniq 10;
dump top10;

Tuesday, January 10, 2012

brain dump #1

for i in $(cat /usr/local/hadoop/conf/slaves); do scp mapred-site.xml $i:/usr/local/hadoop/conf/mapred-site.xml; done

if (grep -q "^/usr/local/lib$" /etc/ld.so.conf); then echo ""; else echo "/usr/local/lib" >> /etc/ld.so.conf; fi

lftp :~> set ftp:ssl-force true
lftp :~> set ftp:ssl-protect-data true
lftp :~> set ssl:verify-certificate true
lftp :~> connect ftp.domain.tld
lftp ftp.domain.tld:~> login my_username

test ssl anonymous auth:
openssl s_client -starttls smtp -crlf -connect some_mail_server:25 -cipher aNULL

test ssl on smtp:
openssl s_client -starttls smtp -crlf -connect some_mail_server:25

low cipher checks:
openssl s_client -connect some_web_server:443 -cipher LOW:EXP

shadow hash:
openssl passwd -1 -salt XXYYZZ11

screen rsync -avz /source/dir/ /target/dir/

rsync -avz -e ssh /source/dir/ user@target_system:/target/dir/

rsync -avz --delete /source/dir/ /target/dir/

rsync -avz --exclude=.snapshot /source/dir/ /target/dir/

linux performance troubleshooting notes


vmstat <# in seconds>
procs
r = # of processes waiting for cpu
b = # of processes waiting for i/o
swap
si and so = swap in and swap out;
close to 0;
no more than 10 blocks/sec
io
bi and bo = disk i/o
system
in = # of interrupts / sec
cs = # of context switches / sec
cpu
us = % user
sy = % system
id = % idle
wa = % i/o wait



iostat -dx <# in seconds>
various rw/sec options
await = # of milliseconds required to
respond to requests
%util = device utilization


top
load = 1m 5m 15m
(average # processes waiting on cpu time)
(quad core load of 4 is really same
amount as 1 core load of 1)
cpu = us/user sy/system id/idle wa/io_wait
avail ram = free mem + cached

'C' = sort processes by %CPU
'M' = sort processes by %MEM


* high vmstat us cpu shows cpu-bound combined with
procs piled up in procs r column, combined with
low disk util in iostat
* high vmstat procs b column combined with high disk
util in iostat show i/o bound
* high values in swap si & so in vmstat show swapping
* idle machines show low r/b in vmstat procs,
high % in cpu id
* iostat to track down which device is getting
heavy reads/writes

Friday, November 9, 2007

Syslog-NG Performance Tuning


I figured I would post some general tuning options that really improve performance on busy central syslog-ng servers. The following settings are used in 2.x, although most will work in some earlier versions as well. These settings work well for me in a tiered environment where client servers are sending both over tcp and udp, from standard syslog and syslog-ng, to a central server(s) running syslog-ng 2.0.5. They are both used in heavy usage (25+ GB / day) situations, and in environments with plenty of hosts (900+).

On to the configuration choices for your central log servers...

Name Resolution

You most likely will want to resolve the IP addresses of client hosts to their hostnames, so enabling name lookups via use_dns(yes) is probably turned on. However, you should ensure you are using your cache properly. Adding dns_cache(1500) and dns_cache_expire(86400), both allow a cache of 1500 entries and set the expiration of entries in the cache to 24 hours. Keep in mind, to allow for enough entries, and account for how often your hosts change IP addresses - such as in dynamic dns environments, etc. These numbers here are just given as an example, tailor to your situation.

If you would rather use the hosts file instead, look into use_dns(persist_only) and dns_cache_hosts(/etc/hosts).

Message Size

Not so much a performance tuning option, but one that needs addressing anyhow. If you are only collecting system logs, the default setting of 8192 bytes is probably enough - but if you collect application logs, you will need to plan accordingly with your log_msg_size(#) option. You will see in your logs, indications of messages being split because they are too long if you have messages going beyond this length.

Output Buffers

Here is an extremely important setting - log_fifo_size(#). The log_fifo_size(#) setting sizes the output buffer, which every destination has. The output buffer must be large enough to store the incoming messages of every source. This setting can be set globally or per destination.

For the log_fifo_size(#), the number indicated is the number of lines/entries/messages that it can hold. By default, it is globally set, extremely conservatively - and if you do any amount of traffic, you will end up seeing dropped messages at some point. The statistics that include dropped messages are printed to syslog every 10 minutes unless you have altered this. In the statistics line it will let you know which destination is dropping messages and how many. You can then make determinations there of whether to globally increase it or per destination, and also an idea of how much larger you need to make it.

Flushing Buffers with sync

From the syslog-ng documentation: "The syslog-ng application buffers the log messages to be sent in an output queue. The sync() parameter specifies the number of messages held in this buffer."

By default, sync(#) is set to 0, which flushes messages immediately - which depending on your logging volume, can be fairly taxing. Increasing this number gently, say to 10 or 20, will hold that number of messages in its buffer before they are written to their destination.

Other Important Considerations

If you are still having trouble with dropped messages, look into using flow control within syslog-ng. Flow control allows you to finely tune the amount of messages received from a source. Although, there are potential other issues you must account for, such as slowing down the source application if it cannot hand off its log messages, etc.

Users with traditional syslog clients sending their logs via UDP, should have a look at this page on UDP Buffer Sizing.

Also, sync() and log_fifo_size() should be tweaked on your client servers as necessary if they are using syslog-ng, and handle heavy loads, sporadic sources, etc. Remember to use your statistics log entries to help you identify problems and load effectively.

Thursday, November 1, 2007

Learning with Honeypots

Honeypot Layout

I've recently rented some dedicated server and public IP resources for running some honeypot/honeynet/whatever setups, more or less to learn, figured I would post my game plan here.

My basic idea is not earth shattering or anything new, I just hope to gain new insight, see what works or doesn't work, and find ways of using honeypots for intrusion detection or as an early warning system for a piece of the overall security monitoring puzzle.

As we all know, any traffic hitting a honeypot system is suspicious, or not warranted at best, it whittles down the amount of traffic we have to look at compared to a production host. However, if you have ever looked at logs or traffic of a publicly accessible, non-production machine, this "whittled down" traffic can still be quite large. Both from things such as worms propagating to your annoying SSH brute force scans. So how do we both look for the unkown nasties while not wasting time on the redundant, now passe, routine malicious scans and such? One way is by filtering and tiering our honeypot architecture.

Filtering, Tiering and Multiple Tools

Fortunately, there are many great tools out there for honeypots and analysis:

honeyd: http://www.honeyd.org/
nepenthes: http://nepenthes.mwcollect.org/
honeyc: https://www.client-honeynet.org/honeyc.html
Capture-HPC: https://www.client-honeynet.org/creleases.html
Honeywall: http://www.honeynet.org/tools/cdrom/

Combined with your standard monitoring and access control tools such as snort, tshark and iptables - and you come away with many ways to both watch, contain and direct how things happen.

I have planned to heavily use VMware for the virtualization aspects of both the high interaction honeypots and some of the low interaction honeypots. Tiering between filters to low interaction honeypots, then to high interaction honeypots - reduces the load and increases the matching of known misuse early on with the least amount of resources squandered.

The Plan

So, here's what I intend to do as a starting point.

An initial box will run VMware, IPtables, and monitoring software (such as tshark/argus/snort or possibly sguil). This box will pass pre-defined traffic after being filtered to a set of IP addresses exposed to an instance of honeyd.

This honeyd machine controls a set number of public IP addresses that I intend to bind to various templates at various times - floating between Linux, Windows and dynamic emulations based on honeyd's passive fingerprinting capabilities provided by p0f signatures and other abilities (how about blacklisted source IPs for instance).

At this point, honeyd will offer some custom service emulation scripts, watch for probes and pokes on various tcp and udp ports defined, and then with the help of some perl glue, make a determination what do with it. The "what to do with it" part, will be either to drop it on the floor, pass it to nepenthes, or sending it to a high interaction honeypot (a Windows one if it is most likely a Windows exploit, a Linux one if it is most likely a Linux exploit, etc.).

The virtual machines running nepenthes and the high interaction honeypots, will be on a NAT'ed network, funneled through the public IP space offered up by the front-end of this setup. Nepenthes will provide a second-line of defense, noticing worms and malware that are already known. If nepenthes does not recognize the traffic, or if the initial honeyd setup determines that these should go elsewhere, the traffic will be destined for an appropriate virtual machine running an OS most likely to match the intended target, or potentially to an emulated service.

In addition, custom perl scripts will handle SMTP service emulation, to both capture and analyze spam and the resulting links and attachments they contain. Tarpitting and utilizing client honeypot tools to visit the linked websites, is on the agenda as well.

Things to Watch For

So many things come to mind as needing that extra care and attention, or that will just be plain fun to mess around with. Here's my list:

* Routing the traffic. Both the honeyd aspect, and the perl glue that will be used to make other determinations, etc.

* Automation. How to maintain my sanity while still providing a valuable learning environment.

* Control. As with any honeypot setup, maintaining control of the various aspects as things are exploited and probed.

* Keeping the various parts of this setup from being fingerprinted and identified as "not real".

* Building a database of everything learned, and providing a usable interface to this data.

Final Thoughts

I intend for this post to be a starting point for what I learn works or doesn't work, interesting tidbits found, etc. Both documenting things I'd like to keep tabs on and sharing with other interested parties. As always, comments and thoughts are welcome.

Much of the ideas and technical know-how came from the recent, and excellent, book on Virtual Honeypots, I highly recommend you check it out.

Tuesday, September 11, 2007

Capturing flow data from your Linksys at home


As a big believer in flow/session data collection in all NIDS locations, it is only right that there be an easy way to do so at home without putting a full-time IDS in place. So with a trusty Linksys router re-flashed with DD-WRT, an extra package installed on the router, and a suite of flow collection/analysis tools on your primary Linux desktop, we can easily achieve this.

On your Linksys:

  1. First things first. In this scenario we re-flashed a Linksys router with DD-WRT, following these instructions.
  2. Next, via the DD-WRT web interface, we enabled JFFS2 support and SSH located in subsections of the Administration tab.
  3. Moving on, update your ipkg configuration, with: ipkg update. Then install fprobe via ipkg: ipkg install fprobe.
  4. Finally, add a shell script to /jffs/etc/config/fprobe.startup. Change permissions: chmod 700 fprobe.startup and reboot your router. The file should contain the following command: fprobe -i br0 -f ip 192.168.1.100:9801
A brief discussion of the fprobe command is needed:

  • -i specifies the interface you are interested in watching flows on. I chose my internal interface.
  • - f specifies a bpf filter. In this scenario, I chose to only create flow records for IP traffic.
  • IP:Port, is the remote IP address and UDP port that you have your flow collector listening on - this will be later on your desktop Linux box.
On your Linux box:

  1. Install flow-tools from here. All that is needed, is a standard: configure; make; make install. *There is one caveat to watch out for, if you use gcc 4.x, a patch available where you downloaded the tarball is necessary.
  2. Create a directory to store your flow data: mkdir -p /data/flows/internal
  3. If you run IPTables or some other host-based firewall, make sure to allow UDP 9801 connections from your router.
  4. Finally, both run the following command and add it somehow to your system startup (via /etc/rc.local, for example): /usr/local/netflow/bin/flow-capture 192.168.1.100/192.168.1.1/9801 -w /data/flows/internal
A brief discussion of the flow-capture command is needed:

  • You specify the network interface you want your collector to listen on, then the address of the flow probe, followed by the UDP port to use - all in a local/remote/UDP format.
  • -w specifies to write out flow files to that directory. By default, flow-capture will have new ones for every 15 minute chunk of time.
So now that we have some flow data being collected to your machine, what are some cool things we can do with it? Looking in flow-tools default install directory for binaries, /usr/local/netflow/bin, we see numerous flow-* tools. We'll look at a few briefly below.

Using flow-print:

flow-print < ft-v05.2007-09-11.080001-0400

The above command will print out the results contained in that particular flow file. The columns will contain srcIP/dstIP/protocol/srcPort/dstPort/octets/packets. The octets line is the equivalent of bytes. This is your standard session/flow data.

Adding a "-f 1" flag will produce timestamps among other things. The -f flag allows for numerous types of formatting and additional columns, etc.

On a sidenote, standard *nix tools - such as awk and grep can be very useful in pulling data from plain old dumps of the flow records.

Using flow-cat and flow-stat:

Much like Argus, with flow-tools you stack together various of the utilities to get output like you want.

flow-cat ft-v05.2007-09-11.0* | flow-stat -f9 -S2

In the above set of commands, flow-cat is used to concatenate all the files that names match that criteria. The resulting output is passed to flow-stat for crunching and displaying. The flow-stat command generates reports, taking formatting options via the -f flag and sorting on both -S and -s. Our example specified a report format on the Source IP address, and sorting based on the Octet (ie. Bytes) field (have a look at the man page for flow-stat to see all the various options). Thus, we now have detailed output from all those files, showing the *noisiest* source hosts listed by most bytes transferred.

Utilizing your desktop and a router, things you probably already have at home, you too can watch/collect/analyze flow data to keep a watchful eye on your network - without deploying a dedicated NIDS or NSM sensor.

Basic Pig usage to process Argus data

Some quick notes on testing out Pig in local mode to process some basic Argus data. Argus Capture a sampling of network traffic with Argus a...