Tag Archives: Apache log

[repost ]exploring apache log files using hive and hadoop


if you’re exploring hive as at technology, and are looking to move beyond “hello, world”, here’s a little recipe for a simple but satisfying first task using hive and hadoop. we’ll work through setting up a clustered installation of hive and hadoop, and then import an apache log file and query it using hive’s SQL-like language.

unless you happen to have three physical linux servers at your disposal, you may want to create your base debian linux servers using a virtualization technology such as xen. for a good guide on setting up xen, go here. for the remainder of this tutorial, i’ll assume that you have three debian (lenny) servers at your disposal.

let’s get started

setting up hadoop

first, we need to set up hadoop. we’re going to install hadoop at /var/hadoop. execute the following commands as root :

# apt-get install sun-java6-jre
# cd /var
# wget http://www.bizdirusa.com/mirrors/apache/hadoop/core/stable/hadoop-0.18.3.tar.gz
# tar -xvf hadoop-0.18.3.tar.gz
# mv hadoop-0.18.3 hadoop
# rm hadoop-0.18.3.tar.gz

now, vi conf/hadoop-env.sh and set the JAVA_HOME variable appropriately. additionally, if you want to run hadoop as a different user, change the hadoop directory permissions appropriately. for example :

# chgrp -R cailin.cailin /var/hadoop

repeat this section for all three of your hadoop servers.

configuring hadoop

first, edit /etc/hosts on all three servers and make sure that they are all aware of each other’s existence. for example, my servers are named haddop1haddop2 and haddop3 and my /etc/hostslooks like this :    haddop1    haddop2    haddop3

now make sure that you have password-free SSH access between all servers.

finally, on all three servers, modify /var/hadoop/hadoop-sites.xml to contain the following

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->





and /var/hadoop/conf/slaves to contain


starting up hadoop

in the previous configuration, we specific that haddop1 was our !NameNode, haddop2 was our !JobTracker and haddop3 was both a DataNode and TaskTracker.

on the !NameNode, issue the command to format a new distributed file system (HDFS) and then startup the HDSF.

$ bin/hadoop namenode -format
$ bin/start-dfs.sh

on the JobTracker start up MapReduce :


testing hadoop (optional)

if you’re like me, even though our ultimate goal here is hive . . . you can’t proceed until you’ve double checked to make sure that everything works so far. okay, fine, we’ll make a short stop to test hadoop. all commands should be executed as your non-root user.

first, make yourself some test data. create a directory in your home directory and put in two or more text files containing a few lines of text each. i created mine at /home/cailin/input/wordcount

On your JobTracker node copy some example data into the HDFS with HDFS directory name input-wordcount

bin/hadoop dfs -copyFromLocal /home/cailin/input/wordcount input-wordcount

Now, kick off a job to count the number of instances of each word in your input files. Put the output of the job in an HDFS directory with name output-wordcount

$ bin/hadoop jar hadoop-0.18.4-dev-examples.jar wordcount input-wordcount output-wordcount

Now, copy the input out of HDFS to your local filesystem and take a look

$ bin/hadoop dfs -get output-wordcount /home/cailin/output/wordcount
$ cat /home/cailin/output/wordcount/*

If you want to run the example again, you need to clean up after your first run. Delete the output directory from the HDFS and your local copy too. Also, delete the input HDFS directory.

$ bin/hadoop dfs -rmr output-wordcount
$ rm -r /home/cailin/output/wordcount
$ bin/hadoop dfs -rmr input-wordcount

shutting down hadoop

it’s probably best if you shut-down hadoop while we’re setting up hive. on NameNode

$ bin/stop-dfs.sh

on the JobTracker

$ bin/stop-mapred.sh

installing hive

on each of the three servers, execute the following as root

   # mkdir /var/hive
# cd /tmp
# svn co http://svn.apache.org/repos/asf/hadoop/hive/trunk hive
#  cd hive
#  ant -Dhadoop.version="0.18.3" package
#  cd /build/dist
# mv * /var/hive/.

and, if you want to run hive as somebody other than root, through in something like the following :

# sudo chown -R cailin.cailin /var/hive

now, vi /etc/profile and make it aware of the following three environment variables, changing the value of JAVA_HOME as appropriate.

export JAVA_HOME=/usr/lib/jvm/java-6-sun
export HADOOP_HOME=/var/hadoop
export HIVE_HOME=/var/hive

logout and login to all three servers to “activate” the changes in /etc/profile

finally, on the NameNode, as your non-root user, execute the following commands to create some necessary directories. (if you get an error indicating that the directory already exists, that’s okay.

$ bin/hadoop fs -mkdir /tmp
$ bin/hadoop fs -mkdir /user/hive/warehouse
$ bin/hadoop fs -chmod g+w /tmp
$ bin/hadoop fs -chmod g+w /user/hive/warehouse

testing hive

first, start up hadoop again, following the instructions above.

now, as the non-root user on your JobTracker start up the hive command line interface (cli)

$ cd /var/hive
$ bin/hive

now, in the hive cli, execute the following trivial series of commands, just to make sure everything is in working order

hive> CREATE TABLE pokes (foo INT, bar STRING);
hive> DROP TABLE pokes;

exploring an apache log file using hive

finally, we’re able to get to the point!

first, copy an apache log file to /tmp/apache.log on the JobTracker

still on your JobTracker in the hive cli, create the table and load in the data

hive> CREATE TABLE apachelog (
ipaddress STRING, identd STRING, user STRING,finishtime STRING,
requestline string, returncode INT, size INT)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.dynamic_type.DynamicSerDe'
'field.delim'=' ',

hive> LOAD DATA LOCAL INPATH '/tmp/apache.log' OVERWRITE INTO TABLE apachelog;

yee hoo! your apache log file is now available to query using the Hive QL. try a few simple queries like :

SELECT * from apachelog  WHERE ipaddress = '';

now, suppose we want to execute a simple task such as determining the biggest apache offender (e.g. the ip address with the most apache requests.) in MySQL we would write something like SELECT ipaddress, COUNT(1) AS numrequest FROM ipaddress GROUP BY ipaddress ORDER BY numrequest DESC LIMIT 1. the closest approximation to this in HIVE QL is

hive> SELECT ipaddress, COUNT(1) AS numrequest FROM apachelog GROUP BY ipaddress SORT BY numrequest DESC LIMIT 1;

however, this may not give you the answer you were expecting! in hive, the SORT BY command indicates only that the data is sorted within a reducer. so, to enforce a global sort over all data, you would have to set the number of reducers to 1. this may not be realistic for a large data set. to see what happens with more than one reducer, force the number of reducers to 2.

hive> set mapred.reduce.tasks=2;
hive> SELECT ipaddress, COUNT(1) AS numrequest FROM apachelog GROUP BY ipaddress SORT BY numrequest DESC LIMIT 1;

to get the right answer, with greater than one reducer, it is necessary to first create a temporary table and populate it with the number of requests per ip address.

CREATE TABLE ipsummary (ipaddress STRING, numrequest INT);
INSERT OVERWRITE TABLE ipsummary SELECT ipaddress, COUNT(1) FROM apachelog GROUP BY ipaddress;

now, with the use of a Hive QL subquery, we can extract the information we seek, even with > 1 reducer.

SELECT ipsummary.ipaddress, ipsummary.numrequest FROM (SELECT MAX(numrequest) AS themax FROM ipsummary) ipsummarymax JOIN ipsummary ON ipsummarymax.themax = ipsummary.numrequest;

tee hee. that was alot of work. next time, just use google analytics.

[repost ]Hadoop Analysis of Apache Logs Using Flume-NG, Hive and Pig


Big Data is the hotness, there is no doubt about it.  Every year its just gotten bigger and bigger and shows no sign of slowing.  There is a lot out there about big data, but despite the hype, there isn’t a lot of good technical content for those who want to get started.  The lack of technical how-to info is made worse by the fact that many Hadoop projects have moved their documentation around over time and Google searches commonly point to obsolete docs.  My intent here is to provide some solid guidance on how to actually get started with practical uses of Hadoop and to encourage others to do the same.

From an SA perspective, the most interesting Hadoop sub-projects have been those for log transport, namely Scribe, Chukwa, and Flume.  Lets examine each.

Log Transport Choices

Scribe was created at Facebook and got a lot of popularity early on due to adoption at high profile sites like Twitter, but development has apparently ceased  and word is that Facebook stopped using it themselves.  So Scribe is off my list.

Chukwa is a confusing beast, its said to be distributed with Hadoop’s core but its just an old version in the same sub-directory of the FTP site, the actual current version is found under the incubator sub-tree.  It is a very comprehensive solution, including a web interface for log analysis, but that functionality is based on HBase, which is fine if you want to use HBase but may be a bit more than you wish to chew off for simple Hive/Pig analysis.  Most importantly, the major Hadoop distributions from HortonWorks,MapR, and Cloudera use Flume instead.  So if your looking for a comprehensive toolset for log analysis, Chukwa is worth checking out, but if you simply need to efficiently get data into Hadoop for use by other Hadoop components, Flume is the clear choice.

That brings us to Flume, more specifically Flume-NG.  The first thing to know about Flume is that there were major changes to Flume pre and post 1.0, major enough that they took to refering to pre 1.0 as “Flume OG” (“Old generation” or “Origonal Gangsta” depending on your mood) and the new post 1.0 releases as “Flume NG”.  Whenever looking at documentation or help on the web about Flume be certain as to which you are looking at!  In particular, stay away from the Flume CWiki pages,  refer only to theflume.apache.org.  I say that because there is so much old cruft in the CWiki pages that you can be easily mislead and become frustrated, so just avoid it.

Now that we’ve thinned out the available options, what can we do with Flume?

Getting Started with Flume

Flume is a very sophisticated tool for transporting data.  We are going to focus on log data, however it can transport just about anything you throw at it.  For our purposes we’re going to use it to transport Apache log data from a web server back to our Hadoop cluster and store it in HDFS where we can then operate on it using other Hadoop tools.

Flume NG is a java application that, like other Hadoop tools, can be downloaded, unpacked, configured and run, without compiling or other forms of tinkering.  Download the latest “bin” tarball and untar it into /opt and rename or symlink to “/opt/flume” (it doesn’t matter where you put it, this is just my preference).  You will need to have Java already installed.

Before we can configure Flume its important to understand its architecture.  Flume runs as an agent.  The agent is sub-divided into 3 categories: sourceschannels, and sinks.  Inside the Flume agent process there is a pub-sub flow between these 3 components.  A source accepts or retrieves data and sends it into a channel.  Data then queues in the channel.  A sink takes data from the channel and does something with it.  There can be multiple sources, multiple channels, and multiple sinks per agent.  The only important thing to remember is that a source can write to multiple channels, but a sink can draw from only one channel.

Lets take an example.  A “source” might tail a file.  New log lines are sent into a channel where they are queued up.  A “sink” then extracts the log lines from the channel and writes them into HDFS.

At first glance this might appear overly complicated, but the distinct advantage  here is that the channel de-couples input and output, which is important if you have performance slowdowns in the sinks.  It also allows the entire system to be plugin-based.  Any number of new sinks can be created to do something with data… for instance, Casandra sinks are available, there is an IRC sink for writing data into an IRC channel.  Flume is extremely flexible thanks to this architecture.

In the real world we want to collect data from a local file, send it across the network and then store it centrally.  In Flume we’d accomplish this by chaining agents together.  The “sink” of one agent sends to the “source” of another.  The standard method of sending data across the network with Flume is using Avro.  For our purposes here you don’t need to know anything about Avro except one of the things it can do is to move data over the network.  Here is what this ultimately looks like:

So on our web server, we create a /opt/flume/conf/flume.confthat looks like this:

## Flume NG Apache Log Collection
## Refer to https://cwiki.apache.org/confluence/display/FLUME/Getting+Started
# http://flume.apache.org/FlumeUserGuide.html#exec-source
agent.sources = apache
agent.sources.apache.type = exec
agent.sources.apache.command = gtail -F /var/log/httpd/access_log
agent.sources.apache.batchSize = 1
agent.sources.apache.channels = memoryChannel
agent.sources.apache.interceptors = itime ihost itype
# http://flume.apache.org/FlumeUserGuide.html#timestamp-interceptor
agent.sources.apache.interceptors.itime.type = timestamp
# http://flume.apache.org/FlumeUserGuide.html#host-interceptor
agent.sources.apache.interceptors.ihost.type = host
agent.sources.apache.interceptors.ihost.useIP = false
agent.sources.apache.interceptors.ihost.hostHeader = host
# http://flume.apache.org/FlumeUserGuide.html#static-interceptor
agent.sources.apache.interceptors.itype.type = static
agent.sources.apache.interceptors.itype.key = log_type
agent.sources.apache.interceptors.itype.value = apache_access_combined

# http://flume.apache.org/FlumeUserGuide.html#memory-channel
agent.channels = memoryChannel
agent.channels.memoryChannel.type = memory
agent.channels.memoryChannel.capacity = 100

## Send to Flume Collector on (Hadoop Slave Node)
# http://flume.apache.org/FlumeUserGuide.html#avro-sink
agent.sinks = AvroSink
agent.sinks.AvroSink.type = avro
agent.sinks.AvroSink.channel = memoryChannel
agent.sinks.AvroSink.hostname =
agent.sinks.AvroSink.port = 4545

## Debugging Sink, Comment out AvroSink if you use this one
# http://flume.apache.org/FlumeUserGuide.html#file-roll-sink
#agent.sinks = localout
#agent.sinks.localout.type = file_roll
#agent.sinks.localout.sink.directory = /var/log/flume
#agent.sinks.localout.sink.rollInterval = 0
#agent.sinks.localout.channel = memoryChannel

This configuration looks overwhelming at first, but it breaks down simply into an “exec” source, a “memory” channel, and an “Avro” sink, with additional parameters specified for each. The syntax for each is in the following form:

agent_name.sources = source1 source2 ...
agent_name.sources.source1.type = exec

agent_name.channel = channel1 channel2 ...
agent_name.channel.channel1.type = memory

agent_name.sinks = sink1 sink2 ...
agent_name.sinks.sink1.type = avro

In my example the agent name was “agent”, but you can name it anything you want. You will specify the agent name when you start the agent, like this:

$ cd /opt/flume
$ bin/flume-ng agent -f conf/flume.conf -n agent

Now that our agent is running on the web server, we need to setup the other agent which will deposit logs lines into HDFS. This type of agent is commonly called a “collector”. Here is the config:

## Sources #########################################################
## Accept Avro data In from the Edge Agents
# http://flume.apache.org/FlumeUserGuide.html#avro-source
collector.sources = AvroIn
collector.sources.AvroIn.type = avro
collector.sources.AvroIn.bind =
collector.sources.AvroIn.port = 4545
collector.sources.AvroIn.channels = mc1 mc2

## Channels ########################################################
## Source writes to 2 channels, one for each sink (Fan Out)
collector.channels = mc1 mc2

# http://flume.apache.org/FlumeUserGuide.html#memory-channel
collector.channels.mc1.type = memory
collector.channels.mc1.capacity = 100

collector.channels.mc2.type = memory
collector.channels.mc2.capacity = 100

## Sinks ###########################################################
collector.sinks = LocalOut HadoopOut

## Write copy to Local Filesystem (Debugging)
# http://flume.apache.org/FlumeUserGuide.html#file-roll-sink
collector.sinks.LocalOut.type = file_roll
collector.sinks.LocalOut.sink.directory = /var/log/flume
collector.sinks.LocalOut.sink.rollInterval = 0
collector.sinks.LocalOut.channel = mc1

## Write to HDFS
# http://flume.apache.org/FlumeUserGuide.html#hdfs-sink
collector.sinks.HadoopOut.type = hdfs
collector.sinks.HadoopOut.channel = mc2
collector.sinks.HadoopOut.hdfs.path = /flume/events/%{log_type}/%{host}/%y-%m-%d
collector.sinks.HadoopOut.hdfs.fileType = DataStream
collector.sinks.HadoopOut.hdfs.writeFormat = Text
collector.sinks.HadoopOut.hdfs.rollSize = 0
collector.sinks.HadoopOut.hdfs.rollCount = 10000
collector.sinks.HadoopOut.hdfs.rollInterval = 600

This configuration is a little different because the source accepts Avro network events and then sends them into 2 memory channels (“fan out”) which feed 2 different sinks, one for HDFS and another for a local log file (for debugging). We start this agent like so:

# bin/flume-ng agent -f conf/flume.conf -n collector

Once both sides are up, you should see data moving. Use “hadoop fs -lsr /flume” to examine files there and if you included the file_roll sink, look in /var/log/flume.

# hadoop fs -lsr /flume/events
drwxr-xr-x   - root supergroup          0 2012-12-24 06:17 /flume/events/apache_access_combined
drwxr-xr-x   - root supergroup          0 2012-12-24 06:17 /flume/events/apache_access_combined/cuddletech.com
drwxr-xr-x   - root supergroup          0 2012-12-24 09:50 /flume/events/apache_access_combined/cuddletech.com/12-12-24
-rw-r--r--   3 root supergroup     224861 2012-12-24 06:17 /flume/events/apache_access_combined/cuddletech.com/12-12-24/FlumeData.1356329845948
-rw-r--r--   3 root supergroup      85437 2012-12-24 06:27 /flume/events/apache_access_combined/cuddletech.com/12-12-24/FlumeData.1356329845949
-rw-r--r--   3 root supergroup     195381 2012-12-24 06:37 /flume/events/apache_access_combined/cuddletech.com/12-12-24/FlumeData.1356329845950

Flume Tunables & Gotcha’s

There are a lot of tunables to play with and carefully consider in the example configs above. I included the documentation links for each component and I highly recommend you review it. Lets specifically look at some things that might cause you frustration while getting started.

First, interceptors. If you look at our HDFS sink path, you’ll see the path includes “log_type”, “host”, and a date. That data is associated with an event when the source grabs it, it is meta-data headers on each event. You associate that data with the event using an “interceptor”. So look back at the source where we ‘gtail’ our log file and you’ll see that we’re using interceptors to associate the log_type, “host”, and date with each event.

Secondly, by default Flume’s HDFS sink writes out SequenceFiles. This seems fine until you run Pig or Hive and get inconsistent or usual results back. Ensure that you specify the “fileType” as “DataStream” and the “writeFormat” as “Text”.

Lastly, there are 3 triggers that will cause Flume to “roll” the HDFS output file: size, count, and interval. When Flume writes data, if any one of the triggers is true it will roll to use a new file. By default the count is 30 (seconds), size is 1024 (bytes), and count is 10. Think about that, if any of those is true the file is rolled. So you end up with a LOT of HDFS files, which may or may not be what you want. Setting any value to 0 disables that type of rolling.

Analysis using Pig

Pig is a great tool for the Java challenged. Its quick, easy, and repeatable. The only real challenge is in accurately describing the data your asking it to chew on.

The PiggyBank library can provide you with a set of loaders which can save you from regex hell. The following is an example of using Pig on my Flume ingested Apache combined format logs using thePiggyBank “CombinedLogLoader”:

# cd /opt/pig
# ./bin/pig
2012-12-23 10:32:56,053 [main] INFO  org.apache.pig.Main - Apache Pig version 0.10.0-SNAPSHOT (r: unknown) compiled Dec 23 2012, 10:29:56
2012-12-23 10:32:56,054 [main] INFO  org.apache.pig.Main - Logging error messages to: /opt/pig-0.10.0/pig_1356258776048.log
2012-12-23 10:32:56,543 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://
2012-12-23 10:32:57,030 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at:

grunt> REGISTER /opt/pig-0.10.0/contrib/piggybank/java/piggybank.jar;
grunt> raw = LOAD '/flume/events/apache_access_combined/cuddletech.com/12-12-24/''
    USING org.apache.pig.piggybank.storage.apachelog.CombinedLogLoader
    AS (remoteAddr, remoteLogname, user, time, method, uri, proto, status, bytes, referer, userAgent);
grunt> agents = FOREACH raw GENERATE userAgent;
grunt> agents_uniq = DISTINCT agents;
grunt> DUMP agents_uniq;

(Mozilla/5.0 ())
(Recorded Future)

While Pig is easy enough to install (unpack and run), you must build the Piggybank JAR, which means you’ll need a JDK and Ant. On a SmartMachine with Pig installed in /opt/pig, it’d look like this:

# pkgin in sun-jdk6-6.0.26 apache-ant
# cd /opt/pig/
# ant
# cd /opt/pig/contrib/piggybank/java
# ant
     [echo]  *** Creating pigudf.jar ***
      [jar] Building jar: /opt/pig-0.10.0/contrib/piggybank/java/piggybank.jar

Total time: 5 seconds

Analysis using Hive

Similar to Pig, the challenge with Hive is really just describing the schema around the data. Thankfully there is assistance out therefor just this problem.

[root@hadoop02 /opt/hive]# bin/hive
Logging initialized using configuration in jar:file:/opt/hive-0.9.0-bin/lib/hive-common-0.9.0.jar!/hive-log4j.properties
Hive history file=/tmp/root/hive_job_log_root_201212241029_318322444.txt
    >   host STRING,
    >   identity STRING,
    >   user STRING,
    >   time STRING,
    >   request STRING,
    >   status STRING,
    >   size STRING,
    >   referer STRING,
    >   agent STRING)
    > ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
    >   "input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\"[^\"]*\") ([^ \"]*|\"[^\"]*\"))?",
    >   "output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s"
    > )
    > LOCATION '/flume/events/apache_access_combined/cuddletech.com/12-12-24/';
Time taken: 7.514 seconds

Now you can query to your hearts content. Please note that in the above example if you omit the “EXTERNAL” keyword when creating the table that Hive will move your data into its own data warehouse directory, which may not be what you want.

Next Steps

Hadoop provides an extremely powerful set of tools to solve very big problems. Pig and Hive are easy to use and very powerful. Flume-NG is an excellent tool for reliably moving data and extremely extensible. There is a lot I’m not getting into here, like using file-backed or database backed channels in Flume to protect against node failure thus increasing delivery reliability, or using multi-tiered aggregation by using intermediate Flume agents (meaning, Avro Source to Avro Sink)… there is a lot of fun things to explore here. My hope is that I’ve provided you with an additional source of data to help you on your way.

If you start getting serious with Hadoop, I highly recommend you buy the following O’Reilly books for Hadoop, which are very good and will save you a lot of time wasted in trial-and-error:

A Friendly Warning

In closing, I feel it necessarily to point out the obvious. For most people there is no reason to do any of this. Hadoop is a Peterbiltfor data. You don’t use a Peterbilt for a job that can be done with a Ford truck, its not worth the time, money and effort.

When I’ve asked myself “How big must data be for it to be big data?” I’ve come up with the following rule: If a “grep” of a file takes more than 5 minutes, its big. If the file can not be reasonably sub-divided to be smaller files or any query requires examining multiple files, then it might be Hadoop time.

For most logging applications, I strongly recommend either Splunk(if you can afford it) or using Rsyslog/Logstash and ElasticSearch, they are far more suited to the task with less hassle, less complexity and much more functionality.

This entry was posted on Thursday, December 27th, 2012 at 1:20 am and is filed under DevOpsSysAdmin. You can follow any responses to this entry through the RSS 2.0 feed. Both comments and pings are currently closed.

[repost ]Analyzing Apache logs with Pig


(guest blog post by Dmitriy Ryaboy)

A number of organizations donate server space and bandwidth to the Apache Foundation; when you download Hadoop, Tomcat, Maven, CouchDB, or any of the other great Apache projects, the bits are sent to you from a large list of mirrors. One of the ways in which Cloudera supports the open source community is to host such a mirror.

In this blog post, we will use Pig to examine the download logs recorded on our server, demonstrating several features that are often glossed over in introductory Pig tutorials—parameter substitution in PigLatin scripts, Pig Streaming, and the use of custom loaders and user-defined functions (UDFs). It’s worth mentioning here that, as of last week, the Cloudera Distribution for Hadoop includes a package for Pig version 0.2 for both Red Hat and Ubuntu, as promised in an earlier post. It’s as simple as apt-get install pig or yum install hadoop-pig.

There are many software packages that can do this kind of analysis automatically for you on average-sized log files, of course. However, many organizations log so much data and require such custom analytics that these ordinary approaches cease to work. Hadoop provides a reliable method for scaling storage and computation; PigLatin provides an expressive and flexible language for data analysis.

Our log files are in Apache’s standard CombinedLogFormat. It’s a tad more complicated to parse than tab- or comma- delimited files, so we can’t just use the built-in PigLoader().  Luckily, there is already a custom loader in the Piggybank built specifically for parsing these kinds of logs.

First, we need to get the PiggyBank from Apache. The PiggyBank is a collection of useful add-ons (UDFs) for Pig, contributed by the Pig user community. There are instructions on the Pig website for downloading and compiling the PiggyBank. Note that you will need to make sure to add pig.jar to your CLASSPATH environment variable before running ant.

Now, we can start our PigLatin script by registering the piggybank jarfile and defining references to methods we will be using.

1.register /home/dvryaboy/src/pig/trunk/piggybank.jar;
2.DEFINE LogLoader
4.DEFINE DayExtractor

By the way — the PiggyBank contains another useful loader, called MyRegExLoader, which can be instantiated with any regular expression when you declare it with a DEFINE statement. Useful in a pinch.

While we are working on our script, it may be useful to run in local mode, only reading a small sample data set (a few hundred lines). In production we will want to run on a different file. Moreover, if we like the reports enough to automate them, we may wish to run the report every day, as new logs come in. This means we need to parameterize the source data location. We will also be using a database that maps geographic locations to IPs, and we probably want to parametrize that as well.

1.%default LOGS 'access_log.small'
2.%default GEO 'GeoLiteCity.dat'

To specify a different value for a parameter, we can use the -param flag when launching the pig script:

pig -x mapreduce -f scripts/blogparse.pig -param LOGS='/mirror.cloudera.com/logs/access_log.*'

For mapping IPs to geographic locations, we use a third-party database from MaxMind.  This database maps IP ranges to countries, regions, and cities.  Since the data from MaxMind lists IP ranges, and our logs list specific IPs, a regular join won’t work for our purposes. Instead, we will write a simple script that takes a parsed log as input, looks up the geo information using MaxMind’s Perl module, and outputs the log with geo data prepended.

The script itself is simple — it reads in a tuple representing a parsed log record, checks the first field (the IP) against the database, and prints the data back to STDOUT :

01.#!/usr/bin/env perl
02.use warnings;
03.use strict;
04.use Geo::IP::PurePerl;
06.my ($path)=shift;
07.my $gi = Geo::IP::PurePerl->new($path);
09.while (<>) {
11.if (/([^\t]*)\t(.*)/) {
12.my ($ip, $rest) = ($1, $2);
13.my ($country_code, undef, $country_name, $region, $city)
14.= $gi->get_city_record($ip);
15.print join("\t", $country_code||'', $country_name||'',
16.$region||'', $city||'', $ip, $rest), "\n";

Getting this script into Pig is a bit more interesting. The Pig Streaming interface provides us with a simple way to ship scripts that will process data, and cache any necessary objects (such as the GeoLiteCity.dat file we downloaded from MaxMind).  However, when the scripts are shipped, they are simply dropped into the current working directory. It is our responsibility to ensure that all dependencies—such as the Geo::IP::PurePerl module—are satisfied. We could install the module on all the nodes of our cluster; however, this may not be an attractive option. We can ship the module with our script—but in Perl, packages are represented by directories, so just dropping the .pm file into cwd will not be sufficient, and Pig doesn’t let us ship directory hierarchies.  We solve this problem by packing the directory into a tarball, and writing a small Bash script called “ipwrapper.sh” that will set up our Perl environment when invoked:

1.#!/usr/bin/env bash
2.tar -xzf geo-pack.tgz
3.PERL5LIB=$PERL5LIB:$(pwd) ./geostream.pl $1

The geo-pack.tgz tarball simply contains geostream.pl and Geo/IP/PurePerl.pm .

We also want to make the GeoLiteCity.dat file available to all of our nodes. It would be inefficient to simply drop the file in HDFS and reference it directly from every mapper, as this would cause unnecessary network traffic.  Instead, we can instruct Pig to cache a file from HDFS locally, and use the local copy.

We can relate all of the above to Pig in a single instruction:

1.DEFINE iplookup `ipwrapper.sh $GEO`
2.ship ('ipwrapper.sh')

We can now write our main Pig script. The objective here is to load the logs, filter out obviously non-human traffic, and using the rest, calculate the distribution of downloads by country and by Apache project.

Load the logs:

1.logs = LOAD '$LOGS' USING LogLoader as
2.(remoteAddr, remoteLogname, user, time, method,
3.uri, proto, status, bytes, referer, userAgent);

Filter out records that represent non-humans (Googlebot and such), aren’t Apache-related, or just check the headers and do not download contents.

01.logs = FILTER logs BY bytes != '-' AND uri matches '/apache.*';
03.-- project just the columns we will need
04.logs = FOREACH logs GENERATE
06.DayExtractor(time) as day, uri, bytes, userAgent;
08.-- The filtering function is not actually in the PiggyBank.
09.-- We plan on contributing it soon.
10.notbots = FILTER logs BY (NOT

Get country information, group by country code, aggregate.

01.with_country = STREAM notbots THROUGH `ipwrapper.sh $GEO`
02.AS (country_code, country, state, city, ip, time, uri, bytes, userAgent);
04.geo_uri_groups = GROUP with_country BY country_code;
06.geo_uri_group_counts = FOREACH geo_uri_groups GENERATE
08.COUNT(with_country) AS cnt,
09.SUM(with_country.bytes) AS total_bytes;
11.geo_uri_group_counts = ORDER geo_uri_group_counts BY cnt DESC;
13.STORE geo_uri_group_counts INTO 'by_country.tsv';

The first few rows look like:

Country Hits Bytes
USA 8906 2.0458781232E10
India 3930 1.5742887409E10
China 3628 1.6991798253E10
Mexico 595 1.220121453E9
Colombia 259 5.36596853E8

At this point, the data is small enough to plug into your favorite visualization tools. We wrote a quick-and-dirty python script to take logarithms and use the Google Chart API to draw this map:

Bytes by Country

This is pretty interesting. Let’s do a breakdown by US states.

Note that with the upcoming Pig 0.3 release, you will be able to have multiple stores in the same script, allowing you to re-use the loading and filtering results from earlier steps. With Pig 0.2, this needs to go in a separate script, with all the required DEFINEs, LOADs, etc.

01.us_only = FILTER with_country BY country_code == 'US';
03.by_state = GROUP us_only BY state;
05.by_state_cnt = FOREACH by_state GENERATE
07.COUNT(us_only.state) AS cnt,
08.SUM(us_only.bytes) AS total_bytes;
10.by_state_cnt = ORDER by_state_cnt BY cnt DESC;
12.store by_state_cnt into 'by_state.tsv';

Theoretically, Apache selects an appropriate server based on the visitor’s location, so our logs should show a heavy skew towards California. Indeed, they do (recall that the intensity of the blue color is based on a log-scale).

Bytes by US State

Now, let’s get a breakdown by project. To get a rough mapping of URI to Project, we simply get the directory name after /apache in the URI. This is somewhat inaccurate, but good for quick prototyping. This time around, we won’t even bother writing a separate script — this is a simple awk job, after all! Using streaming, we can process data the same way we would with basic Unix utilities connected by pipes.

01.uris = FOREACH notbots GENERATE uri;
03.-- note that we have to escape the dollar sign for $3,
04.-- otherwise Pig will attempt to interpret this as a Pig variable.
05.project_map = STREAM uris
06.THROUGH `awk -F '/' '{print \$3;}'` AS (project);
08.project_groups = GROUP project_map BY project;
10.project_count = FOREACH project_groups GENERATE
12.COUNT(project_map.project) AS cnt;
14.project_count = ORDER project_count BY cnt DESC;
16.STORE project_count INTO 'by_project.tsv';

We can now take the by_project.tsv file and plot the results (in this case, we plotted the top 18 projects, by number of downloads).
Downloads by Project

We can see that Tomcat and Httpd dwarf the rest of the projects in terms of file downloads, and the distribution appears to follow a power-law.

We’d love to hear how folks are using Pig to analyze their data. Drop us a line, or comment below!