Tag Archives: analysis

[repost ]Hadoop Analysis of Apache Logs Using Flume-NG, Hive and Pig


Big Data is the hotness, there is no doubt about it.  Every year its just gotten bigger and bigger and shows no sign of slowing.  There is a lot out there about big data, but despite the hype, there isn’t a lot of good technical content for those who want to get started.  The lack of technical how-to info is made worse by the fact that many Hadoop projects have moved their documentation around over time and Google searches commonly point to obsolete docs.  My intent here is to provide some solid guidance on how to actually get started with practical uses of Hadoop and to encourage others to do the same.

From an SA perspective, the most interesting Hadoop sub-projects have been those for log transport, namely Scribe, Chukwa, and Flume.  Lets examine each.

Log Transport Choices

Scribe was created at Facebook and got a lot of popularity early on due to adoption at high profile sites like Twitter, but development has apparently ceased  and word is that Facebook stopped using it themselves.  So Scribe is off my list.

Chukwa is a confusing beast, its said to be distributed with Hadoop’s core but its just an old version in the same sub-directory of the FTP site, the actual current version is found under the incubator sub-tree.  It is a very comprehensive solution, including a web interface for log analysis, but that functionality is based on HBase, which is fine if you want to use HBase but may be a bit more than you wish to chew off for simple Hive/Pig analysis.  Most importantly, the major Hadoop distributions from HortonWorks,MapR, and Cloudera use Flume instead.  So if your looking for a comprehensive toolset for log analysis, Chukwa is worth checking out, but if you simply need to efficiently get data into Hadoop for use by other Hadoop components, Flume is the clear choice.

That brings us to Flume, more specifically Flume-NG.  The first thing to know about Flume is that there were major changes to Flume pre and post 1.0, major enough that they took to refering to pre 1.0 as “Flume OG” (“Old generation” or “Origonal Gangsta” depending on your mood) and the new post 1.0 releases as “Flume NG”.  Whenever looking at documentation or help on the web about Flume be certain as to which you are looking at!  In particular, stay away from the Flume CWiki pages,  refer only to theflume.apache.org.  I say that because there is so much old cruft in the CWiki pages that you can be easily mislead and become frustrated, so just avoid it.

Now that we’ve thinned out the available options, what can we do with Flume?

Getting Started with Flume

Flume is a very sophisticated tool for transporting data.  We are going to focus on log data, however it can transport just about anything you throw at it.  For our purposes we’re going to use it to transport Apache log data from a web server back to our Hadoop cluster and store it in HDFS where we can then operate on it using other Hadoop tools.

Flume NG is a java application that, like other Hadoop tools, can be downloaded, unpacked, configured and run, without compiling or other forms of tinkering.  Download the latest “bin” tarball and untar it into /opt and rename or symlink to “/opt/flume” (it doesn’t matter where you put it, this is just my preference).  You will need to have Java already installed.

Before we can configure Flume its important to understand its architecture.  Flume runs as an agent.  The agent is sub-divided into 3 categories: sourceschannels, and sinks.  Inside the Flume agent process there is a pub-sub flow between these 3 components.  A source accepts or retrieves data and sends it into a channel.  Data then queues in the channel.  A sink takes data from the channel and does something with it.  There can be multiple sources, multiple channels, and multiple sinks per agent.  The only important thing to remember is that a source can write to multiple channels, but a sink can draw from only one channel.

Lets take an example.  A “source” might tail a file.  New log lines are sent into a channel where they are queued up.  A “sink” then extracts the log lines from the channel and writes them into HDFS.

At first glance this might appear overly complicated, but the distinct advantage  here is that the channel de-couples input and output, which is important if you have performance slowdowns in the sinks.  It also allows the entire system to be plugin-based.  Any number of new sinks can be created to do something with data… for instance, Casandra sinks are available, there is an IRC sink for writing data into an IRC channel.  Flume is extremely flexible thanks to this architecture.

In the real world we want to collect data from a local file, send it across the network and then store it centrally.  In Flume we’d accomplish this by chaining agents together.  The “sink” of one agent sends to the “source” of another.  The standard method of sending data across the network with Flume is using Avro.  For our purposes here you don’t need to know anything about Avro except one of the things it can do is to move data over the network.  Here is what this ultimately looks like:

So on our web server, we create a /opt/flume/conf/flume.confthat looks like this:

## Flume NG Apache Log Collection
## Refer to https://cwiki.apache.org/confluence/display/FLUME/Getting+Started
# http://flume.apache.org/FlumeUserGuide.html#exec-source
agent.sources = apache
agent.sources.apache.type = exec
agent.sources.apache.command = gtail -F /var/log/httpd/access_log
agent.sources.apache.batchSize = 1
agent.sources.apache.channels = memoryChannel
agent.sources.apache.interceptors = itime ihost itype
# http://flume.apache.org/FlumeUserGuide.html#timestamp-interceptor
agent.sources.apache.interceptors.itime.type = timestamp
# http://flume.apache.org/FlumeUserGuide.html#host-interceptor
agent.sources.apache.interceptors.ihost.type = host
agent.sources.apache.interceptors.ihost.useIP = false
agent.sources.apache.interceptors.ihost.hostHeader = host
# http://flume.apache.org/FlumeUserGuide.html#static-interceptor
agent.sources.apache.interceptors.itype.type = static
agent.sources.apache.interceptors.itype.key = log_type
agent.sources.apache.interceptors.itype.value = apache_access_combined

# http://flume.apache.org/FlumeUserGuide.html#memory-channel
agent.channels = memoryChannel
agent.channels.memoryChannel.type = memory
agent.channels.memoryChannel.capacity = 100

## Send to Flume Collector on (Hadoop Slave Node)
# http://flume.apache.org/FlumeUserGuide.html#avro-sink
agent.sinks = AvroSink
agent.sinks.AvroSink.type = avro
agent.sinks.AvroSink.channel = memoryChannel
agent.sinks.AvroSink.hostname =
agent.sinks.AvroSink.port = 4545

## Debugging Sink, Comment out AvroSink if you use this one
# http://flume.apache.org/FlumeUserGuide.html#file-roll-sink
#agent.sinks = localout
#agent.sinks.localout.type = file_roll
#agent.sinks.localout.sink.directory = /var/log/flume
#agent.sinks.localout.sink.rollInterval = 0
#agent.sinks.localout.channel = memoryChannel

This configuration looks overwhelming at first, but it breaks down simply into an “exec” source, a “memory” channel, and an “Avro” sink, with additional parameters specified for each. The syntax for each is in the following form:

agent_name.sources = source1 source2 ...
agent_name.sources.source1.type = exec

agent_name.channel = channel1 channel2 ...
agent_name.channel.channel1.type = memory

agent_name.sinks = sink1 sink2 ...
agent_name.sinks.sink1.type = avro

In my example the agent name was “agent”, but you can name it anything you want. You will specify the agent name when you start the agent, like this:

$ cd /opt/flume
$ bin/flume-ng agent -f conf/flume.conf -n agent

Now that our agent is running on the web server, we need to setup the other agent which will deposit logs lines into HDFS. This type of agent is commonly called a “collector”. Here is the config:

## Sources #########################################################
## Accept Avro data In from the Edge Agents
# http://flume.apache.org/FlumeUserGuide.html#avro-source
collector.sources = AvroIn
collector.sources.AvroIn.type = avro
collector.sources.AvroIn.bind =
collector.sources.AvroIn.port = 4545
collector.sources.AvroIn.channels = mc1 mc2

## Channels ########################################################
## Source writes to 2 channels, one for each sink (Fan Out)
collector.channels = mc1 mc2

# http://flume.apache.org/FlumeUserGuide.html#memory-channel
collector.channels.mc1.type = memory
collector.channels.mc1.capacity = 100

collector.channels.mc2.type = memory
collector.channels.mc2.capacity = 100

## Sinks ###########################################################
collector.sinks = LocalOut HadoopOut

## Write copy to Local Filesystem (Debugging)
# http://flume.apache.org/FlumeUserGuide.html#file-roll-sink
collector.sinks.LocalOut.type = file_roll
collector.sinks.LocalOut.sink.directory = /var/log/flume
collector.sinks.LocalOut.sink.rollInterval = 0
collector.sinks.LocalOut.channel = mc1

## Write to HDFS
# http://flume.apache.org/FlumeUserGuide.html#hdfs-sink
collector.sinks.HadoopOut.type = hdfs
collector.sinks.HadoopOut.channel = mc2
collector.sinks.HadoopOut.hdfs.path = /flume/events/%{log_type}/%{host}/%y-%m-%d
collector.sinks.HadoopOut.hdfs.fileType = DataStream
collector.sinks.HadoopOut.hdfs.writeFormat = Text
collector.sinks.HadoopOut.hdfs.rollSize = 0
collector.sinks.HadoopOut.hdfs.rollCount = 10000
collector.sinks.HadoopOut.hdfs.rollInterval = 600

This configuration is a little different because the source accepts Avro network events and then sends them into 2 memory channels (“fan out”) which feed 2 different sinks, one for HDFS and another for a local log file (for debugging). We start this agent like so:

# bin/flume-ng agent -f conf/flume.conf -n collector

Once both sides are up, you should see data moving. Use “hadoop fs -lsr /flume” to examine files there and if you included the file_roll sink, look in /var/log/flume.

# hadoop fs -lsr /flume/events
drwxr-xr-x   - root supergroup          0 2012-12-24 06:17 /flume/events/apache_access_combined
drwxr-xr-x   - root supergroup          0 2012-12-24 06:17 /flume/events/apache_access_combined/cuddletech.com
drwxr-xr-x   - root supergroup          0 2012-12-24 09:50 /flume/events/apache_access_combined/cuddletech.com/12-12-24
-rw-r--r--   3 root supergroup     224861 2012-12-24 06:17 /flume/events/apache_access_combined/cuddletech.com/12-12-24/FlumeData.1356329845948
-rw-r--r--   3 root supergroup      85437 2012-12-24 06:27 /flume/events/apache_access_combined/cuddletech.com/12-12-24/FlumeData.1356329845949
-rw-r--r--   3 root supergroup     195381 2012-12-24 06:37 /flume/events/apache_access_combined/cuddletech.com/12-12-24/FlumeData.1356329845950

Flume Tunables & Gotcha’s

There are a lot of tunables to play with and carefully consider in the example configs above. I included the documentation links for each component and I highly recommend you review it. Lets specifically look at some things that might cause you frustration while getting started.

First, interceptors. If you look at our HDFS sink path, you’ll see the path includes “log_type”, “host”, and a date. That data is associated with an event when the source grabs it, it is meta-data headers on each event. You associate that data with the event using an “interceptor”. So look back at the source where we ‘gtail’ our log file and you’ll see that we’re using interceptors to associate the log_type, “host”, and date with each event.

Secondly, by default Flume’s HDFS sink writes out SequenceFiles. This seems fine until you run Pig or Hive and get inconsistent or usual results back. Ensure that you specify the “fileType” as “DataStream” and the “writeFormat” as “Text”.

Lastly, there are 3 triggers that will cause Flume to “roll” the HDFS output file: size, count, and interval. When Flume writes data, if any one of the triggers is true it will roll to use a new file. By default the count is 30 (seconds), size is 1024 (bytes), and count is 10. Think about that, if any of those is true the file is rolled. So you end up with a LOT of HDFS files, which may or may not be what you want. Setting any value to 0 disables that type of rolling.

Analysis using Pig

Pig is a great tool for the Java challenged. Its quick, easy, and repeatable. The only real challenge is in accurately describing the data your asking it to chew on.

The PiggyBank library can provide you with a set of loaders which can save you from regex hell. The following is an example of using Pig on my Flume ingested Apache combined format logs using thePiggyBank “CombinedLogLoader”:

# cd /opt/pig
# ./bin/pig
2012-12-23 10:32:56,053 [main] INFO  org.apache.pig.Main - Apache Pig version 0.10.0-SNAPSHOT (r: unknown) compiled Dec 23 2012, 10:29:56
2012-12-23 10:32:56,054 [main] INFO  org.apache.pig.Main - Logging error messages to: /opt/pig-0.10.0/pig_1356258776048.log
2012-12-23 10:32:56,543 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://
2012-12-23 10:32:57,030 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at:

grunt> REGISTER /opt/pig-0.10.0/contrib/piggybank/java/piggybank.jar;
grunt> raw = LOAD '/flume/events/apache_access_combined/cuddletech.com/12-12-24/''
    USING org.apache.pig.piggybank.storage.apachelog.CombinedLogLoader
    AS (remoteAddr, remoteLogname, user, time, method, uri, proto, status, bytes, referer, userAgent);
grunt> agents = FOREACH raw GENERATE userAgent;
grunt> agents_uniq = DISTINCT agents;
grunt> DUMP agents_uniq;

(Mozilla/5.0 ())
(Recorded Future)

While Pig is easy enough to install (unpack and run), you must build the Piggybank JAR, which means you’ll need a JDK and Ant. On a SmartMachine with Pig installed in /opt/pig, it’d look like this:

# pkgin in sun-jdk6-6.0.26 apache-ant
# cd /opt/pig/
# ant
# cd /opt/pig/contrib/piggybank/java
# ant
     [echo]  *** Creating pigudf.jar ***
      [jar] Building jar: /opt/pig-0.10.0/contrib/piggybank/java/piggybank.jar

Total time: 5 seconds

Analysis using Hive

Similar to Pig, the challenge with Hive is really just describing the schema around the data. Thankfully there is assistance out therefor just this problem.

[root@hadoop02 /opt/hive]# bin/hive
Logging initialized using configuration in jar:file:/opt/hive-0.9.0-bin/lib/hive-common-0.9.0.jar!/hive-log4j.properties
Hive history file=/tmp/root/hive_job_log_root_201212241029_318322444.txt
    >   host STRING,
    >   identity STRING,
    >   user STRING,
    >   time STRING,
    >   request STRING,
    >   status STRING,
    >   size STRING,
    >   referer STRING,
    >   agent STRING)
    > ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
    >   "input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\"[^\"]*\") ([^ \"]*|\"[^\"]*\"))?",
    >   "output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s"
    > )
    > LOCATION '/flume/events/apache_access_combined/cuddletech.com/12-12-24/';
Time taken: 7.514 seconds

Now you can query to your hearts content. Please note that in the above example if you omit the “EXTERNAL” keyword when creating the table that Hive will move your data into its own data warehouse directory, which may not be what you want.

Next Steps

Hadoop provides an extremely powerful set of tools to solve very big problems. Pig and Hive are easy to use and very powerful. Flume-NG is an excellent tool for reliably moving data and extremely extensible. There is a lot I’m not getting into here, like using file-backed or database backed channels in Flume to protect against node failure thus increasing delivery reliability, or using multi-tiered aggregation by using intermediate Flume agents (meaning, Avro Source to Avro Sink)… there is a lot of fun things to explore here. My hope is that I’ve provided you with an additional source of data to help you on your way.

If you start getting serious with Hadoop, I highly recommend you buy the following O’Reilly books for Hadoop, which are very good and will save you a lot of time wasted in trial-and-error:

A Friendly Warning

In closing, I feel it necessarily to point out the obvious. For most people there is no reason to do any of this. Hadoop is a Peterbiltfor data. You don’t use a Peterbilt for a job that can be done with a Ford truck, its not worth the time, money and effort.

When I’ve asked myself “How big must data be for it to be big data?” I’ve come up with the following rule: If a “grep” of a file takes more than 5 minutes, its big. If the file can not be reasonably sub-divided to be smaller files or any query requires examining multiple files, then it might be Hadoop time.

For most logging applications, I strongly recommend either Splunk(if you can afford it) or using Rsyslog/Logstash and ElasticSearch, they are far more suited to the task with less hassle, less complexity and much more functionality.

This entry was posted on Thursday, December 27th, 2012 at 1:20 am and is filed under DevOpsSysAdmin. You can follow any responses to this entry through the RSS 2.0 feed. Both comments and pings are currently closed.

[repost ]Social Network Analysis的作品





在Social Network的Influence、Diffusion方面做得比较好的,Yahoo的Francesco Bonchihttp://t.cn/zYpBh21,微软的Wei Chen http://t.cn/zYpBh2u, 清华的Jie Tanghttp://t.cn/zYpBh23 还有哪些人的作品有些独有的思路和风格?

青春的sikun:Social Network各大实验组(含paper+data)SNAP-http://t.cn/zY0L8KBBen Zhao@UCSB-http://t.cn/zY0L8Ku Huan Liu@ASU-http://t.cn/zY0L8Km Jiawei Han@UIUC-http://t.cn/zY0L8K3 Philip Yu@UIC; Gummandi@MPI-http://t.cn/zY0L8K1

刘知远THU:赞,不过竟然不提Jon Kleinberg和Jure Leskovec。

[repost ]R, Octave, and Python: Which Suits Your Analysis Needs?


Analysts and engineers on a budget are turning to R, Octave and Python instead of data analysis packages from proprietary vendors. But which of those is right for your needs?

Some businesses want all the benefits of a top-shelf data analysis package, but lack the budget to purchase one from SAS Institute, MathWorks, or another established, proprietary vendor.

However, analysts can still rely on open-source software and online-learning resources to bring data-mining capabilities into their organization. In fact, many are turning to ROctave and Python with exactly this goal in mind.

Why Those Three?

When it comes to machine learning (the creation of algorithms that allow machines to recognize and react to patterns), matrix decomposition algorithms are critical. R, Octave and Python are flexible and easy to use for vectorization and matrix operations; they’re not just data-analysis packages, but also programming languages for creating one’s own functions or packages.

For analysts who lack the time to engage in extensive coding, these open-source packages also offer some very handy built-in functions and toolboxes. For example, both R and Octave have simple zscore functions for computing Z-Score; for Python, the function can be defined in a very straightforward manner:

def zscore(X):
mu = mean(X,None)
sigma = samplestd(X)
return (array(X)-mu)/sigma

If you want to use MCMC Bayesian estimation, R boasts MCMCpack, Octave includes pmtk3, and Python has PyMC.

All three options feature large and growing user communities (i.e., the R mailing list) that serve as vital hubs for sharing information and exchanging experiences.

Which Software Package to Choose?

Can any one of these packages do more than the other two? The answer is probably no; the three functionalities have a lot in common. That being said, R is popular among statisticians thanks to its emphasis on statistical computing. Octave has a number of industry and academic applications, and engineers and analysts often utilize Python for building software platforms. It would definitely prove easier for someone who has worked with Matlab to pick up Octave, as Octave is often described as the open source “clone” for Matlab.

My suggestion is to try all three, and see which offering’s toolbox solves your specific problems. As previously mentioned, R’s strength is in statistical analysis. Octave is good for developing Machine Learning algorithms for numeric problems. Python is a general programming language strong in algorithm building for both number and text mining.

Based on my own user experience and research, here is a high-level summary for the three:

If you don’t have time or need to learn an entire programming language, an online universe of open-source software can provide you with multiple solutions for your specific needs. Take a little time to experiment and find the one that fits best. When searching for open source solutions, it’s a good idea to search both for the broad terms such as machine learning, data mining or artificial intelligence, along with specific implementations such as neural networks.

No matter what your skill level, open source software may have a solution for you. Open source software can range from all-in-one solutions to code libraries for sophisticated users who want a more customized solution. So whether you’re looking to learn simple regression or robotic vision, open source may have an ideal solution for you.

slashdot (http://s.tt/1vWxH)

[repost ]Easy, Real-Time Big Data Analysis Using Storm


Conceptually straightforward and easy to work with, Storm makes handling big data analysis a breeze.
Today, companies regularly generate terabytes of data in their daily operations. The sources include everything from data captured from network sensors, to the Web, social media, transactional business data, and data created in other business contexts. Given the volume of data being generated, real-time computation has become a major challenge faced by many organizations. A scalable real-time computation system that we have used effectively is the open-source Storm tool, which was developed at Twitter and is sometimes referred to as “real-time Hadoop.” However, Storm is far simpler to use than Hadoop in that it does not require mastering an alternate universe of new technologies simply to handle big data jobs.

This article explains how to use Storm. The example project, called “Speeding Alert System,” analyzes real-time data and raises a trigger and relevant data to a database, when the speed of a vehicle exceeds a predefined threshold.


Whereas Hadoop relies on batch processing, Storm is a real-time, distributed, fault-tolerant, computation system. Like Hadoop, it can process huge amount of data—but does so in real time — with guaranteed reliability; that is, every message will be processed. Storm also offers features such as fault tolerance and distributed computation, which make it suitable for processing huge amounts of data on different machines. It has these features as well:

  • It has simple scalability. To scale, you simply add machines and change parallelism settings of the topology. Storm’s usage of Hadoop’s Zookeeper for cluster coordination makes it scalable for large cluster sizes.
  • It guarantees processing of every message.
  • Storm clusters are easy to manage.
  • Storm is fault tolerant: Once a topology is submitted, Storm runs the topology until it is killed or the cluster is shut down. Also, if there are faults during execution, reassignment of tasks is handled by Storm.
  • Topologies in Storm can be defined in any language, although typically Java is used.

To follow the rest of the article, you first need to install and set up Storm. The steps are straightforward:

  • Download the Storm archive from the official Storm website.
  • Unpack the bin/ directory onto your PATH and make sure the bin/storm script is executable.

Storm Components

A Storm cluster mainly consists of a master and worker node, with coordination done by Zookeeper.

Master Node:

The master node runs a daemon, Nimbus, which is responsible for distributing the code around the cluster, assigning the tasks, and monitoring failures. It is similar to the Job Tracker in Hadoop.

Worker Node:

The worker node runs a daemon, Supervisor, which listens to the work assigned and runs the worker process based on requirements. Each worker node executes a subset of a topology. The coordination between Nimbus and several supervisors is managed by a Zookeeper system or cluster.


Zookeeper is responsible for maintaining the coordination service between the supervisor and master. The logic for a real-time application is packaged into a Storm “topology.” A topology consists of a graph of spouts (data sources) and bolts (data operations) that are connected with stream groupings (coordination). Let’s look at these terms in greater depth.


In simple terms, a spout reads the data from a source for use in the topology. A spout can either be reliable or unreliable. A reliable spout makes sure to resend a tuple (which is an ordered list of data items) if Storm fails to process it. An unreliable spout does not track the tuple once it’s emitted. The main method in a spout is nextTuple(). This method either emits a new tuple to the topology or it returns if there is nothing to emit.


A bolt is responsible for all the processing that happens in a topology. Bolts can do anything from filtering to joins, aggregations, talking to files/databases, and so on. Bolts receive the data from a spout for processing, which may further emit tuples to another bolt in case of complex stream transformations. The main method in a bolt is execute(), which accepts a tuple as input. In both the spout and bolt, to emit the tuple to more than one stream, the streams can be declared and specified in declareStream().

Stream Groupings:

A stream grouping defines how a stream should be partitioned among the bolt’s tasks. There are built-in stream groupings provided by Storm: shuffle grouping, fields grouping, all grouping, one grouping, direct grouping, and local/shuffle grouping. Custom implementation by using theCustomStreamGrouping interface can also be added.


For our use case, we designed one topology of spout and bolt that can process a huge amount of data (log files) designed to trigger an alarm when a specific value crosses a predefined threshold. Using a Storm topology, the log file is read line by line and the topology is designed to monitor the incoming data. In terms of Storm components, the spout reads the incoming data. It not only reads the data from existing files, but it also monitors for new files. As soon as a file is modified, spout reads this new entry and, after converting it to tuples (a format that can be read by a bolt), emits the tuples to the bolt to perform threshold analysis, which finds any record that has exceeded the threshold.

The next section explains the use case in detail.

Threshold Analysis

In this article, we will be mainly concentrating on two types of threshold analysis: instant threshold and time series threshold.

  • Instant threshold checks if the value of a field has exceeded the threshold value at that instant and raises a trigger if the condition is satisfied. For example, raise a trigger if the speed of a vehicle exceeds 80 km/h.
  • Time series threshold checks if the value of a field has exceeded the threshold value for a given time window and raises a trigger if the same is satisfied. For example, raise a trigger if the speed of a vehicle exceeds 80 km/h more than once in last five minutes.

Listing One shows a log file of the type we’ll use, which contains vehicle data information such as vehicle number, speed at which the vehicle is traveling, and location in which the information is captured.


Listing OneA log file with entries of vehicles passing through the checkpoint.

AB 123, 60, North city
BC 123, 70, South city
CD 234, 40, South city
DE 123, 40, East city
EF 123, 90, South city
GH 123, 50, West city

A corresponding XML file is created, which consists of the schema for the incoming data. It is used for parsing the log file. The schema XML and its corresponding description are shown in the table.

The XML file and the log file are in a directory that is monitored by the spout constantly for real-time changes. The topology we use for this example is shown in Figure 1.

Figure 1: Topology created in Storm to process real-time data.

As shown in Figure 1, the FileListenerSpout accepts the input log file, reads the data line by line, and emits the datea to the ThresoldCalculatorBolt for further threshold processing. Once the processing is done, the contents of the line for which the threshold is calculated is emitted to the DBWriterBolt, where it is persisted in the database (or an alert is raised). The detailed implementation for this process is explained next.

Spout Implementation

Spout takes a log file and the XML descriptor file as the input. The XML file consists of the schema corresponding to the log file. Let us consider an example log file, which has vehicle data information such as vehicle number, speed at which the vehicle is travelling, and location in which the information is captured. (See Figure 2.)

Figure 2: Flow of data from log files to Spout.


Conceptually straightforward and easy to work with, Storm makes handling big data analysis a breeze.
Listing Two shows the specific XML file for a tuple, which specifies the fields and the delimiter separating the fields in a log file. Both the XML file and the data are kept in a directory whose path is specified in the spout.

Listing TwoAn XML file created for describing the log file.





An instance of Spout is initialized with constructor parameters of DirectoryPath, andTupleInfo object. The TupleInfo object stores necessary information related to log file such as fields, delimiter, and type of field. This object is created by serializing the XML file usingXStream.

Spout implementation steps are:

  • Listen to changes on individual log files. Monitor the directory for the addition of new log files.
  • Convert rows read by the spout to tuples after declaring fields for them.
  • Declare the grouping between spout and bolt, deciding the way in which tuples are given to bolt.

The code for Spout is shown in Listing Three.

Listing ThreeLogic in Open, nextTuple, and declareOutputFields methods of Spout.

public void open( Map conf, TopologyContext context,SpoutOutputCollector collector ) 
    _collector = collector;
	fileReader  =  new BufferedReader(new FileReader(new File(file)));
    catch (FileNotFoundException e) 

public void nextTuple() 
    protected void ListenFile(File file) 
	RandomAccessFile access = null; 
	String line = null;                  
           while ((line = access.readLine()) != null)
               if (line !=null)
                    String[] fields=null;
                    if (tupleInfo.getDelimiter().equals("|"))
                       fields = line.split("\\"+tupleInfo.getDelimiter());
                    else                         		                                                                			 fields = line.split(tupleInfo.getDelimiter());                                                
                    if (tupleInfo.getFieldList().size() == fields.length)
                       _collector.emit(new Values(fields)); 
      catch (IOException ex) { } 	          

public void declareOutputFields(OutputFieldsDeclarer declarer) 
     String[] fieldsArr = new String [tupleInfo.getFieldList().size()];
     for(int i=0; i<tupleInfo.getFieldList().size(); i++)
         fieldsArr[i] = tupleInfo.getFieldList().get(i).getColumnName();
     declarer.declare(new Fields(fieldsArr));

declareOutputFields() decides the format in which the tuple is emitted, so that the bolt can decode the tuple in a similar fashion. Spout keeps on listening to the data added to the log file and as soon as data is added, it reads and emits the data to the bolt for processing.

Conceptually straightforward and easy to work with, Storm makes handling big data analysis a breeze.

Bolt Implementation

The output of Spout is given to Bolt for further processing. The topology we have considered for our use case consists of two bolts as shown in Figure 3.

Figure 3: Flow of data from Spout to Bolt.


The tuples emitted by spout is received by the ThresholdCalculatorBolt for threshold processing. It accepts several of inputs for threshold check. The inputs it accepts are:

  • Threshold value to check
  • Threshold column number to check
  • Threshold column data type
  • Threshold check operator
  • Threshold frequency of occurrence
  • Threshold time window

A class, shown Listing Four, is defined to hold these values.

Listing FourThresholdInfo class.

public class ThresholdInfo implements Serializable
    private String action;
    private String rule;
    private Object thresholdValue;
    private int thresholdColNumber;
    private Integer timeWindow;
    private int frequencyOfOccurence;


Based on the values provided in fields, the threshold check is made in the execute() method as shown in Listing Five. The code mostly consists of parsing and checking the incoming values.

Listing FiveCode for threshold check.

public void execute(Tuple tuple, BasicOutputCollector collector)
        List<Object> inputTupleList = (List<Object>) tuple.getValues();
        int thresholdColNum = thresholdInfo.getThresholdColNumber();
        Object thresholdValue = thresholdInfo.getThresholdValue();
        String thresholdDataType =
        Integer timeWindow = thresholdInfo.getTimeWindow();
        int frequency = thresholdInfo.getFrequencyOfOccurence();
            String valueToCheck = inputTupleList.get(thresholdColNum-1).toString();
            String frequencyChkOp = thresholdInfo.getAction();
                long curTime = System.currentTimeMillis();
                long diffInMinutes = (curTime-startTime)/(1000);
                             if(count.get() > frequency)
                    else if(frequencyChkOp.equals("!="))
                             if(count.get() > frequency)
                         System.out.println("Operator not supported");
                         if(count.get() > frequency)
                 else if(frequencyChkOp.equals("!="))
                          if(count.get() > frequency)
           else if(thresholdDataType.equalsIgnoreCase("int") ||
                   thresholdDataType.equalsIgnoreCase("double") ||
                   thresholdDataType.equalsIgnoreCase("float") ||
                   thresholdDataType.equalsIgnoreCase("long") ||
               String frequencyChkOp = thresholdInfo.getAction();
                    long valueToCheck =
                    long curTime = System.currentTimeMillis();
                    long diffInMinutes = (curTime-startTime)/(1000);
                    System.out.println("Difference in minutes="+diffInMinutes);
                             if(valueToCheck < Double.parseDouble(thresholdValue.toString()))
                                  if(count.get() > frequency)
                         else if(frequencyChkOp.equals(">"))
                              if(valueToCheck > Double.parseDouble(thresholdValue.toString()))
                                  if(count.get() > frequency)
                          else if(frequencyChkOp.equals("=="))
                             if(valueToCheck == Double.parseDouble(thresholdValue.toString()))
                                 if(count.get() > frequency)
                          else if(frequencyChkOp.equals("!="))
   . . .
          System.err.println("Emitting null in bolt");


Conceptually straightforward and easy to work with, Storm makes handling big data analysis a breeze.
The tuples emitted by the threshold bolt are passed to the next corresponding bolt, which is the DBWriterBolt bolt in our case.


The processed tuple has to be persisted for raising a trigger or for further use. DBWriterBoltdoes the job of persisting the tuples into the database. The creation of a table is done inprepare(), which is the first method invoked by the topology. Code for this method is given in Listing Six.

Listing SixCode for creation of tables.

public void prepare( Map StormConf, TopologyContext context ) 
    catch (ClassNotFoundException e) 
        System.out.println("Driver not found");

       connection driverManager.getConnection( 
           "jdbc:mysql://"+databaseIP+":"+databasePort+"/"+databaseName, userName, pwd);
       connection.prepareStatement("DROP TABLE IF EXISTS "+tableName).execute();

       StringBuilder createQuery = new StringBuilder(
           "CREATE TABLE IF NOT EXISTS "+tableName+"(");
       for(Field fields : tupleInfo.getFieldList())
               createQuery.append(fields.getColumnName()+" VARCHAR(500),");
               createQuery.append(fields.getColumnName()+" "+fields.getColumnType()+",");
       createQuery.append("thresholdTimeStamp timestamp)");

       // Insert Query
       StringBuilder insertQuery = new StringBuilder("INSERT INTO "+tableName+"(");
       String tempCreateQuery = new String();
       for(Field fields : tupleInfo.getFieldList())
       insertQuery.append("thresholdTimeStamp").append(") values (");
       for(Field fields : tupleInfo.getFieldList())

       prepStatement = connection.prepareStatement(insertQuery.toString());
    catch (SQLException e) 

Insertion of data is done in batches. The logic for insertion is provided in execute() as shown in Listing Seven, and consists mostly of parsing the variety of different possible input types.

Listing Seven: Code for insertion of data.

public void execute(Tuple tuple, BasicOutputCollector collector) 
       List<Object> inputTupleList = (List<Object>) tuple.getValues();
       int dbIndex=0;
       for(int i=0;i<tupleInfo.getFieldList().size();i++)
           Field field = tupleInfo.getFieldList().get(i);
           try {
               dbIndex = i+1;
                   prepStatement.setString(dbIndex, inputTupleList.get(i).toString());
               else if(field.getColumnType().equalsIgnoreCase("int"))
               else if(field.getColumnType().equalsIgnoreCase("long"))
               else if(field.getColumnType().equalsIgnoreCase("float"))
               else if(field.getColumnType().equalsIgnoreCase("double"))
               else if(field.getColumnType().equalsIgnoreCase("short"))
               else if(field.getColumnType().equalsIgnoreCase("boolean"))
               else if(field.getColumnType().equalsIgnoreCase("byte"))
               else if(field.getColumnType().equalsIgnoreCase("Date"))
                  Date dateToAdd=null;
                  if (!(inputTupleList.get(i) instanceof Date))  
                       DateFormat df = new SimpleDateFormat("yyyy-MM-dd hh:mm:ss");
                           dateToAdd = df.parse(inputTupleList.get(i).toString());
                       catch (ParseException e) 
                           System.err.println("Data type not valid");
			dateToAdd = (Date)inputTupleList.get(i);
			java.sql.Date sqlDate = new java.sql.Date(dateToAdd.getTime());
			prepStatement.setDate(dbIndex, sqlDate);
		catch (SQLException e) 
	Date now = new Date();			
	    prepStatement.setTimestamp(dbIndex+1, new java.sql.Timestamp(now.getTime()));
	    if (counter.get()== batchSize) 
	catch (SQLException e1) 
        long curTime = System.currentTimeMillis();
       long diffInSeconds = (curTime-startTime)/(60*1000);
       if(counter.get()<batchSize && diffInSeconds>batchTimeWindowInSeconds)
            try {
                startTime = System.currentTimeMillis();
            catch (SQLException e) {

public void executeBatch() throws SQLException
    counter = new AtomicInteger(0);

Once the spout and bolt are ready to be executed, a topology is built by the topology builder to execute it. The next section explains the execution steps.

Running and Testing the Topology in a Local Cluster

  • Define the topology using TopologyBuilder, which exposes the Java API for specifying a topology for Storm to execute.
  • Using Storm Submitter, we submit the topology to the cluster. It takes name of the topology, configuration, and topology as input.
  • Submit the topology.

Listing EightBuilding and executing a topology.

public class StormMain
     public static void main(String[] args) throws AlreadyAliveException, 
          ParallelFileSpout parallelFileSpout = new ParallelFileSpout();
          ThresholdBolt thresholdBolt = new ThresholdBolt();
          DBWriterBolt dbWriterBolt = new DBWriterBolt();
          TopologyBuilder builder = new TopologyBuilder();
          builder.setSpout("spout", parallelFileSpout, 1);
          builder.setBolt("thresholdBolt", thresholdBolt,1).shuffleGrouping("spout");
          if(this.argsMain!=null && this.argsMain.length > 0) 
                   this.argsMain[0], conf, builder.createTopology());
              Config conf = new Config();
              LocalCluster cluster = new LocalCluster();
              "Threshold_Test", conf, builder.createTopology());


After building the topology, it is submitted to local cluster. Once the topology is submitted, it runs until it is explicitly killed or the cluster is shut down without requiring any modifications. This is another big advantage of Storm.

This comparatively simple example shows the ease with which it’s possible to set up and use Storm once you understand the basic concepts of topology, spout, and bolt. The code is straightforward and both scalability and speed are provided by Storm. So, if you’re looking to handle big data and don’t want to traverse the Hadoop universe, you might well find that using Storm is a simple and elegant solution.

[repost ]IBM LanguageWare Resource Workbench



An Eclipse application for building custom language analysis into IBM LanguageWare resources and their associated UIMA annotators.


IBM LanguageWare Resource WorkbenchActions

Update: July 20, 2012:
Studio 3.0 is out and it is officially bundled with ICA 3.0. If you are a Studio 3.0 user, please use ICA forum instead of LRW forum. LRW is a fixpack that resolves issues in various areas including the Parsing Rules editor, PEAR file export and Japanese/Chinese language support. LRW is still available for download on the Downloads link for IBM OmniFind Enterprise Edition V9.1 Fix Pack users.

What is IBM LanguageWare?

IBM LanguageWare is a technology which provides a full range of text analysis functions. It is used extensively throughout the IBM product suite and is successfully deployed in solutions which focus on mining facts from large repositories of text. LanguageWare is the ideal solution for extracting the value locked up in unstructured text information and exposing it to business applications. With the emerging importance of Business Intelligence and the explosion in text-based information, the need to exploit this “hidden” information has never been so great. LanguageWare technology not only provides the functionality to address this need, it also makes it easier than ever to create, manage and deploy analysis engines and their resources.

It comprises Java libraries with a large set of features and the linguistic resources that supplement them. It also comprises an easy-to-use Eclipse-based development environment for building custom text analysis applications. In a few clicks, it is possible to create and deploy UIMA (Unstructured Information Management Architecture) annotators that perform everything from simple dictionary lookups to more sophisticated syntactic and semantic analysis of texts using dictionaries, rules and ontologies.

The LanguageWare libraries provide the following non-exhaustive list of features: dictionary look-up and fuzzy look-up, lexical analysis, language identification, spelling correction, hyphenation, normalization, part-of-speech disambiguation, syntactic parsing, semantic analysis, facts/entities extraction and relationship extraction. For more details see the documentation.

The LanguageWare Resource Workbench provides a complete development environment for the building and customization of dictionaries, rules, ontologies and associated UIMA annotators. This environment removes the need for specialist knowledge of the underlying technologies of natural language processing or UIMA. In doing so, it allows the user to focus on the concepts and relationships of interest, and to develop analyzers which extract them from text without having to write any code. The resulting application code is wrapped as UIMA annotators, which can be seamlessly plugged into any application that is UIMA-compliant. Further information about UIMA is available on the UIMA Apache site

LanguageWare is used in such various products as Lotus Notes and Domino, Information Integrator OmniFind Edition (IBM’s search technology), and more.

The LanguageWare Resource Workbench technology runs on Microsoft Windows and Linux. The core LanguageWare libraries support a much broader list of platforms. For more details on platform support please see the product documentation.

How does it work?

The LanguageWare Resource Workbench allows users to easily:

  • Develop rules to spot facts, entities and relationships using a simple drag and drop paradigm
  • Build language and domain resources into a LanguageWare dictionary or ontology
  • Import and export dictionary data to/from a database
  • Browse the dictionaries to assess their content and quality
  • Test rules and dictionaries in real-time on documents
  • Create UIMA annotators for annotating text with the contents of dictionaries and rules
  • Annotate text and browse the contents of each annotation.

The Workbench contains the following tools:

  • A dictionary viewer/editor
  • An XML-based dictionary builder
  • A Database-based dictionary builder (IBM DB2 and Apache Derby support are provided)
  • A dictionary comparison tool
  • A rule viewer/editor/builder
  • A UIMA annotator generator, which allows text documents to be annotated and the results displayed.
  • A UIMA CAS (common annotation structure) comparator, which allows you to compare the results of two different analyses through comparing the CASes generated by each run.

The LanguageWare Resource Workbench documentation is available online and is also installed using the Microsoft Windows or Linux installers or using the respective .zip files.

What type of application is LanguageWare suitable for?

LanguageWare technology can be used in any application that makes use of text analytics. Good examples are:

  • Business Intelligence
  • Information Search and Retrieval
  • The Semantic Web (in particular LanguageWare supports semantic analysis of documents based on ontologies)
  • Analysis of Social Networks
  • Semantic tagging applications
  • Semantic search applications
  • Any application wishing to extract useful data from unstructured text

For Web-based semantic query of the LanguageWare text analytics, you might be interested in checking out IBM Data Discovery and Query Builder. When used together, these two technologies can provide a full range of data access services including UI presentation, security and auditing of users, structured and unstructured data access through semantic concepts and deep text analytics of unstructured data elements.

More information

About the technology author(s)

LanguageWare is a worldwide organization comprising a highly qualified team of specialists with a diverse combination of backgrounds: linguists, computer scientists, mathematicians, cognitive scientists, physicists, and computational linguists. This team is responsible for developing innovative Natural Language Processing technology for IBM Software Group.

LanguageWare, along with LanguageWare Resource Workbench, is a collaborative project combining skills, technologies, and ideas gathered from various IBM product teams and IBM Research division.

Platform requirements

Operating systems: Microsoft Windows XP, Microsoft Windows Vista, or SUSE Linux Enterprise Desktop 11

Hardware: Intel 32-bit platforms (tested)


  • Java 5 SR5 or above (not compatible with Java 5 SR4 or below)
  • Apache UIMA SDK 2.3 (required by LanguageWare Annotators in order to run outside the Workbench)

Notes: Other platforms and JDK implementations may work but have not been significantly tested.

Installation instructions

Installing LanguageWare Resource Workbench and LanguageWare Demonstrator

On each platform, there are two methods of installation.
On Microsoft Windows:

  • Download the lrw.win32.install.exe or Demonstrator.exe and launch the installation.

On Linux:

  • Download the lrw.linux.install.bin or Demonstrator.bin and launch the installation (e.g., by running sh ./lrw.linux.install.bin).

LanguageWare Training and Enablement

The LanguageWare Training Material is a set of presentations that walk you through the use of LanguageWare Resource Workbench. It uses a step by step approach with examples and practice exercises to build up your knowledge of the application and understand data modeling. Please follow the logical succession of the material, and make sure you finish the sample exercise.
Please use this link to access the presentation decks.


  • Intel is the trademarks or registered trademark of Intel Corporation or its subsidiaries in the United States and other countries.
  • Java and all Java-based trademarks and logos are trademarks or registered trademarks of Oracle and/or its affiliates.
  • Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both.
  • Microsoft and Windows are trademarks of Microsoft Corporation in the United States, other countries, or both.


Tab navigation

1. What is the LanguageWare Resource Workbench? Why should I use it?The LanguageWare Resource Workbench is a comprehensive Eclipse-based environment for developing UIMA Analyzers. It allows you to build Domain Extraction Models (lexical resources and parsing rules, see FAQ 8) which describe the entities and relationships you wish to extract, creating custom annotators tailored to your specific needs. These annotators can be easily exported as PEAR files (see FAQ 10) and installed into any UIMA pipeline or deployed onto an ICA server. If you satisfy the following criteria, then you will want to use LanguageWare Resource Workbench:

  • You need a robust open standards (UIMA) text analyzer that can be easily customized to your specific domain and analysis challenges.
  • You need a technology that will enable you to exploit your existing structured data repositories in the analysis of the unstructured sources.
  • * You need a technology that allows you to build custom domain models that become your intellectual property and differentiation in the marketplace.
  • need a technology that is multi-lingual, multi-platform, multi-domain, and high performance.

Back to top

2. What’s new in this version of LanguageWare Resource Workbench?

All new features are outlined in the LanguageWare Resource Workbench Release Notes, ReleaseNotes.htm, which is located in the Workbench installation directory.

Back to top

3. Where should I start with LanguageWare?

The best way to get started with LanguageWare is to install the LanguageWare Resource Workbench. Check the training videos provided above or the training material (to be posted soon); it will introduce you to the Workbench and show you how it works.

Back to top

4. What documentation is available to help me use LanguageWare?

Context-sensitive help is provided as part of the LanguageWare Resource Workbench. There is an online help system shipped with the LanguageWare Resource Workbench (under Help / Help Contents). Check the training videos provided above or the training material (to be posted soon). More detailed information about the underlying APIs will be provided for fully-licensed users of the technology.

Back to top

5. What are the known limitations with this release of the LanguageWare Resource Workbench?

Any problems or limitations are outlined in the LanguageWare Resource Workbench Release Notes, ReleaseNote.htm, which is located in the LanguageWare Resource Workbench installation directory and is part of the LanguageWare Resource Workbench Help System.

Back to top

6. What version of UIMA do I need to use the LanguageWare annotators?

LanguageWare Resource Workbench ships with, and has been tested against, Apache UIMA, Version 2.3. They should work with newer versions of Apache UIMA; however, they have not been extensively tested for compatibility. Therefore, we would recommend Apache UIMA v2.3. The LanguageWare annotators are not compatible with versions of UIMA prior to 2.1. These were released by IBM and have namespace conflict with Apache UIMA.

Back to top

7. What is a Domain Extraction Model (or “model,” “annotator,” “analyzer”)? How do I build a good Domain Extraction Model?

A “model” is the set of resources you build to describe what you want to extract from the data. The models are a combination of:

  • The morphological resources, which describe the basic language characteristics
  • The lexical resources, which describe the entities/concepts that you want to recognize
  • The POS tagger resource
  • The parsing rules, which describe how concepts combine to generate new entities and relationships.

The process of building data models is an iterative process within the LanguageWare Resource Workbench.

Back to top

8. How do I change the default editor for new file types in the LanguageWare Resource Workbench?

Go to Window / Preferences / General / Editors / File Associations. If the content type is already listed, just add a new editor and pick the LanguageWare Text Editor. You can set this to be the default, or alternatively leave it as an option that you can choose, on right click, whenever you open a file of that type. You will need to restart the LanguageWare Resource Workbench before this comes into effect. Note: Eclipse remembers the last viewer you used for a file type so if you opened a document with a different editor beforehand you may need to right-click on the file and explicitly choose the LanguageWare Resource Workbench Text Editor the first time on restart.

Back to top

9. How do I integrate the UIMA Analyzers that I develop in the LanguageWare Resource Workbench?

Once you have completed building your Domain Extraction Models (dictionaries and rules), the LanguageWare Resource Workbench provides an “Export as UIMA Pear” function under File / Export. This will generate a PEAR file that contains all the code and resources required to run your pipeline in any UIMA-enabled application, that is, in a UIMA pipeline.

Back to top

10. How is my data stored?

The LanguageWare Resource Workbench is designed to primarily help you to build your domain extraction models and this includes databases in which you can store your data. The LanguageWare Resource Workbench ships with an embedded database (Derby, open source); however, it can also connect to an enterprise database, such as DB2.

Back to top

11. What licensing conditions apply for LanguageWare, for academic purposes, or for commercial use?LanguageWare is licensed through the IBM Content Analytics License at http://www-01.ibm.com/software/data/cognos/products/cognos-content-analytics/.

Back to top

13. Is Language Identification identifying the wrong language?

Sometimes the default amount of text (1024 characters) used by Language Identification is not enough to disambiguate the correct language. This happens specially when languages are quite close or when the text analysed may include text in more than one language. In this case, it may help to increase the MaxCharsToExamine parameter. To do this, select from the LWR menu: Window > Preferences > LanguageWare > UIMA Annotation Display. Enable the checkbox for “Show edit advanced configuration option on pipeline stages.” Select “Apply” and “OK.” Next time you open a UIMA Pipeline Configuration file, you will notice an Advanced Configuration link at the Document Language stage. Click on it to expand and display its contents, notice the MaxCharsToExamine parameter can be edited. Change the default number displayed to a bigger threshold. Save your changes and try again to see if the Language Identification has improved.

Back to top

13. Why is the LanguageWare Resource Workbench shipped as an Eclipse-based application?

We built the LanguageWare Resource Workbench on Eclipse because it provides a collaborative framework through which we can share components with other product teams across IBM, with our partners, and with our customers. This version of the LanguageWare Resource Workbench is a complete, stand-alone application. However, users can still get the benefits of the Eclipse IDE by installing Eclipse features into the Workbench. Popular features include the Eclipse CVS feature for managing shared projects and the Eclipse XML feature for full XML editing support. See the Eclipse online help for more information about finding and installing new features. It is important to understand that while the LanguageWare Resource Workbench is Eclipse-based, the Annotators that are exported from the LanguageWare Resource Workbench (under File / Export) can be installed into any UIMA pipeline and can be deployed in a variety of ways. The LanguageWare Resource Workbench team, as part of the commercial LanguageWare Resource Workbench license, provides integration source code to simplify the overall deployment and integration effort. This includes UIMA serializers, CAS consumers, and APIs for integrating into through C/JNI, Eclipse, Web Services (REST), and others.

Back to top

14. What languages are supported by the LanguageWare Resource Workbench?The following languages are fully supported, ie, LanguageID, Lexical Analysis and Part of Speech Disambiguation.






Chinese (Simplified)


Chinese (Traditional)





















For the following languages, a lexical dictionary without part of speech disambiguation can be made available upon request: Afrikaans, Catalan, Greek, Norwegian (Bokmal), Norwegian (Nynorsk), Russian and Swedish. These dictionaries are provided “AS-IS” (i.e. they have not been maintained and will not be supported. While feedback on them is much appreciated, requests for changes, fixes or queries will only be addressed if adequately planned and sufficiently funded).
Back to top

15. Does LanguageWare support GB18030?

LanguageWare annotators support UTF-16, and this qualifies as GB18030 support. This does mean that you need to translate the text from GB18030 to UTF-16 at the document ingestion stage. Java will do this automatically for you (in the collection reader stage) as long as the correct encoding is specified when reading files.

Please note that text in GB18030 extension B may contain characters outside the Unicode Basic Multilingual Plane. Currently the default LanguageWare break rules would incorrectly split such characters into two tokens. If support for these rare characters is required, the attached break rules file can be used to ensure the proper handling of 4-byte characters. (Note that the file zh-surrogates.dic for Chinese is wrapped in zh-surrogates.zip.)

Back to top

16. Are you experiencing problems with the LanguageWare Resource Workbench UI on Linux platforms?

There is a known issue on Ubuntu with Eclipse and the version of the GTK+ toolkit that prevents toolbars being drawn properly or buttons working properly with mouse clicks. The fix is explained here:
To provide a work around you need to create an environment variable “GDK_NATIVE_WINDOWS=1” before loading up the LanguageWare Resource Workbench. Another issue was reported for Ubuntu 9.10 (Karmic) with LanguageWare Resource Workbench showing an empty dialog window when starting. The issue is explained here:
To provide a work around you need to add “-Dorg.eclipse.swt.browser.XULRunnerPath=/dev/null” line to lrw.ini file in the LanguageWare Resource Workbench installation folder.

Back to top

17. Any other questions not covered here?

Please use the “Forum” to post your questions and we will get back to you.

Back to top

18. How do I upgrade my version of the LRW?

On Windows and Linux, each version of the LRW is a separate application. You should not install a new version over a previous one, instead make sure to either uninstall your previous version, or ensure that all versions are installed in separate locations. If in doubt, the default use of the LRW installers will ensure this behaviour.

The projects you created with older versions of the LRW will never be removed by the uninstall process. You can point your new version of the LRW at the same data workspace during startup.

Back to top


Bookmarks logo

Download Text Analytics Tools and Runtime for IBM LanguageWare

Updated by AlexisTipper|Dec 12 2011|Tags: download
Bookmarks logo

LanguageWare Forum

Updated by Amine_Akrout|Sep 14 2011|Tags:
Bookmarks logo


Updated by iron-horse|Jun 21 2011|Tags: alphaworks
Bookmarks logo

developerWorks Community

Updated by iron-horse|Jun 21 2011|Tags: dw
View All

[repost ]PIN analysis


A good friend of mine, Ian, recently forwarded me an internet joke. The headline was something like:

“All credit card PIN numbers in the World leaked”

The body of the message simply said 0000 0001 0002 0003 0004 

Ian’s messages made me chuckle. Then, later the same day, I read this XKCD cartoon. The merging of these two humorous topics created the seed for this article.


I love Randall’s work. My favorite, to date, is this one. I have a signed copy of it on my office wall.

Like many of his creations, this cartoon is excellent at bifurcating readers; people read it, then either smile and chuckle, or stare blankly at it followed by a “Huh? I don’t get it!” comment. Then you explain it, and get a reply“Yeeaaaaaa…no, I still don’t get it!”

Esoteric humor in action.

You can be cool and buy his signed artwork too.


What is the least common PIN number?

There are 10,000 possible combinations that the digits 0-9 can be arranged to form a 4-digit pin code. Out of these ten thousand codes, which is the least commonly used?

Which of these pin codes is the least predictable?

Which of these pin codes is the most predictable?

If you were given the task of trying to crack a random credit card by repeatedly trying PIN codes, what order should you try guessing to maximize your chances of selecting the correct number in the shortest time?

If you had to make predication about what the least commonly used 4-digit PIN is, what would be your guess?

This tangentially relates to the XKCD cartoon. In Randall’s cartoon, the perpetrator’s plan backfired because his selected license plate was so unique that it was very memorable. What is the least memorable license plate? Ask any spy you know (snigger) what the best way to blend into a crowd is. Their answer will be not stand out, to appear “normal”, and not be notable in any way.

People are notoriously bad at generating random passwords. I hope this article will scare you into being a little more careful in how you select your next PIN number.

Are you curious about what the least commonly used PIN number might be?

How about the most popular?

Read on …


This article is not intended to be a hacker bible, or to be used as a utility, resource, or tool to help would-be thieves perform nefarious actions. I will only disclose data sufficient to make my points, and will try to avoid giving specific data outside of the obvious examples. I do not want to be an enabler for script-kiddies. Please do not email me asking for the database I used; if you do, you will be wasting your time as I’m not going to respond. I’m not going to sell, donate or release the source data – don’t ask!


Obviously, I don’t have access to a credit card PIN number database. Instead I’m going to use a proxy. I’m going to use data condensed from released/exposed/discovered password tables and security breaches.

Soap Box – Password Database Exposures

Over the years, there have been numerous password table security breaches: Some very high profile, some low profile, but all embarrassing (and many exceedingly expensive; both in direct fines and indirect loss of business through erosion of trust and reputation).

Fool me once, well, no, even that’s not really acceptable, but fool me twice … I’ll go even further: Any developer who stores the password table of their database in clear text should be so mortified by this lack of security that they should not be sleeping at night until they fix it. Ignoring the fact that you should never have ever coded it this way, you have an obligationto learn from these past breaches.

If you work for a company and are knowledgeable that your customer database is“protected” by such lightweight security then run, don’t walk, to your CEO/Presidents office, pound on the door and insist (s)he puts out a mandate to fix the matter with extreme prejudice. Don’t leave until you get an affirmative response. Badger, badger then badger them again. Make yourself a proverbial thorn in their side.

I’m not trying to sell my services as a consultant here (though if you are interested, my rates are very reasonable compared to the cost of legal defense, potential FTC sanctions, class action suits, shareholder backlash, fines, loss of reputation and business …) There are plenty of security experts in the industry who can help you (if you need help filtering them and don’t have referrals, someone who has CISSP qualifications is a good place to start).

 Bottom line  Security strengthens with layers, and the simple application of encryption on your database table can help protect your customer’s data if this table is exposed. It does not defend against all possible attacks, but it does nothing but good things. What possible reason is there store things in clear-text?

Back to the data

By combining the exposed password databases I’ve encountered, and filtering the results to just those rows that are exactly four digits long [0-9] the output is a database of all the four digit character combinations that people have used as their account passwords.

Given that users have a free choice for their password, if users select a four digit password to their online account, it’s not a stretch to use this as a proxy for four digit PIN codes.

The Data

I was able to find almost 3.4 million four digit passwords. Every single one of the of the 10,000 combinations of digits from 0000 through to9999 were represented in the dataset.

The most popular password is  1234  …
… it’s staggering how popular this password appears to be. Utterly staggering at the lack of imagination …
… nearly 11% of the 3.4 million passwords are  1234  !!!
The next most popular 4-digit PIN in use is  1111  with over 6% of passwords being this.

In third place is  0000  with almost 2%.

A table of the top 20 found passwords in shown at the right. A staggering 26.83% of all passwords could be guessed by attempting these 20 combinations!

(Statistically, with 10,000 possible combination, if passwords were uniformly randomly distributed, we would expect the these twenty passwords to account for just 0.2% of the total, not the 26.83% encountered)

Looking more closely at the top few records, all the usual suspects are present  1111   2222  3333  …  9999  as well as  1212  and (snigger)  6969 .

It’s not a surprise to see patterns like  1122  and  1313  occurring high up in the list, nor  4321  or 1010 .

2001  makes an appearance at #19.  1984  follows not far behind in position #26, and James Bond fans may be interested to know  0007  is found between the two of them in position #23 (another variant  0070  follows not much further behind at #28).

PIN Freq
#1 1234 10.713%
#2 1111 6.016%
#3 0000 1.881%
#4 1212 1.197%
#5 7777 0.745%
#6 1004 0.616%
#7 2000 0.613%
#8 4444 0.526%
#9 2222 0.516%
#10 6969 0.512%
#11 9999 0.451%
#12 3333 0.419%
#13 5555 0.395%
#14 6666 0.391%
#15 1122 0.366%
#16 1313 0.304%
#17 8888 0.303%
#18 4321 0.293%
#19 2001 0.290%
#20 1010 0.285%

The first “puzzling” password I encountered was  2580  in position #22. What is the significance of these digits? Why should so many people select this code to make it appear so high up the list?

Then I realized that  2580 is a straight down the middle of a telephone keypad!
(Interestingly, this is very compelling evidence confirming the hypothesis that a 4-digit password list is a great proxy for a PIN number database. If you look at the numeric keypad on a PC-keyboard you’ll see that 2580 is slightly more awkward to type on the PC than a phone because the order of keys on a keyboard is the inverted. Cash machines and other terminals that take credit cards use a phone style numeric pads. It appears that many people have an easy to type/remember PIN number for their credit card and are re-using the same four digits for their online passwords, where the “straight down the middle” mnemonic no longer applies).

(Another fascinating piece of trivia is that people seem to prefer even numbers over odd, and codes like  2468  occur higher than a odd number equivalent, such as  1357 ).

Cumulative Frequency

As noted above, the more popular password selections dominate the frequency tables. The most popular PIN code of  1234  is more popular than the lowest 4,200 codes combined!

That’s right, you might be able to crack over 10% of all codes with one guess! Expanding this, you could get 20% by using just five numbers!

Below is a cumulative frequency graph:

Statistically, one third of all codes can be guessed by trying just 61 distinct combinations!

The 50% cumulative chance threshold is passed at just 426 codes (far less than the 5,000 that a random uniformly distribution would predict). Paranoid yet?

Bottom of the pile

OK, we’ve investigated most frequently used PINS and found they tend to be predictable and easy to remember, let’s turn for a second to the bottom of the pile.

What are the least “interesting” (least used) PINS?

In my dataset the answer is  8068  with just 25 occurrences in 3.4 million (this equates to 0.000744%, far, far fewer than random distribution would predict, and five orders of magnitude behind the most popular choice).

To the right are the twenty least popular 4-digit passwords encountered.

 Warning  Now that we’ve learned that, historically, 8068  is (was?) the least commonly used password 4-digit PIN, please don’t go out and change yours to this! Hackers can read too! They will also be promoting 8068 up their attempt trees in order to catch people who read this (or similar) articles.

Check out about the Nash Equilibrium

PIN Freq
#9980 8557 0.001191%
#9981 9047 0.001161%
#9982 8438 0.001161%
#9983 0439 0.001161%
#9984 9539 0.001161%
#9985 8196 0.001131%
#9986 7063 0.001131%
#9987 6093 0.001131%
#9988 6827 0.001101%
#9989 7394 0.001101%
#9990 0859 0.001072%
#9991 8957 0.001042%
#9992 9480 0.001042%
#9993 6793 0.001012%
#9994 8398 0.000982%
#9995 0738 0.000982%
#9996 7637 0.000953%
#9997 6835 0.000953%
#9998 9629 0.000953%
#9999 8093 0.000893%
#10000 8068 0.000744%

Memorable Years

Many of the high frequency PIN numbers can be interpreted as years, e.g.  1967   1956   1937  … It appears that many people use a year of birth (or possibly an anniversary) as their PIN. This will certainly help them remember their code, but it greatly increases its predictability.

Just look at the stats: Every single  19??  combination can be found in the top fifth of the dataset!

Below is a plot of this in graphical format. In this chart, each yellow line represents a PIN number that starts  19??

If all the passwords were uniformly distributed, there should be no significant difference between the frequency of occurrence of, for instance,  1972  and any other PIN ending in seventy two  ??72 . However, as we shall see, this is not the case at all.

1972  occurs in ordinal position #76 (with a frequency 0.099363%). Here’s a histogram for the occurrences of all  ??72  probabilities.

You can clearly see the spike at  1972  (with smaller spikes at  7272  and  1472 )

If you calculate the ratio of the peak of  1972  to the average of all the other  ??72  PINS you get the ratio of  22:1

PINS starting with  19??  are much more likley to occur. Of course, it’s not just 1972. Here is plot of the ratio of 19 to non-19 for all hundred combinations. Along the x-axis are all the combinations of last two digits –XX, and for each of these the ratio of the 19XX to average of all the other ??XX occurrences has been calculated. Here’s the chart:

It’s a pretty good approximation for a demographic chart! (suggested by the red-dashed trend line) which would probably allow a fair estimation of the ages (years of birth) of the people using the various websites. (Of course, hackers invert this strategy and use the age of a target to try and give information to guess a user’s PIN. Looking at this graph, this might give them up to a 40x advantage!)

Just about all the ratios are above 1.0. The noteable exceptions are  ??34  and  ??00  (which are easy to explain, since the massive popularity of  1234  and  0000  dwarf  1934  and  1900 respectively). Simiarly  33   44   55   66  … are lower than expected as the quad codes like  3333  mask out even the  1933  boost.

There are also spikes in the graph corresponding to the popular PINS of  1919   1984  and  1999

Patterns in data

I love pretty ways to graphically vizualize data. Pictures really do paint thousands of words.

Another interesting way to visualize the PIN data is in this grid plot of the distribution. In this heatmap, the x-axis depicts the left two digits from [00] to [99] and the y-axis depicts the right two digits from[00] to [99]. The bottom left is  0000  and the top right is  9999 .

Color is used to represent frequency. The higher frequency occurences are yellow to white hot, and the lower frequency occurences are red, through dark red to black.

 Geek Note  The scaling is logarithmic.

You could look at this plot all day!

The bright line for the leading diagonal shows the repeated couplets that people love to use for their PIN numbers  0000   0101   0202  … 5454   5555   5656  …  9898   9999 .

Every eleventh dot on the leading diagonal is brighter corresponding to the quad numbers e.g.  4444   5555 . Here is a larger scale version:

Interesting things

There are so many interesting things to learn from this heatmap. Here are just a couple:

The first is the interesting harmonics of shading (seen here more easily in a gray scale plot).

You can make out a “grid pattern” in the plot.

The lighter areas corresponding to couplets of numbers that are close to each other. For some reason, people don’t like to select pairs of numbers that have larger numerical gaps between them. Combinations like  45  and  67  occur much more frequently than things like  29  and  37

Here we see the line corresponding to  19XX . The intensity the dots relates to the chart we plotted earlier

There are a large number of codes starting with 19, especially towards the higher end.

There is a strong bias towards the lower left quadrant. People love to start their PIN numbers with 0, and even more so with the digit 1.

The chart on the right shows the relative frequency of the first digit of 4-digit pin codes.

As you can see, the digit 1dominates (and it’s not all down to the  19XX  phenomenon.)

Little bright specs dot the plot in places corresponding to numerical runs (both ascending and descending) such as  2345  ,  4321  and  5678 .

I’ve highlighted just a couple on the plot to the left.

Jumps in steps of two are also visible e.g.  2468

Repeated-pair couplets of numbers are very common, such as  XYXY

The hundred sets of repeating couplet pairs represent a staggering 17.8% of all observed PIN numbers.

More than four

The purpose of this posting was to investigate patterns and frequency of four digit PIN numbers. However, the database I collected also hasall-numeric password of different lengths. It’s worth taking a quick look at these too.

I found close to 7 million all-numeric passwords. Approximately half of these were the four-digit codes we’ve just examined.

Six digit codes are the next most popular length, followed eight.

I hope, hope that the people who have passwords of nine digits long are not using their Social Security Numbers!


Below are the top 20 passwords for the various lengths, along with their share of their same-size namespace.

# 5 6 7 8 9 10
#1 12345 22.802% 123456 11.684% 1234567 3.440% 12345678 11.825% 123456789 35.259% 1234567890 20.431%
#2 11111 4.484% 123123 1.370% 7777777 1.721% 11111111 1.326% 987654321 3.661% 0123456789 2.323%
#3 55555 1.769% 111111 1.296% 1111111 0.637% 88888888 0.959% 123123123 1.587% 0987654321 2.271%
#4 00000 1.258% 121212 0.623% 8675309 0.465% 87654321 0.815% 789456123 1.183% 1111111111 2.087%
#5 54321 1.196% 123321 0.591% 1234321 0.220% 00000000 0.675% 999999999 0.825% 1029384756 1.293%
#6 13579 1.112% 666666 0.577% 0000000 0.188% 12341234 0.569% 147258369 0.591% 9876543210 0.971%
#7 77777 0.618% 000000 0.521% 4830033 0.158% 69696969 0.348% 741852963 0.455% 0000000000 0.942%
#8 22222 0.454% 654321 0.506% 7654321 0.154% 12121212 0.320% 111111111 0.425% 1357924680 0.479%
#9 12321 0.412% 696969 0.454% 5201314 0.128% 11223344 0.293% 123454321 0.413% 1122334455 0.441%
#10 99999 0.397% 112233 0.417% 0123456 0.124% 12344321 0.275% 123654789 0.378% 1234512345 0.402%
#11 33333 0.338% 159753 0.283% 2848048 0.124% 77777777 0.262% 147852369 0.356% 1234554321 0.380%
#12 00700 0.261% 292513 0.250% 7005425 0.120% 99999999 0.223% 111222333 0.304% 5555555555 0.259%
#13 90210 0.244% 131313 0.235% 1080413 0.111% 22222222 0.219% 963852741 0.255% 1212121212 0.244%
#14 88888 0.217% 123654 0.228% 7895123 0.107% 55555555 0.205% 321654987 0.253% 9999999999 0.231%
#15 38317 0.216% 222222 0.212% 1869510 0.102% 33333333 0.176% 420420420 0.241% 2222222222 0.219%
#16 09876 0.185% 789456 0.209% 3223326 0.100% 44444444 0.165% 007007007 0.227% 7777777777 0.206%
#17 44444 0.179% 999999 0.194% 1212123 0.096% 66666666 0.160% 135792468 0.164% 3141592654 0.195%
#18 98765 0.169% 101010 0.190% 1478963 0.088% 11112222 0.140% 397029049 0.158% 3333333333 0.186%
#19 01234 0.160% 777777 0.188% 2222222 0.085% 13131313 0.131% 012345678 0.154% 7894561230 0.165%
#20 42069 0.154% 007007 0.186% 5555555 0.082% 10041004 0.127% 123698745 0.152% 1234567891 0.161%

Some interesting observations (and a little speculation)

 For five digit passwords, users appear to have even less imagination in selecting their codes (22.8% select 12345). All the usual suspects occur, but a new addition is the puerile addition in position #20 of the concatenation of 420 and 69.

 For six digit password, again 696969 appears highly. Also of note is 159753 (a “X” mark over the numeric keypad). James Bond returns with 007007.

 For seven digits, the standby of 1234567 is a much lower frequency (though still the top). I speculate that this is because many people may be using their telephone number (without area code) as a seven digit password. Telephone numbers are fairly distinct, and already memorized, so when a seven digit code is needed, they spring to mind easily. The higher frequency of usage of telephone numbers reduces the need to use imagination (or lack thereof) and select something else.

 Is Jenny there? The fouth most popular seven digit password is 8675309 (It’s a popular 80’s song).

 Eight digit passwords are just as expected. Lots of pattern, and lots of repetition.

 Common nine digit passwords also follow patterns and repetition. 789456123 appears as an easy “Along the top, middle and bottom of the keypad” 147258369 is related in the vertical direction (and other variants appear high up). Again we get a 420 moment with 420420420, and also the shaken, not stirred, but repeated 007007007 returns.

 Interestingly for ten digits 1029384756 appears (alternating ascending/descending digits), as well as the odd/even 1357924680.

 Hurrah for math! In position #17 of the ten digit password list we get 3141592654 (The first few digits of Pi)


If you are a  developer  tester  or  executive  I hope you are sufficiently paranoid that you will immediatelycheck to see that your systems do not store sensitive information, like passwords, unencrypted. The entire reason I was able to perform this analysis is because dumb stupid and lazy coders stored information in clear text. Your lazyness has the potential to impact millions.

If you are a  consumer  and your recognize any of the numbers I’ve used in this article to be your passwords/pins I hope you apply common sense and immediately change them to something a little less predictable. Alternatively, you could be lazy and not change things (In that case, at least the only person you are harming with this apathy is yourself.)


Since publishing this article, it’s been brought to my attention that, of course, in addition to anniversary years, many people encapsulate dates in the format MMDD (such as birthdays …) for their PIN codes.

This clearly explains the lower left corner where, if you look at the heatmap, there is a huge contrast change at the height of around 30-31 (the number of days in a month), extending to 12 on the x-axis. (Thanks to zero79 for first pointing this out).

Many people also asked the significance of 1004 in the four character PIN table. This comes from Korean speakers. When spoken, “1004” is cheonsa (cheon = 1000, sa=4).

“Cheonsa” also happens to be the Korean word for Angel.

Another XKCD cartoon

It only seems appropriate to end with another XKCD cartoon. This one is Password Strength

Check out other interesting blogarticles.






[repost ]Keyword based Social Networks: Models, Algorithms and Analysis



Ankush Garg
September 2009
Computer Science
Keyword based Social Networks: Models, Algorithms and Analysis
In online social networks, users create a pro le by setting some attributes that
help in characterizing the user. We call these pro le attributes the keywords of the
user. In this work, we focus on social networks based on keywords and study how
keywords can be e ectively used to model and search in such networks. We make
two main contributions in this thesis.
First, we study the relationship between semantic similarity of user keywords
and the social network topology. We present a forest' model to categorize keywords
and de ne similarity functions between a pair of users. Based on that, we model
the social network topology and validate the model against a real life social network
Second, we study unstructured keyword based social networks and propose
an information
ow model to disseminate keywords. Then, based on the given
ow, we design and develop a search algorithm to nd users who have
keywords, present in the search query, as part of their pro le attributes. The
search algorithm is based on a linear combination of topological distance and trust
metrics. We observe an improvement in orders of magnitude when the search
algorithm is compared to breadth rst search.Keyword based Social Networks: Models,
Algorithms and Analysis
Ankush Garg
B.Tech., Computer Science and Engineering (IIT Delhi, India) 2007
Submitted in partial satisfaction of the requirements for the degree of
in the
of the
Approved by the Committee in Charge:
Dr. S. Felix Wu
Committee Chair and Professor of Computer Science
Dr. Charles U. Martel
Professor of Computer Science
Dr. Norman S. Matlo
Professor of Computer Science
iTo my parents: Smt. Sheela Garg and Late. Dr. K. C. Garg
मातदेवो भव। �पत ृ देवो भव॥ ृ - तै��र�योप�नषद १.११.२
Let your mother and father be like God unto you. – Taittiriya Upanishad 1.11.2
यं माता �पतरौ क्ेें सेते ंभवे नृृ ाम।म
न तसय �ननषृ�त� ेकया षत वष क ेतैर म ै�प॥ - मन स म�त ृ २.२२७
That trouble (and pain) which the parents undergo on the birth of (their)
children, cannot be compensated even in a hundred years. – Manu Smriti 2.227
In online social networks, users create a pro le by setting some attributes that
help in characterizing the user. We call these pro le attributes the keywords of the
user. In this work, we focus on social networks based on keywords and study how
keywords can be e ectively used to model and search in such networks. We make
two main contributions in this thesis.
First, we study the relationship between semantic similarity of user keywords
and the social network topology. We present a
forest’ model to categorize keywords
and de ne similarity functions between a pair of users. Based on that, we model
the social network topology and validate the model against a real life social network
Second, we study unstructured keyword based social networks and propose
an information
ow model to disseminate keywords. Then, based on the given
ow, we design and develop a search algorithm to nd users who have
keywords, present in the search query, as part of their pro le attributes. The
search algorithm is based on a linear combination of topological distance and trust
metrics. We observe an improvement in orders of magnitude when the search
algorithm is compared to breadth rst search.
I would like to express my gratitude to my advisors, Dr. S. Felix Wu and
Dr. Charles U. Martel, for guiding this work with utmost interest and care. I
am grateful to them for giving me the freedom to explore my ideas. I also thank
them for teaching excellent courses on Social Networks and Advanced Algorithms
respectively, that laid the foundation for my decision to work with them.
I would like to thank Dr. Norman S. Matlo for nding the time to serve on
my thesis committee and providing his valuable feedback.
I am thankful to my colleague, Prantik Bhattacharyya, for the assistance and
support provided by him. It was a nice experience working with him on various
concepts and writing papers together. I extend my thanks to members of the DSL
research team, namely Lerone Banks, Juan Lang, Shaozhi Shawn Ye, Thomas
Tran, Kelcey Chan, Xiaoming Lu, Matthew Spear, George and Ken Gribble for
providing an enjoyable research environment. A special thanks to Mika, Xiaoming
and Matt’s daughter, for her cherubic responses during the weekly meetings.
I owe a special gratitude to my brother, Vandit Garg, for supporting and mo-
tivating me during tough academic times. The encouragement provided by him
went a long way in helping me to move ahead in life in spite of the stuttering
problem. I would also like to thank my mother, Smt. Sheela Garg, for being a
constant source of strength in my life.
Finally, I would like to express my gratitude to my friend Sampath Goud who
helped me in zeroing in on the Sanskrit verses used on the dedication page.
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
1 Introduction 1
1.1 The Role of Keywords . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Research Contributions . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Social Network Model based on Keyword Categorization . . 3
1.2.2 Information Flow and Search in Unstructured Keyword based
Social Networks . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Social Network Model based on Keyword Categorization 6
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Why Categorize Keywords? . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Social Network Modeling . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.1 Forest Structure . . . . . . . . . . . . . . . . . . . . . . . . . 11
v2.3.2 Similarity Functions . . . . . . . . . . . . . . . . . . . . . . 13
2.3.3 Social Graph . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4 Evaluation Methodology . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4.1 Analyzing the Facebook Network . . . . . . . . . . . . . . . 16
2.4.2 Analyzing the Simulated Network . . . . . . . . . . . . . . . 17
2.5 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.6 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . 21
3 Information Flow and Search in Unstructured Keyword based So-
cial Networks 22
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3 Information Flow Model . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3.1 Keyword Propagation Process . . . . . . . . . . . . . . . . . 28
3.3.2 Identity Privacy Issues . . . . . . . . . . . . . . . . . . . . . 31
3.3.3 Information Maintenance . . . . . . . . . . . . . . . . . . . . 31
3.4 Search Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.4.1 Selecting Topologically Closer Nodes . . . . . . . . . . . . . 32
3.4.2 Selecting Trustworthy Nodes . . . . . . . . . . . . . . . . . . 34
3.4.3 Querying with Keywords . . . . . . . . . . . . . . . . . . . . 35
3.5 Simulation Methodology . . . . . . . . . . . . . . . . . . . . . . . . 38
3.5.1 Graph Generation . . . . . . . . . . . . . . . . . . . . . . . . 39
3.5.2 Trust Distribution . . . . . . . . . . . . . . . . . . . . . . . 39
3.5.3 Information Propagation: Keywords and Policies . . . . . . 40
3.5.4 Query Distribution . . . . . . . . . . . . . . . . . . . . . . . 41
vi3.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.6.1 Performance at various values . . . . . . . . . . . . . . . . 42
3.6.2 Performance at various hop values . . . . . . . . . . . . . . . 43
3.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.8 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4 Conclusion 48
References 54
viiList of Figures
2.1 Number of Distinct Keywords vs Keyword Frequency on log-log scale 9
2.2 Keyword Frequency Distribution . . . . . . . . . . . . . . . . . . . 10
2.3 Forest with two component trees . . . . . . . . . . . . . . . . . . . 12
2.4 s(u,v) vs k(u,v): All User Pairs . . . . . . . . . . . . . . . . . . . . 19
2.5 n(u,v) vs k(u,v): Direct Friends . . . . . . . . . . . . . . . . . . . . 20
3.1 Example of the community based social network . . . . . . . . . . . 26
3.2 Network with restrictive propagation policy . . . . . . . . . . . . . 42
(a) No. of query messages vs . . . . . . . . . . . . . . . . . . . 42
(b) Targets found/query message vs . . . . . . . . . . . . . . . . 42
3.3 Network with liberal propagation policy . . . . . . . . . . . . . . . 43
(a) No. of query messages vs . . . . . . . . . . . . . . . . . . . 43
(b) Targets found/query message vs . . . . . . . . . . . . . . . . 43
3.4 Analysis of network (liberal policy) for varying hops . . . . . . . . . 44
(a) %age of targets found vs hops . . . . . . . . . . . . . . . . . . 44
(b) Targets found/query message vs hops . . . . . . . . . . . . . . 44
viiiList of Tables
2.1 Sample Users with Keywords . . . . . . . . . . . . . . . . . . . . . 18
2.2 Weak Similarity with respect to Z . . . . . . . . . . . . . . . . . . . 18
3.1 Keyword Forwarding Table . . . . . . . . . . . . . . . . . . . . . . . 29
3.2 Keyword Received Table . . . . . . . . . . . . . . . . . . . . . . . . 33
3.3 Query Messages and Targets Found per query at di erent hops for
the network with liberal policy and = 0:5 . . . . . . . . . . . . . . 45
Chapter 1
Social interactions form an integral part of the human behavior. With the advent
of the internet, a new platform has emerged for human beings to interact. A
whole range of online social networks (OSNs) like Facebook [1], Orkut [2], LinkedIn
[3], etc. have emerged that allow users to discover, establish and maintain their
relationships with others. Based on pro le information and mutual interest, users
become friends and join communities on the social network that allows them to
interact and share information with each other. In this work, we focus on social
networks based on keywords and study how keywords can be e ectively used to
model and search in social networks.
1.1 The Role of Keywords
In online social networks, users have to create a pro le by setting some attributes
that help in characterizing the user on the social network. We call these pro le
attributes the keywords of the user. These keywords not only de ne the interests,1.2 Research Contributions 2
passions and other traits of a user but also allow users to become friends with
people sharing common attributes. An important property of social networks is
that people tend to have attributes similar to those of their friends. Moreover,
people adopt activities based on the activities of the people that they are currently
friends with, and they form new friendships as a result of their existing activities
In Davis Social Links (DSL) [5, 6], keywords are being used to develop a new
internet model. The current communication model on the internet depends on
the concept of routable identity', i.e. an identity is all we need to communicate
with the owner of the identity, e.g. IP address, email address, etc. The key idea
of DSL is to route packets from one network entity to another only via the social
network paths between them. Further, the social paths are based on the keywords
that are set by users as their pro le attributes. Depending on the ranking of
social paths the receiver can eciently prioritize incoming packets (or messages)
thereby having a better control over who can contact him. Thus, we can say that
DSL tries to mimic the communication model in our human society and leverages
online social networks for designing the next generation internet communication
model. Deriving motivation from the DSL model, we study the role that keywords
play in social networks from di erent perspectives.
1.2 Research Contributions
In this thesis, we address two important questions. First, how to categorize key-
words e ectively so that they can be used to quantify the similarity between di er-
ent users and thereby how to model the social network? Second, how to disseminate1.2 Research Contributions 3
keywords and search for users in an unstructured keyword based social network
(e.g. DSL)? We summarize the main research contributions made by this thesis in
the following subsections.
1.2.1 Social Network Model based on Keyword Catego-
A user pro le on an online social network is characterized by its pro le entries or
keywords. We study the relationship between semantic similarity of user keywords
and the social network topology. First, we present a
forest’ model to categorize
keywords and de ne the notion of distance between keywords across multiple cat-
egorization trees (i.e., a forest). Second, we use the keyword distance to de ne
similarity functions between a pair of users and show how social network topology
can be modeled accordingly. Third, we validate our social network topology model,
using a simulated social graph, against a real life social graph dataset.
1.2.2 Information Flow and Search in Unstructured Key-
word based Social Networks
In online social networks (OSNs), user connections can be represented as a graph.
The network formed has distinct properties that distinguish it from other graph
topologies. In this work, we consider an unstructured keyword based social network
topology where each edge has a trust value associated with it to represent the
mutual relationship between the corresponding nodes. Users have keywords as
their pro le attributes that have policies associated with them to de ne abstractly
ow of keyword information and the accessibility to other users in the network.1.3 Organization 4
We also address privacy concerns as outlined in works on future OSN architectures.
Two key contributions are made in this part of the work. First, we develop
an information
ow model to disseminate keyword information when users add
keywords as their pro le attributes. Second, for keyword based queries, we design
and develop a search algorithm to nd users with the speci ed keywords in their
pro le attributes. It is based on a linear combination of topological distance and
trust metrics. It is also dynamic in nature such that it adapts itself for each
individual node during the search process. We observe an improvement in orders
of magnitude when the search algorithm is compared to breadth rst search.
1.3 Organization
In chapter 2, we discuss the proposed social network model based on keywords.
We rst analyze Facebook pro les to understand the keyword usage patterns and
then, develop the metrics for measuring the similarity between user pro les. This
work has been presented at the International Conference on Advances in Social
Networks Analysis and Mining (ASONAM ’09) [7] held at Athens, Greece in July,
2009 and appears in the conference proceedings.
Chapter 3 presents our proposed algorithms for information
ow and search in
unstructured keyword based social networks. We compare the search algorithm
with breadth rst search (BFS) for di erent keyword policies. We also present
a brief description of the search algorithms from the literature related to peer-to-
peer networks. This work has been accepted for publication in Workshop on Social
Mobile Web (SMW ’09) [8] to be held in conjunction with the IEEE International
Conference on Social Computing (SocialCom ’09) at Vancouver, Canada in August,1.3 Organization 5
2009 and will appear in the proceedings of SocialCom ’09.
We conclude the thesis in chapter 4 with a summary of the results. Here, we
also discuss the possibility of combining both the techniques and the development
of new search algorithms.6
Chapter 2
Social Network Model based on
Keyword Categorization
2.1 Introduction
In real life, individuals become friends when they share common interests or pas-
sions. Sociologists have termed this tendency of human beings as homophily'.
Similarly, on online social networks (OSNs), like Facebook [1] or Orkut [2], users
establish friendships when they discover similar pro le characteristics. The growth
of LinkedIn [3], a social networking website, demonstrates the impact of pro le
information very well. Its purpose is to help people build professional networks
and nd career development opportunities. Using LinkedIn, employers can look
into the pro le information of users to search for potential employees. Similarly,
it helps employees look for potential employers. We feel that categorizing pro le
information and correlating it with network topology constitutes an important step
towards the study of OSNs.2.1 Introduction 7
Social networks is a widely researched area. Milgram [9] tried to ascertain if
people in the society are linked by small chains. He asked people to forward letters
to their friends who they thought were likely to know the target person. Thus,
people implicitly made decisions based on their view of the geographical location or
professional links of their friends and the associated likelihood of successful delivery
of the letter. Lattice Model [10] uses geographical distance, a user trait, to model
social networks. Models based on interest [11] and hierarchy [12] have also been
proposed to model the friendship behavior of people. In Davis Social Links (DSL)
[5], the social map is de ned on the basis of keywords that are set by social peers
as their pro le attributes. Information transfer takes place only when a social path
exists between the end users. Thus, it seems that keywords will play an important
role in the development of future OSNs.
A typical user pro le on an OSN is characterized by its pro le entries (key-
words) like location, hometown, activities, interests, music, etc. It is important to
understand the use of keywords and how they can be used e ectively to classify
content in OSNs. Consider the scenario, where a newcomer in the city, say Bob,
would like to nd people interested in soccer. As he doesn't know anyone yet, he
tries his OSN pro le to search for soccer enthusiasts in the city but uses the word
football’ for the query. Though, both the words soccer' andfootball’ can refer
to the same sport, Bob’s query returns no successful results because traditional
residents use the word soccer' for the game. The system fails to understand the
underlying semantic relationship between the keyword entered by Bob and pro le
entries of other users. This shows the importance of extracting relationship(s) from
the diverse information provided by users.
Linguists have long been studying such relationships between words. Methods2.2 Why Categorize Keywords? 8
like Latent Semantic Indexing [13] explored semantics based relationship among
digital data. Similarity between users as a function of their topological distance
was studied in [14]. Here, we study the relationship between semantic similarity of
user keywords and the social network topology. First, we present a
forest’ model
to categorize keywords and de ne the notion of distance between keywords across
multiple categorization trees (i.e., a forest). Second, we use the keyword distance
to de ne similarity functions between a pair of users and show how social net-
work topology can be modeled accordingly. Third, we validate our social network
topology model, using a simulated social graph, against Facebook [1] data.
Section 2.2 presents our ndings on keyword usage patterns and discusses the
need to categorize keywords. Section 2.3 describes the forest' model to categorize
keywords. Here, we also propose functions to quantify similarity between users and
a social network model based on those functions. Section 2.4 deals with the meth-
ods that we used to evaluate and validate our model. Section 2.5 presents results
of experiments analyzing the proposed model and realistic data. We conclude in
section 2.6 with possible extensions.
2.2 Why Categorize Keywords?
A typical pro le on any OSN consists of numerous sections (e.g. Orkut [2] has
Social, Professional and Personal sections; Facebook has Basic, Education &amp; Work
and Personal Information sections) that characterize the user. These sections are
further sub-divided into various elds, e.g. the Personal section on Facebook [1]
has Interests, Activities, Favorite Movies, Books, etc. as elds. We call the entries
in these elds Keyword(s) as they represent user attributes. To understand the2.2 Why Categorize Keywords? 9
keyword usage patterns we analyzed 1265 unique Facebook
user pro les [15]. Most
of the elds contained proper nouns (e.g. movie names, albums, etc.) as entries,
hence, for all evaluation purposes, we restricted ourselves to keywords found in the
Interests (which contained words mostly from an English dictionary) eld.
10^0.00 10^0.95 10^1.23 10^1.40 10^1.58 10^1.74 10^1.88 10^2.22
Number of Distinct Keywords
Keyword Frequency
Frequency Distribution
Log (Frequency Distribution)
Figure 2.1: Number of Distinct Keywords vs Keyword Frequency on log-log scale
We looked at the magnitude of information given by users and how keywords
can be processed to extract meaningful knowledge. On average, each user provided
5.8 keywords for the Interests eld and the keyword set contained 1573 unique
keywords. To analyze the distribution of keywords, we plotted the number of
distinct keywords for a given keyword frequency on a log-log scale (see gure 2.1).
We divided the keyword frequency into four categories to represent keywords with
di erent frequencies (see gure 2.2).
The trend line (solid continuous line) for the graph in gure 2.1 shows an
exponential drop in the number of distinct keywords as the keyword frequency
increases. The distribution shows consistency with similar results on tag distribu-
tion over web applications [16]. It follows the Zipf's Law because the occurrence
1We are thankful to Matthew Spear for providing us the Facebook data. Readers are directed
to [15] for details.2.2 Why Categorize Keywords? 10
frequency of a keyword increases as its popularity increases in the frequency list.
Thus, we can infer that most of the keywords entered by users are distinct. Figure
2.2 also substantiates this observation as only 14% of the keywords belong to the
high frequency category. This means that only a fraction of keywords are repeat-
edly used by di erent users and a large percentage of keywords (44% of the total)
occur with very low frequency.
High Freq. ≥ 100 Instances
Medium Freq. ≥ 50 Instances
Low Freq. ≥ 15 Instances
Very Low Freq. ≥ 1 Instances
Figure 2.2: Keyword Frequency Distribution
Two important conclusions can be drawn using the above discussion. First, as
the number of unique keywords is large there may be some relationship between
these di erent keywords. This observation uses the fact that di erent topics in
which users are interested can be generalized to a small number as there are limited
community’ categories in OSNs. The large number of unique keywords must
also be related to these limited categories. Second, it is possible that the very
low frequency keywords (which constitute almost half of the total) aren’t very
dissimilar either with each other or to the other 56% of keywords due to the
extensive usage scope of English words. Thus, to develop a social network model
based on keywords, there is a need to explore the hidden relations among keywords
and to categorize them. For instance, in Bob’s case, if the OSN could understand
the relationship between soccer' andfootball’ it might give better results for Bob’s
query.2.3 Social Network Modeling 11
2.3 Social Network Modeling
In this section, we rst describe a method of categorizing keywords in a data
structure to utilize the underlying relationship amongst them. Then, we de ne
functions to quantify the distance between keywords and similarity among social
peers based on the distance between their keyword pairs. Finally, we talk about
correlating social network topology with keywords using the similarity functions.
2.3.1 Forest Structure
We want a data structure that can help de ne distance between keywords by
capturing hidden relations between them. It must employ methods to clearly
distinguish between related and unrelated keywords. A single hierarchical structure
(e.g. by a modi cation of [12]) will be insucient as it will fail to capture important
characteristics of keywords. First, it is not always possible to relate all the words,
e.g. earthquake' andsoccer’, in a single structure. The distance between such
unrelated words must come out to be relatively larger than that between related
words. Second, the data structure must capture all meanings of a word as it can be
used in di erent contexts (or in di erent syntactic categories). E.g., according to
WordWeb [17], the word stern' could meansevere’ as an adjective and rear part
of a ship' as a noun. We propose a forest structure to store keywords where each
tree in the forest contains related keywords. As a keyword could have more than
one meaning it could occur in di erent trees. This way, we use multiple hierarchical
trees (i.e. a forest) to measure distance between keywords.
Di erent methods could be used for arranging the keywords in the forest. A
method based on etymological relations can be used to construct the forest. For2.3 Social Network Modeling 12
instance, in a language like English, which has been derived from Latin, Greek,
etc., most of the words have a root associated with them. Wordinfo [18] lists
61,362 English words which have either Latin or Greek roots. A root word with
its derived words can be put in a single tree. E.g. the words
equine’ (horse),
equestrian' (horse rider),equestrienne’ (female horse rider) and equestrianism'
(horsemanship) that come from the Latin root
equus’ (a horse) can form one tree.
All such trees taken together can form the forest.
Figure 2.3: Forest with two component trees
Another way is to place semantically related keywords in the same tree. E.g. a
tree can be made with the keywords related to sports'. The next levels could
contain various sports like
football’, racing', etc., each having its own sub-tree.
Similarly, we can have another tree for all the countries under the keyword
Nations’ (or UN') as shown in gure 2.3. Since,soccer’ is a hyponym of football',
Bob's query will get better results as the semantic similarity between the two
keywords has been captured by this structure.2.3 Social Network Modeling 13
2.3.2 Similarity Functions
Now we de ne the notion of distance between keywords based on the forest struc-
ture. Let there be t trees (T1; T2; :::; Tt) in the forest F . Consider two keywords Ka
and Kb such that both of them belong to the same tree. Let LCA be the least com-
mon ancestor of Ka and Kb. Also, assume d(LCA; Ka) to be the depth of Ka from
the LCA. E.g., in gure 2.3, if Ka = soccer and Kb = racing then LCA = sports
and d(LCA; Ka) = 2. Let us also de ne the maximum of the distances of Ka and
Kb from the LCA as dLCA i.e. dLCA(Ka; Kb) = max(d(LCA; Ka); d(LCA; Kb)).
De nition 1. If K1 and K2 are two keywords, then the distance, D(K1; K2),
between them is given as:
D(K1; K2) =
dLCA(K1; K2) if 9Ti s.t. K1; K2 2 Ti
1 if no such Ti exists
If more than one such Ti exists, then the distance is set to the minimum of all the
corresponding dLCA's.
If K1 and K2 don't have any relation then D(K1; K2) is 1. Also, the minimum
of all dLCA's is used to account for multiple occurrences of keywords in F . These
observations justify the bene ts of a forest structure (as explained in section 2.3.1)
over a simple hierarchical model to store keywords. A possible metric to de ne
the distance between two keywords could have been the sum of the depths of the
keywords from their LCA (i.e. D(K1; K2)=d(LCA; K1) + d(LCA; K2)). But, we
believe that to capture the distance between keywords from a generic common
point (i.e. the LCA), max(d(LCA; K1); d(LCA; K2)) is more appropriate as
function gives the farthest distance from the generic point.2.3 Social Network Modeling 14
Now, we will de ne the similarity functions between social peers. Let the set
of nodes (or social peers) in the social graph be V . Assume that a social peer w
(2 V ) has Nw keywords and let Kw
(1  i  Nw) be his/her keywords. Consider
two peers u and v on the network. Let k(u; v) (Nu  Nv) be the total number of
keyword pairs that they have. Also, let n(u; v) be the number of keyword pairs
; Kv
) such that Ku
and Kv
j belong to the same tree in F .
De nition 2. For two social peers u and v on the network, the weak similarity',
s(u,v), between them is:
s(u; v) =
n(u; v)
k(u; v)
De nition 3. For two social peers u and v on the network, the
strong similarity’,
S(u,v), between them is:
S(u; v) =
k(u; v)
We call the function s weak similarity', as it doesn't take into account the
position of keywords in a tree, i.e. keywords with distinct distance values will
contribute equally towards the weak similarity. The function S is called
similarity’, as it also considers the relative positions of keywords in the forest
as keywords with greater distance contribute less towards the similarity value.
Exponential function was a natural choice for the de nition because it has a nite
value at the boundary conditions for D(Ku
; Kv
) (as e
0 = 1 and e
1 = 0).
The value of S(u; v) decreases as the distance between the keywords increases
implying that u and v share lesser interests or attributes. It may happen that