Tag Archives: analyzing

[repost ]Analyzing Big Data » Getting Started: Sentiment Analysis

original:http://docs.aws.amazon.com/gettingstarted/latest/emr/getting-started-emr-sentiment-tutorial.html

Getting Started: Sentiment Analysis

Sentiment analysis refers to various methods of examining and processing data in order to identify a subjective response, usually a general mood or a group’s opinions about a specific topic. For example, sentiment analysis can be used to gauge the overall positivity of a blog or a document, or to capture constituent attitudes toward a political candidate.

Sentiment data is often derived from social media services and similar user-generated content, such as reviews, comments, and discussion groups. The data sets thus tend to grow large enough to be considered “big data.”

Suppose your company recently released a new product and you want to assess its reception among consumers. You know that social media can help you capture a broad sample of public opinion, but you don’t have time to monitor every mention. You need a better way to determine aggregate sentiment.

Amazon EMR integrates open-source data processing frameworks with the full suite of Amazon Web Services. The resulting architecture is scalable, efficient, and ideal for analyzing large-scale sentiment data, such as tweets over a given time period.

In this tutorial, you’ll launch an AWS CloudFormation stack that provides a script for collecting tweets. You’ll store the tweets in Amazon S3 and customize a mapper file for use with Amazon EMR. Then you’ll create an Amazon EMR cluster that uses a Python natural language toolkit, implemented with a Hadoop streaming job, to classify the data. Finally, you’ll examine the output files and evaluate the aggregate sentiment of the tweets.

This tutorial typically takes less than an hour to complete. You pay only for the resources you use. The tutorial includes a cleanup step to help ensure that you don’t incur additional costs. You may also want to review the Pricing topic.

Important

Before you begin, make sure you’ve completed the steps in Getting Set Up.

Click Next to start the tutorial.

Step 1: Create a Twitter Developer Account

In order to collect tweets for analysis, you’ll need to create an account on the Twitter developer site and generate credentials for use with the Twitter API.

To create a Twitter developer account

  1. Go to https://dev.twitter.com/user/login and log in with your Twitter user name and password. If you do not yet have a Twitter account, click the Sign up link that appears under the Username field.
  2. If you’ve already used the Twitter developer site to generate credentials and register applications, skip to the next step.

    If you have not yet used the Twitter developer site, you’ll be prompted to authorize the site to use your account. Click Authorize app to continue.

  3. Go to the Twitter applications page at https://dev.twitter.com/apps and click Create a new application.
  4. Follow the on-screen instructions. For the application Name, Description, and Website, you can enter any text — you’re simply generating credentials to use with this tutorial, rather than creating a real application.
  5. On the details page for your new application, you’ll see a Consumer key and Consumer secret. Make a note of these values; you’ll need them later in this tutorial. You may want to store your credentials in a text file.
  6. At the bottom of the application details page, click Create my access token. Make a note of the Access token and Access token secret values that appear, or add them to the text file you created in the preceding step.

If you need to retrieve your Twitter developer credentials at any point, you can go to https://dev.twitter.com/apps and select the application you created for the purposes of this tutorial.

Step 2: Create an Amazon S3 Bucket for the Amazon EMR Files

Amazon EMR jobs typically use Amazon S3 buckets for input and output data files, as well as for any mapper and reducer files that aren’t provided by open-source tools. For the purposes of this tutorial, you’ll create your own Amazon S3 bucket in which you’ll store the data files and a custom mapper.

To create an Amazon S3 bucket using the console

  1. Sign in to the AWS Management Console and open the Amazon S3 console at https://console.aws.amazon.com/s3/home.
  2. Click Create Bucket.
  3. Enter a name for your bucket, such as mysentimentjob.

    Note

    To meet Hadoop requirements, Amazon S3 bucket names used with Amazon EMR are restricted to lowercase letters, numbers, periods (.), and hyphens (-).

  4. Leave the Region set to US Standard and click Create.
  5. Click the name of your new bucket in the All Buckets list.
  6. Click Create Folder, then type input. Press Enter or click the check mark.
  7. Repeat this step to create another folder called mapper at the same level as the input folder.
  8. For the purposes of this tutorial (to ensure that all services can use the folders), you should make the folders public. Select the check boxes next to your folders. ClickActions, then click Make Public. Click OK to confirm that you want to make the folders public.

Make a note of your bucket and folder names — you’ll need them in later steps.

Step 3: Collect and Store the Sentiment Data

In this step, you’ll use an AWS CloudFormation template to launch an instance, then use the tools on the instance to collect data via the Twitter API. You’ll also use a command-line tool to store the collected data in the Amazon S3 bucket you created.

To launch the AWS CloudFormation stack

  1. Open the AWS CloudFormation console at https://console.aws.amazon.com/cloudformation.
  2. Make sure US East (N. Virginia) is selected in the region selector of the navigation bar.
  3. Click Create Stack.
  4. In the Stack Name box, type any name that will help you identify your stack, such as MySentimentStack.
  5. Under Template, select Provide a Template URL. Type https://s3.amazonaws.com/awsdocs/gettingstarted/latest/sentiment/sentimentGSG.template in the box (or copy the URL from this page and paste it in the box). Click Continue.
  6. On the Specify Parameters page, enter your AWS and Twitter credentials. The Key Pair name must match the key pair you created in the US-East region in Step 2: Create a Key Pair.

    For best results, copy and paste the Twitter credentials from the Twitter developer site or the text file you saved them in.

    Note

    The order of the Twitter credential boxes on the Specify Parameters page may not match the display order on the Twitter developer site. Make sure you’re pasting the correct value in each box.

  7. Select the check box to acknowledge that the template may create IAM resources, then click Continue. Click Continue again on the Add Tags page.
  8. Review your settings, making sure your Twitter credentials are correct. You can make changes to the settings by clicking the Edit link for a specific step in the process.
  9. Click Continue to launch the stack. A confirmation window opens. Click Close.
  10. The confirmation window closes, returning you to the AWS CloudFormation console. Your new AWS CloudFormation stack appears in the list with its status set toCREATE_IN_PROGRESS.

    Note

    Your stack will take several minutes to launch. Make sure to click Refresh on the Stacks page to see whether the stack has been successfully created.

For more information about AWS CloudFormation, go to Walkthrough: Updating a Stack.

To collect tweets using your AWS CloudFormation stack

When your stack shows the status CREATE_COMPLETE, it’s ready to use.

  1. Click the Outputs tab in the bottom pane to get the IP address of the Amazon EC2 instance that AWS CloudFormation created.
  2. Connect to the instance via SSH, using the user name ec2-user. For more information about connecting to an instance and configuring your SSH credentials and tools, seeConnecting to Your Linux/UNIX Instances Using SSH or Connecting to Linux/UNIX Instances from Windows Using PuTTY. (Disregard the sections that describe how to transfer files.)
  3. In the SSH window, type the following command:
    cd sentiment
  4. The instance has been preconfigured with Tweepy, an open-source package for use with the Twitter API. Python scripts for running Tweepy appear in the sentimentdirectory. To ensure that they are present, type the following command:
    ls

    You should see files named collector.py and twaiter.py, as well as twitterparams.py.

  5. To collect tweets, type the following command, where term1 is your search term.
    python collector.py term1

    To use a multi-word term, enclose it in quotation marks. Examples:

    python collector.py kindle
    python collector.py "kindle fire"

    The collector script is not case sensitive.

  6. Press Enter to run the collector script. Your SSH window should show the message “Collecting tweets. Please wait.”

    The script collects 500 tweets, which may take several minutes. If you’re searching for a subject that is not currently popular on Twitter (or if you edited the script to collect more than 500 tweets), the script will take longer to run. You can interrupt it at any time by pressing Control+C.

    When the script has finished running, your SSH window will show the message “Finished collecting tweets.”

    Note

    If your SSH connection is interrupted while the script is still running, reconnect to the instance and run the script with nohup (e.g., nohup python collector.py > /dev/null &).

To store the collected tweets in Amazon S3

Your sentiment analysis stack has been preconfigured with s3cmd, a command-line tool for Amazon S3. You’ll use s3cmd to store your tweets in the bucket you created earlier.

  1. In your SSH window, type the following command. (The current directory should still be sentiment. If it’s not, use cd to navigate to the sentiment directory.)
    ls

    You should see a file named tweets.date-time.txt, where date and time reflect when the script was run. This file contains the ID numbers and full text of the tweets that matched your search terms.

  2. To copy the Twitter data to Amazon S3, type the following command, where tweet-file is the file you identified in the previous step and your-bucket is the name of the Amazon S3 bucket you created earlier.
    s3cmd put tweet-file s3://your-bucket/input/

    Example:

    s3cmd put tweets.Nov12-1227.txt s3://mysentimentjob/input/

    Important

    Be sure to include the trailing slash, to indicate that input is a folder. Otherwise, Amazon S3 will create an object called input in your base S3 bucket.

  3. To verify that the file was uploaded to Amazon S3, type the following command:
    s3cmd ls s3://your-bucket/input/

    You can also use the Amazon S3 console at https://console.aws.amazon.com/s3/ to view the contents of your bucket and folders.

Step 4: Customize the Amazon EMR Mapper

When you create your own Hadoop streaming programs, you’ll need to write mapper and reducer executables as described in Process Data with a Streaming Cluster in the Amazon Elastic MapReduce Developer Guide. For this tutorial, we’ve prepopulated an Amazon S3 bucket with a mapper script that you can customize for use with your Twitter search term.

To customize the mapper

  1. Download the mapper file from https://s3.amazonaws.com/awsdocs/gettingstarted/latest/sentiment/sentiment.py.
  2. Use a text editor of your choice to edit the following line in the file:
    subj1 = "term1"

    Replace term1 with the search term you used in Step 3: Collect and Store the Sentiment Data. Example:

    subj1 = "kindle"

    Important

    Make sure you don’t change any of the spacing in the file. Incorrect indentation will cause the Hadoop streaming program to fail.

    Save the edited file. You may also want to review the file generally, to get a sense of how mappers can work.

    Note

    In your own mappers, you’ll probably want to fully automate the configuration. The manual editing in this tutorial is for purposes of illustration only. For more details about creating Amazon EMR work steps and bootstrap actions, go to Create Bootstrap Actions to Install Additional Software and Steps in the Amazon Elastic MapReduce Developer Guide.

  3. Go to the Amazon S3 console at https://console.aws.amazon.com/s3/ and locate the mapper folder you created in Step 2: Create an Amazon S3 Bucket for the Amazon EMR Files.
  4. Click Upload and follow the on-screen instructions to upload your customized mapper file.
  5. Make the mapper file public: select it, then select Actions and then Make Public.

Step 5: Create an Amazon EMR Cluster

Important

This tutorial reflects changes made to the Amazon EMR console in November 2013. If your console screens do not match the images in this guide, switch to the new version by clicking the link that appears at the top of the console:

Amazon EMR allows you to configure a cluster with software, bootstrap actions, and work steps. For this tutorial, you’ll run a Hadoop streaming program. When you configure a cluster with a Hadoop streaming program in Amazon EMR, you specify a mapper and a reducer, as well as any supporting files. The following list provides a summary of the files you’ll use for this tutorial.

  • For the mapper, you’ll use the file you customized in the preceding step.
  • For the reducer method, you’ll use the predefined Hadoop package aggregate. For more information about the aggregate package, go to the Hadoop documentation.
  • Sentiment analysis usually involves some form of natural language processing. For this tutorial, you’ll use the Natural Language Toolkit (NLTK), a popular Python platform. You’ll use an Amazon EMR bootstrap action to install the NLTK Python module. Bootstrap actions load custom software onto the instances that Amazon EMR provisions and configures. For more information, go to Create Bootstrap Actions in the Amazon Elastic MapReduce Developer Guide.
  • Along with the NLTK module, you’ll use a natural language classifier file that we’ve provided in an Amazon S3 bucket.
  • For the job’s input data and output files, you’ll use the Amazon S3 bucket you created (which now contains the tweets you collected).

Note that the files used in this tutorial are for illustration purposes only. When you perform your own sentiment analysis, you’ll need to write your own mapper and build a sentiment model that meets your needs. For more information about building a sentiment model, go to Learning to Classify Text in Natural Language Processing with Python, which is provided for free on the NLTK site.

To create an Amazon EMR cluster using the console

  1. Open the Amazon EMR console at https://console.aws.amazon.com/elasticmapreduce/.
  2. Click Create cluster.
  3. In the Cluster Configuration section, type a Cluster name or use the default value of My cluster. Set Termination protection to No and clear the Logging enabled check box.

    Note

    In a production environment, logging and debugging can be usefull tools for analyzing errors or inefficiencies in Amazon EMR steps and applications. For more information on how to use logging and debugging in Amazon EMR, go to Troubleshooting in the Amazon Elastic MapReduce Developer Guide.

  4. In the Software Configuration section, leave the default Hadoop distribution setting: Amazon and latest AMI version. Under Applications to be installed, click each X to remove Hive and Pig from the list.
  5. In the Hardware Configuration section, leave the default settings. The default instance types, an m1.small master node and two m1.small core nodes, will help keep the cost of this tutorial low.

    Note

    When you analyze data in a real application, you may want to increase the size or number of these nodes to improve processing power and optimize computational time. You may also want to use spot instances to further reduce your Amazon EC2 costs. For more information about spot instances, go to Lowering Costs with Spot Instances in the Amazon Elastic MapReduce Developer Guide.

  6. In the Security and Access section, select the EC2 key pair you created earlier. Leave the default IAM settings.
  7. In the Bootstrap Actions section, in the Add bootstrap action list, select Custom action. You’ll add a custom action that installs and configures the Natural Language Toolkit on the cluster.
  8. In the Add Bootsrap Action popup, enter a Name for the action or leave it set to Custom action. In the Amazon S3 Location box, types3://awsdocs/gettingstarted/latest/sentiment/config–nltk.sh (or copy and paste the URL from this page), and then click Add. (You can also download and review the shell script, if you’d like.)

    The Bootstrap Actions section should now show the custom action you added.

  9. In the Steps section, you’ll define the Hadoop streaming job. In the Add step list, select Streaming program, then click Configure and add.
  10. In the Add Step popup, configure the job as follows, replacing your-bucket with the name of the Amazon S3 bucket you created earlier:
    Name Sentiment analysis
    Mapper s3://your-bucket/mapper/sentiment.py
    Reducer aggregate
    Input S3 location s3://your-bucket/input
    Output S3 location s3://your-bucket/output (make sure this folder does not yet exist)
    Arguments -cacheFile s3://awsdocs/gettingstarted/latest/sentiment/classifier.p#classifier.p
    Action on failure Continue

    Click Add. The Steps section should now show the parameters for the streaming program.

  11. Below the step parameters, set Auto-terminate to Yes.
  12. Review the cluster settings. If everything looks correct, click Create cluster.

A summary of your new cluster will appear, with the status Starting. It will take a few minutes for Amazon EMR to provision the Amazon EC2 instances for your cluster.

Step 6: Examine the Sentiment Analysis Output

When your cluster’s status in the Amazon EMR console is Waiting: Waiting after step completed, you can examine the results.

To examine the streaming program results

  1. Go to the Amazon S3 console at https://console.aws.amazon.com/s3/home and locate the bucket you created in Step 2: Create an Amazon S3 Bucket for the Amazon EMR Files. You should see a new output folder in your bucket. You may need to click the refresh arrow in the top right corner to see the new bucket.
  2. The job output will be split into several files: an empty status file named _SUCCESS and several part-xxxxx files. The part-xxxxx files contain sentiment measurements generated by the Hadoop streaming program.
  3. To download an output file, select it in the list, then click Actions and select Download. Right-click the link in the pop-up window to download the file.

    Repeat this step for each output file.

  4. Open the files in a text editor. You’ll see the total number of positive and negative tweets for your search term, as well as the total number of tweets that did not match any of the positive or negative terms in the classifier (usually because the subject term was in a different field, rather than in the actual text of the tweet).

    Example:

    kindle: negative      13
    kindle: positive     479
    No match:              8

    In this example, the sentiment is overwhelmingly positive. In most cases, the positive and negative totals will be closer together. For your own sentiment analysis work, you’ll want to collect and compare data over several time periods, possibly using several different search terms, to get as accurate a measurement as possible.

Step 7: Clean Up

To prevent your account from accruing additional charges, you should terminate the resources you used in this tutorial.

To delete the AWS CloudFormation stack

  1. Go to the AWS CloudFormation console at https://console.aws.amazon.com/cloudformation.
  2. In the AWS CloudFormation Stacks section, select your sentiment stack.
  3. Either click the Delete Stack button, or right-click your selected stack and click Delete Stack.
  4. Click Yes, Delete in the confirmation dialog that appears.

    Note

    After stack deletion has begun, you can’t cancel the process. The stack will proceed to the state DELETE_IN_PROGRESS. After the stack has been deleted, it will have the state DELETE_COMPLETE.

Because you ran a Hadoop streaming program and set it to auto-terminate after running the steps in the program, the cluster should have been automatically terminated when processing was complete.

To ensure the Amazon EMR cluster was terminated

  1. If you are not already viewing the cluster list, click Cluster List in the Elastic MapReduce menu at the top of the Amazon Elastic MapReduce console.
  2. In the cluster list, make sure the Status of your cluster is Terminated.
To terminate an Amazon EMR cluster

  1. If you are not already viewing the cluster list, click Cluster List in the Elastic MapReduce menu at the top of the Amazon Elastic MapReduce console.
  2. In the cluster list, select the box to the left of the cluster name, and then click Terminate. In the confirmation pop-up that appears, click Terminate.

The next step is optional. It deletes the key pair you created earlier. You are not charged for key pairs. If you are planning to explore Amazon EMR further or complete the other tutorial in this guide, you should retain the key pair.

To delete a key pair

  1. In the Amazon EC2 console navigation pane, select Key Pairs.
  2. In the content pane, select the key pair you created, then click Delete.

The next step is optional. It deletes two security groups created for you by Amazon EMR when you launched the cluster. You are not charged for security groups. If you are planning to explore Amazon EMR further, you should retain them.

To delete Amazon EMR security groups

  1. In the Amazon EC2 console navigation pane, click Security Groups.
  2. In the content pane, click the ElasticMapReduce-slave security group.
  3. In the details pane for the ElasticMapReduce-slave security group, click the Inbound tab. Delete all actions that reference ElasticMapReduce. Click Apply Rule Changes.
  4. In the content pane, click ElasticMapReduce-slave, and then click Delete. Click Yes, Delete to confirm. (This group must be deleted before you can delete the ElasticMapReduce-master group.)
  5. In the content pane, click ElasticMapReduce-master, and then click Delete. Click Yes, Delete to confirm.

You’ve completed the sentiment analysis tutorial. Be sure to review the other topics in this guide for more information about Amazon Elastic MapReduce.

 

[repost ]Analyzing Big Data » Getting Started: Web Server Log Analysis

original:http://docs.aws.amazon.com/gettingstarted/latest/emr/getting-started-emr-tutorial.html

Getting Started: Web Server Log Analysis

Suppose you host a popular e-commerce website. In order to understand your customers better, you want to analyze your Apache web logs to discover how people are finding your site. You’d especially like to determine which of your online ad campaigns are most successful in driving traffic to your online store.

The web server logs, however, are too large to import into a MySQL database, and they are not in a relational format. You need another way to analyze them.

Amazon EMR integrates open-source applications such as Hadoop and Hive with Amazon Web Services to provide a scalable and efficient architecture for analyzing large-scale data, such as Apache web logs.

In the following tutorial, we’ll import data from Amazon S3 and create an Amazon EMR cluster from the AWS Management Console. Then we’ll connect to the master node of the cluster, where we’ll run Hive to query the Apache logs using a simplified SQL syntax.

This tutorial typically takes less than an hour to complete. You pay only for the resources you use. The tutorial includes a cleanup step to help ensure that you don’t incur additional costs. You may also want to review the Pricing topic.

Important

Before you begin, make sure you’ve completed the steps in Getting Set Up.

Click Next to start the tutorial.

Step 1: Create a Cluster Using the Console

Important

This tutorial reflects changes made to the Amazon EMR console in November 2013. If your console screens do not match the images in this guide, switch to the new version by clicking the link that appears at the top of the console:

 

 

To create a cluster using the console

  1. Sign in to the AWS Management Console and open the Amazon Elastic MapReduce console at https://console.aws.amazon.com/elasticmapreduce/.
  2. Click Create Cluster.
  3. In the Cluster Configuration section, type a Cluster name or use the default value of My cluster. Set Termination protection to No and clear the Logging check box.

    Note

    In a production environment, logging and debugging can be useful tools for analyzing errors or inefficiencies in Amazon EMR steps or programs. For more information on how to use logging and debugging in Amazon EMR, go to Troubleshooting in the Amazon Elastic MapReduce Developer Guide.

  4. In the Software Configuration section, leave the default Hadoop distribution setting: Amazon and latest AMI version. Under Applications to be installed, leave the defaultHive settings. Click the X to remove Pig from the list.
  5. In the Hardware Configuration section, leave the default settings. The default instance types, an m1.small master node and two m1.small core nodes, will help keep the cost of this tutorial low.

    Note

    When you analyze data in a real application, you may want to increase the size or number of these nodes to improve processing power and reduce computational time. You may also want to use spot instances to further reduced your Amazon EC2 costs. For more information about spot instances, go to Lowering Costs with Spot Instances in the Amazon Elastic MapReduce Developer Guide.

  6. In the Security and Access section, select the EC2 key pair you created in the preceding step. Leave the default IAM settings.

    Leave the default Bootstrap Actions and Steps settings. Bootstrap actions and steps allow you to customize and configure your application. For this tutorial, we will be using Hive, which is already installed on the AMI, so no addition configuration is needed.

  7. Review the settings. If everything looks correct, click Create cluster.

    A summary of your new cluster will appear, with the status STARTING. It will take a few minutes for Amazon EMR to provision the Amazon EC2 instances for your cluster.

Step 2: Connect to the Master Node

When the cluster in the Amazon EMR console is WAITING, the master node is ready for you to connect to it. First you’ll need to get the DNS name of the master node and configure your connection tools and credentials.

To locate the DNS name of the master node

  • If you’re not currently viewing the Cluster Details page, first select the cluster on the Cluster List page.

    On the Cluster Details page, you’ll see the Master public DNS name Make a note of the DNS name; you’ll need it in the next step.

You can use secure shell (SSH) to open a terminal connection to the master node. An SSH application is installed by default on most Linux, Unix, and Mac OS installations. Windows users can use an application called PuTTY to connect to the master node. Platform-specific instructions for configuring a Windows application to open an SSH connection are provided later in this topic.

You must first configure your credentials, or SSH will return an error message saying that your private key file is unprotected, and it will reject the key. You need to do this step only the first time you use the private key to connect.

To configure your credentials on Linux/Unix/Mac OS X

  1. Open a terminal window. On most computers running Mac OS X, you’ll find the terminal at Applications/Utilities/Terminal. On many Linux distributions, the path is Applications/Accessories/Terminal.
  2. Set the permissions on the PEM file for your Amazon EC2 key pair so that only the key owner has permissions to access the key. For example, if you saved the file asmykeypair.pem in your home directory, you can use this command:
    chmod og-rwx ~/mykeypair.pem
To connect to the master node using Linux/Unix/Mac OS X

  1. In the terminal window, enter the following command, where the value of the -i parameter indicates the location of the private key file you saved in of Step 2: Create a Key Pair. In this example, the key is assumed to be in your home directory.
    ssh hadoop@master-public-dns-name \
    -i ~/mykeypair.pem
  2. You’ll see a warning that the authenticity of the host can’t be verified. Type yes to continue connecting.

If you’re using a Windows-based computer, you’ll need to install an SSH client in order to connect to the master node. In this tutorial, we’ll use PuTTY. If you have already installed PuTTY and configured your key pair, you can skip this procedure.

To install and configure PuTTY on Windows

  1. Download PuTTYgen.exe and PuTTY.exe to your computer from http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html.
  2. Launch PuTTYgen.
  3. Click Load. Select the PEM file you created earlier. You may have to change the search parameters from file of type “PuTTY Private Key Files (*.ppk) to “All Files (*.*)”.
  4. Click Open.
  5. On the PuTTYgen Notice telling you the key was successfully imported, click OK.
  6. To save the key in the PPK format, click Save private key.
  7. When PuTTYgen prompts you to save the key without a pass phrase, click Yes.
  8. Enter a name for your PuTTY private key, such as mykeypair.ppk.
To connect to the master node using Windows/Putty

  1. Start PuTTY.
  2. In the Category list, click Session. In the Host Name box, type hadoop@DNS. The input will look similar to hadoop@ec2-184-72-128-177.compute-1.amazonaws.com.
  3. In the Category list, expand Connection, expand SSH, and then click Auth.
  4. In the Options controlling SSH authentication pane, click Browse for Private key file for authentication, and then select the private key file that you generated earlier. If you are following this guide, the file name is mykeypair.ppk.
  5. Click Open.
  6. To connect to the master node, click Open.
  7. In the PuTTY Security Alert window, click Yes.

Note

For more information about how to install PuTTY and use it to connect to an EC2 instance, go to Connecting to Linux/UNIX Instances from Windows Using PuTTY in theAmazon Elastic Compute Cloud User Guide.

When you’ve successfully connected to the master node via SSH, you’ll see a welcome message and prompt similar to the following:

-----------------------------------------------------------------------------

Welcome to Amazon EMR running Hadoop and Debian/Lenny.

Hadoop is installed in /home/hadoop. Log files are in /mnt/var/log/hadoop. Check
/mnt/var/log/hadoop/steps for diagnosing step failures.

The Hadoop UI can be accessed via the following commands: 

  JobTracker    lynx http://localhost:9100/
  NameNode      lynx http://localhost:9101/

-----------------------------------------------------------------------------
hadoop@ip-10-245-190-34:~$

 

Step 3: Start and Configure Hive

Apache Hive is a data warehouse application you can use to query Amazon EMR cluster data with a SQL-like language. Because Hive was listed in the Applications to be installedwhen we created the cluster, it’s ready to use on the master node.

To use Hive interactively to query the web server log data, you’ll need to load some additional libraries. The additional libraries are contained in a Java archive file namedhive_contrib.jar on the master node. When you load these libraries, Hive bundles them with the map-reduce job that it launches to process your queries.

To learn more about Hive, go to http://hive.apache.org/.

To start and configure Hive on the master node

  1. On the command line of the master node, type hive, and then press Enter.
  2. At the hive> prompt, type the following command, and then press Enter.
    hive> add jar /home/hadoop/hive/lib/hive_contrib.jar;

    Wait for a confirmation message similar to the following:

    Added /home/hadoop/hive/lib/hive_contrib.jar to class path
    Added resource: /home/hadoop/hive/lib/hive_contrib.jar

Step 4: Create the Hive Table and Load Data into HDFS

In order for Hive to interact with data, it must translate the data from its current format (in the case of Apache web logs, a text file) into a format that can be represented as a database table. Hive does this translation using a serializer/deserializer (SerDe). SerDes exist for a variety of data formats. For information about how to write a custom SerDe, go to the Apache Hive Developer Guide.

The SerDe we’ll use in this example uses regular expressions to parse the log file data. It comes from the Hive open-source community and can be found athttps://github.com/apache/hive/blob/trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/RegexSerDe.java. (This link is provided for reference only; for the purposes of this tutorial, you do not need to download the SerDe.).

Using this SerDe, we can define the log files as a table, which we’ll query using SQL-like statements later in this tutorial.

To translate the Apache log file data into a Hive table

  • Copy the following multiline command. At the hive command prompt, paste the command, and then press Enter.
    CREATE TABLE serde_regex(
      host STRING,
      identity STRING,
      user STRING,
      time STRING,
      request STRING,
      status STRING,
      size STRING,
      referer STRING,
      agent STRING)
    ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
    WITH SERDEPROPERTIES (
      "input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\"[^\"]*\") ([^ \"]*|\"[^\"]*\"))?",
      "output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s"
    ) 
    LOCATION 's3://elasticmapreduce/samples/pig-apache/input/';

In the command, the LOCATION parameter specifies the location of a set of sample Apache log files in Amazon S3. To analyze your own Apache web server log files, you would replace the Amazon S3 URL above with the location of your own log files in Amazon S3. To meet the requirements of Hadoop, Amazon S3 bucket names used with Amazon EMR must contain only lowercase letters, numbers, periods (.), and hyphens (-).

After you run the command above, you should receive a confirmation like this one:

Found class for org.apache.hadoop.hive.contrib.serde2.RegexSerDe
OK
Time taken: 12.56 seconds
hive>

Once Hive has loaded the data, the data will persist in HDFS storage as long as the Amazon EMR cluster is running, even if you shut down your Hive session and close the SSH connection.

 

Step 5: Query Hive

You’re ready to start querying the Apache log file data. Here are some sample queries to run.

Count the number of rows in the Apache webserver log files.

select count(1) from serde_regex;

Return all fields from one row of log file data.

select * from serde_regex limit 1;

Count the number of requests from the host with an IP address of 192.168.1.198.

select count(1) from serde_regex where host="192.168.1.198";

To return query results, Hive translates your query into a Hadoop MapReduce job and runs it on the Amazon EMR cluster. Status messages will appear as the Hadoop job runs.

Hive SQL is a subset of SQL; if you know SQL, you’ll be able to easily create Hive queries. For more information about the query syntax, go to the Hive Language Manual.

Step 6: Clean Up

To prevent your account from accruing additional charges, you should terminate the cluster when you are done with this tutorial. Because you used the cluster interactively, it has to be manually terminated.

To disconnect from Hive and SSH

  1. In your SSH window or client, press Click CTRL+C to exit Hive.
  2. At the SSH command prompt, type exit, and then press Enter. You can then close the terminal or PuTTY window.
    exit
To terminate an Amazon EMR cluster

  1. If you are not already viewing the cluster list, click Cluster List at the top of the Amazon Elastic MapReduce console.
  2. In the cluster list, select the box to the left of the cluster name, and then click Terminate. In the confirmation pop-up that appears, click Terminate.

The next step is optional. It deletes the key pair you created earlier. You are not charged for key pairs. If you are planning to explore Amazon EMR further or complete the other tutorial in this guide, you should retain the key pair.

To delete a key pair

  1. In the Amazon EC2 console navigation pane, select Key Pairs.
  2. In the content pane, select the key pair you created, then click Delete.

The next step is optional. It deletes two security groups created for you by Amazon EMR when you launched the cluster. You are not charged for security groups. If you are planning to explore Amazon EMR further, you should retain them.

To delete Amazon EMR security groups

  1. In the Amazon EC2 console navigation pane, click Security Groups.
  2. In the content pane, click the ElasticMapReduce-slave security group.
  3. In the details pane for the ElasticMapReduce-slave security group, click the Inbound tab. Delete all actions that reference ElasticMapReduce. Click Apply Rule Changes.
  4. In the content pane, click ElasticMapReduce-slave, and then click Delete. Click Yes, Delete to confirm. (This group must be deleted before you can delete the ElasticMapReduce-master group.)
  5. In the content pane, click ElasticMapReduce-master, and then click Delete. Click Yes, Delete to confirm.

 

[repost ]Analyzing Billions Of Credit Card Transactions And Serving Low-Latency Insights In The Cloud

original:http://highscalability.com/blog/2013/1/7/analyzing-billions-of-credit-card-transactions-and-serving-l.html

This is a guest post by Ivan de Prado and Pere Ferrera, founders ofDatasalt, the company behind Pangool and Splout SQL Big Data open-source projects.

The amount of payments performed using credit cards is huge. It is clear that there is inherent value in the data that can be derived from analyzing all the transactions. Client fidelity, demographics, heat maps of activity, shop recommendations, and many other statistics are useful to both clients and shops for improving their relationship with the market. At Datasalt we have developed a system in collaboration with the BBVA bank that is able to analyze years of data and serve insights and statistics to different low-latency web and mobile applications.

The main challenge we faced besides processing Big Data input is that the output was also Big Data, and even bigger than the input. And this output needed to be served quickly, under high load.

The solution we developed has an infrastructure cost of just a few thousands of dollars per month thanks to the use of the cloud (AWS)Hadoop and Voldemort. In the following lines we will explain the main characteristics of the proposed architecture.

Data, Goals And First Decisions

The system uses BBVA’s credit card transactions performed in shops all around the world as input source for the analysis. Obviously, the data is anonymous, impersonal and dissociated to prevent any privacy issue. Credit card numbers are hashed. Any resultant insights are always an aggregation, so no personal information can be derived from them.

We calculate many statistics and data per each shop and per different periods of time. These are some of them:

  • Histogram of payment amounts for each shop
  • Client fidelity
  • Client demographics
  • Shop recommendations (clients buying here also buy at …). Filtered by location, shop category and so on.

The main goal of the project was to offer all this information to the different agents (shops, clients) through low-latency web and mobile applications. So one demanding requirement was to be able to serve the results with sub-second latencies under high load. And because this was a research project, a high degree of flexibility in the code and in the requirements needed to be handled.

Because updating the data only once a day was not a problem, we opted by a batch-oriented architecture (Hadoop). And we chose Voldemort as a read-only store for serving Hadoop-generated insights, which is a simple yet super-fast key/value store that integrates well with Hadoop.

The Platform

The system was built on top of Amazon Web Services. Specifically, we used S3 for storing raw input data, Elastic Map Reduce (Amazon’s provided Hadoop) for the analysis and EC2 for serving the results. Using cloud technologies allowed us to iterate fast and deliver functional prototypes quickly, which is exactly what we needed for the kind of project.

The Architecture

The architecture has three main parts:

  • Data storage: Used to maintain raw data (credit card transactions) and the resulting Voldemort stores.
  • Data processing: A Hadoop workflow running on EMR that performs all the computations and creates the data stores needed by Voldemort.
  • Data serving: A Voldemort cluster that serves the precomputed data from the data processing layer.

Every day, the bank uploads all the transactions that happened on that day to a folder in S3. This allows us to keep all historical data – all credit card transactions performed every single day. All that data is the input for the processing layer, so we recompute everything, every day. Reprocessing all the data allows us to be very agile. If requirements change or if we find a silly bug, we just update the project code and all data is fixed after the next batch. It is a development decision that brings us:

  • A simplified code base & architecture,
  • Flexibility & adaptability to changes,
  • Easy handling of human errors (Just fix the bug and relaunch the process).

Once a day the controller starts a new Hadoop cluster on EMR and launches the processing flow. This flow is composed of approximately 16 Tuple MapReduce jobs that calculate various insights. The last part of the flow (Voldemort indexer) is in charge of building the data store files that will later be deployed to Voldemort. Once the flow has finished, the resulting data store files are uploaded to S3. The controller shuts down the Hadoop cluster, and sends a deploy request to Voldemort. Then, Voldemort downloads the new data stores from S3 and performs a hot-swap, replacing the old ones entirely.

The Technologies

Hadoop And Pangool

The whole analytics and processing flow is implemented using Pangool Jobs on top of Hadoop. This brings us a good balance between performance, flexbility and agileness. The use of tuples allows us to carry information between the flow using simple data types (int, string) and at the same time we can include other complex objects (like histograms) with their own custom serialization.

Also, because Pangool is still a low level API, we can fine-tune a lot every single Job when needed.

Voldemort

Voldemort is a key/value NoSql database developed by LinkedIn, based on Amazon’s Dynamoconcepts.

The main idea behind Voldemort is dividing data in chunks. Each chunk is replicated and served in the nodes of the Voldemort cluster. Each Voldemort daemon is able to route queries to the node that keeps the value for a particular key. Voldemort supports fast reads as well as random writes, but for this project we use Voldemort as a read-only datastore, replacing all data chunks after each batch process. Because data stores are pre-generated by Hadoop, query serving is not affected by the deployment process. This is one of the advantages of using this read-only, batch approach. We also have the flexbility to change the cluster topology and rebalance the data when needed.

Voldemort provides a Hadoop MapReduce Job that creates the data stores in a distributed cluster. Each chunk of data is just a Berkeley DB B-tree.

Voldemort’s interface is TCP but we wanted to serve data using HTTP. The VServ is a simple HTTP server that transforms incoming HTTP requests into Voldemort TCP requests. A load balancer is in charge of sharing out queries between all the VServs.

The Computed Data

Statistics

Part of the analysis consists in calculating simple statistics: averages, max, min, stdev, unique counts, etc. They are implemented using well known MapReduce approaches. But we also compute some histograms. In order to implement them efficiently in Hadoop we created a custom histogram that can be computed in only one pass. Moreover, we can compute all simple statistics for each commerce together with the associated histograms in just one MapReduce step, and for an arbitrary number of time periods.

In order to reduce the amount of storage used by histograms and to improve its visualization, the original computed histograms formed by many bins are transformed into variable-width bins histograms. The following diagram shows the 3-bins optimal histogram for a particular histogram:

The optimal histogram is computed using a random-restart hill climbing approximated algorithm. The following diagram shows the possible movements on each hill climbing iteration:

The algorithm has been shown very fast and accurate: we achieved 99% accuracy compared to an exact dynamic algorithm (implemented from this paper), with a speed increase of one factor.

Commerce Recommendations

Recommendations are computed using co-ocurrences. That is, if somebody bought in both shop A and B, then a co-ocurrence between A and B exists. Only one co-ocurrence is taken into account even if a buyer bought several times in both A and B. The top co-ocurring shops for a given shop are recommendations for that shop.

But some improvements need to be applied to that simple idea of co-ocurrences. First, the most popular shops are filtered out using a simple frequency cut because almost everybody buys in them. So there is no value in recommending them. Filtering recommendations by location (shops close to each other), by shop category or by both also improves the recommendations. Time-based co-ocurrences produce hotter recommendations vs “always true” recommendations. Limiting the time where a co-ocurrence can happen results in recommendations of shops where people bought right after buying in the first one.

Hadoop and Pangool are the perfect tool to compute the co-ocurrences and generate the recommendations, although some challenges are not easy to overcome. Particularly, if one buyer is paying in many shops, the number of co-ocurrences for this credit call will show a quadratic growth, making the analysis not scale linearly. Because this is rare case, we just limit the amount of co-ocurrences per card, considering only those where the buyer bought the most.

The Cost & Some Numbers

The amount of information to serve in Voldemort for one year of BBVA’s credit card transactions on Spain is 270 GB. The whole processing flow would run in 11 hours on a cluster of 24 “m1.large” instances. The whole infrastructure, including the EC2 instances needed to serve the resulting data would cost approximately $3500/month.

There is still room for optimizations. But considering the solution is agile, flexible and in the cloud, the price is quite reasonable. The cost of the system running in an in-house infrastructure would be much cheaper.

Conclusions & Future

Thanks to the use of technologies like Hadoop, Amazon Web Services and NoSQL databases it is possible to develop quickly solutions that are scalable, flexible and prepared to stand human failures at a reasonable cost.

Future work would involve the replacement of Voldemort by Splout SQL, which allows to deploy Hadoop-generated datasets and extends low-latency key/value to low-latency SQL. It would reduce the analysis time and the amount of data to serve as many aggregations could be performed “on the fly”. For example, it would allow for aggregated statistics over arbitrary time periods, which is something impossible to pre-compute.

[repost ]Averages, web performance data, and how your analytics product is lying to you

original:http://highscalability.com/blog/2012/5/23/averages-web-performance-data-and-how-your-analytics-product.html

This guest post is written by Josh Fraser, co-founder and CEO of Torbit. Torbit creates tools for measuring, analyzing and optimizing web performance.  

Did you know that 5% of the pageviews on Walmart.com take over 20 seconds to load? Walmart discovered this recently after adding real user measurement (RUM) to analyze their web performance for every single visitor to their site. Walmart used JavaScript to measure their median load time as well as key metrics like their 95th percentile. While 20 seconds is a long time to wait for a website to load, the Walmart story is actually not that uncommon. Remember, this is the worst 5% of their pageviews, not the typical experience.

Walmart’s median load time was reported at around 4 seconds, meaning half of their visitors loaded Walmart.com faster than 4 seconds and the other half took longer than 4 seconds to load. Using this knowledge, Walmart was prepared to act. By reducing page load times by even one second, Walmart found that they would increase conversions by up to 2%.

The Walmart case-study highlights how important it is to use RUM and look beyond averages if you want an accurate depiction of what’s happening on your site. Unlike synthetic tests which load your website from random locations around the world, RUM allows you to collect real data from your actual visitors. If Walmart hadn’t added RUM, and started tracking their 95th percentile, they may have never known about the performance issues that were costing them some of their customers. After all, nearly every performance analytics product on the market just gives you an average loading time. If you only look at Walmart’s average loading time of 7 seconds it’s not that bad, right? But as you just read, averages don’t tell the whole story.

There are three ways to measure the central tendency of any data set: the average (or mean), median, and the mode – in this post we’re only going to focus on the first two. We’re also going to focus on percentiles, all of which are reported for you in our real user measurement tool.

It may have been some time since you dealt with these terms so here’s a little refresher:

  • Average (mean): The sum of every data value in your set, divided by the total number of data points in that set. Skewed data or outliers may exist and pull the average away from the center, which could lead you to make wrongful interpretations.
  • Median: If you lined up each value in a data set in ascending order, the median is the single value in the exact middle. In page speed analytics, using the median gives you a more accurate representation of page load times for your visitors since it’s not influenced by skewed data or outliers. The median represents a load time where 50% of your visitors load the page faster than the median value and 50% load the page slower than that value.
  • Percentiles: Percentiles are the 100 groups that fall under the full spectrum of your data. Usually, we hear, “You’re in the 90th percentile,” which means that your data is better than 90 percent of the data in question. In real user measurement, the 90th percentile represents a time value, and 90 percent of your audience loading at that value or faster. Percentiles show you a time value that you can expect some percentage of your visitors to beat in their load times.

 

Example histogram showing the log-normal distribution of loading times

Look at this example histogram showing the loading times for one of our customers. If you’ve studied probability theory, you may recognize this as a log-normal distribution. This means the distribution is the multiplicative product of multiple independent random variables. When dealing with performance data, a histogram is one of your most helpful visualizations.

In this example, other products that only report the average load time would show that their visitors load the site in 5.76 seconds. While the average page load is 5.76 seconds, the median load time is 3.52 seconds. Over half of visitors load the site faster than 5.76 seconds, but you’d never know that just looking at averages. Additionally, the 90th percentile here is over 11 seconds! Most people are experiencing load times faster than that, but of course, that 10% still matters.

For people who care about performance, it’s important to use a RUM product that gives you a complete view into what’s going on. You should be able to see a histogram of the loading times for every visitor to your site. You should be able to see your median load time, your 99th percentile and lots of other key metrics that are far more actionable than just looking at an average.

For any business making money online, you know that every visitor matters. For most sites, it’s not acceptable for 10% of your visitors to have a terrible experience. Those 10% were potential customers that you lost, perhaps for good, simply because your performance wasn’t as great as it should have been. But how do you quantify that?

It all begins with real user measurement.

If you want to accurately measure the speed on your site, it’s important to include RUM in your tool belt. Neither synthetic tests nor averages tell the full story. Without RUM, you’re missing out on important customer experience data that really matters for your business.

Related Articles

 

[repost ]Analyzing Apache logs with Pig

original:http://www.cloudera.com/blog/2009/06/analyzing-apache-logs-with-pig/

(guest blog post by Dmitriy Ryaboy)

A number of organizations donate server space and bandwidth to the Apache Foundation; when you download Hadoop, Tomcat, Maven, CouchDB, or any of the other great Apache projects, the bits are sent to you from a large list of mirrors. One of the ways in which Cloudera supports the open source community is to host such a mirror.

In this blog post, we will use Pig to examine the download logs recorded on our server, demonstrating several features that are often glossed over in introductory Pig tutorials—parameter substitution in PigLatin scripts, Pig Streaming, and the use of custom loaders and user-defined functions (UDFs). It’s worth mentioning here that, as of last week, the Cloudera Distribution for Hadoop includes a package for Pig version 0.2 for both Red Hat and Ubuntu, as promised in an earlier post. It’s as simple as apt-get install pig or yum install hadoop-pig.

There are many software packages that can do this kind of analysis automatically for you on average-sized log files, of course. However, many organizations log so much data and require such custom analytics that these ordinary approaches cease to work. Hadoop provides a reliable method for scaling storage and computation; PigLatin provides an expressive and flexible language for data analysis.

Our log files are in Apache’s standard CombinedLogFormat. It’s a tad more complicated to parse than tab- or comma- delimited files, so we can’t just use the built-in PigLoader().  Luckily, there is already a custom loader in the Piggybank built specifically for parsing these kinds of logs.

First, we need to get the PiggyBank from Apache. The PiggyBank is a collection of useful add-ons (UDFs) for Pig, contributed by the Pig user community. There are instructions on the Pig website for downloading and compiling the PiggyBank. Note that you will need to make sure to add pig.jar to your CLASSPATH environment variable before running ant.

Now, we can start our PigLatin script by registering the piggybank jarfile and defining references to methods we will be using.

1.register /home/dvryaboy/src/pig/trunk/piggybank.jar;
2.DEFINE LogLoader
3.org.apache.pig.piggybank.storage.apachelog.CombinedLogLoader();
4.DEFINE DayExtractor
5.org.apache.pig.piggybank.evaluation.util.apachelogparser.DateExtractor('yyyy-MM-dd');

By the way — the PiggyBank contains another useful loader, called MyRegExLoader, which can be instantiated with any regular expression when you declare it with a DEFINE statement. Useful in a pinch.

While we are working on our script, it may be useful to run in local mode, only reading a small sample data set (a few hundred lines). In production we will want to run on a different file. Moreover, if we like the reports enough to automate them, we may wish to run the report every day, as new logs come in. This means we need to parameterize the source data location. We will also be using a database that maps geographic locations to IPs, and we probably want to parametrize that as well.

1.%default LOGS 'access_log.small'
2.%default GEO 'GeoLiteCity.dat'

To specify a different value for a parameter, we can use the -param flag when launching the pig script:

pig -x mapreduce -f scripts/blogparse.pig -param LOGS='/mirror.cloudera.com/logs/access_log.*'

For mapping IPs to geographic locations, we use a third-party database from MaxMind.  This database maps IP ranges to countries, regions, and cities.  Since the data from MaxMind lists IP ranges, and our logs list specific IPs, a regular join won’t work for our purposes. Instead, we will write a simple script that takes a parsed log as input, looks up the geo information using MaxMind’s Perl module, and outputs the log with geo data prepended.

The script itself is simple — it reads in a tuple representing a parsed log record, checks the first field (the IP) against the database, and prints the data back to STDOUT :

01.#!/usr/bin/env perl
02.use warnings;
03.use strict;
04.use Geo::IP::PurePerl;
05.
06.my ($path)=shift;
07.my $gi = Geo::IP::PurePerl->new($path);
08.
09.while (<>) {
10.chomp;
11.if (/([^\t]*)\t(.*)/) {
12.my ($ip, $rest) = ($1, $2);
13.my ($country_code, undef, $country_name, $region, $city)
14.= $gi->get_city_record($ip);
15.print join("\t", $country_code||'', $country_name||'',
16.$region||'', $city||'', $ip, $rest), "\n";
17.}
18.}

Getting this script into Pig is a bit more interesting. The Pig Streaming interface provides us with a simple way to ship scripts that will process data, and cache any necessary objects (such as the GeoLiteCity.dat file we downloaded from MaxMind).  However, when the scripts are shipped, they are simply dropped into the current working directory. It is our responsibility to ensure that all dependencies—such as the Geo::IP::PurePerl module—are satisfied. We could install the module on all the nodes of our cluster; however, this may not be an attractive option. We can ship the module with our script—but in Perl, packages are represented by directories, so just dropping the .pm file into cwd will not be sufficient, and Pig doesn’t let us ship directory hierarchies.  We solve this problem by packing the directory into a tarball, and writing a small Bash script called “ipwrapper.sh” that will set up our Perl environment when invoked:

1.#!/usr/bin/env bash
2.tar -xzf geo-pack.tgz
3.PERL5LIB=$PERL5LIB:$(pwd) ./geostream.pl $1

The geo-pack.tgz tarball simply contains geostream.pl and Geo/IP/PurePerl.pm .

We also want to make the GeoLiteCity.dat file available to all of our nodes. It would be inefficient to simply drop the file in HDFS and reference it directly from every mapper, as this would cause unnecessary network traffic.  Instead, we can instruct Pig to cache a file from HDFS locally, and use the local copy.

We can relate all of the above to Pig in a single instruction:

1.DEFINE iplookup `ipwrapper.sh $GEO`
2.ship ('ipwrapper.sh')
3.cache('/home/dvryaboy/tmp/$GEO#$GEO');

We can now write our main Pig script. The objective here is to load the logs, filter out obviously non-human traffic, and using the rest, calculate the distribution of downloads by country and by Apache project.

Load the logs:

1.logs = LOAD '$LOGS' USING LogLoader as
2.(remoteAddr, remoteLogname, user, time, method,
3.uri, proto, status, bytes, referer, userAgent);

Filter out records that represent non-humans (Googlebot and such), aren’t Apache-related, or just check the headers and do not download contents.

01.logs = FILTER logs BY bytes != '-' AND uri matches '/apache.*';
02.
03.-- project just the columns we will need
04.logs = FOREACH logs GENERATE
05.remoteAddr,
06.DayExtractor(time) as day, uri, bytes, userAgent;
07.
08.-- The filtering function is not actually in the PiggyBank.
09.-- We plan on contributing it soon.
10.notbots = FILTER logs BY (NOT
11.org.apache.pig.piggybank.filtering.IsBotUA(userAgent));

Get country information, group by country code, aggregate.

01.with_country = STREAM notbots THROUGH `ipwrapper.sh $GEO`
02.AS (country_code, country, state, city, ip, time, uri, bytes, userAgent);
03.
04.geo_uri_groups = GROUP with_country BY country_code;
05.
06.geo_uri_group_counts = FOREACH geo_uri_groups GENERATE
07.group,
08.COUNT(with_country) AS cnt,
09.SUM(with_country.bytes) AS total_bytes;
10.
11.geo_uri_group_counts = ORDER geo_uri_group_counts BY cnt DESC;
12.
13.STORE geo_uri_group_counts INTO 'by_country.tsv';

The first few rows look like:

Country Hits Bytes
USA 8906 2.0458781232E10
India 3930 1.5742887409E10
China 3628 1.6991798253E10
Mexico 595 1.220121453E9
Colombia 259 5.36596853E8

At this point, the data is small enough to plug into your favorite visualization tools. We wrote a quick-and-dirty python script to take logarithms and use the Google Chart API to draw this map:

Bytes by Country

This is pretty interesting. Let’s do a breakdown by US states.

Note that with the upcoming Pig 0.3 release, you will be able to have multiple stores in the same script, allowing you to re-use the loading and filtering results from earlier steps. With Pig 0.2, this needs to go in a separate script, with all the required DEFINEs, LOADs, etc.

01.us_only = FILTER with_country BY country_code == 'US';
02.
03.by_state = GROUP us_only BY state;
04.
05.by_state_cnt = FOREACH by_state GENERATE
06.group,
07.COUNT(us_only.state) AS cnt,
08.SUM(us_only.bytes) AS total_bytes;
09.
10.by_state_cnt = ORDER by_state_cnt BY cnt DESC;
11.
12.store by_state_cnt into 'by_state.tsv';

Theoretically, Apache selects an appropriate server based on the visitor’s location, so our logs should show a heavy skew towards California. Indeed, they do (recall that the intensity of the blue color is based on a log-scale).

Bytes by US State

Now, let’s get a breakdown by project. To get a rough mapping of URI to Project, we simply get the directory name after /apache in the URI. This is somewhat inaccurate, but good for quick prototyping. This time around, we won’t even bother writing a separate script — this is a simple awk job, after all! Using streaming, we can process data the same way we would with basic Unix utilities connected by pipes.

01.uris = FOREACH notbots GENERATE uri;
02.
03.-- note that we have to escape the dollar sign for $3,
04.-- otherwise Pig will attempt to interpret this as a Pig variable.
05.project_map = STREAM uris
06.THROUGH `awk -F '/' '{print \$3;}'` AS (project);
07.
08.project_groups = GROUP project_map BY project;
09.
10.project_count = FOREACH project_groups GENERATE
11.group,
12.COUNT(project_map.project) AS cnt;
13.
14.project_count = ORDER project_count BY cnt DESC;
15.
16.STORE project_count INTO 'by_project.tsv';

We can now take the by_project.tsv file and plot the results (in this case, we plotted the top 18 projects, by number of downloads).
Downloads by Project

We can see that Tomcat and Httpd dwarf the rest of the projects in terms of file downloads, and the distribution appears to follow a power-law.

We’d love to hear how folks are using Pig to analyze their data. Drop us a line, or comment below!

[repost ]Using Apache Hadoop to Find Signal in the Noise: Analyzing Adverse Drug Events

original:http://www.cloudera.com/blog/2011/11/using-hadoop-to-analyze-adverse-drug-events/

Last month at the Web 2.0 Summit in San Francisco, Cloudera CEO Mike Olson presented some work the Cloudera Data Science Team did to analyze adverse drug events. We decided to share more detail about this project because it demonstrates how to use a variety of open-source tools – R, Gephi, and Cloudera’s Distribution Including Apache Hadoop (CDH) – to solve an old problem in a new way.

Background: Adverse Drug Events

An adverse drug event (ADE) is an unwanted or unintended reaction that results from the normal use of one or more medications. The consequences of ADEs range from mild allergic reactions to death, with one study estimating that 9.7% of adverse drug events lead to permanent disability. Another study showed that each patient who experiences an ADE remains hospitalized for an additional 1-5 days and costs the hospital up to $9,000.

Some adverse drug events are caused by drug interactions, where two or more prescription or over-the-counter (OTC) drugs taken together leads to an unexpected outcome. As the population ages and more patients are treated for multiple health conditions, the risk of ADEs from drug interactions increases. In the United States, roughly 4% of adults older than 55 are at risk for a major drug interaction.

Because clinical trials study a relatively small number of patients, both regulatory agencies and pharmaceutical companies maintain databases in order to track adverse events that occur after drugs have been approved for market. In the United States, the FDA uses the Adverse Event Reporting System (AERS), where healthcare professionals and consumers may report the details of ADEs they experienced.  The FDA makes a well-formatted sample of the reports available for download from their website, to the benefit of data scientists everywhere.

Methodology

Identifying ADEs is primarily a signal detection problem: we have a collection of events, where each event has multiple attributes (in this case, the drugs the patient was taking) and multiple outcomes (the adverse reactions that the patient experienced), and we would like to understand how the attributes correlate with the outcomes. One simple technique for analyzing these relationships is a 2×2 contingency table:

For All Drugs/Reactions: Reaction = Rj Reaction != Rj Total
Drug = Di A B A + B
Drug != Di C D C + D
Total A + C B + D A + B + C + D

 

Based on the values in the cells of the tables, we can compute various measures of disproportionality to find drug-reaction pairs that occur more frequently than we would expect if they were independent.

For this project, we analyzed interactions involving multiple drugs, using a generalization of the contingency table method that is described in the paper, “Empirical bayes screening for multi-item associations” by DuMouchel and Pregibon. Their model computes a Multi-Item Gamma-Poisson Shrinkage (MGPS) estimator for each combination of drugs and outcomes, and gives us a statistically sound measure of disproportionality even if we only have a handful of observations for a particular combination of drugs. The MGPS model has been used for a variety of signal detection problems across multiple industries, such as identifying fraudulent phone calls, performing market basket analyses and analyzing defects in automobiles.

Solving the Hard Problem with Apache Hadoop

At first glance, it doesn’t seem like we would need anything beyond a laptop to analyze ADEs, since the FDA only receives about one million reports a year. But when we begin to examine these reports, we discover a problem that is similar to what happens when we attempt to teach computers to play chess: a combinatorial explosion in the number of possible drug interactions we must consider. Even restricting ourselves to analyzing pairs of drugs, there are more than 3 trillion potential drug-drug-reaction triples in the AERS dataset, and tens of millions of triples that we actually see in the data. Even including the iterative Expectation Maximization algorithm that we use to fit the MGPS model, the total runtime of our analysis is dominated by the process of counting how often the various interactions occur.

The good news is that MapReduce running on a Hadoop cluster is ideal for this problem. By creating a pipeline of MapReduce jobs to clean, aggregate, and join our data, we can parallelize the counting problem across multiple machines to achieve a linear speedup in our overall runtime. The faster runtime for each individual analysis allows us to iterate rapidly on smaller models and tackle larger problems involving more drug interactions than anyone has ever looked at before.

Visualizing the Results

The output of our analysis is a collection of drug-drug-reaction triples that have very large disproportionality scores. But as we all know, correlation is not causation. The output of our analysis provides us with useful information that should be filtered and evaluated by domain experts and used as the basis for further study using controlled experiments.

With that caveat in mind, our analysis revealed a few drug pairs with surprisingly high correlations with adverse events that did not show up in a search of the academic literature: gabapentin (a seizure medication) taken in conjunction with hydrocodone/paracetamol was correlated with memory impairment, and haloperidol in conjunction with lorazepam was correlated with the patient entering into a coma.

Even with restrictive filters applied to the drug-drug-reaction triples, we still end up with tens of thousands of triples that score high enough to merit further investigation. In addition to looking at individual triples, we can also use graph visualization tools like Gephi to explore the macro-level structure of the data. Gephi has a number of powerful layout algorithms and filtering tools that allow us to impose structure on an undifferentiated mass of data points. Here is a graph in which the vertices are drugs and the thickness of the edges represent the number of high scoring adverse reactions that feature each pair of drugs:

 

We can also pan and zoom to different regions of the graph and highlight clusters of drug interactions. Here is a cluster of drugs that are used in treating HIV:

A cluster of HIV-related drugs

 

And here is a cluster of drugs that are used to fight cancer:

A cluster of cancer-related drugs

 

The combination of Apache Hadoop, R, and Gephi changes the way we think about analyzing adverse drug events. Instead of focusing on a handful of outcomes, we can process all of the events in the data set at the same time. We can try out hundreds of different strategies for cleaning records, stratifying observations into clusters, and scoring drug-reaction tuples, run everything in parallel, and analyze the data at a fraction of the cost of a traditional supercomputer. We can render the results of our analyses using visualization tools that can be used by domain experts to explore relationships within our data that they might never have thought to look for. By dramatically reducing the costs of exploration and experimentation, we foster an environment that enables innovation and discovery.

Open Data, Open Analysis

This project was possible because the FDA’s Center for Drug Evaluation and Research makes a portion of their data open and available to anyone who wants to download it. In turn, we are releasing a well-commented version of the code we used to analyze that data – a mixture of Java, Pig, R, and Python – on the Cloudera github repository under the Apache License. We also contributed the most useful Pig function developed for this project, which computes approximate quantiles for a stream of numbers, to LinkedIn’s datafu library. We hope to collaborate with the community to improve the models over time and create a resource for students, researchers, and fellow data scientists.

[repost ]Pixable Architecture – Crawling, Analyzing, And Ranking 20 Million Photos A Day

original:http://highscalability.com/blog/2012/2/21/pixable-architecture-crawling-analyzing-and-ranking-20-milli.html

This is a guest post by Alberto Lopez Toledo, PHD, CTO of Pixable, and Julio Viera, VP of Engineering at Pixable.

Pixable aggregates photos from across your different social networks and finds the best ones so you never miss an important moment. That means currently processing the metadata of more than 20 million new photos per day: crawling, analyzing, ranking, and sorting them along with the other 5+ billion that are already stored in our database. Making sense of all that data has challenges, but two in particular rise above the rest:

  1. How to access millions of photos per day from Facebook, Twitter, Instagram, and other services in the most efficient manner.
  2. How to process, organize, index, and store all the meta-data related to those photos.

Sure, Pixable’s infrastructure is changing continuously, but there are some things that we have learned over the last year. As a result, we have been able to build a scalable infrastructure that takes advantage of today’s tools, languages and cloud service, all running on Amazon Web Services where we have more than 80 servers running. This document provides a brief introduction to those lessons.

Backend Architecture – Where Everything Happens

Infrastructure – Loving Amazon EC2

We maintain all of our servers on Amazon EC2 using a wide range of instances, from t1.micro to m2.2xlarge, with CentOS Linux. After setting up the server, we create our own internal AMIs–one for every type of server. We always have them ready for immediate deployment when the load increases, thus maintaining a minimum performance standard at all times.

To compensate for these load fluctuations, we developed our own auto-scaling technology that forecasts how many servers we need of each type according to the current and historic load for a particular time of the day. Then we launch or terminate instances just to keep the right level of provisioning. In this way, we are able to save money by scaling down our servers when they’re unnecessary. Auto-scaling in Amazon is not easy, since there are many variables to consider.

For example: it makes no sense to terminate an instance that has been running for just half an hour, since Amazon charges for whole hours. Also, Amazon can take 20+ minutes to launch a spot-instance. So for sudden spikes in traffic, we do some clever launch scheduling of on-demand instances (that launch much faster), and then swapping them out for spot-instances in the next hour. This is the result of pure operational research, the objective of which is to extract the best performance for the right amount of money. Think of it like the film “Moneyball”, but with virtualized servers instead of baseball players.

Our web servers currently run Apache + PHP 5.3 (lately we have been fine-tuning some web servers to run nginx + php-fpm, which will soon become our standard configuration). The servers are evenly distributed across different availability zones behind an Amazon’s Elastic Load Balancer, so we can absorb both downtime and price fluctuations on Amazon. Our static content is stored on Amazon Cloud Front, and we use Amazon Route 53 for DNS services. Yes indeed… we love Amazon.

Work Queue- Jobs For Crawling And Ranking Photos, Send Notifications And More

Virtually all processing at Pixable is done via an asynchronous job (e.g., crawling new photos from different users from Facebook, sending push notifications, calculating friend rankings, etc). We have several dozen worker servers crawling metadata from photos from different services and processing that data. This is a continuous, around-the-clock process.

As expected, we have different types of jobs: some with high priority, such as real time user calls, messaging, and crawling photos for currently active users. Lower priority jobs include offline crawling and long data-intensive deferred tasks. Although we use the very capable beanstalkd as our job queue server, we have developed our own management framework on top of it. We call it the Auto-Pilot, and it automatically manages the handling of priorities, e.g. by devoting the job server time to the high priority jobs and pausing the lower priority ones when certain sets of conditions on the platform-wide level are met.

We developed very sophisticated rules to handle these priorities, considering metrics that affect the performance of the system and impact the user-perceived speed. Some are obvious, such as the average waiting time of jobs or the lag of the slaves (ssshhh, we never have lag on our slaves :-) ), to more complex metrics such as the status of our own PHP synchronization mutex locks for distributed environments. We do as much as possible to make an equitable trade-off between efficiency and performance.

Crawling Engine – Crawl New Photos Across Facebook, Twitter And More 24/7

We are constantly improving our crawling technology, which is a complex parallel algorithm that uses a mutex locking library, developed in-house, to synchronize all the processes for a particular user. This algorithm has helped us to improve our Facebook crawling speed by at least 5x since launch. We can now easily fetch in excess of 20 million new photos every day. This is quite remarkable, considering the fact that any large data query to the Facebook API can several seconds. We’ll get deeper into our crawling engine in a subsidiary document.

Data Storage – Indexing Photos And Metadata

Naturally, our data storage grows every day. Currently we store 90% of our data in MySQL (with a memcached layer on top of it), using two groups servers. The first group is a 2 master – 2 slave configuration that stores the more normalized data accessed by virtually every system, such as user profile information, global category settings, and other system parameters.

The second server group contains manually sharded servers in which we store the data related to user photos, such as photo URLs. This metadata is highly de-normalized to the point where we virtually run the storage as a NoSQL solution like MongoDB, only in MySQL tables (NoSQL-in-MySQL). So you can guess where the other 10% of our data is stored, right? Yes, in MongoDB! We are moving parts of our storage to MongoDB, mainly because of the simple-yet-flexibile sharding and replication solutions it offers.

Logging, Profiling And Analytics

We developed a highly flexible logging and profiling framework that allows us to record events with high granularity–down to a single line of code. Every log event is categorized by a set of labels that we later query (e.g. event of user X, or calls in module Y). On top of that, we can dynamically profile the time between any of the logging events, allowing us to build real-time profiling of our entire system. The logging and profiling system weighs very heavily on the storage system (several thousands of updates per second), so we developed a mix of two-level MySQL tables (a memory-based fast buffer, that serves as a leaky bucket for the actual data stored), combined with some partitioned MySQL tables that are filled up later asynchronously. This architecture allows us to process more than 15,000 log entries per second. We also have our own event tracking system, wherein every user action from logins to shares to individual clicks are recorded so they can later be analyzed with complex queries.

We also rely heavily on the wonderful Mixpanel service, an event-based tracking system in which we perform most of our high-level analysis and reports.

Front End – Simple Visualization Devices

Pixable runs in multiple (front-end) devices, the most popular one being the iPhone and the iPad. We also have web and mobile web sites that load a simple index page, and everything else is performed in the client by extensive use of jQuery and Ajax calls. All our web frontends will soon run a single codebase that automatically adapts to mobile or desktop screens (give it a try!http://new.pixable.com). This way, we can run the same code on our main site, on an Android device, or a Chrome browser extension! In fact, our Android app is a combination of a native app with our mobile web frontend. The Android app renders a frame with some minimal controls, and simply presents the mobile web view inside of it.

It sounds a bit harsh, but all our frontends are “dumb”. All the heavy lifting is performed in the backend, with everything connected via our own private API. This allows us to do rapid development and deployment without having to change the installed user base. Agile, baby!

API – Connecting Our Front End And Back End

The API is the glue that keeps everything working together. Using Tonic PHP, we developed our own private RESTful API that exposes our backend functionality. Also, we extended Tonic to make it support built-in versioning. That way, it’s really easy for us to maintain backward compatibility when developing new features or changing response formats in the API. We just configure what versions of the API calls are supported for which device version. We have a no-device-left-behind policy, no matter how old the version is!

But the API does not do any actual work. It simply relies on the information from the front-ends, passing it to the actual Pixable heart: the backend.