Getting Started: Sentiment Analysis
Sentiment analysis refers to various methods of examining and processing data in order to identify a subjective response, usually a general mood or a group’s opinions about a specific topic. For example, sentiment analysis can be used to gauge the overall positivity of a blog or a document, or to capture constituent attitudes toward a political candidate.
Sentiment data is often derived from social media services and similar user-generated content, such as reviews, comments, and discussion groups. The data sets thus tend to grow large enough to be considered “big data.”
Suppose your company recently released a new product and you want to assess its reception among consumers. You know that social media can help you capture a broad sample of public opinion, but you don’t have time to monitor every mention. You need a better way to determine aggregate sentiment.
Amazon EMR integrates open-source data processing frameworks with the full suite of Amazon Web Services. The resulting architecture is scalable, efficient, and ideal for analyzing large-scale sentiment data, such as tweets over a given time period.
In this tutorial, you’ll launch an AWS CloudFormation stack that provides a script for collecting tweets. You’ll store the tweets in Amazon S3 and customize a mapper file for use with Amazon EMR. Then you’ll create an Amazon EMR cluster that uses a Python natural language toolkit, implemented with a Hadoop streaming job, to classify the data. Finally, you’ll examine the output files and evaluate the aggregate sentiment of the tweets.
This tutorial typically takes less than an hour to complete. You pay only for the resources you use. The tutorial includes a cleanup step to help ensure that you don’t incur additional costs. You may also want to review the Pricing topic.
Before you begin, make sure you’ve completed the steps in Getting Set Up.
Click Next to start the tutorial.
Step 1: Create a Twitter Developer Account
In order to collect tweets for analysis, you’ll need to create an account on the Twitter developer site and generate credentials for use with the Twitter API.
- Go to https://dev.twitter.com/user/login and log in with your Twitter user name and password. If you do not yet have a Twitter account, click the Sign up link that appears under the Username field.
- If you’ve already used the Twitter developer site to generate credentials and register applications, skip to the next step.
If you have not yet used the Twitter developer site, you’ll be prompted to authorize the site to use your account. Click Authorize app to continue.
- Go to the Twitter applications page at https://dev.twitter.com/apps and click Create a new application.
- Follow the on-screen instructions. For the application Name, Description, and Website, you can enter any text — you’re simply generating credentials to use with this tutorial, rather than creating a real application.
- On the details page for your new application, you’ll see a Consumer key and Consumer secret. Make a note of these values; you’ll need them later in this tutorial. You may want to store your credentials in a text file.
- At the bottom of the application details page, click Create my access token. Make a note of the Access token and Access token secret values that appear, or add them to the text file you created in the preceding step.
If you need to retrieve your Twitter developer credentials at any point, you can go to https://dev.twitter.com/apps and select the application you created for the purposes of this tutorial.
Step 2: Create an Amazon S3 Bucket for the Amazon EMR Files
Amazon EMR jobs typically use Amazon S3 buckets for input and output data files, as well as for any mapper and reducer files that aren’t provided by open-source tools. For the purposes of this tutorial, you’ll create your own Amazon S3 bucket in which you’ll store the data files and a custom mapper.
- Sign in to the AWS Management Console and open the Amazon S3 console at https://console.aws.amazon.com/s3/home.
- Click Create Bucket.
- Enter a name for your bucket, such as mysentimentjob.
To meet Hadoop requirements, Amazon S3 bucket names used with Amazon EMR are restricted to lowercase letters, numbers, periods (.), and hyphens (-).
- Leave the Region set to US Standard and click Create.
- Click the name of your new bucket in the All Buckets list.
- Click Create Folder, then type input. Press Enter or click the check mark.
- Repeat this step to create another folder called mapper at the same level as the input folder.
- For the purposes of this tutorial (to ensure that all services can use the folders), you should make the folders public. Select the check boxes next to your folders. ClickActions, then click Make Public. Click OK to confirm that you want to make the folders public.
Make a note of your bucket and folder names — you’ll need them in later steps.
Step 3: Collect and Store the Sentiment Data
In this step, you’ll use an AWS CloudFormation template to launch an instance, then use the tools on the instance to collect data via the Twitter API. You’ll also use a command-line tool to store the collected data in the Amazon S3 bucket you created.
- Open the AWS CloudFormation console at https://console.aws.amazon.com/cloudformation.
- Make sure US East (N. Virginia) is selected in the region selector of the navigation bar.
- Click Create Stack.
- In the Stack Name box, type any name that will help you identify your stack, such as MySentimentStack.
- Under Template, select Provide a Template URL. Type https://s3.amazonaws.com/awsdocs/gettingstarted/latest/sentiment/sentimentGSG.template in the box (or copy the URL from this page and paste it in the box). Click Continue.
- On the Specify Parameters page, enter your AWS and Twitter credentials. The Key Pair name must match the key pair you created in the US-East region in Step 2: Create a Key Pair.
For best results, copy and paste the Twitter credentials from the Twitter developer site or the text file you saved them in.
The order of the Twitter credential boxes on the Specify Parameters page may not match the display order on the Twitter developer site. Make sure you’re pasting the correct value in each box.
- Select the check box to acknowledge that the template may create IAM resources, then click Continue. Click Continue again on the Add Tags page.
- Review your settings, making sure your Twitter credentials are correct. You can make changes to the settings by clicking the Edit link for a specific step in the process.
- Click Continue to launch the stack. A confirmation window opens. Click Close.
- The confirmation window closes, returning you to the AWS CloudFormation console. Your new AWS CloudFormation stack appears in the list with its status set toCREATE_IN_PROGRESS.
Your stack will take several minutes to launch. Make sure to click Refresh on the Stacks page to see whether the stack has been successfully created.
For more information about AWS CloudFormation, go to Walkthrough: Updating a Stack.
When your stack shows the status CREATE_COMPLETE, it’s ready to use.
- Click the Outputs tab in the bottom pane to get the IP address of the Amazon EC2 instance that AWS CloudFormation created.
- Connect to the instance via SSH, using the user name
ec2-user. For more information about connecting to an instance and configuring your SSH credentials and tools, seeConnecting to Your Linux/UNIX Instances Using SSH or Connecting to Linux/UNIX Instances from Windows Using PuTTY. (Disregard the sections that describe how to transfer files.)
- In the SSH window, type the following command:
- The instance has been preconfigured with Tweepy, an open-source package for use with the Twitter API. Python scripts for running Tweepy appear in the
sentimentdirectory. To ensure that they are present, type the following command:
You should see files named
twaiter.py, as well as
- To collect tweets, type the following command, where
term1is your search term.
python collector.py term1
To use a multi-word term, enclose it in quotation marks. Examples:
python collector.py kindle python collector.py "kindle fire"
The collector script is not case sensitive.
- Press Enter to run the collector script. Your SSH window should show the message “Collecting tweets. Please wait.”
The script collects 500 tweets, which may take several minutes. If you’re searching for a subject that is not currently popular on Twitter (or if you edited the script to collect more than 500 tweets), the script will take longer to run. You can interrupt it at any time by pressing Control+C.
When the script has finished running, your SSH window will show the message “Finished collecting tweets.”
If your SSH connection is interrupted while the script is still running, reconnect to the instance and run the script with nohup (e.g.,
nohup python collector.py > /dev/null &).
Your sentiment analysis stack has been preconfigured with s3cmd, a command-line tool for Amazon S3. You’ll use s3cmd to store your tweets in the bucket you created earlier.
- In your SSH window, type the following command. (The current directory should still be
sentiment. If it’s not, use
cdto navigate to the
You should see a file named
timereflect when the script was run. This file contains the ID numbers and full text of the tweets that matched your search terms.
- To copy the Twitter data to Amazon S3, type the following command, where
tweet-fileis the file you identified in the previous step and
your-bucketis the name of the Amazon S3 bucket you created earlier.
s3cmd put tweet-file s3://your-bucket/input/
s3cmd put tweets.Nov12-1227.txt s3://mysentimentjob/input/
Be sure to include the trailing slash, to indicate that input is a folder. Otherwise, Amazon S3 will create an object called
inputin your base S3 bucket.
- To verify that the file was uploaded to Amazon S3, type the following command:
s3cmd ls s3://your-bucket/input/
You can also use the Amazon S3 console at https://console.aws.amazon.com/s3/ to view the contents of your bucket and folders.
Step 4: Customize the Amazon EMR Mapper
When you create your own Hadoop streaming programs, you’ll need to write mapper and reducer executables as described in Process Data with a Streaming Cluster in the Amazon Elastic MapReduce Developer Guide. For this tutorial, we’ve prepopulated an Amazon S3 bucket with a mapper script that you can customize for use with your Twitter search term.
- Download the mapper file from https://s3.amazonaws.com/awsdocs/gettingstarted/latest/sentiment/sentiment.py.
- Use a text editor of your choice to edit the following line in the file:
subj1 = "term1"
term1with the search term you used in Step 3: Collect and Store the Sentiment Data. Example:
subj1 = "kindle"
Make sure you don’t change any of the spacing in the file. Incorrect indentation will cause the Hadoop streaming program to fail.
Save the edited file. You may also want to review the file generally, to get a sense of how mappers can work.
In your own mappers, you’ll probably want to fully automate the configuration. The manual editing in this tutorial is for purposes of illustration only. For more details about creating Amazon EMR work steps and bootstrap actions, go to Create Bootstrap Actions to Install Additional Software and Steps in the Amazon Elastic MapReduce Developer Guide.
- Go to the Amazon S3 console at https://console.aws.amazon.com/s3/ and locate the
mapperfolder you created in Step 2: Create an Amazon S3 Bucket for the Amazon EMR Files.
- Click Upload and follow the on-screen instructions to upload your customized mapper file.
- Make the mapper file public: select it, then select Actions and then Make Public.
Step 5: Create an Amazon EMR Cluster
This tutorial reflects changes made to the Amazon EMR console in November 2013. If your console screens do not match the images in this guide, switch to the new version by clicking the link that appears at the top of the console:
Amazon EMR allows you to configure a cluster with software, bootstrap actions, and work steps. For this tutorial, you’ll run a Hadoop streaming program. When you configure a cluster with a Hadoop streaming program in Amazon EMR, you specify a mapper and a reducer, as well as any supporting files. The following list provides a summary of the files you’ll use for this tutorial.
- For the mapper, you’ll use the file you customized in the preceding step.
- For the reducer method, you’ll use the predefined Hadoop package
aggregate. For more information about the aggregate package, go to the Hadoop documentation.
- Sentiment analysis usually involves some form of natural language processing. For this tutorial, you’ll use the Natural Language Toolkit (NLTK), a popular Python platform. You’ll use an Amazon EMR bootstrap action to install the NLTK Python module. Bootstrap actions load custom software onto the instances that Amazon EMR provisions and configures. For more information, go to Create Bootstrap Actions in the Amazon Elastic MapReduce Developer Guide.
- Along with the NLTK module, you’ll use a natural language classifier file that we’ve provided in an Amazon S3 bucket.
- For the job’s input data and output files, you’ll use the Amazon S3 bucket you created (which now contains the tweets you collected).
Note that the files used in this tutorial are for illustration purposes only. When you perform your own sentiment analysis, you’ll need to write your own mapper and build a sentiment model that meets your needs. For more information about building a sentiment model, go to Learning to Classify Text in Natural Language Processing with Python, which is provided for free on the NLTK site.
- Open the Amazon EMR console at https://console.aws.amazon.com/elasticmapreduce/.
- Click Create cluster.
- In the Cluster Configuration section, type a Cluster name or use the default value of My cluster. Set Termination protection to No and clear the Logging enabled check box.
In a production environment, logging and debugging can be usefull tools for analyzing errors or inefficiencies in Amazon EMR steps and applications. For more information on how to use logging and debugging in Amazon EMR, go to Troubleshooting in the Amazon Elastic MapReduce Developer Guide.
- In the Software Configuration section, leave the default Hadoop distribution setting: Amazon and latest AMI version. Under Applications to be installed, click each X to remove Hive and Pig from the list.
- In the Hardware Configuration section, leave the default settings. The default instance types, an m1.small master node and two m1.small core nodes, will help keep the cost of this tutorial low.
When you analyze data in a real application, you may want to increase the size or number of these nodes to improve processing power and optimize computational time. You may also want to use spot instances to further reduce your Amazon EC2 costs. For more information about spot instances, go to Lowering Costs with Spot Instances in the Amazon Elastic MapReduce Developer Guide.
- In the Security and Access section, select the EC2 key pair you created earlier. Leave the default IAM settings.
- In the Bootstrap Actions section, in the Add bootstrap action list, select Custom action. You’ll add a custom action that installs and configures the Natural Language Toolkit on the cluster.
- In the Add Bootsrap Action popup, enter a Name for the action or leave it set to Custom action. In the Amazon S3 Location box, types3://awsdocs/gettingstarted/latest/sentiment/config–nltk.sh (or copy and paste the URL from this page), and then click Add. (You can also download and review the shell script, if you’d like.)
The Bootstrap Actions section should now show the custom action you added.
- In the Steps section, you’ll define the Hadoop streaming job. In the Add step list, select Streaming program, then click Configure and add.
- In the Add Step popup, configure the job as follows, replacing
your-bucketwith the name of the Amazon S3 bucket you created earlier:
Name Sentiment analysis Mapper s3://
Reducer aggregate Input S3 location s3://
Output S3 location s3://
your-bucket/output (make sure this folder does not yet exist)
Arguments -cacheFile s3://awsdocs/gettingstarted/latest/sentiment/classifier.p#classifier.p Action on failure Continue
Click Add. The Steps section should now show the parameters for the streaming program.
- Below the step parameters, set Auto-terminate to Yes.
- Review the cluster settings. If everything looks correct, click Create cluster.
A summary of your new cluster will appear, with the status Starting. It will take a few minutes for Amazon EMR to provision the Amazon EC2 instances for your cluster.
Step 6: Examine the Sentiment Analysis Output
When your cluster’s status in the Amazon EMR console is Waiting: Waiting after step completed, you can examine the results.
- Go to the Amazon S3 console at https://console.aws.amazon.com/s3/home and locate the bucket you created in Step 2: Create an Amazon S3 Bucket for the Amazon EMR Files. You should see a new
outputfolder in your bucket. You may need to click the refresh arrow in the top right corner to see the new bucket.
- The job output will be split into several files: an empty status file named
part-xxxxxfiles contain sentiment measurements generated by the Hadoop streaming program.
- To download an output file, select it in the list, then click Actions and select Download. Right-click the link in the pop-up window to download the file.
Repeat this step for each output file.
- Open the files in a text editor. You’ll see the total number of positive and negative tweets for your search term, as well as the total number of tweets that did not match any of the positive or negative terms in the classifier (usually because the subject term was in a different field, rather than in the actual text of the tweet).
kindle: negative 13 kindle: positive 479 No match: 8
In this example, the sentiment is overwhelmingly positive. In most cases, the positive and negative totals will be closer together. For your own sentiment analysis work, you’ll want to collect and compare data over several time periods, possibly using several different search terms, to get as accurate a measurement as possible.
Step 7: Clean Up
To prevent your account from accruing additional charges, you should terminate the resources you used in this tutorial.
- Go to the AWS CloudFormation console at https://console.aws.amazon.com/cloudformation.
- In the AWS CloudFormation Stacks section, select your sentiment stack.
- Either click the Delete Stack button, or right-click your selected stack and click Delete Stack.
- Click Yes, Delete in the confirmation dialog that appears.
After stack deletion has begun, you can’t cancel the process. The stack will proceed to the state DELETE_IN_PROGRESS. After the stack has been deleted, it will have the state DELETE_COMPLETE.
Because you ran a Hadoop streaming program and set it to auto-terminate after running the steps in the program, the cluster should have been automatically terminated when processing was complete.
- If you are not already viewing the cluster list, click Cluster List in the Elastic MapReduce menu at the top of the Amazon Elastic MapReduce console.
- In the cluster list, make sure the Status of your cluster is Terminated.
- If you are not already viewing the cluster list, click Cluster List in the Elastic MapReduce menu at the top of the Amazon Elastic MapReduce console.
- In the cluster list, select the box to the left of the cluster name, and then click Terminate. In the confirmation pop-up that appears, click Terminate.
The next step is optional. It deletes the key pair you created earlier. You are not charged for key pairs. If you are planning to explore Amazon EMR further or complete the other tutorial in this guide, you should retain the key pair.
- In the Amazon EC2 console navigation pane, select Key Pairs.
- In the content pane, select the key pair you created, then click Delete.
The next step is optional. It deletes two security groups created for you by Amazon EMR when you launched the cluster. You are not charged for security groups. If you are planning to explore Amazon EMR further, you should retain them.
- In the Amazon EC2 console navigation pane, click Security Groups.
- In the content pane, click the ElasticMapReduce-slave security group.
- In the details pane for the ElasticMapReduce-slave security group, click the Inbound tab. Delete all actions that reference ElasticMapReduce. Click Apply Rule Changes.
- In the content pane, click ElasticMapReduce-slave, and then click Delete. Click Yes, Delete to confirm. (This group must be deleted before you can delete the ElasticMapReduce-master group.)
- In the content pane, click ElasticMapReduce-master, and then click Delete. Click Yes, Delete to confirm.
You’ve completed the sentiment analysis tutorial. Be sure to review the other topics in this guide for more information about Amazon Elastic MapReduce.