Tag Archives: Amazon EC2

private cloud:Eucalyptus


What is Eucalyptus

In the previous article, Learn about cloud computing we introduced the basics of cloud computing and we discussed how clouds are classified according to service offerings or “styles” (i.e., IaaS, PaaS, SaaS) and “types” (i.e., public, private and hybrid). In this article we introduce the Eucalyptus cloud computing platform.

Eucalyptus enables the creation of on-premise private clouds, with no requirements for retooling the organization’s existing IT infrastructure or need to introduce specialized hardware. Eucalyptus implements an IaaS (Infrastructure as a Service) private cloud that is accessible via an API compatible with Amazon EC2 and Amazon S3. For more information on our API see our Developer’s Corner. This compatibility allows any Eucalyptus cloud to be turned into a hybrid cloud, capable of drawing compute resources from public cloud. And Eucalyptus is compatible with a wealth of tools and applications that also adhere to the de facto EC2 and S3 standards.

Here are some of the characteristics that make Eucalyptus the most widely deployed cloud platform for the private (on-premise) cloud:

Open Source
Eucalyptus is open source: if you want to modify it, contribute to it, assess its security or just learn from it you can download it and have the source code at your fingertips. The Eucalyptus development process is in the open, as are bug reports, community contributions and security advisories.
Eucalyptus’ design is modular. The Eucalyptus components have well-defined interfaces (via WSDL, since they are web services) and thus can be easily swapped out for custom components.
Eucalyptus allows its components to be installed strategically close to the needed/used resources. For example Walrus can be installed close to the storage, while the Cluster Controller can be installed close to the cluster it will manage.
Designed to Perform
Eucalyptus was designed from the ground up to be scalable and to achieve optimal performance in diverse environments (designed to overlay an existing infrastructure).
Eucalyptus is flexible and can be installed on a very minimal setup. Yet it can be installed on thousands of cores and terabytes of storage. And it can do so as an overlay on top of an existing infrastructure.
Eucalyptus is compatible with the most popular and widely used Cloud API currently available: Amazon EC2 and S3. Eucalyptus’ design allows for any other API to be implemented, but to date no other real Cloud API contender is as complete and as requested as Amazon’s.
Hypervisor Agnostic
Eucalyptus is designed to easily support most available and future hypervisors. Currently Eucalyptus fully supports KVM and Xen. Additionally, the Enterprise Edition supports the proprietary VMware hypervisor.
Hybrid Cloud
All of the above characteristics makes Eucalyptus easy to deploy as an hybrid cloud. An hybrid cloud combines resources drawn from multiple clouds, typically one private and one public. Eucalyptus compatibility with Amazon’s EC2 API allows for a natural hybrid cloud with the biggest public cloud available.

History of Eucalyptus

Eucalyptus was born as a University project in the MAYHEM labs of the Computer Science Dept. at UC Santa Barbara. The MAYHEM team’s experience in Grid Computing, HPC, and massively scalable systems (Rich Wolski’s team of NWS and EveryWare fame) made it the natural place for the birth of Eucalyptus.

The name EUCALYPTUS is an acronym and stands for Elastic Utility ComputingArchitecture for Linking Your Programs TUseful Systems. A brief description of this period can be read here. This is of course no coincidence as UCSB is a leading University in Cloud Computing research.

In 2009, the Eucalyptus team started a company (Eucalyptus Systems Inc.) to commercialize Eucalyptus. Currently there is Eucalyptus, the open source project, and Eucalyptus EE (Enterprise Edition), which is the commercial version of Eucalyptus.

Next Steps

Our talented development team has made installing Eucalyptus easy, but some planning is still required (e.g., Where is the storage for Walrus/S3? What about the storage and network for EBS? How many public IPs can Eucalyptus use? Is there a private network for the Node Controllers? How big will my cloud grow?). These questions are a normal part of the planning process that IT goes through in order to appropriately size the infrastructure for a deployment. Before embarking on a full-scale deployment we suggest familiarizing yourself with Eucalyptus by:

  • TestDrive our community cloud (ECC);
  • Reserve few machines, and follow our documentation in particular the administrator’s guide, to install Eucalyptus on your favorite distribution.

Our forum are always available if you need additional help. Or explore other ways toparticipate in our community.

[repost]Quickly Launch A Cassandra Cluster On Amazon EC2

Today's Workplace

If you have read my previous post, “Map-Reduce With Ruby Using Hadoop“, then you will know that firing up a Hadoop cluster is really simple when you use Whirr. Without even ssh’ing on the machines in the cloud you can start-up your cluster and interact with it. In this post I’ll show you that it is just as easy to fire up a Cassandra cluster on Amazon EC2.

Install Whirr

I will fly through the setup of Whirr quite quickly. All the commands you need are here, but if you want a more thorough explanation then see my other post, “Map-Reduce With Ruby Using Hadoop“.

I am assuming that you have Homebrew installed.

sudo brew update
sudo brew install maven
mkdir ~/src/cloudera
cd ~/src/cloudera
wget http://archive.cloudera.com/cdh/3/whirr-0.1.0+23.tar.gz
tar -xvzf whirr-0.1.0+23.tar.gz
cd whirr-0.1.0+23
mvn clean install
mvn package -Ppackage

Be patient with the above. There is a lot to install, so it will take some time. Maven installs a lot of dependencies if it is your first time using it.

The good news is that from here on you are setup to easily fire-up your Amazon EC2 cluster for Cassandra, or Hadoop if you choose.

Whirr Configuration File

We will need to make a configuration file for Whirr to tell it that we want to launch a Cassandra cluster with 3 nodes. If you are brave, patient and have the cash, then you could just as easily fire-up a 100 node cluster (leave a comment if you do – there may be prizes!).

You will need to create a cassandra.properties file with the following contents…

whirr.instance-templates=3 cassandra

Replace the obvious fields with your Amazon EC2 Access Key ID and Amazon EC2 Secret Access Key.

Launch Your Cluster

Now you are ready to fire-up your Cassandra cluster. Simply use the following command and then be prepared to wait 5-10 minutes while Amazon builds your machines. This time is variable. Sometimes Amazon is quick, sometimes not so quick.

bin/whirr launch-cluster --config cassandra.properties

Launching mycassandracluster cluster
Configuring template
Starting 3 node(s)
Nodes started: [[id=us-east-1/i-13f25e7f, providerId=i-13f25e7f, tag=mycassandracluster, name=null, location=[id=us-east-1d, scope=ZONE, description=us-east-1d, parent=us-east-1], uri=null, imageId=us-east-1/ami-74f0061d, os=[name=null, family=amzn-linux, version=2010.11.1-beta, arch=paravirtual, is64Bit=true, description=amazon/amzn-ami-2010.11.1-beta.x86_64-ebs], userMetadata={}, state=RUNNING, privateAddresses=[], publicAddresses=[], hardware=[id=t1.micro, providerId=t1.micro, name=t1.micro, processors=[[cores=1.0, speed=1.0]], ram=630, volumes=[[id=vol-1657d47e, type=SAN, size=null, device=/dev/sda1, durable=true, isBootDevice=true]], supportsImage=hasRootDeviceType(ebs)]], [id=us-east-1/i-17f25e7b, providerId=i-17f25e7b, tag=mycassandracluster, name=null, location=[id=us-east-1d, scope=ZONE, description=us-east-1d, parent=us-east-1], uri=null, imageId=us-east-1/ami-74f0061d, os=[name=null, family=amzn-linux, version=2010.11.1-beta, arch=paravirtual, is64Bit=true, description=amazon/amzn-ami-2010.11.1-beta.x86_64-ebs], userMetadata={}, state=RUNNING, privateAddresses=[], publicAddresses=[], hardware=[id=t1.micro, providerId=t1.micro, name=t1.micro, processors=[[cores=1.0, speed=1.0]], ram=630, volumes=[[id=vol-1457d47c, type=SAN, size=null, device=/dev/sda1, durable=true, isBootDevice=true]], supportsImage=hasRootDeviceType(ebs)]], [id=us-east-1/i-11f25e7d, providerId=i-11f25e7d, tag=mycassandracluster, name=null, location=[id=us-east-1d, scope=ZONE, description=us-east-1d, parent=us-east-1], uri=null, imageId=us-east-1/ami-74f0061d, os=[name=null, family=amzn-linux, version=2010.11.1-beta, arch=paravirtual, is64Bit=true, description=amazon/amzn-ami-2010.11.1-beta.x86_64-ebs], userMetadata={}, state=RUNNING, privateAddresses=[], publicAddresses=[], hardware=[id=t1.micro, providerId=t1.micro, name=t1.micro, processors=[[cores=1.0, speed=1.0]], ram=630, volumes=[[id=vol-e857d480, type=SAN, size=null, device=/dev/sda1, durable=true, isBootDevice=true]], supportsImage=hasRootDeviceType(ebs)]]]
Authorizing firewall
Running configuration script
Completed launch of mycassandracluster
Started cluster of 3 instances
Cluster{instances=[Instance{roles=[cassandra], publicAddress=/, privateAddress=/}, Instance{roles=[cassandra], publicAddress=/, privateAddress=/}, Instance{roles=[cassandra], publicAddress=/, privateAddress=/}], configuration={}}

You now have your very own Cassandra cluster running in the cloud. Not so hard, hey!

Connect From Ruby

I will be following this post with step-by-step guide on how you can interact with your new cluster from your Ruby On Rails application. I recommend subscribing to the RSS feed to get updates to the blog.

Shutdown The Cluster

Here is how you can shutdown your cluster.

bin/whirr destroy-cluster --config cassandra.properties

Destroying mycassandracluster cluster
Cluster mycassandracluster destroyed


Whirr makes it very easy to start and stop a Cassandra cluster in the cloud without leaving the comfort of your laptop. What you do with that cluster is up to you, but I will be give you some ideas of what you could do in future posts. Stay tuned!


[repost]Quora’s Technology Examined

Magnifying the Circuit Board

Quora has taken the tech and entrepreneurial world by storm, providing a system that works so fluidly that it is sometimes hard to see what the big fuss is all about. This slick tool is powered, not only by an intelligent crowd of askers and answerers, but by a well-crafted backend created by co-founders who honed their skills at Facebook.

It is not surprising that, with all the smart people using this smart tool, there are many pondering on how it works so well. The NoSQL boffins scratch their heads and ponder such questions as, “Why does Quora use MySQL as the data store rather than NoSQLs such as Cassandra, MongoDB, CouchDB, etc?“.

In this blog post I will delve into the snippets of information available on Quora and look at Quora from a technical perspective. What technical decisions have they made? What does their architecture look like? What languages and frameworks do they use? How do they make that search bar respond so quickly?

Components Of Quora

The general components that make up Quora are…

  • You can ask questions
  • You can answer questions (anonymously if you desire)
  • You can comment on answered questions
  • You can vote-up or vote-down answers to questions
  • Questions can be assigned to topics
  • You can write a post (a informative statement, rather like a orphaned answer or blog post)
  • You can follow questions, topics or other users
  • A super-fast auto-complete search-box at the top, which doubles as the method for entering new questions

The last point, the super-fast auto-complete search-box, is one of the defining features of Quora. You can see immediately, as you begin to enter a question, whether somebody else has already asked that question or if there is a topic or post on the subject. Let’s start there…

What’s Cooking Under That Hood?

Only the questions, topic labels, user names or post titles are indexed and served up to the search-box. There is no full-text search, so searching the content of questions and answers will not work. The text that is indexed is tokenized so that words in a different order will still be matched. Prefix matching enables best matches to be shown before the entire word is entered. For instance, typing “mi” might immediately show “Microsoft” in the results.

There is some simple stemming of words, since “nears” matches “near”, but “pony” does not match “ponies”. “Topic-aliases” allow for similar matches on topic names, such as “startup” and “start-up”. These topic-aliases have been manually entered by users. Otherwise these would not match.

If a duplicate question is redirected to another question (a feature of Quora), then that original question will still appear in the search results, since it increases the chances of a match. There is no n-gram indexing, so slight mis-spellings will not match. For instance, “gooogle” (with an extra “o”) finds nothing.

Previously, they did use an open source search server, called Sphinx. It supports the features they are using above, but they have since moved from this due to real-time constraints. Their new solution is built in-house and allows them better prefix indexing and control over the matching algorithms. They built this in Python.

What libraries does Quora use for search?
Adam D’Angelo, Quora Founder (Nov 13, 2010)
Our search is custom-written. It doesn’t use any libraries aside from Thrift, and Python’s unicode library, which we use for unicode normalization.

Speedy Queries

Did I mention that the search-box is fast? I did some tests and found the responses to be impressive. Queries are sent over AJAX as a GET request. Responses come back as JSON with the rendered HTML embedded inside the JSON. Rendering of the results on the server-side, as opposed to rendering them in JavaScript, seems to be due to the need to highlight matching words in the text. This is sometimes too complex for JavaScript. For instance, typing “categories” might highlight the world “category” in the result text.

I was seeing responses of roughly 50 milliseconds per query from my Linode machine. Quora does not short-change you when sending requests. From within the browser, I found typing “Microsoft” (9 characters) would result in nine requests to the Quora search server, no matter how fast you type. As you will see later, the server is in control, so if it did become over-loaded, then it could update the results less frequently without changing the JavaScript.

Quora uses persistent connections. A HTTP connection is established with the server when you start typing the search query. This connection is kept open and further requests are made on this same open connection. The connection will terminate (times-out) if not used for 60 seconds. If a connection times-out then a new connection is established when typing begins.

To simulate the typing of a word into the search-box, I sent the following requests, character-by-character, across a persistent connection. For instance “butler” is six requests (“b”, “bu”, “but” … “butler”).

"butler" (6 chars) duration: 0.393 secs 0.065 secs per query
"butler monkeys" (14 chars) duration: 0.672 secs 0.048 secs per query
"fasdisajfosdffsa" (16 chars) duration: 0.746 secs 0.046 secs per query

That last query was used to test if there was a slow-down for a word that would obviously not be in a caching layer. I saw no slow-down. This means that they are not caching, caching is only used to take the load off the backend search engine or they are doing something smarter (e.g. if there is no match for “fasd” then there will be no match for “fasdi”, so abort).

Is Quora going to implement full-text search?
Adam D’Angelo, I made a lot of the early Quora … (Sep 1, 2010)
Yes, eventually. We haven’t implemented this yet because we’ve prioritized other things, but we will definitely do it in the future.

Webnode2 And LiveNode

Webnode2 and LiveNode are some of Quora’s internal systems, which were built for managing the content. Webnode2 generates HTML, CSS and JavaScript and is tightly coupled with LiveNode, which is responsible for managing the display of the content on the webpage. Charlie Cheever says that he were to start a similar project without LiveNode, then the first thing he would do is rebuild it.

They seem very pleased with the technology they have built and struggled to find its weaknesses. One weakness is that it is tricky for LiveNode to keep track of what is happening within the browser as it pushes changes from the server. If users A and B are viewing the same question then ones interactions will affect the other. For instance, if user A up-votes an answer then that answer will be promoted and will visibly move up the page. This display change will be pushed over AJAX to user B’s browser. Any prior browser-side change that user B made, such as expanding a comments section, might be lost.

LiveNode is written in Python, C++, and JavaScript. jQuery and Cython is also used.

While they would like to open-source LiveNode and have tried to keep code separation, doing so right now would be too much work and would take time away from their main goal, which is making Quora better.

Charlie Cheever points out that webnode2 is unrelated to the “free and easy website builder” called Webnode at webnode.com.

Steve Souders’ 14 rules are…

  • Make Fewer HTTP Requests
  • Use a Content Delivery Network
  • Add an Expires Header
  • Gzip Components
  • Put Stylesheets at the Top
  • Put Scripts at the Bottom
  • Avoid CSS Expressions
  • Make JavaScript and CSS External
  • Reduce DNS Lookups
  • Minify JavaScript
  • Avoid Redirects
  • Remove Duplicate Scripts
  • Configure ETags
  • Make AJAX Cacheable


Quora is a great example of a modern tech start-up. They are very small team who understand the technologies they are using very well. They have made considered choices in the technology they have selected and have a good vision of which components would be better written from scratch. They seem keen to share these in-house technologies with the open-source community and I look forward to when they have the time to make this a reality. I intend keep following Quora and writing about them more in future blog posts.

If you found this post useful then please leave a comment or follow this blog.



[repost]Build an Infinitely Scalable Infrastructure for $100 Using Amazon Services


Build an Infinitely Scalable Infrastructure for $100 Using Amazon Services

Can you really create an infinitely scalable infrastructure for less than $100 using Amazon’s storage, grid, and queuing services platform? It appears so, at least for the right application. Amazon beams a spot light on the future battle of the roll-your-own versus the connect-the-dots approach to building next generation websites using core external services. Their argument is strong. Using Amazon’s platform you can quickly build an infrastructure that would otherwise take an eternity to make, a pile of money to create, and an unbounded mass of people to implement and maintain. Yet Amazon doesn’t provide SLAs, so you can you really trust them with your crown jewels? Facebook recently leap frogged Amazon’s vision with an even more comprehensive set of services. The battle for the future is on.

Site: http://aws.amazon.com/

Information Sources

  • Slides: Building Highly Scalable Web Applications
  • Podcast: Technometria: Amazon Web Services
  • Amazon Services Home.


  • Amazon ECS (E-Commerce Service)
  • Amazon S3 (simple storage service)
  • Amazon SQS (simple queuing service)
  • Amazon EC2 (grid service)
  • Amazon Web Search Service
  • Amazon Flexible Payments Service (Amazon FPS)
  • REST and SOAP Service Interfaces

    What’s Inside?

    Why use external services?

  • Amazon’s services replace the boxes, wires, and disk drives part of the application stack.
  • Amazon has spent ten years and over $1 billion developing a world-class web service that millions of customers use every day. Maybe you can leverage that experience for your site?
  • Focus on the customer. 70% if Web Development isn’t about providing customer value. It’s about building and managing data centers. Your efforts would be better spent on your customers and not plumbing.
  • Quicker to market. Scaling is hard. Let someone else worry
    about that while you concentrate on adding user value.
  • Designing for peak load is expensive. So turn fixed costs into variable costs. Say you want to handle high traffic flows from slashdot or digg, or you have high seasonal demand, having the infrastructure in place to handle those loads is a high fixed cost. You could use that money better elsewhere. It make sense to create an infrastructure where you can automatically and
    temporarily scale resources to handle peak demand.
  • High reliability and availability. A dedicated service may be more reliable than a service you could create. It say “may” because Amazon doesn’t provide an SLA, so you wont get any guarantees. The idea is that Amazon is cheap enough and reliable enough that the few failures will be acceptable. Besides, SLAs usually just refund some money when things go wrong, they
    don’t really guarantee anything.
  • It’s a cheap CDN. Amazon’s storage network could serve a relatively inexpensive content delivery network. This option is discussed in
    Reducing Your Website’s Bandwidth Usage
    . The idea is that just the frequent downloading of a simple favicon.ico
    file can use a significant portion of your bandwidth. Using S3 for $2/month to offload 90% of your bandwidth to an external host is a good deal. However, without an SLA S3 can’t be thought of as a proper CDN.

    Amazon ECS (E-Commerce Service)

  • This service exposes Amazon’s product data and e-commerce functionality:
    Detailed Product Information on all Amazon.com Products, Access to Product Images, All Customer Reviews associated with a Product, etc.
  • Amazon products are aggressively priced.
  • I found this service disappointing. If you want to build a store on top of Amazon it seems great, but I didn’t see a way to add your own products to the store, so I don’t think it’s generally useful.

    Amazon S3 (simple storage service)

  • This service stores data in Amazon’s storage network.
  • $.15 per GPB per month storage
  • $.01 for 1000 to 10000 requests.
  • $.10 – $.17 per GB data transfer.
  • The service is: fast, relaible, scalable, redundant, dispersed.
  • You can have per object URLs. This means you can reference an image or other file directly with a URL, so it’s usable in a web page.
  • Typical use: CDN and backup storage.
  • Storage is distributed to multiple locations so you get a level of geographical distribution.

    Amazon SQS (simple queuing service)

  • This service provides an internet scale queuing service for storing messages. Distributed actors put work on the queue and take work off the queue.
  • $.10 per 1000 messages.
  • $.10 – $.18 per GB data transfer.
  • This service is: scalable, elastic, reliable, simple, secure.
  • Typical use: a centralized work queue. You put jobs on the queue and different actors can pop work of the queue and process them when they get CPU time.
  • Expected message latency, as of 2007, was 2-10 seconds. This is horrible for many applications, not bad for many others.
  • Part of scalability. Have any number of producers and consumers. You don’t worry about it.
  • Queues are spread across multiple machines and multiple data centers.

    Amazon EC2 (grid service)

  • This service provides resizable compute capacity in the cloud. It is designed to make web-scale computing easier for developers.
  • Basically you create a Xen image for your Linux distro and upload it into their “elastic compute cloud.” Using an API you can then start as many instances as you like.
  • Typical use: transcoding, audio work, load testing.
  • Root level access to the server and full control over the machine.
  • Can scale up and scale down on a minute-by-minute basis.
  • For real-time processing one criticism has been slow CPUs (1.75 Ghz Xeon). This probably won’t be a problem if your application is written to linearly scale.
  • An EC2 instance is not persistent so you can’t store a database there. You have some local storage, but it goes away when the instance goes away.
  • Takes a few minutes to start and stop images, so it’s not really on demand.
  • You can add anything you want to an image. If you want a database you can add it in.

    GigaVox Media Example Web-Scale Architecture

  • You can start to see how Amazon’s services can work together. Let’s say you have a large batch of MP2s you would like transcode to MP3s. You would store the original media into S3, queue the work request into SQS, and have instances running in EC2 to take work of the queue and perform the transcoding, storing the results back into S3. And this is exactly what GigaVox does.
  • GigaVox is a podcasting company.
    – They take original recordings and transcode them say from MP2 to MP3. Many other transcodings are also performed.
    – Then these chunks of media are assembled together into a delivery format based on building a show. For example, old podcasts can be reassembled each night with up to date advertisements.
    – To do this at scale would take a lot of costly resources.
  • Using Amazon’s services GigaVox gets geographically redundancy and failover for relatively inexpensive CPU, bandwidth, and storage charges,
    and bandwidth costs. You have no boxes or wires. No data center to manage. And you can grow with small fixed costs.
  • Messages are time stamped on the queue. If the message waited in the queue for too long then they can start more EC2 images. You can balance costs. You could also layer in a customer based priority mechanism.
  • They have each instance have its own messaging queue for command and control.
  • For security reasons they upload files through ftp to instances rather than going through S3.
  • All bandwidth withing the Amazon cloud is free. This is an important business consideration for making the services work together.
  • Another set of instances and queues handles assembling the delivered media.
  • Allows GigaVox to deliver value to their customer at a low startup cost.

    Lessons Learned

  • Build or buy is always a difficult decision. If a service doesn’t work then you may lose your customers and there’s nothing more you can do other send yet another urgent email to nobody in particular. This is a horrible feeling. Yet, if it does work you could be way ahead of the game. How to choose? That would be telling :-)
  • Build a layer of virtualization so you can switch to another provider when they become available or so you can replace it with your own service. This lessens your dependency on Amazon in the event they get tired of offering services or their performance deteriorates.
  • As a startup using Amazon services isn’t a big risk because you are already in a risky situation. And any risk is moderated by the very low cost of starting up and money is always an issue for startups.
  • For many use cases buying your own dedicated servers may still be a better approach as you get more control, lower latency, and the same hardware is usable for multiple purposes.
  • Software as a service is a powerful and practical idea. It changes how you build software. It forces you to layer your software around interfaces. And once your software is composed of interfaces you have loosely coupled components that can be easily replaced. You also have the basis for a platform API should you ever want to provide an API you your customers. The highest level of development would to use the same API you give your customers to build your service.
  • Loosely coupled, message based architectures combined with service interfaces allow you to think several levels up the abstraction layer. You don’t have to wallow in the muck, which frees you how to structure your application using large scale blocks of behavior.
  • Designing a UI for an asynchronous interactive interface poses some challenges. It may take a while to perform an operation, so how do you interact with the user to handle that?
  • Instinctively I doubted Amazon could deliver. But if you have the right type of problem, you really can do a lot of work cheaply using Amazon services.

    See Also

  • Flickr and YouTube also deal with service level APIs.