Tag Archives: administration

[repost ]HBase Administration, Performance Tuning

original:http://www.packtpub.com/article/hbase-basic-performance-tuning

Performance is one of the most interesting characteristics of an HBase cluster’s behavior. It is a challenging operation for administrators, because performance tuning requires deep understanding of not only HBase but also of Hadoop, Java Virtual Machine Garbage Collection (JVM GC), and important tuning parameters of an operating system.

The structure of a typical HBase cluster is shown in the following diagram:

There are several components in the cluster—the ZooKeeper cluster, the HBase master node , region servers, the Hadoop Distributed File System(HDFS) and the HBase client.

The ZooKeeper cluster acts as a coordination service for the entire HBase cluster, handling master selection, root region server lookup, node registration, and so on. The master node does not do heavy tasks. Its job includes region allocation and failover, log splitting, and load balancing. Region servers hold the actual regions; they handle I/O requests to the hosting regions, flush the in-memory data store (MemStore) to HDFS, and split and compact regions. HDFS is the place where HBase stores its data files (StoreFile) and write ahead logs (WAL). We usually have an HBase region server running on the same machine as the HDFS DataNode, but it is not mandatory.

The HBase client provides APIs to access the HBase cluster. To communicate with the cluster, clients need to find the region server holding a specific row key range; this is called region lookups. HBase has two system tables to support region lookups—the -ROOT- table and the .META.table.

The -ROOT-table is used to refer to regions in the .META.table, while the .META.table holds references to all user regions. First, the clients query ZooKeeper to find the -ROOT-table location (the region server where it is deployed); they then query the -ROOT-table, and subsequently the .META.table, to find the region server holding a specific region. Clients also cache region locations to avoid querying ZooKeeper, -ROOT-, and .META.tables every time. With this background knowledge, we will describe how to tune HBase to gain better performance, in this article.

Besides HBase itself, other tuning points include Hadoop configurations, the JVM garbage collection settings, and the OS kernel parameters. These are as important as tuning HBase itself. We will also include recipes to tune these configurations, in this article.

In this article, by Yifeng Jiang, author of HBase Administration Cookbook, we will cover:

  • Setting up Hadoop to spread disk I/O
  • Using a network topology script to make the Hadoop rack-aware
  • Mounting disks with noatimeand nodiratime
  • Setting vm.swappinessto 0 to avoid swap
  • Java GC and HBase heap settings
  • Using compression
  • Managing compactions
  • Managing a region split

[repost ]IBM LanguageWare Resource Workbench

original:https://www.ibm.com/developerworks/community/groups/service/html/communityview?communityUuid=6adead21-9991-44f6-bdbb-baf0d2e8a673#09

Overview

An Eclipse application for building custom language analysis into IBM LanguageWare resources and their associated UIMA annotators.

 

IBM LanguageWare Resource WorkbenchActions

Update: July 20, 2012:
Studio 3.0 is out and it is officially bundled with ICA 3.0. If you are a Studio 3.0 user, please use ICA forum instead of LRW forum. 7.2.0.2 LRW is a fixpack that resolves issues in various areas including the Parsing Rules editor, PEAR file export and Japanese/Chinese language support.
7.2.0.1 LRW is still available for download on the Downloads link for IBM OmniFind Enterprise Edition V9.1 Fix Pack users.

What is IBM LanguageWare?

IBM LanguageWare is a technology which provides a full range of text analysis functions. It is used extensively throughout the IBM product suite and is successfully deployed in solutions which focus on mining facts from large repositories of text. LanguageWare is the ideal solution for extracting the value locked up in unstructured text information and exposing it to business applications. With the emerging importance of Business Intelligence and the explosion in text-based information, the need to exploit this “hidden” information has never been so great. LanguageWare technology not only provides the functionality to address this need, it also makes it easier than ever to create, manage and deploy analysis engines and their resources.

It comprises Java libraries with a large set of features and the linguistic resources that supplement them. It also comprises an easy-to-use Eclipse-based development environment for building custom text analysis applications. In a few clicks, it is possible to create and deploy UIMA (Unstructured Information Management Architecture) annotators that perform everything from simple dictionary lookups to more sophisticated syntactic and semantic analysis of texts using dictionaries, rules and ontologies.

The LanguageWare libraries provide the following non-exhaustive list of features: dictionary look-up and fuzzy look-up, lexical analysis, language identification, spelling correction, hyphenation, normalization, part-of-speech disambiguation, syntactic parsing, semantic analysis, facts/entities extraction and relationship extraction. For more details see the documentation.

The LanguageWare Resource Workbench provides a complete development environment for the building and customization of dictionaries, rules, ontologies and associated UIMA annotators. This environment removes the need for specialist knowledge of the underlying technologies of natural language processing or UIMA. In doing so, it allows the user to focus on the concepts and relationships of interest, and to develop analyzers which extract them from text without having to write any code. The resulting application code is wrapped as UIMA annotators, which can be seamlessly plugged into any application that is UIMA-compliant. Further information about UIMA is available on the UIMA Apache site

LanguageWare is used in such various products as Lotus Notes and Domino, Information Integrator OmniFind Edition (IBM’s search technology), and more.

The LanguageWare Resource Workbench technology runs on Microsoft Windows and Linux. The core LanguageWare libraries support a much broader list of platforms. For more details on platform support please see the product documentation.

How does it work?

The LanguageWare Resource Workbench allows users to easily:

  • Develop rules to spot facts, entities and relationships using a simple drag and drop paradigm
  • Build language and domain resources into a LanguageWare dictionary or ontology
  • Import and export dictionary data to/from a database
  • Browse the dictionaries to assess their content and quality
  • Test rules and dictionaries in real-time on documents
  • Create UIMA annotators for annotating text with the contents of dictionaries and rules
  • Annotate text and browse the contents of each annotation.

The Workbench contains the following tools:

  • A dictionary viewer/editor
  • An XML-based dictionary builder
  • A Database-based dictionary builder (IBM DB2 and Apache Derby support are provided)
  • A dictionary comparison tool
  • A rule viewer/editor/builder
  • A UIMA annotator generator, which allows text documents to be annotated and the results displayed.
  • A UIMA CAS (common annotation structure) comparator, which allows you to compare the results of two different analyses through comparing the CASes generated by each run.

The LanguageWare Resource Workbench documentation is available online and is also installed using the Microsoft Windows or Linux installers or using the respective .zip files.

What type of application is LanguageWare suitable for?

LanguageWare technology can be used in any application that makes use of text analytics. Good examples are:

  • Business Intelligence
  • Information Search and Retrieval
  • The Semantic Web (in particular LanguageWare supports semantic analysis of documents based on ontologies)
  • Analysis of Social Networks
  • Semantic tagging applications
  • Semantic search applications
  • Any application wishing to extract useful data from unstructured text

For Web-based semantic query of the LanguageWare text analytics, you might be interested in checking out IBM Data Discovery and Query Builder. When used together, these two technologies can provide a full range of data access services including UI presentation, security and auditing of users, structured and unstructured data access through semantic concepts and deep text analytics of unstructured data elements.

More information

About the technology author(s)

LanguageWare is a worldwide organization comprising a highly qualified team of specialists with a diverse combination of backgrounds: linguists, computer scientists, mathematicians, cognitive scientists, physicists, and computational linguists. This team is responsible for developing innovative Natural Language Processing technology for IBM Software Group.

LanguageWare, along with LanguageWare Resource Workbench, is a collaborative project combining skills, technologies, and ideas gathered from various IBM product teams and IBM Research division.

Platform requirements

Operating systems: Microsoft Windows XP, Microsoft Windows Vista, or SUSE Linux Enterprise Desktop 11

Hardware: Intel 32-bit platforms (tested)

Software:

  • Java 5 SR5 or above (not compatible with Java 5 SR4 or below)
  • Apache UIMA SDK 2.3 (required by LanguageWare Annotators in order to run outside the Workbench)

Notes: Other platforms and JDK implementations may work but have not been significantly tested.

Installation instructions

Installing LanguageWare Resource Workbench and LanguageWare Demonstrator

On each platform, there are two methods of installation.
On Microsoft Windows:

  • Download the lrw.win32.install.exe or Demonstrator.exe and launch the installation.

On Linux:

  • Download the lrw.linux.install.bin or Demonstrator.bin and launch the installation (e.g., by running sh ./lrw.linux.install.bin).

LanguageWare Training and Enablement

The LanguageWare Training Material is a set of presentations that walk you through the use of LanguageWare Resource Workbench. It uses a step by step approach with examples and practice exercises to build up your knowledge of the application and understand data modeling. Please follow the logical succession of the material, and make sure you finish the sample exercise.
Please use this link to access the presentation decks.

Trademarks

  • Intel is the trademarks or registered trademark of Intel Corporation or its subsidiaries in the United States and other countries.
  • Java and all Java-based trademarks and logos are trademarks or registered trademarks of Oracle and/or its affiliates.
  • Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both.
  • Microsoft and Windows are trademarks of Microsoft Corporation in the United States, other countries, or both.

FAQs

Tab navigation


1. What is the LanguageWare Resource Workbench? Why should I use it?The LanguageWare Resource Workbench is a comprehensive Eclipse-based environment for developing UIMA Analyzers. It allows you to build Domain Extraction Models (lexical resources and parsing rules, see FAQ 8) which describe the entities and relationships you wish to extract, creating custom annotators tailored to your specific needs. These annotators can be easily exported as PEAR files (see FAQ 10) and installed into any UIMA pipeline or deployed onto an ICA server. If you satisfy the following criteria, then you will want to use LanguageWare Resource Workbench:

  • You need a robust open standards (UIMA) text analyzer that can be easily customized to your specific domain and analysis challenges.
  • You need a technology that will enable you to exploit your existing structured data repositories in the analysis of the unstructured sources.
  • * You need a technology that allows you to build custom domain models that become your intellectual property and differentiation in the marketplace.
  • need a technology that is multi-lingual, multi-platform, multi-domain, and high performance.

Back to top

2. What’s new in this version of LanguageWare Resource Workbench?

All new features are outlined in the LanguageWare Resource Workbench Release Notes, ReleaseNotes.htm, which is located in the Workbench installation directory.

Back to top

3. Where should I start with LanguageWare?

The best way to get started with LanguageWare is to install the LanguageWare Resource Workbench. Check the training videos provided above or the training material (to be posted soon); it will introduce you to the Workbench and show you how it works.

Back to top

4. What documentation is available to help me use LanguageWare?

Context-sensitive help is provided as part of the LanguageWare Resource Workbench. There is an online help system shipped with the LanguageWare Resource Workbench (under Help / Help Contents). Check the training videos provided above or the training material (to be posted soon). More detailed information about the underlying APIs will be provided for fully-licensed users of the technology.

Back to top

5. What are the known limitations with this release of the LanguageWare Resource Workbench?

Any problems or limitations are outlined in the LanguageWare Resource Workbench Release Notes, ReleaseNote.htm, which is located in the LanguageWare Resource Workbench installation directory and is part of the LanguageWare Resource Workbench Help System.

Back to top

6. What version of UIMA do I need to use the LanguageWare annotators?

LanguageWare Resource Workbench ships with, and has been tested against, Apache UIMA, Version 2.3. They should work with newer versions of Apache UIMA; however, they have not been extensively tested for compatibility. Therefore, we would recommend Apache UIMA v2.3. The LanguageWare annotators are not compatible with versions of UIMA prior to 2.1. These were released by IBM and have namespace conflict with Apache UIMA.

Back to top

7. What is a Domain Extraction Model (or “model,” “annotator,” “analyzer”)? How do I build a good Domain Extraction Model?

A “model” is the set of resources you build to describe what you want to extract from the data. The models are a combination of:

  • The morphological resources, which describe the basic language characteristics
  • The lexical resources, which describe the entities/concepts that you want to recognize
  • The POS tagger resource
  • The parsing rules, which describe how concepts combine to generate new entities and relationships.

The process of building data models is an iterative process within the LanguageWare Resource Workbench.

Back to top

8. How do I change the default editor for new file types in the LanguageWare Resource Workbench?

Go to Window / Preferences / General / Editors / File Associations. If the content type is already listed, just add a new editor and pick the LanguageWare Text Editor. You can set this to be the default, or alternatively leave it as an option that you can choose, on right click, whenever you open a file of that type. You will need to restart the LanguageWare Resource Workbench before this comes into effect. Note: Eclipse remembers the last viewer you used for a file type so if you opened a document with a different editor beforehand you may need to right-click on the file and explicitly choose the LanguageWare Resource Workbench Text Editor the first time on restart.

Back to top

9. How do I integrate the UIMA Analyzers that I develop in the LanguageWare Resource Workbench?

Once you have completed building your Domain Extraction Models (dictionaries and rules), the LanguageWare Resource Workbench provides an “Export as UIMA Pear” function under File / Export. This will generate a PEAR file that contains all the code and resources required to run your pipeline in any UIMA-enabled application, that is, in a UIMA pipeline.

Back to top

10. How is my data stored?

The LanguageWare Resource Workbench is designed to primarily help you to build your domain extraction models and this includes databases in which you can store your data. The LanguageWare Resource Workbench ships with an embedded database (Derby, open source); however, it can also connect to an enterprise database, such as DB2.

Back to top

11. What licensing conditions apply for LanguageWare, for academic purposes, or for commercial use?LanguageWare is licensed through the IBM Content Analytics License at http://www-01.ibm.com/software/data/cognos/products/cognos-content-analytics/.

Back to top

13. Is Language Identification identifying the wrong language?

Sometimes the default amount of text (1024 characters) used by Language Identification is not enough to disambiguate the correct language. This happens specially when languages are quite close or when the text analysed may include text in more than one language. In this case, it may help to increase the MaxCharsToExamine parameter. To do this, select from the LWR menu: Window > Preferences > LanguageWare > UIMA Annotation Display. Enable the checkbox for “Show edit advanced configuration option on pipeline stages.” Select “Apply” and “OK.” Next time you open a UIMA Pipeline Configuration file, you will notice an Advanced Configuration link at the Document Language stage. Click on it to expand and display its contents, notice the MaxCharsToExamine parameter can be edited. Change the default number displayed to a bigger threshold. Save your changes and try again to see if the Language Identification has improved.

Back to top

13. Why is the LanguageWare Resource Workbench shipped as an Eclipse-based application?

We built the LanguageWare Resource Workbench on Eclipse because it provides a collaborative framework through which we can share components with other product teams across IBM, with our partners, and with our customers. This version of the LanguageWare Resource Workbench is a complete, stand-alone application. However, users can still get the benefits of the Eclipse IDE by installing Eclipse features into the Workbench. Popular features include the Eclipse CVS feature for managing shared projects and the Eclipse XML feature for full XML editing support. See the Eclipse online help for more information about finding and installing new features. It is important to understand that while the LanguageWare Resource Workbench is Eclipse-based, the Annotators that are exported from the LanguageWare Resource Workbench (under File / Export) can be installed into any UIMA pipeline and can be deployed in a variety of ways. The LanguageWare Resource Workbench team, as part of the commercial LanguageWare Resource Workbench license, provides integration source code to simplify the overall deployment and integration effort. This includes UIMA serializers, CAS consumers, and APIs for integrating into through C/JNI, Eclipse, Web Services (REST), and others.

Back to top

14. What languages are supported by the LanguageWare Resource Workbench?The following languages are fully supported, ie, LanguageID, Lexical Analysis and Part of Speech Disambiguation.

 

Language

Code

Arabic

ar

Chinese (Simplified)

zh-CN

Chinese (Traditional)

zh-TW

Danish

da

Dutch

nl

English

en

French

fr

German

de

Italian

it

Japanese

ja

Portuguese

pt

Spanish

es

 

For the following languages, a lexical dictionary without part of speech disambiguation can be made available upon request: Afrikaans, Catalan, Greek, Norwegian (Bokmal), Norwegian (Nynorsk), Russian and Swedish. These dictionaries are provided “AS-IS” (i.e. they have not been maintained and will not be supported. While feedback on them is much appreciated, requests for changes, fixes or queries will only be addressed if adequately planned and sufficiently funded).
Back to top

15. Does LanguageWare support GB18030?

LanguageWare annotators support UTF-16, and this qualifies as GB18030 support. This does mean that you need to translate the text from GB18030 to UTF-16 at the document ingestion stage. Java will do this automatically for you (in the collection reader stage) as long as the correct encoding is specified when reading files.

Please note that text in GB18030 extension B may contain characters outside the Unicode Basic Multilingual Plane. Currently the default LanguageWare break rules would incorrectly split such characters into two tokens. If support for these rare characters is required, the attached break rules file can be used to ensure the proper handling of 4-byte characters. (Note that the file zh-surrogates.dic for Chinese is wrapped in zh-surrogates.zip.)

Back to top

16. Are you experiencing problems with the LanguageWare Resource Workbench UI on Linux platforms?

There is a known issue on Ubuntu with Eclipse and the version of the GTK+ toolkit that prevents toolbars being drawn properly or buttons working properly with mouse clicks. The fix is explained here:
http://git.gnome.org/browse/gtk+/commit/?id=a79f929dd6c89fceeaf0d9039e5a10cad9d87d2f.
To provide a work around you need to create an environment variable “GDK_NATIVE_WINDOWS=1” before loading up the LanguageWare Resource Workbench. Another issue was reported for Ubuntu 9.10 (Karmic) with LanguageWare Resource Workbench showing an empty dialog window when starting. The issue is explained here:
https://bugs.launchpad.net/bugs/429065.
To provide a work around you need to add “-Dorg.eclipse.swt.browser.XULRunnerPath=/dev/null” line to lrw.ini file in the LanguageWare Resource Workbench installation folder.

Back to top

17. Any other questions not covered here?

Please use the “Forum” to post your questions and we will get back to you.

Back to top

18. How do I upgrade my version of the LRW?

On Windows and Linux, each version of the LRW is a separate application. You should not install a new version over a previous one, instead make sure to either uninstall your previous version, or ensure that all versions are installed in separate locations. If in doubt, the default use of the LRW installers will ensure this behaviour.

The projects you created with older versions of the LRW will never be removed by the uninstall process. You can point your new version of the LRW at the same data workspace during startup.

Back to top

BookmarksActions

Bookmarks logo

Download Text Analytics Tools and Runtime for IBM LanguageWare

Updated by AlexisTipper|Dec 12 2011|Tags: download
Bookmarks logo

LanguageWare Forum

Updated by Amine_Akrout|Sep 14 2011|Tags:
Bookmarks logo

alphaworks

Updated by iron-horse|Jun 21 2011|Tags: alphaworks
Bookmarks logo

developerWorks Community

Updated by iron-horse|Jun 21 2011|Tags: dw
View All

[repost]Product: ISPMan Centralized ISP Management System

original:

Product: ISPMan Centralized ISP Management System

From FRESH Ports and their website:

ISPman is an ISP management software written in perl, using an LDAP
backend to manage virtual hosts for an ISP. It can be used to manage,
DNS, virtual hosts for apache config, postfix configuration, cyrus
mail boxes, proftpd etc.

ISPMan was written as a management tool for the network at 4unet where
between 30 to 50 domains are hosted and the number is crazily growing.
Managing these domains and their users was a little time consuming,
and needed an Administrator who knows linux and these daemons
fluently. Now the help-desk can easily manage the domains and users.

LDAP data can be easily replicated site wide, and mail box server can
be scaled from 1 to n as required. An LDAP entry called maildrop
tells the SMTP server (postfix) where to deliver the mail. The SMTP
servers can be loadbalanced with one of many load balancing
techniques. The program is written with scalability and High
availability in mind.

This may not be the right software for you if you want to run a small
ISP on a single box or if you want to use this software as an LDAP
editor or a DNS management software by itself.

ISPMan is written mostly in Perl and is based on four major components. All these components are based on open standards and are easily customizable.

  • LDAP-directory works as a central registry of information about users, hosts, dns, processes etc. All information related to resources is kept in this directory.

    The LDAP directory can be replicated to multiple machines to balance the load.

  • Ispman-webinterface is an intuitive Iinterface to manage informations about your ISP infrastructure. This interface allows you to edit your LDAP registry to change different informations about your resources such as adding a new domain, deleting a user etc.

    The interface can run on http or https and is only available after successful authentification as an ISPMan admin. Access control to this interface can also be limited to designated IP addresses either via Apache access control functions or via ISPMan ACL.

  • Ispman-agent is a component of ISPMan that runs on hosts taking part in the ISP, these agents read the LDAP directory for processes assigned to them and take appropriate actions

    Example : create directory for new domains, create mailbox for users, etc. These agents are a very important part of the system and are should be run continuously.

    The agents are run via a fault taulerant services manager called « daemontools » that makes sure that the agents recovers immediately in case of any failure.

  • ISPman-customer-control-panel is an interface targeted towards customers (domain owners). Using this interface the domain owners can manage their own dns, webserver settings, users, mailing lists, access control etc.
  • [repost]Product: ScaleOut StateServer is Memcached on Steroids

    original:

    Product: ScaleOut StateServer is Memcached on Steroids

    ScaleOut StateServer is an in-memory distributed cache across a server farm or compute grid. Unlike middleware vendors, StateServer is aims at being a very good data cache, it doesn’t try to handle job scheduling as well.

    StateServer is what you might get when you take Memcached and merge in all the value added distributed caching features you’ve ever dreamed of. True, Memcached is free and ScaleOut StateServer is very far from free, but for those looking a for a satisfying out-of-the-box experience, StateServer may be just the caching solution you are looking for. Yes, “solution” is one of those “oh my God I’m going to pay through the nose” indicator words, but it really applies here. Memcached is a framework whereas StateServer has already prepackaged most features you would need to add through your own programming efforts.

    Why use a distributed cache? Because it combines the holly quadrinity of computing: better performance, linear scalability, high availability, and fast application development. Performance is better because data is accessed from memory instead of through a database to a disk. Scalability is linear because as more servers are added data is transparently load balanced across the servers so there is an automated in-memory sharding. Availability is higher because multiple copies of data are kept in memory and the entire system reroutes on failure. Application development is faster because there’s only one layer of software to deal with, the cache, and its API is simple. All the complexity is hidden from the programmer which means all a developer has to do is get and put data.

    StateServer follows the RAM is the new disk credo. Memory is assumed to be the system of record, not the database. If you want data to be stored in a database and have the two kept in sync, then you’ll have to add that layer yourself. All the standard memcached techniques should work as well for StateServer. Consider however that a database layer may not be needed. Reliability is handled by StateServer because it keeps multiple data copies, reroutes on failure, and has an option for geographical distribution for another layer of added safety. Storing to disk wouldn’t make you any safer.

    Via email I asked them a few questions. The key question was how they stacked up against Memcached? As that is surely one of the more popular challenges they would get in any sales cycle, I was very curious about their answer. And they did a great job differentiation themselves. What did they say?


    First, for an in-depth discussion of their technology take a look ScaleOut Software Technology, but here a few of the highlights:

  • Platforms: .Net, Linux, Solaris
  • Languages: .Net, Java and C/C++
  • Transparent Services: server farm membership, object placement, scaling, recovery, creating and managing replicas, and handling synchronization on object access.
  • Performance: Scales with measured linear throughput gain to farms with 64 servers. StateServer was subjected to maximum access load in tests that ramped from 2 to 64 servers, with more than 2.5 gigabytes of cached data and a sustained throughput of over 92,000 accesses per second using a 20 Mbits/second Infiniband network. StateServer provided linear throughput increases at each stage of the test as servers and load were added.
  • Data cache only. Doesn’t try to become middleware layer for executing jobs. Also will not sync to your database.
  • Local Cache View. Objects are cached on the servers where they were most recently accessed. Application developers can view the distributed cache as if it were a local cache which is accessed by the customary add, retrieve, update, and remove operations on cached objects. Object locking for synchronization across threads and servers is built into these operations and occurs automatically.
  • Automatic Sharding and Load Balancing. Automatically partitions all of distributed cache’s stored objects across the farm and simultaneously processes access requests on all servers. As servers are added to the farm, StateServer automatically repartitions and rebalances the storage workload to scale throughput. Likewise, if servers are removed, ScaleOut StateServer coalesces stored objects on the surviving servers and rebalances the storage workload as necessary.
  • High Availability. All cached objects are replication on up to two additional servers. If a server goes offline or loses network connectivity, ScaleOut StateServer retrieves its objects from replicas stored on other servers in the farm, and it creates new replicas to maintain redundant storage as part of its “self-healing” process. Uses a quorum-based updating scheme.
  • Flexible Expiration Policies. Optional object expiration after sliding or fixed timeouts, LRU memory reclamation, or object dependency changes. Asynchronous events are also available to signal object expiration.
  • Geographical Scaleout. Has the ability to automatically replicate to a remote cache using the ScaleOut GeoServer option.
  • Parallel Query. Perform fully parallel queries on cached objects. Developers can attach metadata or “tags” to cached objects and query the cache for all matching objects. ScaleOut StateServer performs queries in parallel across all caching servers and employs patent-pending technology to ensure that query operations are both highly available and scalable. This is really cool technology that really leverages the advantage of in-memory databases. Sharding means you have a scalable system then can execute complex queries in parallel without you doing all the work you would normally do in a sharded system. And you don’t have to resort to the complicated logics need for SimpleDB and BigTable type systems. Very nice.
  • Pricing:
    – Development Edition: No Charge
    – Professional Edition: $1,895 for 2 servers
    – Data Center Edition: $71,995 for 64 servers
    – GeoServer Option First two data centers $14,995, Each add’l data center $7,495.
    – Support: 25% of software license fee

    Some potential negatives about ScaleOut StateServer:

  • I couldn’t find a developer forum. There may be one, but it eluded me. One thing I always look for is a vibrant developer community and I didn’t see one. So if you have problems or want to talk about different ways of doing things, you are on your own.
  • The sales group wasn’t responsive. I sent them an email with a question and they never responded. That always makes me wonder how I’ll be treated once I’ve put money down.
  • The lack of developer talk made it hard for me to find negatives about the product itself, so I can’t evaluate its quality in production.

    In the next section the headings are my questions and the responses are from ScaleOut Software.

    Why use ScaleOut StateServer instead of Memcached?

    I’ve [Dan McMillan, VP Sales] included some data points below based on our current understanding of the Memcached product. We don’t use and haven’t tested Memcached internally, so this comparison is based in part upon our own investigations and in part what we are hearing from our own customers during their evaluation and comparisons. We are aware that Memcached is successfully being used on many large, high volume sites. We believe strong demand for ScaleOut is being driven by companies that need a ready-to-deploy solution that provides advanced features and just works. We also hear that Memcached is often seen as a low cost solution in the beginning, but development and ongoing management costs sometimes far exceed our licensing fees.

    What sets ScaleOut apart from Memcached (and other competing solutions) is that ScaleOut was architected from the ground up to be a fully integrated and automated caching solution. ScaleOut offers both scalability and high availability, where our competitors typically provide only one or the other. ScaleOut is considered a full-featured, plug-n-play caching solution at a very reasonable price point, whereas we view Memcached as a framework in which to build your own caching solution. Much of the cost in choosing Memcached will be in development and ongoing management. ScaleOut works right out of the box.

    I asked ScaleOut Software founder and chief architect, Bill Bain for his thoughts on this. He is a long-time distributed caching and parallel computing expert and is the architect of ScaleOut StateServer. He had several interesting points to share about creating a distributed cache by using an open source (i.e. build it yourself) solution versus ScaleOut StateServer.

    First, he estimates that it would take considerable time and effort for engineers to create a distributed cache that has ScaleOut StateServer’s fundamental capabilities. The primary reason is that the open source method only gives you a starting point, but it does not include most capabilities that are needed in a distributed cache. In fact, there is no built-in scalability or availability, the two principal benefits of a distributed cache. Here is some of the functionality that you would have to build:

  • Scalable storage and throughput. You need to create a means of storing objects across the servers in the farm in a way that will scale as servers are added, such as creating and managing partitions. Dynamic load balancing of objects is needed to avoid hot spots, and to our knowledge this is not provided in memcached.
  • High availability. To ensure that objects are available in the case of a server failure, you need to create replicas and have a means of automatically retrieving them in case a server fails. Also, just knowing that a server has failed requires you to develop a scalable heart-beating mechanism that spans all servers and maintains a global membership. Replicas have to be atomically updated to maintain the coherency of the stored data.
  • Global object naming. The storage, load-balancing, and high availability mechanisms need to make use of efficient, global object naming and lookup so that any client can access any object in the distributed cache, even after load-balancing or recovery actions.
  • Distributed locking. You need distributed locking to coordinate accesses by different clients so that there are not conflicts or synchronization issues as objects are read, updated and deleted. Distributed locks have to automatically recover in case of server failures.
  • Object timeouts. You also will need to build the capability for the cache to handle object timeouts (absolute and sliding) and to make these timeouts highly available.
  • Eventing. If you want your application to be able to catch asynchronous events such as timeouts, you will need a mechanism to deliver events to clients, and this mechanism should be both scalable and highly available.
  • Local caching. You need the ability to internally cache deserialized data on the clients to keep response times fast and avoid deserialization overhead on repeated reads. These local caches need to be kept coherent with the distributed cache.
  • Management. You need a means to manage all of the servers in the distributed cache and to collect performance data. There is no built-in management capability in memcached, and this requires a major development effort.
  • Remote client support. ScaleOut currently offers both a standard configuration (installed as a Windows service on each web server) and a remote client configuration (Installed on a dedicated cache farm).
  • ASP.Net/Java interoperability. Our Java/Linux release will offer true ASP.Net/Java interop, allowing you to share objects and manage sessions across platforms. Note: we just posted our “preview” release last week.
  • Indexed query functionality. Our forthcoming ScaleOut 4.0 release will contain this feature, which allows you to query the store to return objects based on metadata.
  • Multiple data center support. With our GeoServer product, you can automatically replicate cached information to up to 8 remote data centers. This provides a powerful solution for disaster recovery, or even “active-active” configurations. GeoServer’s replication is both scalable and high available.

    In addition to the above, we hope that the fact ScaleOut Software provides a commercial solution that is reasonably priced, supported and constantly improved would be viewed as an important plus for our customers. In many cases, in-house and open source solutions are not supported or improved once the original developer is gone or is assigned to other priorities.

    Do you find yourself in competition with the likes of Terracotta, GridGain, GridSpaces, and Coherence type products?

    Our ScaleOut technology has previously been targeted to the ASP.Net space. Now that we are entering the Java/Linux space, we will be competing with companies like the ones you mentioned above, which are mainly Java/Linux focused as well.

    We initially got our start with distributed caching for ecommerce applications, but grid computing seems to be a strong growth area for us as well. We are now working with some large Wall Street firms on grid computing projects that involve some (very large) grid operations.

    I would like to reiterate that we are very focused on data caching only. We don’t try to do job scheduling or other grid computing tasks, but we do improve performance and availability for those tasks via our distributed data cache.

    What architectures your customers are using with your GeoServer product?

    A. GeoServer is a newer, add-on product that is designed to replicate the contents of two or more geographically separated ScaleOut object stores (caches). Typically a customer might use GeoServer to replicate object data between a primary data center site and a DR site. GeoServer facilitates continuous (async.) replication between sites, so if site A goes offline, the other site B is immediately available to handle the workload.

    Our ScaleOut technology offers 3 primary benefits: Scalability, performance & high availability. From a single web farm perspective, ScaleOut provides high availability by making either 1 or 2 (this is configurable) replica copies of each master object and storing the replica on an alternate host server in the farm. ScaleOut provides uniform access to the object from any server, and protects the object in the case of a server failure. With GeoServer, these benefits are extended across multiple sites.

    It is true that distributed caches typically hold temporary, fast-changing data, but that data can still be very critical to ecommerce, or grid computing applications. Loss of this data during a server failure, worker process recycle or even a grid computation process is unacceptable. We improve performance by keeping the data in-memory, while still maintaining high availability.

    Related Articles

  • RAM is the new disk
  • Latency is Everywhere and it Costs You Sales – How to Crush it
  • A Bunch of Great Strategies for Using Memcached and MySQL Better Together
  • Paper: Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web
  • Google’s Paxos Made Live – An Engineering Perspective
  • Industry Chat with Bill Bain and Marc Jacobs – Joe Rubino interviews William L. Bain, Founder & CEO of ScaleOut Software and Marc Jacobs, Director at Lab49, on distributed caches and their use within Financial Services.