Tag Archives: Analyze

[project ]StreamBase CEP

original:http://www.streambase.com/products/streambasecep/#axzz1u5NAqqP9

StreamBase CEP

The StreamBase Complex Event Processing platform is a high-performance system for rapidly building applications that analyze and act on real-time streaming data.   With StreamBase, organizations rapidly build real-time systems in record time and are deployed at a fraction of the cost and risk of alternatives.

StreamBase’s complex event processing (CEP) software is distinguished by bringing together three significant capabilities in one integrated platform: rapid development via the industry’s first and only graphical event-flow language, extreme performance with a low-latency high-throughput event server, and the broadest connectivity to real-time and historical data.  Industry leaders and StreamBase partners in capital markets, the intelligence/security sector, and other industries are benefiting from rapid, on-the-fly processing of complex data streams-and the fastest time to prototype, test, and deploy real-time applications.

StreamBase Platform

The traditional approach for processing this complex event data has required costly, time-consuming custom-coding of the infrastructure and application logic, using specialized expertise to build a complete functional system. StreamBase eliminates these problems by offering commercial systems software designed to process, analyze and respond to those real-time data streams instantaneously, offering superior speed, scalability, and value compared to conventional infrastructures or custom-coded environments.

Rapid Time-to-Value with Graphical Development Environment

StreamBase Studio™ is an Eclipse-based integrated development environment (IDE) which provides tools for all stages of the development process, including design, test and deployment. Applications can be prototyped and built in just a few hours or days.

The operations in event processing inherently follow a workflow pattern, and StreamBase Studio provides an intuitive drag-and-connect graphical authoring environment with workflow orientation that eliminates the need for custom-coding application logic – which can be very time-consuming and expensive. StreamBase also makes it easy to modify applications when business needs change or data volumes increase.

High Performance Complex Event Processing

StreamBase applications achieve performance levels measured at hundreds of thousands messages/second by virtue of the StreamBase Server, an ultra low-latency application server optimized for real-time event processing.  It utilizes an inbound processing architecture that queries data as it streams through the system.  Business rules and rich application logic are applied in real-time to deliver results in-flight as they are produced, enabling significant speed/performance gains compared to alternatives that require storing and indexing the data before queries are processed.

Enterprise Connectivity

StreamBase offers a broad set of connectivity options that allow for integration with a variety of data sources and enterprise systems. These include StreamBase Adapters to leading financial market data feeds (Thomson Reuters, Bloomberg, Lime, Activ, exchanges), messaging systems (e.g. Tibco RV, EMS/JMS messaging), JDBC-compliant databases, and real-time dashboard development environments like Adobe Flex; and in addition connectivity to high capacity databases and data warehouses (Kx, HP Vertica, Thomson Reuters Velocity Analytics, and DB2). StreamBase also offers published Java, C++, and .NET API support and a wizards-based rapid adapter development toolkit.

The Bottom Line

Before StreamBase, processing real-time data feeds with high throughput was a difficult undertaking. With StreamBase’s enterprise-class engine running StreamSQL, more and more organizations are now gaining:

  • Faster processing and reaction to real-time complex event streams
  • Shorter development cycles, with significantly easier maintenance
  • Dramatically lower development and programming costs
  • Flexibility to quickly adapt to changing business and analytic needs
  • Reduced hardware and operational expenses
  • Faster time-to-profit from real-time initiatives

For additional information about StreamBase, please see our answers to Frequently Asked Questions, or visit the support knowledge base.

Read more: http://www.streambase.com/products/streambasecep/#ixzz1u5PFt7Fd

[repost]Twitter’s Plan to Analyze 100 Billion Tweets

original:

Twitter’s Plan to Analyze 100 Billion Tweets

If Twitter is the “nervous system of the web” as some people think, then what is the brain that makes sense of all those signals (tweets) from the nervous system? That brain is the Twitter Analytics System and Kevin Weil, as Analytics Lead at Twitter, is the homunculus within in charge of figuring out what those over 100 billion tweets (approximately the number of neurons in the human brain) mean.

Twitter has only 10% of the expected 100 billion tweets now, but a good brain always plans ahead. Kevin gave a talk, Hadoop and Protocol Buffers at Twitter, at the Hadoop Meetup, explaining how Twitter plans to use all that data to an answer key business questions.

What type of questions is Twitter interested in answering? Questions that help them better understand Twitter. Questions like:

  1. How many requests do we serve in a day?
  2. What is the average latency?
  3. How many searches happen in day?
  4. How many unique queries, how many unique users, what is their geographic distribution?
  5. What can we tell about as user from their tweets?
  6. Who retweets more?
  7. How does usage differ for mobile users?
  8. What went wrong at the same time?
  9. Which features get users hooked?
  10. What is a user’s reputation?
  11. How deep do retweets go?
  12. Which new features worked?

And many many more. The questions help them understand Twitter, their analytics system helps them get the answers faster.

Hadoop and Pig are Used for Analysis

Any question you can think of requires analyzing big data for answers. 100 billion is a lot of tweets. That’s why Twitter uses Hadoop and Pig as their analysis platform. Hadoop provides: key-value storage on a distributed file system, horizontal scalability, fault tolerance, and map-reduce for computation. Pig is a query a mechanism that makes it possible to write complex queries on top of Hadoop.

Saying you are using Hadoop is really just the beginning of the story. The rest of the story is what is the best way to use Hadoop? For example, how do you store data in Hadoop?

This may seem an odd question, but the answer has big consequences. In a relational database you don’t store the data, the database stores you, er, it stores the data for you. APIs move that data around in a row format.

Not so with Hadoop. Hadoop’s key-value model means it’s up to you how data is stored. Your choice has a lot to do with performance, how much data can be stored, and how agile you can be in reacting to future changes.

Each tweet has 12 fields, 3 of which have sub structure, and the fields can and will change over time as new features are added. What is the best way to store this data?

Data is Stored in Protocol Buffers to Keep it Efficient and Flexible

Twitter considered CSV, XML, JSON, and Protocol Buffers as possible storage formats. Protocol Buffer is a way of encoding structured data in an efficient yet extensible format. Google uses Protocol Buffers for almost all of its internal RPC protocols and file formats. BSON (binary JSON) was not evaluated, but would probably not work because it doesn’t have an IDL (interface definition language.) Avro is one potential option that they’ll look into in the future.

An evaluation matrix was created which declared Protocol Buffers the winner. Protocol Buffers won because it allows data to be split across different nodes; it is reusable for data other than just tweets (logs, file storage, RPC, etc); it parses efficiently; fields can be added, changed, and deleted without having to change deployed code; the encoding is small; it supports hierarchical relationships. All the other options failed one or more of these criteria.

IDL Used for Codegen

Surprisingly efficiency, flexibility and other sacred geek metrics were not the only reason Twitter liked Protocol Buffers. What is often considered a weakness, Protocol Buffer’s use of an IDL to describe data structures, is actually considered a big win by Twitter. Having to define data structure IDL is often seen as a useless waste of time. But from the IDL they generate, as part of the build process, all Hadoop related code: Protocol Buffer InoutFOrmats, OutputFormats, Writables, Pig LoadFuncs, Pig StoreFuncs, and more.

All the code that once was written by hand for each new data structure is now simply auto generated from the IDL. This saves ton of effort and the code is much less buggy. IDL actually saves time.

At one point model driven auto generation was a common tactic on many projects. Then fashion moved to hand generating everything. Codegen it seems wasn’t agile enough. Once you hand generate everything you start really worrying about the verbosity of your language, which moved everyone to more dynamic languages, and ironically DSLs were still often listed as an advantage of languages like Ruby. Another consequence of hand coding was the framework of the weekitis. Frameworks help blunt the damage caused by thinking everything must be written from scratch.

It’s good to see code generation coming into fashion again. There’s a lot of power in using a declarative specification and then writing highly efficient, system specific code generators. Data structures are the low hanging fruit, but it’s also possible to automate larger more complex processes.

Overall it was a very interesting and useful talk. I like seeing the careful evaluation of different options based on knowing what you want and why.  It’s refreshing to see how these smart choices can synergize and make a better and more stable system.

Related Articles

  1. Hadoop and Protocol Buffers at Twitter
  2. A Peek Under Twitter’s Hood – Twitter’s open source page goes live.
  3. Hadoop
  4. ProtocolBuffers
  5. Hadoop Bay Area User Group – Feb 17th at Yahoo! – RECAP
  6. Twitter says “Today, we are seeing 50 million tweets per day—that’s an average of 600 tweets per second.”