Tag Archives: Arch-Reliability

[repost ]Getting Real About Distributed System Reliability

original:http://blog.empathybox.com/post/19574936361/getting-real-about-distributed-system-reliability

There is a lot of hype around distributed data systems, some of it justified. It’s true that the internet has centralized a lot of computation onto services like Google, Facebook, Twitter, LinkedIn (my own employer), and other large web sites. It’s true that this centralization puts a lot of pressure on system scalability for these companies. Its true that incremental and horizontal scalability is a deep feature that requires redesign from the ground up and can’t be added incrementally to existing products. It’s true that, if properly designed, these systems can be run with no planned downtime or maintenance intervals in a way that traditional storage systems make harder. It’s also true that software that is explicitly designed to deal with machine failures is a very different thing from traditional infrastructure. All of these properties are critical to large web companies, and are what drove the adoption of horizontally scalable systems like HadoopCassandraVoldemort, etc. I was the original author of Voldemort and have worked on distributed infrastructure for the last four years or so. So in-so-far as there is a “big data” debate, I am firmly in the “pro-” camp. But one thing you often hear is that this kind of software is more reliable than the traditional alternatives it replaces, and this just isn’t true. It is time people talked honestly about this.

You hear this assumption of reliability everywhere. Now that scalable data infrastructure has a marketing presence, it has really gotten bad. Hadoop or Cassandra or what-have-you can tolerate machine failures then they must be unbreakable right? Wrong.

Where does this come from? Distributed systems need to partition data or state up over lots of machines to scale. Adding machines increases the probability that some machine will fail, and to address this these systems typically have some kind of replicas or other redundancy to tolerate failures. The core argument that gets used for these systems is that if a single machine has probabilityP of failure, and if the software can replicate data N times to survive N-1 failures, and if the machines fail independently, then the probability of losing a particular piece of data must be PN. So for any desired reliability R and any single-node failure probability P you can pick some replication N so that PN< R. This argument is the core motivation behind most variations on replication and fault tolerance in distributed systems. It is true that without this property the system would be hopelessly unreliable as it grew. But this leads people to believe that distributed software is somehow innately reliable, which unfortunately is utter hogwash.

Where is the flaw in the reasoning? Is it the dreaded Hadoop single points of failure? No, it is far more fundamental than that: the problem is the assumption that failures are independent. Surely no belief could possibly be more counter to our own experience or just common sense than believing that there is no correlation between failures of machines in a cluster. You take a bunch of pieces of identical hardware, run them on the same network gear and power systems, have the same people run and manage and configure them, and run the same (buggy) software on all of them. It would be incredibly unlikely that the failures on these machines would be independent of one another in the probabilistic sense that motivates a lot of distributed infrastructure. If you see a bug on one machine, the same bug is on all the machines. When you push bad config, it is usually game over no matter how many machines you push it to.

PN is an upper bound on reliability but one that you could never, never approach in practice. For example Google has a fantastic paper that gives empirical numbers on system failures in Bigtable and GFS and reports empirical data on groups of failures that show rates several orders of magnitude higher than the independence assumption would predict. This is what one of the best system and operations teams in the world can get: your numbers may be far worse.

The actual reliability of your system depends largely on how bug free it is, how good you are at monitoring it, and how well you have protected against the myriad issues and problems it has. This isn’t any different from traditional systems, except that the new software is far less mature. I don’t mean this disparagingly, I work in this area, it is just a fact. Maturity comes with time and usage and effort. This software hasn’t been around for as long as MySQL or Oracle, and worse, the expertise to run it reliably is much less common. MySQL and Oracle administrators are plentiful, but folks experience with, say, serious production Zookeeper operations knowledge are much more rare.

Kernel filesystem engineers say it takes about a decade for a new filesystem to go from concept to maturity. I am not sure these systems will be mature much faster—they are not easier systems to build and the fundamental design space is much less well explored. This doesn’t mean they won’t be usefulsooner, especially in domains where they solve a pressing need and are approached with an appropriate amount of caution, but they are not yet by any means a mature technology.

Part of the difficulty is that distributed system software is actually quite complicated in comparison to single-server code. Code that deals with failure cases and is “cluster aware” is extremely tricky to get right. The root of the problem is that dealing with failures effectively explodes the possible state space that needs testing and validation. For example it doesn’t even make sense to expect a single-node database to be fast if its disk system suddenly gets really slow (how could it), but a distributed system does need to carry on in the presence of single degraded machine because it has some many machines, one is sure to be degraded. These kind of “semi-failures” are common and very hard to deal with. Correctly testing these kinds of issues in a realistic setting is brutally hard and the newer generation of software doesn’t have anything like the QA processes its more mature predecessors had. (If you get a chance get someone who has worked at Oracle to describe to you what kind of testing they do to a line of code that goes into their database before it gets shipped to customers). As a result there are a lot of bugs. And of course these bugs are on all the machines, so they absolutely happen together.

Likewise distributed systems typically require more configuration and more complex configuration because they need to be cluster aware, deal with timeouts, etc. This configuration is, of course, shared; and this creates yet another opportunity to bring everything to its knees.

And finally these systems usually want lots of machines. And no matter how good you are, some part of operational difficulty always scales with the number of machines.

Let’s discuss some real issues. We had a bug in Kafka recently that lead to the server incorrectly interpreting a corrupt request as a corrupt log, and shutting itself down to avoid appending to a corrupt log. Single machine log corruption is the kind of thing that should happen due to a disk error, and bringing down the corrupt node is the right behavior—it shouldn’t happen on all the machines at the same time unless all the disks fail at once. But since this was due to corrupt requests, and since we had one client that sent corrupt requests, it was able to sequentially bring down all the servers. Oops. Another example is this Linux bug which causes the system to crash after ~200 days of uptime. Since machines are commonly restarted sequentially this lead to a situation where a large percentage of machines went hard down one after another. Likewise any memory management problems—either leaks or GC problems—tend to happen everywhere at once or not at all. Some companies do public post-mortums for major failures and these are a real wealth of failures in systems that aren’t supposed to fail. This paper has an excellent summary of HDFS availability at Yahoo—they note how few of the problems are of the kind that high availability for the namenode would solve. This list could go on, but you get the idea.

I have come around to the view that the real core difficulty of these systems is operations, not architecture or design. Both are important but good operations can often work around the limitations of bad (or incomplete) software, but good software cannot run reliably with bad operations. This is quite different from the view of unbreakable, self-healing, self-operating systems that I see being pitched by the more enthusiastic NoSQL hypesters. Worse yet, you can’t easily buy good operations in the same way you can buy good software—you might be able to hire good people (if you can find them) but this is more than just people; it is practices, monitoring systems, configuration management, etc.

These difficulties are one of the core barriers to adoption for distributed data infrastructure. LinkedIn and other companies that have a deep interest in doing creative things with data have taken on the burden of building this kind of expertise in-house—we employ committers on Hadoop and other open source projects on both our engineering and operations team, and have done a lot of from-scratch development in this space where there was gaps. This makes it feasible to take full advantage of an admittedly valuable but immature set of technologies, and let’s us build products we couldn’t otherwise—but this kind of investment only makes sense at a certain size and scale. It may be too high a cost for small startups or companies outside the web space trying to bootstrap this kind of knowledge inside a more traditional IT organization.

This is why people should be excited about things like Amazon’s DynamoDB. When DynamoDB was released, the company DataStax that supports and leads development on Cassandra released a feature comparison checklist. The checklist was unfair in many ways (as these kinds of vendor comparisons usually are), but the biggest thing missing in the comparison is that you don’t run DynamoDB, Amazon does. That is a huge, huge difference. Amazon is good at this stuff, and has shown that they can (usually) support massively multi-tenant operations with reasonable SLAs, in practice.

I really think there is really only one thing to talk about with respect to reliability: continuous hours of successful production operations. That’s it. In some ways the most obvious thing, but not typically what you hear when people talk about these systems. I will believe the system can tolerate (some) failures when I see it tolerate those failures; I believe it can run for a year without downtime when i see it run for a year without downtime. I call this empirical reliability (as opposed to theoretical reliability). And getting good empirical reliability is really, really hard. These systems end up being large hunks of monitoring, tests, and operational procedures with a teeny little distributed system strapped to the back.

You see this showing up in discussions of the CAP theorem all the time. The CAP theorem is a useful thing, but it applies more to system design than system implementations. A design is simple enough that you can maybe prove it provides consistency or tolerates partition failures under some assumptions. This is a useful lens to look at system designs. You can’t hope to do this kind of proof with an actual system implementation, the thing you run. The difficulty of building these things means it is really unthinkable that these systems are, in actual reality, either consistent, available, or partition tolerant—they certainly all have numerous bugs that will break each of these guarantees. I really likethis paper that takes the approach of actually trying to calculate the observed consistency of eventually consistent systems—they seem to do it via simulation rather than measurement, which is unfortunate, but the idea is great.

It isn’t that system design is meaningless, it is worth discussing the system design as it does act as a kind of limiting factor on certain aspects of reliability and performance as the implementation matures and improves, but don’t take it too seriously as guaranteeing anything.

So why isn’t this kind of empirical measurement more talked about? I don’t know. My pet theory is it has to do with the somewhat rigid and deductive mindset of classical computer science. This is inherited from pure math, and conflicts with the attitude in scientific disciplines. It leads to a preference for starting with axioms, and then proving various properties that follow from these axioms. This world view doesn’t embrace the kind of empirical measurements you would expect to justify claims about reality (for another example see this great blog post on programming language productivity claims in programming language research). But that is getting off topic. Suffice it to say, when making predictions about how a system will work in the real world I believe in measurements of reality a lot more than arguments from first-principles.

I think we should insist on a little more rigor and empiricism in this area.

I would love to see claims in academic publication around practicality or reliability justified in the same way we justify performance claims—by doing it. I would be a lot more likely to believe an academic distributed system was practically feasible if it was run continuously under load for a year successfully and if information was reported on failures and outages. Maybe that isn’t feasible for an academic project, but few other allegedly scientific academic disciplines can get away with making claims about reality without evidence.

More broadly I would like to see more systems whose operation is verifiable. Many systems have the ability to log out information about their state in a way that makes programmatic checking of various invariants and guarantees possible (such as consistency or availability). An example of such an invariant for a messaging system is that all the messages sent to it are received by all the subscribers. We actually measure the truth of this statement in real-time in production for our Kafka data pipeline for all 450 topics and all their subscribers. The number of corner-cases one uncovers with this kind of check, run through a few hundred use cases and a few billion messages per day is truly astounding. I think this is a special case of a broad class of verification that can be done on the running system that goes far far deeper than what is traditionally considered either monitoring or testing. Call it unit testing in production, or self-proving systems, or next generation monitoring, or whatever, but I think this kind of deep verification is something that makes turning the theoretical claims a design makes into measured properties of the real running system.

Likewise if you have a “NoSQL vendor” I think it is reasonable to ask them to provide hard information on customer outages. They don’t need to tell you who the customer is, but they should let you know the observed real-life distribution of MTTF and MTTR they are able to achieve, not just highlight one or two happy cases. Make sure you understand how they measure this, do they have automated test load that runs or just wait for people to complain? This is a reasonable thing for people paying for a service to ask for. To a certain extent if they provide this kind of empirical data it isn’t clear why you should evencare what their architecture is beyond intellectual curiosity.

Distributed systems solve a number of key problems at the heart of scaling large websites. I absolutely think this is going to become the default approach to handling state in internet application development. But no one benefits from the kind of irrational exuberance that currently surrounds the “big data” or nosql systems communities. This area is experiencing a boom—but nothing takes us from boom to bust like unrealistic hype and broken promises. We should be a little more honest about where these systems already shine and where they are still maturing.

[repost ]Set up a Web server cluster in 5 easy steps

original:http://www.ibm.com/developerworks/linux/library/l-linux-ha/

Spreading a workload across multiple processors, coupled with various software recovery techniques, provides a highly available environment and enhances overall RAS (Reliability, Availability, and Serviceability) of the environment. Benefits include faster recovery from unplanned outages, as well as minimal effects of planned outages on the end user.

To get the most out of this article, you should be familiar with Linux and basic networking, and you should have Apache servers already configured. Our examples are based on standard SUSE Linux Enterprise Server 10 (SLES10) installations, but savvy users of other distributions should be able to adapt the methods shown here.

This article illustrates the robust Apache Web server stack with 6 Apache server nodes (though 3 nodes is sufficient for following the steps outlined here) as well as 3 Linux Virtual Server (LVS) directors. We used 6 Apache server nodes to drive higher workload throughputs during testing and thereby simulate larger deployments. The architecture presented here should scale to many more directors and backend Apache servers as your resources permit, but we haven’t tried anything larger ourselves. Figure 1 shows our implementation using the Linux Virtual Server and the linux-ha.org components.
Figure 1. Linux Virtual Servers and Apache
Linux Virtual Servers and Apache

As shown in Figure 1, the external clients send traffic to a single IP address, which may exist on any of the LVS director machines. The director machines actively monitor the pool of Web servers they relay work to.

Note that the workload progresses from the left side of Figure 1 toward the right. The floating resource address for this cluster will reside on one of the LVS director instances at any given time. The service address may be moved manually through a graphical configuration utility, or (more commonly) it can be self-managing, depending on the state of the LVS directors. Should any director become ineligible (due to loss of connectivity, software failure, or similar) the service address will be relocated automatically to an eligible director.

The floating service address must span two or more discrete hardware instances in order to continue operation with the loss of one physical machine. With the configuration decisions presented in this article, each LVS director is able to forward packets to any real Apache Web server regardless of physical location or proximity to the active director providing the floating service address. This article shows how each of the LVS directors can actively monitor the Apache servers in order to ensure requests are sent only to operational back-end servers.

With this configuration, practitioners have successfully failed entire Linux instances with no interruption of service to the consumers of the services enabled on the floating service address (typically http and https Web requests).

New to Linux Virtual Server terminology

LVS directors: Linux Virtual Server directors are systems that accept arbitrary incoming traffic and pass it on to any number of realservers. They then accept the response from the realservers and pass it back to the clients who initiated the request. The directors need to perform their task in a transparent fashion such that clients never know that realservers are doing the actual workload processing.

LVS directors themselves need the ability to float resources (specifically, a virtual IP address on which they listen for incoming traffic) between one another in order to not become a single point of failure. LVS directors accomplish floating IP addresses by leveraging the Heartbeat component from LVS. This allows each configured director that is running Heartbeat to ensure one, and only one, of the directors lays claim to the virtual IP address servicing incoming requests.

Beyond the ability to float a service IP address, the directors need to be able to monitor the status of the realservers that are doing the actual workload processing. The directors must keep a working knowledge of what realservers are available for processing at all times. In order to monitor the realservers, the mon package is used. Read on for details on configuring Heartbeat and configuring mon.

Realservers: These systems are the actual Web server instances providing the HA service. It is vital to have more than one realserver providing the service you wish to make HA. In our environment, 6 realservers are implemented, but adding more is trivial once the rest of the LVS infrastructure is in place.

In this article, the realservers are all assumed to be running the Apache Web Server, but other services could just as easily have been implemented (in fact, it is trivially easy to enabled SSH serving as an additional test of the methodology presented here).

The realservers used are stock Apache Web servers with the notable exception that they were configured to respond as if it were the LVS director’s floating IP address, or a virtual hostname corresponding to the floating IP address used by the directors. This is accomplished by altering a single line in the Apache configuration file.

You can duplicate our configuration using an entirely open source software stack consisting of Heartbeat technology components provided by linux-ha.org, and server monitoring via mon and Apache. As stated, we used SUSE Linux Enterprise Server for testing our configuration.

All of the machines used in the LVS scenario reside on the same subnet and use the Network Address Translation (NAT) configuration. Numerous other network topographies are described at the Linux Virtual Server Web site (see Resources); we favor NAT for simplicity. For added security, you should limit traffic across firewalls to only the floating IP address that is passed between the LVS directors.

The Linux Virtual Server suite provides a few different methods to accomplish a transparent HA back-end infrastructure. Each method has advantages and disadvantages. LVS-NAT operates on a director server by grabbing incoming packets that are destined for configuration-specified ports and rewriting the destination address in the packet header dynamically. The director does not process the data content of the packets itself, but rather relays them on to the realservers. The destination address in the packets is rewritten to point to a given realserver from the cluster. The packet is then placed back on the network for delivery to the realserver, and the realserver is unaware that anything has gone on. As far as the realserver is concerned, it has simply received a request directly from the outside world. The replies from the realserver are then sent back to the director where they are again rewritten to have the source address of the floating IP address that clients are pointed at, and are sent along to the original client.

Using the LVS-NAT approach means the realservers require simple TCP/IP functionality. The other modes of LVS operation, namely LVS-DR and LVS-Tun require more complex networking concepts. The major benefit behind the choice of LVS-NAT is that very little alteration is required to the configuration of the realservers. In fact, the hardest part is remembering to set the routing statements properly.

Step 1: Building realserver images

Begin by making a pool of Linux server instances, each running Apache Web server, and ensure that the servers are working as designed by pointing a Web browser to each of the realserver’s IP addresses. Typically, a standard install will be configured to listen on port 80 on its own IP address (in other words, on a different IP for each realserver).

Next, configure the default Web page on each server to display a static page containing the hostname of the machine serving the page. This ensures that you always know which machine you are connecting to during testing.

As a precaution, check that IP forwarding on these systems is OFF by issuing the following command:

# cat /proc/sys/net/ipv4/ip_forward

If it’s not OFF and you need to disable it, issue this command:

# echo "0" >/proc/sys/net/ipv4/ip_forward

An easy way to ensure that each of your realservers is properly listening on the http port (80) is to use an external system and perform a scan. From some other system with network connectivity to your server, you can use the nmap utility to make sure the server is listening.
Listing 1. Using nmap to make sure the server is listening

                
# nmap -P0 192.168.71.92

Starting nmap 3.70 ( http://www.insecure.org/nmap/ ) at 2006-01-13 16:58 EST
Interesting ports on 192.168.71.92:
(The 1656 ports scanned but not shown below are in state: closed)
PORT    STATE    SERVICE
22/tcp  open     ssh
80/tcp  filtered http
111/tcp open     rpcbind
631/tcp open     ipp

 

Be aware that some organizations frown on the use of port scanning tools such as nmap: make sure that your organization approves before using it.

Next, point your Web browser to each realserver’s actual IP address to ensure each is serving the appropriate page as expected. Once this is completed, go to Step 2.

Back to top

Step 2: Installing and configuring the LVS directors

Now you are ready to construct the 3 LVS director instances needed. If you are doing a fresh install of SUSE Linux Enterprise Server 10 for each of the LVS directors, be sure to select the high availability packages relating to heartbeat, ipvsadm, and mon during the initial installation. If you have an existing installation, you can always use a package management tool, such as YAST, to add these packages after your base installation. It is strongly recommended that you add each of the realservers to the /etc/hosts file. This will ensure there is no DNS-related delay when servicing incoming requests.

At this time, double check that each of the directors are able to perform a timely ping to each of the realservers:
Listing 2. Pinging the realservers

                
# ping -c 1 $REAL_SERVER_IP_1

 # ping -c 1 $REAL_SERVER_IP_2

 # ping -c 1 $REAL_SERVER_IP_3

 # ping -c 1 $REAL_SERVER_IP_4

 # ping -c 1 $REAL_SERVER_IP_5

 # ping -c 1 $REAL_SERVER_IP_6

 

Once completed, install ipvsadm, Heartbeat, and mon from the native package management tools on the server. Recall that Heartbeat will be used for intra-director communication, and mon will be used by each director to maintain information about the status of each realserver.

Back to top

Step 3: Installing and configuring Heartbeat on the directors

If you have worked with LVS before, keep in mind that configuring Heartbeat Version 2 on SLES10 is quite a bit different than it was for Heartbeat Version 1 on SLES9. Where Heartbeat Version 1 used files (haresources, ha.cf, and authkeys) stored in the /etc/ha.d/ directory, Version 2 uses the new, XML-based Cluster Information Base (CIB). The recommended approach for upgrading is to use the haresources file to generate the new cib.xml file. The contents of a typical ha.cf file are shown in Listing 3.

We took the ha.cf file from a SLES9 system and added the bottom 3 lines (respawn, pingd, and crm) for Version 2. If you have an existing version 1 configuration, you may opt to do the same. If you are using these instructions for a new installation, you can copy Listing 3 and modify it to suit your production environment.
Listing 3. A sample /etc/ha.d/ha.cf config file

                
 # Log to syslog as facility "daemon"
 use_logd on
 logfacility daemon

 # List our cluster members (the realservers)
 node litsha22
 node litsha23
 node litsha21

 # Send one heartbeat each second
 keepalive 3

 # Warn when heartbeats are late
 warntime 5

 # Declare nodes dead after 10 seconds
 deadtime 10

 # Keep resources on their "preferred" hosts - needed for active/active
 #auto_failback on

 # The cluster nodes communicate on their heartbeat lan (.68.*) interfaces
 ucast eth1 192.168.68.201
 ucast eth1 192.168.68.202
 ucast eth1 192.168.68.203

 # Failover on network failures
 # Make the default gateway on the public interface a node to ping
 # (-m) -> For every connected node, add <integer> to the value set
 #  in the CIB, * Default=1
 # (-d) -> How long to wait for no further changes to occur before
 #  updating the CIB with a changed attribute
 # (-a) -> Name of the node attribute to set,  * Default=pingd
 respawn hacluster /usr/lib/heartbeat/pingd -m 100 -d 5s

 # Ping our router to monitor ethernet connectivity
 ping litrout71_vip

 #Enable version 2 functionality supporting clusters with  > 2 nodes
 crm yes

 

The respawn directive is used to specify a program to run and monitor while it runs. If this program exits with anything other than exit code 100, it will be automatically restarted. The first parameter is the user id to run the program under, and the second parameter is the program to run. The -m parameter sets the attribute pingd to 100 times the number of ping nodes reachable from the current machine, and the –d parameter delays 5 seconds before modifying the pingd attribute in the CIB. The ping directive is given to declare the PingNode to Heartbeat, and the crm directive specifies whether Heartbeat should run the 1.x-style cluster manager or 2.x-style cluster manager that supports more than 2 nodes.

This file should be identical on all the directors. It is absolutely vital that you set the permissions appropriately such that the hacluster daemon can read the file. Failure to do so will cause a slew of warnings in your log files that may be difficult to debug.

For a release 1-style Heartbeat cluster, the haresources file specifies the node name and networking information (floating IP, associated interface, and broadcast). For us, this file remained unchanged:

litsha21 192.168.71.205/24/eth0/192.168.71.255

This file will be used only to generate the cib.xml file.

The authkeys file specifies a shared secret allowing directors to communicate with one another. The shared secret is simply a password that all the heartbeat nodes know and use to communicate with one another. The secret prevents unwanted parties from trying to influence the heartbeat server nodes. This file also remained unchanged:

auth 1

1 sha1 ca0e08148801f55794b23461eb4106db

The next few steps show you how to convert the version 1 haresources file to the new version 2 XML-based configuration format (cib.xml). Though it should be possible to simply copy and use the configuration file in Listing 4 as a starting point, it is strongly suggested that you follow along to tailor the configuration for your deployment.

To convert file formats to the XML-based CIB (Cluster Information Base) file you will use in deployment, issue the following command:

python /usr/lib64/heartbeat/haresources2cib.py /etc/ha.d/haresources > /var/lib/heartbeat/crm/test.xml

A configuration file similar to the one shown in Listing 4 will be generated and placed in /var/lib/heartbeat/crm/test.xml.
Listing 4. Sample CIB.xml file

                
 <cib admin_epoch="0" have_quorum="true" num_peers="3" cib_feature_revision="1.3"
  generated="true" ccm_transition="7" dc_uuid="114f3ad1-f18a-4bec-9f01-7ecc4d820f6c"
  epoch="280" num_updates="5205" cib-last-written="Tue Apr  3 16:03:33 2007">
    <configuration>
      <crm_config>
        <cluster_property_set id="cib-bootstrap-options">
          <attributes>
            <nvpair id="cib-bootstrap-options-symmetric_cluster"
                   name="symmetric_cluster" value="true"/>
            <nvpair id="cib-bootstrap-options-no_quorum_policy"
                   name="no_quorum_policy" value="stop"/>
            <nvpair id="cib-bootstrap-options-default_resource_stickiness"
                   name="default_resource_stickiness" value="0"/>
            <nvpair id="cib-bootstrap-options-stonith_enabled"
                   name="stonith_enabled" value="false"/>
            <nvpair id="cib-bootstrap-options-stop_orphan_resources"
                   name="stop_orphan_resources" value="true"/>
            <nvpair id="cib-bootstrap-options-stop_orphan_actions"
                   name="stop_orphan_actions" value="true"/>
            <nvpair id="cib-bootstrap-options-remove_after_stop"
                   name="remove_after_stop" value="false"/>
            <nvpair id="cib-bootstrap-options-transition_idle_timeout"
                   name="transition_idle_timeout" value="5min"/>
            <nvpair id="cib-bootstrap-options-is_managed_default"
                   name="is_managed_default" value="true"/>
          <attributes>
        <cluster_property_set>
      <crm_config>
      <nodes>
        <node uname="litsha21" type="normal" id="01ca9c3e-8876-4db5-ba33-a25cd46b72b3">
          <instance_attributes id="standby-01ca9c3e-8876-4db5-ba33-a25cd46b72b3">
            <attributes>
              <nvpair name="standby" id="standby-01ca9c3e-8876-4db5-ba33-a25cd46b72b3"
                     value="off"/>
            <attributes>
          <instance_attributes>
        <node>
        <node uname="litsha23" type="normal" id="dc9a784f-3325-4268-93af-96d2ab651eac">
          <instance_attributes id="standby-dc9a784f-3325-4268-93af-96d2ab651eac">
            <attributes>
              <nvpair name="standby" id="standby-dc9a784f-3325-4268-93af-96d2ab651eac"
                     value="off"/>
            <attributes>
          <instance_attributes>
        <node>
        <node uname="litsha22" type="normal" id="114f3ad1-f18a-4bec-9f01-7ecc4d820f6c">
          <instance_attributes id="standby-114f3ad1-f18a-4bec-9f01-7ecc4d820f6c">
            <attributes>
              <nvpair name="standby" id="standby-114f3ad1-f18a-4bec-9f01-7ecc4d820f6c"
                     value="off"/>
            <attributes>
          <instance_attributes>
        <node>
      <nodes>
      <resources>
        <primitive provider="heartbeat" type="IPaddr" id="IPaddr_1">
          <operations>
            <op id="IPaddr_1_mon" interval="5s" name="monitor" timeout="5s"/>
          <operations>
          <instance_attributes id="IPaddr_1_inst_attr">
            <attributes>
              <nvpair id="IPaddr_1_attr_0" name="ip" value="192.168.71.205"/>
              <nvpair id="IPaddr_1_attr_1" name="netmask" value="24"/>
              <nvpair id="IPaddr_1_attr_2" name="nic" value="eth0"/>
              <nvpair id="IPaddr_1_attr_3" name="broadcast" value="192.168.71.255"/>
            <attributes>
          <instance_attributes>
        <primitive>
      <resources>
      <constraints>
        <rsc_location id="rsc_location_IPaddr_1" rsc="IPaddr_1">
          <rule id="prefered_location_IPaddr_1" score="200">
            <expression attribute="#uname" id="prefered_location_IPaddr_1_expr"
                   operation="eq" value="litsha21"/>
          <rule>
        <rsc_location>
        <rsc_location id="my_resource:connected" rsc="IPaddr_1">
          <rule id="my_resource:connected:rule" score_attribute="pingd">
            <expression id="my_resource:connected:expr:defined" attribute="pingd"
                   operation="defined"/>
          <rule>
        <rsc_location>
      <constraints>
    <configuration>
  <cib>

 

Once your configuration file is generated, move test.xml to cib.xml, change the owner to hacluster and the group to haclient, and then restart the heartbeat process.

Now that the heartbeat configuration is complete, set heartbeat to start at boot time on each of the directors. To do this, Issue the following command (or equivalent for your distribution) on each director:

# chkconfig heartbeat on

Restart each of the LVS directors to ensure the heartbeat service starts properly at boot. By halting the machine that holds the floating resource IP address first, you can watch as the other LVS Director images establish quorum, and instantiate the service address on a newly-elected primary node within a matter of seconds. When you bring the halted director image back online, the machines will re-establish quorum across all nodes, at which time the floating resource IP may transfer back. The entire process should take only a few seconds.

Additionally, at this time you may wish to use the graphical utility for the heartbeat process, hb_gui (see Figure 2), to manually move the IP address around in the cluster by setting various nodes to the standby or active state. Retry these steps numerous times, disabling and re-enabling various machines that are active or inactive. With the choice of configuration policy selected earlier, as long as quorum can be established and at least one node is eligible, the floating resource IP address will remain operational. During your testing, you can use simple pings to ensure that no packet loss occurs. When you have finished experimenting, you should have a strong feel for how robust your configuration is. Make sure you are comfortable with the HA configuration of your floating resource IP before continuing on.
Figure 2. Graphical configuration utility for the heartbeat process, hb_gui
Figure 2. Graphical configuration utility for the heartbeat process, hb_gui

Figure 2 illustrates the graphical console as it appears after login, showing the managed resources and associated configuration options. Note that you must log into the hb_gui console when you first launch the application; the credentials used will depend on your deployment.

Notice in Figure 2 how the nodes in the cluster, the litsha2* systems, are each in the running state. The system labeled litsha21 is the current active node, as indicated by the addition of a resource displayed immediately below and indented (IPaddr_1).

Also note the selection labeled “No Quorum Policy” to the value “stop”. This means that any isolated node releases resources it would otherwise own. The implication of that decision is that at any given time, 2 heartbeat nodes must be active to establish quorum (in other words, a voting majority). Even if a single active, 100% operational node loses connection to its peer systems due to network failure or if both the inactive peers halt simultaneously, the resource will be voluntarily released.

Back to top

Step 4: Creating LVS rules with the ipvsadm command

The next step is to take the floating resource IP address and build on it. Because LVS is intended to be transparent to remote Web browser clients, all Web requests must be funneled through the directors and passed on to one of the realservers. Then any results need to be relayed back to the director, which then returns the response to the client who initiated the Web page request.

To accomplish that flow of requests and responses, first configure each of the LVS directors to enable IP forwarding (thus allowing requests to be passed on to the realservers) by issuing the following commands:

# echo "1" >/proc/sys/net/ipv4/ip_forward

# cat /proc/sys/net/ipv4/ip_forward

If all was successful, the second command would return a “1” as output to your terminal. To add this permanently, add:

'' IP_FORWARD="yes"

to /etc/sysconfig/sysctl.

Next, to tell the directors to relay incoming HTTP requests to the HA floating IP address on to the realservers, use the ipvsadm command.

First, clear the old ipvsadm tables:

# /sbin/ipvsadm -C

Before you can configure the new tables, you need to decide what kind of workload distribution you want the LVS directors to use. On receipt of a connect request from a client, the director assigns a realserver to the client based on a “schedule,” and you will set the scheduler type with the ipvsadm command. Available schedulers include:

  • Round Robin (RR): New incoming connections are assigned to each realserver in turn.
  • Weighted Round Robin (WRR): RR scheduling with additional weighting factor to compensate for differences in realserver capabilities such as additional CPUs, more memory, and so on.
  • Least Connected (LC): New connections go to the realserver with the least number of connections. This is not necessarily the least-busy realserver, but it is a step in that direction.
  • Weighted Least Connection (WLC): LC with weighting.

It is a good idea to use RR scheduling for testing, as it is easy to confirm. You may want to add WRR and LC to your testing routine to confirm that they work as expected. The examples shown here assume RR scheduling and its variants.

Next, create a script to enable ipvsadm service forwarding to the realservers, and place a copy on each LVS director. This script will not be necessary when the later configuration of mon is done to automatically monitor for active realservers, but it aids in testing the ipvsadm component until then. Remember to double-check for proper network and http/https connectivity to each of your realservers before executing this script.
Listing 5. The HA_CONFIG.sh file

                
 #!/bin/sh

 # The virtual address on the director which acts as a cluster address

 VIRTUAL_CLUSTER_ADDRESS=192.168.71.205

 REAL_SERVER_IP_1=192.168.71.220

 REAL_SERVER_IP_2=192.168.71.150

 REAL_SERVER_IP_3=192.168.71.121

 REAL_SERVER_IP_4=192.168.71.145

 REAL_SERVER_IP_5=192.168.71.185

 REAL_SERVER_IP_6=192.168.71.186

 # set ip_forward ON for vs-nat director (1 on, 0 off).

 cat /proc/sys/net/ipv4/ip_forward

 echo "1" >/proc/sys/net/ipv4/ip_forward

 # director acts as the gw for realservers

 # Turn OFF icmp redirects (1 on, 0 off), if not the realservers might be clever and

 #  not use the director as the gateway!

 echo "0" >/proc/sys/net/ipv4/conf/all/send_redirects

 echo "0" >/proc/sys/net/ipv4/conf/default/send_redirects

 echo "0" >/proc/sys/net/ipv4/conf/eth0/send_redirects

 # Clear ipvsadm tables (better safe than sorry)

 /sbin/ipvsadm -C

 # We install LVS services with ipvsadm for HTTP and HTTPS connections with RR

 #  scheduling

 /sbin/ipvsadm -A -t $VIRTUAL_CLUSTER_ADDRESS:http -s rr

 /sbin/ipvsadm -A -t $VIRTUAL_CLUSTER_ADDRESS:https -s rr

 # First realserver

 # Forward HTTP to REAL_SERVER_IP_1 using LVS-NAT (-m), with weight=1

 /sbin/ipvsadm -a -t $VIRTUAL_CLUSTER_ADDRESS:http -r $REAL_SERVER_IP_1:http -m -w 1

 /sbin/ipvsadm -a -t $VIRTUAL_CLUSTER_ADDRESS:https -r $REAL_SERVER_IP_1:https -m -w 1

 # Second realserver

 # Forward HTTP to REAL_SERVER_IP_2 using LVS-NAT (-m), with weight=1

 /sbin/ipvsadm -a -t $VIRTUAL_CLUSTER_ADDRESS:http -r $REAL_SERVER_IP_2:http -m -w 1

 /sbin/ipvsadm -a -t $VIRTUAL_CLUSTER_ADDRESS:https -r $REAL_SERVER_IP_2:https -m -w 1

 # Third realserver

 # Forward HTTP to REAL_SERVER_IP_3 using LVS-NAT (-m), with weight=1

 /sbin/ipvsadm -a -t $VIRTUAL_CLUSTER_ADDRESS:http -r $REAL_SERVER_IP_3:http -m -w 1

 /sbin/ipvsadm -a -t $VIRTUAL_CLUSTER_ADDRESS:https -r $REAL_SERVER_IP_3:https -m -w 1

 # Fourth realserver

 # Forward HTTP to REAL_SERVER_IP_4 using LVS-NAT (-m), with weight=1

 /sbin/ipvsadm -a -t $VIRTUAL_CLUSTER_ADDRESS:http -r $REAL_SERVER_IP_4:http -m -w 1

 /sbin/ipvsadm -a -t $VIRTUAL_CLUSTER_ADDRESS:https -r $REAL_SERVER_IP_4:https -m -w 1

 # Fifth realserver

 # Forward HTTP to REAL_SERVER_IP_5 using LVS-NAT (-m), with weight=1

 /sbin/ipvsadm -a -t $VIRTUAL_CLUSTER_ADDRESS:http -r $REAL_SERVER_IP_5:http -m -w 1

 /sbin/ipvsadm -a -t $VIRTUAL_CLUSTER_ADDRESS:https -r $REAL_SERVER_IP_5:https -m -w 1

 # Sixth realserver

 # Forward HTTP to REAL_SERVER_IP_6 using LVS-NAT (-m), with weight=1

 /sbin/ipvsadm -a -t $VIRTUAL_CLUSTER_ADDRESS:http -r $REAL_SERVER_IP_6:http -m -w 1

 /sbin/ipvsadm -a -t $VIRTUAL_CLUSTER_ADDRESS:https -r $REAL_SERVER_IP_6:https -m -w 1

 # We print the new ipvsadm table for inspection

 echo "NEW IPVSADM TABLE:"

 /sbin/ipvsadm

 

As you can see in Listing 5, the script simply enables the ipvsadm services, then has virtually identical stanzas to forward Web and SSL requests to each of the individual realservers. We used the -m option to specify NAT, and weight each realserver equally with a weight of 1 (-w 1). The weights specified are superfluous when using normal round robin scheduling (as the default weight is always 1). The option is presented only so that you may deviate to select weighted round robin. To do so change rr to wrr on the 2 consecutive lines below the comment about using round robin, and of course do not forget to adjust the weights accordingly. For more information about the various schedulers, consult the man page for ipvsadm.

You have now configured each director to handle incoming Web and SSL requests to the floating service IP by rewriting them and passing the work on to the realservers in succession. But in order to get traffic back from the realservers, and do the reverse process before handing the requests back to the client who made the request, you need to alter a few of the networking settings on the directors. This is necessary because of the decision to implement LVS directors and realservers in a flat network topology (that is, all on the same subnet). We need to perform the following steps to force the Apache response traffic back through the directors rather than answering directly themselves:

echo "0" > /proc/sys/net/ipv4/conf/all/send_redirects
echo "0" > /proc/sys/net/ipv4/conf/default/send_redirects
echo "0" > /proc/sys/net/ipv4/conf/eth0/send_redirects

This was done to prevent the active LVS director from trying to take a TCP/IP shortcut by informing the realserver and floating service IP to talk directly to one another (since they are on the same subnet). Normally redirects are useful, as they improve performance by cutting out unnecessary middlemen in network connections. But in this case, it would have prevented the response traffic from being rewritten as is necessary for transparency to the client. In fact, if redirects were not disabled on the LVS director, the traffic being sent from the realserver directly to the client would appear to the client as an unsolicited network response and would be discarded.

At this point, it is time to set the default route of each of the realservers to point at the service floating IP address to ensure all responses are passed back to the director for packet rewriting before being passed back to the client that originated the request.

Once redirects are disabled on the directors, and the realservers are configured to route all traffic through the floating service IP, you may proceed to test your HA LVS environment. To test the work done thus far, point a Web browser on a remote client to the floating service address of the LVS directors.

For testing in the laboratory, we used a Gecko-based browser (Mozilla), though any browser should suffice. To ensure the deployment was successful, disable caching in the browser, and click the refresh button multiple times. With each refresh, you should see that the Web page displayed is one of the self-identifying pages configured on the realservers. If you are using RR scheduling, you should observe the page cycling through each of realservers in succession.

Are you now thinking of ensuring that the LVS configuration starts automatically at boot? Don’t do that just yet! There is one more step needed (Step 5) to perform active monitoring of the realservers (thus keeping a dynamic list of which Apache nodes are eligible to service work request).

Back to top

Step 5: Installing and configuring mon on the LVS directors

So far, you have established a highly available service IP address and bound that to the pool of realserver instances. But you must never trust any of the individual Apache servers to be operational at any given time. By choosing RR scheduling, if any given realserver becomes disabled, or ceases to respond to network traffic in a timely fashion, 1/6th of the HTTP requests would be failures!

Thus it is necessary to implement monitoring of the realservers on each of the the LVS directors in order to dynamically add or remove them from the service pool. Another well-known open source package called mon is well suited for this task.

The mon solution is commonly used for monitoring LVS realnodes. Mon is relatively easy to configure, and is very extensible for people familiar with shell scripting. There are essentially three main steps to get everything working: installation, service monitoring configuration, and alert creation. Use your package management tool to handle the installation of mon. When finished with the installation, you need only to adjust the monitoring configuration, and create some alert scripts. The alert scripts are triggered when the monitors determine that a realserver has gone offline, or come back online.

Note that with heartbeat v2 installations, monitoring of realservers can be accomplished by making all the realserver services resources. Or, you can use the Heartbeat ldirectord package.

By default, mon comes with several monitor mechanisms ready to be used. We altered sample configuration file in /etc/mon.cf to make use of the HTTP service.

In the mon configuration file, ensure the header reflects the proper paths. SLES10 is a 64-bit Linux image, but the sample configuration as shipped was for the default (31- or 32-bit) locations. The configuration file sample assumed the alerts and monitors are located /usr/lib, which was incorrect for our particular installation. The parameters we altered were as follows:

alertdir = /usr/lib64/mon/alert.d
mondir = /usr/lib64/mon/mon.d

As you can see, we simply changed lib to lib64. Such a change may not be necessary for your distribution.

The next change to the configuration file was to specify the list of realservers to monitor. This was done with the following 6 directives:
Listing 6. Specifying realservers to monitor

                
 hostgroup litstat1 192.168.71.220 # realserver 1

 hostgroup litstat2 192.168.71.150

 hostgroup litstat3 192.168.71.121

 hostgroup litstat4 192.168.71.145

 hostgroup litstat5 192.168.71.185

 hostgroup litstat6 192.168.71.186 # realserver 6

 

If you want to add additional realservers, simply add additional entries here.

Once you have all of your definitions in place, you need to tell mon how to watch for failure, and what to do in case of failure. To do this, add the following monitor sections (one for each realserver). When done, you will need to place both the mon configuration file and the alert on each of the LVS heartbeat nodes, enabling each heartbeat cluster node to independently monitor all of the realservers.
Listing 7. The /etc/mon/mon.cf file

                
 #
 # global options
 #
 cfbasedir    = /etc/mon
 alertdir = /usr/lib64/mon/alert.d
 mondir       = /usr/lib64/mon/mon.d
 statedir     = /var/lib/mon
 logdir       = /var/log
 maxprocs     = 20
 histlength   = 100
 historicfile = mon_history.log
 randstart    = 60s
 #
 # authentication types:
 #   getpwnam      standard Unix passwd, NOT for shadow passwords
 #   shadow        Unix shadow passwords (not implemented)
 #   userfile      "mon" user file
 #
 authtype = getpwnam
 #
 # downtime logging, uncomment to enable
 #  if the server is running, don't forget to send a reset command
 #  when you change this
 #
 #dtlogfile = downtime.log
 dtlogging = yes
 #
 # NB:  hostgroup and watch entries are terminated with a blank line (or
 #  end of file).  Don't forget the blank lines between them or you lose.
 #
 #
 # group definitions (hostnames or IP addresses)
 # example:
 #
 # hostgroup servers www mail pop server4 server5
 #
 # For simplicity we monitor each individual server as if it were a "group"
 #  so we add only the hostname and the ip address of an individual node for each.
 hostgroup litstat1 192.168.71.220
 hostgroup litstat2 192.168.71.150
 hostgroup litstat3 192.168.71.121
 hostgroup litstat4 192.168.71.145
 hostgroup litstat5 192.168.71.185
 hostgroup litstat6 192.168.71.186
 #
 # Now we set identical watch definitions on each of our groups. They could be
 #  customized to treat individual servers differently, but we have made the
 #  configurations homogeneous here to match our homogeneous LVS configuration.
 #
  watch litstat1
      service http
         description http check servers
         interval 6s
         monitor http.monitor -p 80 -u /index.html
         allow_empty_group
         period wd {Mon-Sun}
             alert dowem.down.alert -h
                upalert dowem.up.alert -h
             alertevery 600s
                 alertafter 1
  watch litstat2
      service http
         description http check servers
         interval 6s
         monitor http.monitor -p 80 -u /index.html
         allow_empty_group
         period wd {Mon-Sun}
             alert dowem.down.alert -h
             upalert dowem.up.alert -h
             alertevery 600s
                 alertafter 1
  watch litstat3
      service http
         description http check servers
         interval 6s
         monitor http.monitor -p 80 -u /index.html
         allow_empty_group
         period wd {Mon-Sun}
             alert dowem.down.alert -h
             upalert dowem.up.alert -h
             alertevery 600s
                 alertafter 1
  watch litstat4
      service http
         description http check servers
         interval 6s
         monitor http.monitor -p 80 -u /index.html
         allow_empty_group
         period wd {Mon-Sun}
             alert dowem.down.alert -h
             upalert dowem.up.alert -h
             alertevery 600s
                 alertafter 1
  watch litstat5
      service http
         description http check servers
         interval 6s
         monitor http.monitor -p 80 -u /index.html
         allow_empty_group
         period wd {Mon-Sun}
             alert dowem.down.alert -h
             upalert dowem.up.alert -h
             alertevery 600s
                 alertafter 1
  watch litstat6
      service http
         description http check servers
         interval 6s
         monitor http.monitor -p 80 -u /index.html
         allow_empty_group
         period wd {Mon-Sun}
             alert dowem.down.alert -h
             upalert dowem.up.alert -h
             alertevery 600s
                 alertafter 1

 

Listing 7 tells mon to use the http.monitor, which is shipped with mon by default. Additionally, port 80 is specified as the port to use. Listing 7 also provides the specific page to request; you may opt to transmit a more efficient small segment of html as proof of success rather than a complicated default html page for your Web server.

The alert and upalert lines invoke scripts that must be placed in the alertdir specified at the top of the configuration file. The directory is typically something that is the distribution default, such as “/usr/lib64/mon/alert.d”. The alerts are responsible for telling LVS to add or remove Apache servers from the eligibility list (by invoking the ipvsadm command, as we shall see in a moment).

When one of the realservers fails the http test, dowem.down.alert will be executed by mon with several arguments automatically. Likewise, when the monitors determine that a realserver has come back online, the mon process executes the dowem.up.alert with the numerous arguments automatically. Feel free to alter the names of the alert scripts to suit your own deployment.

Save this file, and create the alerts (using simple bash scripting) in the alertdir. Listing 8 shows a bash script alert that will be invoked by mon when a real server connection is re-established.
Listing 8. Simple alert: we have connectivity

                
 #! /bin/bash

 #   The h arg is followed by the hostname we are interested in acting on

 #   So we skip ahead to get the -h option since we don't care about the others

 REALSERVER=192.168.71.205

 while [ $1 != "-h" ] ;

 do

         shift

 done

 ADDHOST=$2

 # For the HTTP service

 /sbin/ipvsadm -a -t $REALSERVER:http -r $ADDHOST:http -m -w 1

 # For the HTTPS service

 /sbin/ipvsadm -a -t $REALSERVER:https -r $ADDHOST:https -m -w 1

 

Listing 9 shows a bash script alert that will be invoked by mon when a real server connection is lost.
Listing 9. Simple alert: we have lost connectivity

                
 #! /bin/bash

 #   The h arg is followed by the hostname we are interested in acting on

 #   So we skip ahead to get the -h option since we dont care about the others

 REALSERVER=192.168.71.205

 while [ $1 != "-h" ] ;

 do

         shift

 done

 BADHOST=$2

 # For the HTTP service

 /sbin/ipvsadm -d -t $REALSERVER:http -r $BADHOST

 # For the HTTPS service

 /sbin/ipvsadm -d -t $REALSERVER:https -r $BADHOST

 

Both of those scripts use of the ipvsadm command-line tool to dynamically add and remove realservers from the LVS tables. Note that these scripts are far from perfect. With mon monitoring only the http port for simple Web requests, the architecture as outlined here is vulnerable to situations where a given realserver might be operating correctly for http requests but not for SSL requests. Under those circumstances, we would fail to remove the offending realserver from the list of https candidates. Of course, this is easily remedied by making more advanced alerts specifically for each type of Web request in addition to enabling a second https monitor for each realserver in the mon configuration file. This is left as an exercise for the reader.

To ensure monitoring has been activated, enable and disable the Apache process on each of the realservers in sequence, observing each of the directors for their reaction to the events. Only when you have confirmed that each director is properly monitoring each realserver, should you use the chkconfig command to make sure that the mon process starts automatically at boot. The specific command used was chkconfig mon on, but this may vary based on your distribution.

With this last piece in place, you have finished the task of constructing a cross-system, highly-available Web server infrastructure. Of course, you might now opt to do more advanced work. For instance, you may have noticed that the mon daemon itself is not monitored (the heartbeat project can monitor mon for you), but with this last step, the basic foundation has been laid.

Back to top

Troubleshooting

There are many reasons why an active node could stop functioning properly in an HA cluster, either voluntarily or involuntarily. The node could lose network connectivity to the other nodes, the heartbeat process could be stopped, there might be any one of a number of environmental occurrences, and so on. To deliberately fail the active node, you can issue a halt on that node, or set it to standby mode using the hb_gui (clean take down) command. If you feel inclined to test the robustness of your environment, you might opt to be a bit more aggressive (yank the plug!).

Indicators and failover

There are two types of log file indicators available to the system administrator responsible for configuring a Linux HA heartbeat system. The log files vary depending on whether or not a system is the recipient of the floating resource IP address. Log results for cluster members that did not receive the floating resource IP address look like so:
Listing 10. Log results for also-rans

                
 litsha21:~ # cat  /var/log/messages

 Jan 16 12:00:20 litsha21 heartbeat: [3057]: WARN: node litsha23: is dead

 Jan 16 12:00:21 litsha21 cib: [3065]: info: mem_handle_event: Got an event

  OC_EV_MS_NOT_PRIMARY from ccm

 Jan 16 12:00:21 litsha21 cib: [3065]: info: mem_handle_event: instance=13, nodes=3,

  new=1, lost=0, n_idx=0, new_idx=3, old_idx=6

 Jan 16 12:00:21 litsha21 crmd: [3069]: info: mem_handle_event: Got an event

  OC_EV_MS_NOT_PRIMARY from ccm

 Jan 16 12:00:21 litsha21 crmd: [3069]: info: mem_handle_event: instance=13, nodes=3,

  new=1, lost=0, n_idx=0, new_idx=3, old_idx=6

 Jan 16 12:00:21 litsha21 crmd: [3069]: info: crmd_ccm_msg_callback:callbacks.c Quorum

  lost after event=NOT PRIMARY (id=13)

 Jan 16 12:00:21 litsha21 heartbeat: [3057]: info: Link litsha23:eth1 dead.

 Jan 16 12:00:38 litsha21 ccm: [3064]: debug: quorum plugin: majority

 Jan 16 12:00:38 litsha21 ccm: [3064]: debug: cluster:linux-ha, member_count=2,

  member_quorum_votes=200

 Jan 16 12:00:38 litsha21 ccm: [3064]: debug: total_node_count=3,

  total_quorum_votes=300

                    .................. Truncated For Brevity ..................

 Jan 16 12:00:40 litsha21 crmd: [3069]: info: update_dc:utils.c Set DC to litsha21

  (1.0.6)

 Jan 16 12:00:41 litsha21 crmd: [3069]: info: do_state_transition:fsa.c litsha21:

  State transition S_INTEGRATION ->

 S_FINALIZE_JOIN [ input=I_INTEGRATED cause=C_FSA_INTERNAL

  origin=check_join_state ]

 Jan 16 12:00:41 litsha21 crmd: [3069]: info: do_state_transition:fsa.c All 2 cluster

  nodes responded to the join offer.

 Jan 16 12:00:41 litsha21 crmd: [3069]: info: update_attrd:join_dc.c Connecting to

  attrd...

 Jan 16 12:00:41 litsha21 cib: [3065]: info: sync_our_cib:messages.c Syncing CIB to

  all peers

 Jan 16 12:00:41 litsha21 attrd: [3068]: info: attrd_local_callback:attrd.c Sending

  full refresh

                    .................. Truncated For Brevity ..................

 Jan 16 12:00:43 litsha21 pengine: [3112]: info: unpack_nodes:unpack.c Node litsha21

  is in standby-mode

 Jan 16 12:00:43 litsha21 pengine: [3112]: info: determine_online_status:unpack.c Node

  litsha21 is online

 Jan 16 12:00:43 litsha21 pengine: [3112]: info: determine_online_status:unpack.c Node

  litsha22 is online

 Jan 16 12:00:43 litsha21 pengine: [3112]: info: IPaddr_1

         (heartbeat::ocf:IPaddr): Stopped

 Jan 16 12:00:43 litsha21 pengine: [3112]: notice: StartRsc:native.c  litsha22

    Start IPaddr_1

 Jan 16 12:00:43 litsha21 pengine: [3112]: notice: Recurring:native.c litsha22

       IPaddr_1_monitor_5000

 Jan 16 12:00:43 litsha21 pengine: [3112]: notice: stage8:stages.c Created transition

  graph 0.

                    .................. Truncated For Brevity ..................

 Jan 16 12:00:46 litsha21 mgmtd: [3070]: debug: update cib finished

 Jan 16 12:00:46 litsha21 crmd: [3069]: info: do_state_transition:fsa.c litsha21:

  State transition S_TRANSITION_ENGINE ->

  S_IDLE [ input=I_TE_SUCCESS cause=C_IPC_MESSAGE origin=do_msg_route ]

 Jan 16 12:00:46 litsha21 cib: [3118]: info: write_cib_contents:io.c Wrote version

  0.53.593 of the CIB to disk (digest: 83b00c386e8b67c42d033a4141aaef90)

 

As you can see from Listing 10, a roll is taken, and sufficient members for quorum are available for the vote. A vote is taken, and normal operation is resumed with no further action needed.

In contrast, log results for cluster members that did receive the floating resource IP address are as follows:
Listing 11. The log file of the resource holder

                
 litsha22:~ # cat  /var/log/messages

 Jan 16 12:00:06 litsha22 syslog-ng[1276]: STATS: dropped 0

 Jan 16 12:01:51 litsha22 heartbeat: [3892]: WARN: node litsha23: is dead

 Jan 16 12:01:51 litsha22 heartbeat: [3892]: info: Link litsha23:eth1 dead.

 Jan 16 12:01:51 litsha22 cib: [3900]: info: mem_handle_event: Got an event

  OC_EV_MS_NOT_PRIMARY from ccm

 Jan 16 12:01:51 litsha22 cib: [3900]: info: mem_handle_event: instance=13, nodes=3,

  new=3, lost=0, n_idx=0, new_idx=0, old_idx=6

 Jan 16 12:01:51 litsha22 crmd: [3904]: info: mem_handle_event: Got an event

  OC_EV_MS_NOT_PRIMARY from ccm

 Jan 16 12:01:51 litsha22 crmd: [3904]: info: mem_handle_event: instance=13, nodes=3,

  new=3, lost=0, n_idx=0, new_idx=0, old_idx=6

 Jan 16 12:01:51 litsha22 crmd: [3904]: info: crmd_ccm_msg_callback:callbacks.c Quorum

  lost after event=NOT PRIMARY (id=13)

 Jan 16 12:02:09 litsha22 ccm: [3899]: debug: quorum plugin: majority

 Jan 16 12:02:09 litsha22 crmd: [3904]: info: do_election_count_vote:election.c

  Election check: vote from litsha21

 Jan 16 12:02:09 litsha22 ccm: [3899]: debug: cluster:linux-ha, member_count=2,

  member_quorum_votes=200

 Jan 16 12:02:09 litsha22 ccm: [3899]: debug: total_node_count=3,

  total_quorum_votes=300

 Jan 16 12:02:09 litsha22 cib: [3900]: info: mem_handle_event: Got an event

  OC_EV_MS_INVALID from ccm

 Jan 16 12:02:09 litsha22 cib: [3900]: info: mem_handle_event: no mbr_track info

 Jan 16 12:02:09 litsha22 cib: [3900]: info: mem_handle_event: Got an event

  OC_EV_MS_NEW_MEMBERSHIP from ccm

 Jan 16 12:02:09 litsha22 cib: [3900]: info: mem_handle_event: instance=14, nodes=2,

  new=0, lost=1, n_idx=0, new_idx=2, old_idx=5

 Jan 16 12:02:09 litsha22 cib: [3900]: info: cib_ccm_msg_callback:callbacks.c

  LOST: litsha23

 Jan 16 12:02:09 litsha22 cib: [3900]: info: cib_ccm_msg_callback:callbacks.c

  PEER: litsha21

 Jan 16 12:02:09 litsha22 cib: [3900]: info: cib_ccm_msg_callback:callbacks.c

  PEER: litsha22

                    .................. Truncated For Brevity ..................

 Jan 16 12:02:12 litsha22 crmd: [3904]: info: update_dc:utils.c Set DC to litsha21

  (1.0.6)

 Jan 16 12:02:12 litsha22 crmd: [3904]: info: do_state_transition:fsa.c litsha22:

  State transition S_PENDING -> S_NOT_DC [ input=I_NOT_DC cause=C_HA_MESSAGE

  origin=do_cl_join_finalize_respond ]

 Jan 16 12:02:12 litsha22 cib: [3900]: info: cib_diff_notify:notify.c Update (client:

  3069, call:25): 0.52.585 ->

 0.52.586 (ok)

                    .................. Truncated For Brevity ..................

 Jan 16 12:02:14 litsha22 IPaddr[3998]: INFO: /sbin/ifconfig eth0:0 192.168.71.205

  netmask 255.255.255.0 broadcast 192.168.71.255

 Jan 16 12:02:14 litsha22 IPaddr[3998]: INFO: Sending Gratuitous Arp for

  192.168.71.205 on eth0:0 [eth0]

 Jan 16 12:02:14 litsha22 IPaddr[3998]: INFO: /usr/lib64/heartbeat/send_arp -i 500 -r

  10 -p

 /var/run/heartbeat/rsctmp/send_arp/send_arp-192.168.71.205 eth0 192.168.71.205 auto

  192.168.71.205 ffffffffffff

 Jan 16 12:02:14 litsha22 crmd: [3904]: info: process_lrm_event:lrm.c LRM operation

  (3) start_0 on IPaddr_1 complete

 Jan 16 12:02:14 litsha22 kernel: send_arp uses obsolete (PF_INET,SOCK_PACKET)

 Jan 16 12:02:14 litsha22 kernel: klogd 1.4.1, ---------- state change ----------

 Jan 16 12:02:14 litsha22 kernel: NET: Registered protocol family 17

 Jan 16 12:02:15 litsha22 crmd: [3904]: info: do_lrm_rsc_op:lrm.c Performing op

  monitor on IPaddr_1 (interval=5000ms, key=0:f9d962f0-4ed6-462d-a28d-e27b6532884c)

 Jan 16 12:02:15 litsha22 cib: [3900]: info: cib_diff_notify:notify.c Update (client:

  3904, call:18): 0.53.591 ->

 0.53.592

  (ok)

 Jan 16 12:02:15 litsha22 mgmtd: [3905]: debug: update cib finished

 

As shown in Listing 11, the /var/log/messages file shows this node has acquired the floating resource. The ifconfig line shows the eth0:0 device being created dynamically to maintain service.

And as you can see from Listing 11, a roll is taken, and sufficient members for quorum are available for the vote. A vote is taken, followed by the ifconfig commands that are issued to claim the floating resource IP address.

As an additional means of indicating when a failure has occurred, you can log in to any of the cluster members and execute the hb_gui command. Through this method, you can determine by visual inspection which system has the floating resource.

Lastly, we would be remiss if we did not illustrate a sample log file from a no-quorum situation. If any singular node cannot communicate with either of its peers, it has lost quorum (since 2/3 is the majority in a three-member voting party). In this situation, the node understands that it has lost quorum, and invokes the no quorum policy handler. Listing 12 shows an example of the log file from such an event. When quorum is lost, a log entry indicates it. The cluster node showing this log entry will disown the floating resource. The ifconfig down statement releases it.
Listing 12. Log entry showing loss of quorum

                
 litsha22:~ # cat /var/log/messages

 ....................

 Jan 16 12:06:12 litsha22 ccm: [3899]: debug: quorum plugin: majority

 Jan 16 12:06:12 litsha22 ccm: [3899]: debug: cluster:linux-ha, member_count=1,

  member_quorum_votes=100

 Jan 16 12:06:12 litsha22 ccm: [3899]: debug: total_node_count=3,

  total_quorum_votes=300

                    .................. Truncated For Brevity ..................

 Jan 16 12:06:12 litsha22 crmd: [3904]: info: crmd_ccm_msg_callback:callbacks.c Quorum

  lost after event=INVALID (id=15)

 Jan 16 12:06:12 litsha22 crmd: [3904]: WARN: check_dead_member:ccm.c Our DC node

  (litsha21) left the cluster

                    .................. Truncated For Brevity ..................

 Jan 16 12:06:14 litsha22 IPaddr[5145]: INFO: /sbin/route -n del -host 192.168.71.205

 Jan 16 12:06:15 litsha22 lrmd: [1619]: info: RA output: (IPaddr_1:stop:stderr)

  SIOCDELRT: No such process

 Jan 16 12:06:15 litsha22 IPaddr[5145]: INFO: /sbin/ifconfig eth0:0 192.168.71.205

  down

 Jan 16 12:06:15 litsha22 IPaddr[5145]: INFO: IP Address 192.168.71.205 released

 Jan 16 12:06:15 litsha22 crmd: [3904]: info: process_lrm_event:lrm.c LRM operation

  (6) stop_0 on IPaddr_1 complete

 Jan 16 12:06:15 litsha22 cib: [3900]: info: cib_diff_notify:notify.c Update (client:

  3904, call:32): 0.54.599 ->

 0.54.600 (ok)

 Jan 16 12:06:15 litsha22 mgmtd: [3905]: debug: update cib finished

 

As you can see from the Listing 12, when quorum is lost for any given node, it relinquishes any resources as a result of the chosen no quorum policy configuration. The choice of no quorum policy is up to you.

Fail-back actions and messages

One of the more interesting implications of a properly-configured Linux HA system is that you do not need to take any action to re-instantiate a cluster member. Simply activating the Linux instance is sufficient to let the node rejoin its peers automatically. If you have configured a primary node (that is, one that is favored to gain the floating resource above all others), it will regain the floating resources automatically. Non-favored systems will simply join the eligibility pool and proceed as normal.

Adding another Linux instance back into the pool will cause each node to take notice, and if possibly, re-establish quorum. The floating resources will be re-established on one of the nodes if quorum can be re-established.
Listing 13. Quorum is re-established

                
 litsha22:~ # tail -f /var/log/messages

 Jan 16 12:09:02 litsha22 heartbeat: [3892]: info: Heartbeat restart on node litsha21

 Jan 16 12:09:02 litsha22 heartbeat: [3892]: info: Link litsha21:eth1 up.

 Jan 16 12:09:02 litsha22 heartbeat: [3892]: info: Status update for node litsha21:

  status init

 Jan 16 12:09:02 litsha22 heartbeat: [3892]: info: Status update for node litsha21:

  status up

 Jan 16 12:09:22 litsha22 heartbeat: [3892]: debug: get_delnodelist: delnodelist=

 Jan 16 12:09:22 litsha22 heartbeat: [3892]: info: Status update for node litsha21:

  status active

 Jan 16 12:09:22 litsha22 cib: [3900]: info: cib_client_status_callback:callbacks.c

  Status update: Client litsha21/cib now has status [join]

 Jan 16 12:09:23 litsha22 heartbeat: [3892]: WARN: 1 lost packet(s) for [litsha21]

  [36:38]

 Jan 16 12:09:23 litsha22 heartbeat: [3892]: info: No pkts missing from litsha21!

 Jan 16 12:09:23 litsha22 crmd: [3904]: notice:

  crmd_client_status_callback:callbacks.c Status update: Client litsha21/crmd now has

  status [online]

 ....................

 Jan 16 12:09:31 litsha22 crmd: [3904]: info: crmd_ccm_msg_callback:callbacks.c Quorum

  (re)attained after event=NEW MEMBERSHIP (id=16)

 Jan 16 12:09:31 litsha22 crmd: [3904]: info: ccm_event_detail:ccm.c NEW MEMBERSHIP:

  trans=16, nodes=2, new=1, lost=0 n_idx=0, new_idx=2, old_idx=5

 Jan 16 12:09:31 litsha22 crmd: [3904]: info: ccm_event_detail:ccm.c     CURRENT:

  litsha22 [nodeid=1, born=13]

 Jan 16 12:09:31 litsha22 crmd: [3904]: info: ccm_event_detail:ccm.c     CURRENT:

  litsha21 [nodeid=0, born=16]

 Jan 16 12:09:31 litsha22 crmd: [3904]: info: ccm_event_detail:ccm.c     NEW:

      litsha21 [nodeid=0, born=16]

 Jan 16 12:09:31 litsha22 cib: [3900]: info: cib_diff_notify:notify.c Local-only

  Change (client:3904, call: 35):

 0.54.600 (ok)

 Jan 16 12:09:31 litsha22 mgmtd: [3905]: debug: update cib finished

 ....................

 Jan 16 12:09:34 litsha22 crmd: [3904]: info: update_dc:utils.c Set DC to litsha22

  (1.0.6)

 Jan 16 12:09:35 litsha22 cib: [3900]: info: sync_our_cib:messages.c Syncing CIB to

  litsha21

 Jan 16 12:09:35 litsha22 crmd: [3904]: info: do_state_transition:fsa.c litsha22:

  State transition S_INTEGRATION ->

  S_FINALIZE_JOIN [ input=I_INTEGRATED cause=C_FSA_INTERNAL origin=check_join_state ]

 Jan 16 12:09:35 litsha22 crmd: [3904]: info: do_state_transition:fsa.c All 2 cluster

  nodes responded to the join offer.

 Jan 16 12:09:35 litsha22 attrd: [3903]: info: attrd_local_callback:attrd.c Sending

  full refresh

 Jan 16 12:09:35 litsha22 cib: [3900]: info: sync_our_cib:messages.c Syncing CIB to

  all peers

 .........................

 Jan 16 12:09:37 litsha22 tengine: [5119]: info: send_rsc_command:actions.c Initiating

  action 4: IPaddr_1_start_0 on litsha22

 Jan 16 12:09:37 litsha22 tengine: [5119]: info: send_rsc_command:actions.c Initiating

  action 2: probe_complete on litsha21

 Jan 16 12:09:37 litsha22 crmd: [3904]: info: do_lrm_rsc_op:lrm.c Performing op start

  on IPaddr_1 (interval=0ms,

  key=2:c5131d14-a9d9-400c-a4b1-60d8f5fbbcce)

 Jan 16 12:09:37 litsha22 pengine: [5120]: info: process_pe_message:pengine.c

  Transition 2: PEngine Input stored in: /var/lib/heartbeat/pengine/pe-input-72.bz2

 Jan 16 12:09:37 litsha22 IPaddr[5196]: INFO: /sbin/ifconfig eth0:0 192.168.71.205

  netmask 255.255.255.0 broadcast 192.168.71.255

 Jan 16 12:09:37 litsha22 IPaddr[5196]: INFO: Sending Gratuitous Arp for

  192.168.71.205 on eth0:0 [eth0]

 Jan 16 12:09:37 litsha22 IPaddr[5196]: INFO: /usr/lib64/heartbeat/send_arp -i 500 -r

  10 -p

  /var/run/heartbeat/rsctmp/send_arp/send_arp-192.168.71.205 eth0 192.168.71.205 auto

  192.168.71.205 ffffffffffff

 Jan 16 12:09:37 litsha22 crmd: [3904]: info: process_lrm_event:lrm.c LRM operation

  (7) start_0 on IPaddr_1 complete

 Jan 16 12:09:37 litsha22 cib: [3900]: info: cib_diff_notify:notify.c Update (client:

  3904, call:46): 0.55.607 -> 0.55.608 (ok)

 Jan 16 12:09:37 litsha22 mgmtd: [3905]: debug: update cib finished

 Jan 16 12:09:37 litsha22 tengine: [5119]: info: te_update_diff:callbacks.c Processing

  diff (cib_update): 0.55.607 -> 0.55.608

 Jan 16 12:09:37 litsha22 tengine: [5119]: info: match_graph_event:events.c Action

  IPaddr_1_start_0 (4) confirmed

 Jan 16 12:09:37 litsha22 tengine: [5119]: info: send_rsc_command:actions.c Initiating

  action 5: IPaddr_1_monitor_5000 on litsha22

 Jan 16 12:09:37 litsha22 crmd: [3904]: info: do_lrm_rsc_op:lrm.c Performing op

  monitor on IPaddr_1 (interval=5000ms, key=2:c5131d14-a9d9-400c-a4b1-60d8f5fbbcce)

 Jan 16 12:09:37 litsha22 cib: [5268]: info: write_cib_contents:io.c Wrote version

  0.55.608 of the CIB to disk (digest: 98cb6685c25d14131c49a998dbbd0c35)

 Jan 16 12:09:37 litsha22 crmd: [3904]: info: process_lrm_event:lrm.c LRM operation

  (8) monitor_5000 on IPaddr_1 complete

 Jan 16 12:09:38 litsha22 cib: [3900]: info: cib_diff_notify:notify.c Update (client:

  3904, call:47): 0.55.608 -> 0.55.609 (ok)

 Jan 16 12:09:38 litsha22 mgmtd: [3905]: debug: update cib finished

 

In Listing 13, you see that quorum has been re-established. When quorum is re-established, a vote is performed and litsha22 becomes the active node with the floating resource.

Back to top

Next steps

High availability is best seen as a series of challenges, and the solution outlined here describes the first step. From here, there are many ways to move forward with your environment: you may choose to install redundant networking, a cluster file system to support the realservers, or more advanced middleware that supports clustering directly.

 

Resources

Learn

Get products and technologies

  • Order the SEK for Linux, a two-DVD set containing the latest IBM trial software for Linux from DB2®, Lotus®, Rational®, Tivoli®, and WebSphere®.
  • With IBM trial software, available for download directly from developerWorks, build your next development project on Linux.

Discuss

About the authors

Eli M. Dow is a software engineer in the IBM Test and Integration Center for Linux in Poughkeepsie, NY. He holds a B.S. degree in computer science and psychology and a master’s of computer science from Clarkson University. He is an alumnus of the Clarkson Open Source Institute. His interests include the GNOME desktop, human-computer interaction, virtualization, and Linux systems programming. He is the coauthor of an IBM Redbook Linux for IBM System z9 and IBM zSeries.

Frank LeFevre is a Senior Software Engineer in the IBM Systems and Technology Group in Poughkeepsie, NY. He has over 28 years of experience with IBM mainframe hardware and operating systems. He is currently is the Team Leader for the Linux Virtual Server Platform Evaluation Test team.

[repost ]storm Concepts

original:https://github.com/nathanmarz/storm/wiki/Concepts

This page lists the main concepts of Storm and links to resources where you can find more information. The concepts discussed are:

  1. Topologies
  2. Streams
  3. Spouts
  4. Bolts
  5. Stream groupings
  6. Reliability
  7. Tasks
  8. Workers
  9. Configuration

Topologies

The logic for a realtime application is packaged into a Storm topology. A Storm topology is analogous to a MapReduce job. One key difference is that a MapReduce job eventually finishes, whereas a topology runs forever (or until you kill it, of course). A topology is a graph of spouts and bolts that are connected with stream groupings. These concepts are described below.

Resources:

Streams

The stream is the core abstraction in Storm. A stream is an unbounded sequence of tuples that is processed and created in parallel in a distributed fashion. Streams are defined with a schema that names the fields in the stream’s tuples. Additionally, corresponding fields across tuples must have the same type. That is, the first field must always have the same type and the second field must always have the same type, but different fields can have different types. By default, tuples can contain integers, longs, shorts, bytes, strings, doubles, floats, booleans, and byte arrays. You can also define your own serializers so that custom types can be used natively within tuples.

Every stream is given an id when declared. Since single-stream spouts and bolts are so common, OutputFieldsDeclarer has convenience methods for declaring a single stream without specifying an id. In this case, the stream is given the default id of 1.

Resources:

Spouts

A spout is a source of streams in a topology. Generally spouts will read tuples from an external source and emit them into the topology (e.g. a Kestrel queue or the Twitter API). Spouts can either be reliable or unreliable. A reliable spout is capable of replaying a tuple if it failed to be processed by Storm, whereas an unreliable spout forgets about the tuple as soon as it is emitted.

Spouts can emit more than one stream. To do so, declare multiple streams using the declareStream method of OutputFieldsDeclarer and specify the stream to emit to when using the emit method on SpoutOutputCollector.

The main method on spouts is nextTuplenextTuple either emits a new tuple into the topology or simply returns if there are no new tuples to emit. It is imperative that nextTuple does not block for any spout implementation, because Storm calls all the spout methods on the same thread.

The other main methods on spouts are ack and fail. These are called when Storm detects that a tuple emitted from the spout either successfully completed through the topology or failed to be completed. ack and fail are only called for reliable spouts. See the Javadoc for more information.

Resources:

Bolts

All processing in topologies is done in bolts. Bolts can do anything from filtering, functions, aggregations, joins, talking to databases, and more.

Bolts can do simple stream transformations. Doing complex stream transformations often requires multiple steps and thus multiple bolts. For example, transforming a stream of tweets into a stream of trending images requires at least two steps: a bolt to do a rolling count of retweets for each image, and one or more bolts to stream out the top X images (you can do this particular stream transformation in a more scalable way with three bolts than with two).

Bolts can emit more than one stream. To do so, declare multiple streams using the declareStream method of OutputFieldsDeclarer and specify the stream to emit to when using the emit method on OutputCollector.

When you declare a bolt’s input streams, you always subscribe to specific streams of another component. If you want to subscribe to all the streams of another component, you have to subscribe to each one individually. InputDeclarer has syntactic sugar for subscribing to streams declared on the default stream id. Saying declarer.shuffleGrouping(1) subscribes to the default stream on component 1 and is equivalent todeclarer.shuffleGrouping(1, DEFAULT_STREAM_ID).

The main method in bolts is the execute method which takes in as input a new tuple. Bolts emit new tuples using the OutputCollector object. Bolts must call the ack method on the OutputCollector for every tuple they process so that Storm knows when tuples are completed (and can eventually determine that its safe to ack the original spout tuples). For the common case of processing an input tuple, emitting 0 or more tuples based on that tuple, and then acking the input tuple, Storm provides an IBasicBolt interface which does the acking automatically.

Its perfectly fine to launch new threads in bolts that do processing asynchronously. OutputCollector is thread-safe and can be called at any time.

Resources:

Stream groupings

Part of defining a topology is specifying for each bolt which streams it should receive as input. A stream grouping defines how that stream should be partitioned among the bolt’s tasks.

There are six types of stream groupings in Storm:

  1. Shuffle grouping: Tuples are randomly distributed across the bolt’s tasks in a way such that each bolt is guaranteed to get an equal number of tuples.
  2. Fields grouping: The stream is partitioned by the fields specified in the grouping. For example, if the stream is grouped by the “user-id” field, tuples with the same “user-id” will always go to the same task, but tuples with different “user-id”‘s may go to different tasks.
  3. All grouping: The stream is replicated across all the bolt’s tasks. Use this grouping with care.
  4. Global grouping: The entire stream goes to a single one of the bolt’s tasks. Specifically, it goes to the task with the lowest id.
  5. None grouping: This grouping specifies that you don’t care how the stream is grouped. Currently, none groupings are equivalent to shuffle groupings. Eventually though, Storm will push down bolts with none groupings to execute in the same thread as the bolt or spout they subscribe from (when possible).
  6. Direct grouping: This is a special kind of grouping. A stream grouped this way means that the producer of the tuple decides which task of the consumer will receive this tuple. Direct groupings can only be declared on streams that have been declared as direct streams. Tuples emitted to a direct stream must be emitted using one of the emitDirect methods. A bolt can get the task ids of its consumers by either using the provided TopologyContext or by keeping track of the output of the emit method in OutputCollector (which returns the task ids that the tuple was sent to).

Resources:

  • TopologyBuilder: use this class to define topologies
  • InputDeclarer: this object is returned whenever setBolt is called on TopologyBuilder and is used for declaring a bolt’s input streams and how those streams should be grouped
  • CoordinatedBolt: this bolt is useful for distributed RPC topologies and makes heavy use of direct streams and direct groupings

Reliability

Storm guarantees that every spout tuple will be fully processed by the topology. It does this by tracking the tree of tuples triggered by every spout tuple and determining when that tree of tuples has been successfully completed. Every topology has a “message timeout” associated with it. If Storm fails to detect that a spout tuple has been completed within that timeout, then it fails the tuple and replays it later.

To take advantage of Storm’s reliability capabilities, you must tell Storm when new edges in a tuple tree are being created and tell Storm whenever you’ve finished processing an individual tuple. These are done using the OutputCollector object that bolts use to emit tuples. Anchoring is done in the emit method, and you declare that you’re finished with a tuple using the ack method.

This is all explained in much more detail on Guaranteeing message processing.

Tasks

Each spout or bolt executes as many tasks across the cluster. Each task corresponds to one thread of execution, and stream groupings define how to send tuples from one set of tasks to another set of tasks. You set the parallelism for each spout or bolt in the setSpout and setBoltmethods of TopologyBuilder.

Workers

Topologies execute across one or more worker processes. Each worker process executes a subset of all the tasks for the topology. For example, if the combined parallelism of the topology is 300 and 50 workers are allocated, then each worker will execute 6 tasks (as threads within the worker). Storm tries to spread the tasks evenly across all the workers.

Resources:

Configuration

Storm has a variety of configurations for tweaking the behavior of nimbus, supervisors, and running topologies. Some configurations are system configurations and cannot be modified on a topology by topology basis, whereas other configurations can be modified per topology.

Every configuration has a default value defined in defaults.yaml in the Storm codebase. You can override these configurations by defining a storm.yaml in the classpath of Nimbus and the supervisors. Finally, you can define a topology-specific configuration that you submit along with your topology when using StormSubmitter.

The preference order for configuration values is defaults.yaml < storm.yaml < topology specific configuration. However, the topology-specific configuration can only override configs prefixed with “TOPOLOGY”.

Resources:

[repost]Book: High Performance MySQL

As users come to depend on MySQL, they find that they have to deal with issues of reliability, scalability, and performance–issues that are not well documented but are critical to a smoothly functioning site. This book is an insider’s guide to these little understood topics. Author Jeremy Zawodny has managed large numbers of MySQL servers for mission-critical work at Yahoo!, maintained years of contacts with the MySQL AB team, and presents regularly at conferences. Jeremy and Derek have spent months experimenting, interviewing major users of MySQL, talking to MySQL AB, benchmarking, and writing some of their own tools in order to produce the information in this book. In High Performance MySQL you will learn about MySQL indexing and optimization in depth so you can make better use of these key features. You will learn practical replication, backup, and load-balancing strategies with information that goes beyond available tools to discuss their effects in real-life environments. And you’ll learn the supporting techniques you need to carry out these tasks, including advanced configuration, benchmarking, and investigating logs.

Topics include:
* A review of configuration and setup options
* Storage engines and table types
* Benchmarking
* Indexes
* Query Optimization
* Application Design
* Server Performance
* Replication
* Load-balancing
* Backup and Recovery
* Security

http://www.amazon.com/dp/0596003064?tag=porssioutpos-20&camp=14573&creative=327641&linkCode=as1&creativeASIN=0596003064&adid=1DJWNYB6291D5HRKC7E0&

original:

Book: High Performance MySQL

[repost]Product: Tungsten Replicator

original:

Product: Tungsten Replicator

With Tungsten Replicator Continuent is trying to deliver a better master/slave replication system. Their goal: scalability, reliability with seamless failover, no performance loss.

From their website:
The Tungsten Replicator implements open source database-neutral master/slave replication. Master/slave replication is a highly flexible technology that can solve a wide variety of problems including the following:

* Availability – Failing over to a slave database if your master database dies
* Performance Scaling – Spreading reads across many copies of data
* Cross-Site Clustering – Maintaining active database replicas across WANs
* Change Data Capture – Extracting changes to load data warehouses or update other systems
* Zero Downtime Upgrade – Performing upgrades on a slave server which then becomes the master

The Tungsten Replicator architecture is flexible and designed to support addition of new databases easily. It includes pluggable extractor and applier modules to help transfer data from master to slave.


The Replicator is designed to include a number of specialized features designed to improve its usefulness for particular problems like availability.

* Replicated changes have transaction IDs and are stored in a transaction history log that is identical for each server. This feature allows masters and slaves to exchange roles easily.
* Smooth procedures for planned and unplanned failover.
* Built-in consistency check tables and events allow users to check consistency between tables without stopping replication or applications.
* Support for statement as well as row replication.
* Hooks to allow data transformations when replicating between different database types.

Tungsten Replicator is not a toy. It is designed to allow commercial construction of robust database cluster

Related Articles

  • Tungsten ScaleOut Stack – an open source collection of integrated projects for database scale-out making use of commodity hardware.
  • Continuent Intros Tungsten Replicator by Shamila Janakiraman.
  • [repost]Blog: Adding Simplicity by Dan Pritchett

    original:http://highscalability.com/blog/2007/9/17/blog-adding-simplicity-by-dan-pritchett.html

    Dan has genuine insight into building software and large scale scalable systems in particular. You’ll always learn something interesting reading his blog.

    A Quick Hit of What’s Inside

    Inverting the Reliability Stack, In Support of Non-Stop Software, Chaotic Perspectives, Latency Exists, Cope!, A Real eBay Architect Analyzes Part 3, Avoiding Two Phase Commit, Redux

    Site: http://www.addsimplicity.com/