Testing RabbitMQ Clustering with a Message Simulator – Part 2

I am pleased to announce the first release of the Message Simulator, a lightweight automation tool for helping harden your cluster. This first version is targeted for RabbitMQ clusters. The Message Simulator’s source code is hosted on GitHub: https://github.com/GetLevvel/message-simulator. Now we can start using our shiny new cluster we built in the previous post Testing RabbitMQ Clustering Using Docker – Part 1.

When things break on production it helps when you have seen it before and already know how to fix it!

SIMPLIFYING CLUSTER TESTING

Having confidence your highly available cluster is going to handle, perform, and scale is a complex task requiring expertise beyond just knowing the core clustering technology. Understanding how your client applications will utilize your cluster and then deciding on how to balance tradeoffs between performance, resilience, and reliability for your cluster can require iterations of testing, deployments, debugging, and downtime. To reduce the effort it takes to battle test a cluster, we get asked questions that help find that balance faster. Usually these questions are about:

  1. What are the best practices for setting up a RabbitMQ cluster for High Availability?
  2. How many cluster nodes do we need?
  3. How can we prevent message loss?
  4. What happens when a node/vm/host/network/cluster/datacenter crashes?
  5. We need to add a new node. How do we scale up the cluster?
  6. What happens when we scale up the cluster?
  7. When will our producing and consuming client applications see bottlenecks?
  8. How do we quickly restore services for an outage event?
  9. What happens when we Federate?
  10. What happens when we cluster across EC2 Availability Zones?

This post will not be long enough to address all of these, but these questions are why we are open sourcing the Message Simulator. If you want to have confidence your cluster will handle your specific messaging requirements then it needs to be tested like any other component in your production stack. The benefits from testing your cluster with your exact messaging needs means you can decide where your cluster will stand in regards to: bottlenecks, velocities, thresholds, optimizations, support overhead, and build out your “how do we restore services” runbook before someone gets paged in the middle of the night.

Hosting your own Highly Available RabbitMQ cluster is not complex, but knowing your cluster can handle your exact needs is not something the documentation is going to teach alone.

WHY DO WE USE THE MESSAGE SIMULATOR?

We initially built the Message Simulator to evaluate the performance hit for having a 3-node RabbitMQ cluster configured to auto synchronize after a crash.

Message Simulator Cluster Node Failure

Figure 1 – Simulating a Cluster Node Crash

Similar to the Netflix Simian Army, we wanted a way to crash clustered RabbitMQ brokers in creative and extensible ways. This led us to start building a way to reliably test this process, and with today’s release the entire simulation of external events, broker entities, and messages are outlined in one JSON file. Each JSON file is a self-contained Messaging Simulation Model for regression testing your cluster. Once we centralized everything into a file, we could group the files based off the messaging use case they simulate. This led us to organize the simulations by type and purpose. With today’s first release there are:

  1. Load Simulations – These tests are focused on creating a constant and predictable load on your cluster. These tests are the first step in preparing a cluster for production.
  2. High Availability Simulations – These tests are about building confidence in a cluster’s resiliency, durability, persistence, client handling, monitoring tools, and determining your team’s outage handling processes when events outside of normal operation occur.
  3. Stress Simulations – These tests are focused on creating broker entities that will stress the cluster in unexpected ways. For example, the first test creates a single Fanout Exchange that has over 150 Queues bound to it and then forks 10 independent Publisher processes that will help publish messages to the same Fanout Exchange at the same time. The goal is not to exceed your cluster’s ability but to stress the internal cluster’s processing and resource utilization to see how this internal stress affects your monitoring tools and more importantly where bottlenecks will occur for your cluster’s client applications.
  4. Burst Simulations are coming soon. These tests will focus on time-synced message traffic spike simulations. The general concept is around a use case where a cluster needs to support a viral event or an expected traffic spike. As a hypothetical example, consider a new marketing publish that causes 50,000+ users hitting services within the first couple hours. How does the cluster handle a period of consistent load that is periodically exposed to coordinated high volume message spikes? As a side benefit you can find out how those client consumer applications handle that increased traffic too.

GETTING STARTED

The simulator works with any new or existing RabbitMQ cluster and can run on any system that supports python. The simulator could run outside of your data center or on a VM beside the cluster. So long as there’s a connection, you can start running messaging simulations. Each simulation uses a different messaging pipeline and route map which means you can run multiple simulations at the same time. If you want to run multiple Load simulations while running a High Availability simulation and all the while letting your producing and consuming client applications use your cluster, well you can do that too.

We wanted a tool to beat up a cluster and see what happens to our client applications, and so we made it easy to model your messaging traffic while doing terrible things to it in real time.

For those interested in utilizing a RabbitMQ cluster:

What happens when you simulate a network outage by blocking the cluster’s internal communication port it uses to talk to the other cluster nodes? When does the cluster realize it lost a node? How do you restore it? What kind of reporting tools can detect this?

Want to see if your cluster is ready? You can find out with this simulation. (If things go horribly awry you can take a look at our Troubleshooting section for putting the pieces back together.)

RUNNING SIMULATIONS

You can run a simulation with:

$ ./run_message_simulation.py -f Path_to_Simulation_File

Read more about how to run simulations here: How to Run Simulations

BUILDING YOUR OWN SIMULATION

Not all clusters need to support the same type of message traffic (A low latency, high response application vs a safety system requiring no message loss). To keep things generic, we built the simulator to take in a Message Simulation modeled in a JSON file. Each JSON Simulation Model must implement the following sections:

{
  "Simulation" : {
      "Name" : "Your_Name_For_This_Simulation",
      "Type" : "Rabbit",
      "Rabbit" : {
      }
  },
  "Consumers" : { },
  "BrokerEntities" : {
      "Exchanges" : [ ],
      "Queues"    : [ ],
      "Bindings"  : [ ],
      "Messages"  : [ ]
  }
}

Read more about Building Your Own Simulation

For Message Simulation specifics please refer to these sections:

  1. Configuring the RabbitMQ Connection
  2. Adding Exchanges to the cluster
  3. Adding Queues to the cluster
  4. Bindings between Exchanges routing messages to Queues
  5. Describing Consumers for the simulation
  6. Processing Messages and Events during the simulation
  7. Defining your own Messages and Custom Control Events
  8. Examining the full Message Simulation JSON API

WHAT IS A CUSTOM EVENT VERSUS A MESSAGE

To make simulations into a regression test that we could always run again, we went with the convention that we had to process everything in sequence the same way every time we ran it. To do this we made the simulator support sending AMQP messages from the same list as Custom Event messages. Event messages allow for the cluster to be modified outside of normal AMQP messaging operation (producing, routing, consuming). ConsiderHigh Availability Test 2 that performs these simulation steps in order:

  1. Send 100 AMQP Messages to the Exchange HA_2.Ex with HA_2.A as the Routing Key
  2. Send 100 AMQP Messages to the Exchange HA_2.Ex with HA_2.B as the Routing Key
  3. Stop a Broker targeting rabbit3
  4. Send 500 AMQP Messages to the Exchange HA_2.Ex with HA_2.B as the Routing Key
  5. Send 500 AMQP Messages to the Exchange HA_2.Ex with HA_2.A as the Routing Key
  6. Start a Broker targeting rabbit3
  7. Send 500 AMQP Messages to the Exchange HA_2.Ex with HA_2.A as the Routing Key
  8. Send 500 AMQP Messages to the Exchange HA_2.Ex with HA_2.B as the Routing Key

This test starts by introducing a simple amount of message load, crashes a node, sees if the cluster can still route messages, restores the crashed node, and then checks if messaging still works when the cluster’s third node comes back online. The goal is to make the simulation JSON flexible and generic so we can focus on writing JSON tests instead of modifying the underlying code to test a cluster. The Message Simulator currently supports these Custom Events and Message Types:

  • AMQP
  • Stop Broker
  • Start Broker
  • Start Worker Publisher
  • Add Network Latency Event
  • Remove All Network Latency Events
  • Validate SSH Credentials
  • Validate Docker Credentials
  • Reset All Broker Entities

Troubleshooting RabbitMQ Clusters

Inevitably a simulation will end up breaking your cluster in some unexpected ways. Finding out how to restore services before it is mission critical is always a good exercise, and that’s why we added a simple guide on troubleshooting for restoring your RabbitMQ cluster back to normal operation. While it is not comprehensive for all cases, it is focused on getting your cluster quickly back up so you can test if the new configuration changes are more stable than the previous version.

Read more about Troubleshooting

Upcoming High Availability Simulations

The goal of testing High Availability is to validate that your cluster can meet your client applications’ messaging demand (hopefully with a large set of 9’s). The Simulator includes running Stress and Load tests, and now we are interested in continuing to build out more extensive High Availability simulations.

We have started a list of High Availability Simulations Coming Soon. Let us know if you would like to see a specific High Availability simulation.

For now the focus will be on:

  • Utilizing different Broker entities combined with ha-policies during external events
  • Tests for demonstrating how messages can get copied but not lost with ha-policies like durability and persistence enabled
  • Unsynchronized Cluster Slaves trying to join a running cluster during a Simulation and the Master Node Crashes
  • Unsynchronized Cluster Slaves trying to join a running cluster during a Simulation and running an explicit synchronization
  • Internal cluster TCP network events at varying flapping rates instead of being 100% unavailable (like ha_3_network_latency_event_during_messaging.json)
  • Forcibly disconnecting producers and consumers from the default RabbitMQ TCP Port
  • Tests for demonstrating message loss without HA
  • Tests filling an HDD using brokers set up in disc or ram mode and persistence and durability enabled
  • 100% CPU and memory utilization tests
  • More tests aimed at helping diagnose network partitioning and split brain events
  • Cluster nodes that leave and join clusters repeatedly
  • Restarting Cluster members when the cluster is set to perform automatic synchronization on startup
  • Full cluster outage restoration during messaging
  • Federation network latency and outage events
  • Large message simulations during an outage event
  • Alternate Exchange tests

As time goes we will be keeping the most updated list here: https://github.com/GetLevvel/message-simulator#ha-tests-coming-soon

Thanks for reading!

Well that’s it for this post. We are pretty excited to hear your feedback on this automation tool, and hopefully you find it valuable. Let us know if you would like to have specific simulations added to the GitHub repository (https://github.com/GetLevvel/message-simulator) and if your organization would like assistance determining your RabbitMQ clustering strategy, please reach out to us at Levvel and we can get you started. If you do not have a cluster to run some of these simulations, the previous post can help get you going with your own Docker RabbitMQ Cluster: Testing RabbitMQ Clustering Using Docker – Part 1.

So what’s coming next?

The next post will continue exploring High Availability simulations for the purposes of integration with a large, distributed framework utilizing a RabbitMQ cluster as a core component.

For the ambitious of you out there, I included an easter egg inside the Message Simulator’s repository that is a ‘How To Guide’ for the framework I will be discussing in the next post. See if you can find it!

- Jay

Jay Johnson

Jay Johnson

Principal Consultant

IT Professional with 10+ years of experience in architecture, design and implementation of large distributed, real-time systems across a variety of environments. Focused on executing aggressive timelines by leveraging my expertise in technology, process, and best practices.

GitHub Portfolio: https://github.com/jay-johnson