Running a Distributed Docker Swarm on AWS

Recently I have been discussing how to setup and test a Highly Available RabbitMQ cluster inside Docker Containers. As a few readers figured out, the next post in that series will be integrating a distributed Spring XD environment with the RabbitMQ cluster all running inside Docker containers. Since Spring XD has been docker-ized https://hub.docker.com/u/springxd/ from 1.0, we can use Docker Compose to natively link RabbitMQ + Spring XD containers together for connectivity. I plan on discussing what Spring XD is and does well in the upcoming article, but for today it is beyond the scope of this post. If you want to learn more you can read about it here.

BREAKING OUT OF A SINGLE HOST DEPLOYMENT

Today I want to take a step back and address a few questions about hosting a Docker RabbitMQ cluster (and Spring XD) running on the same host. Deployment to a single host is not ideal for true HA or where a production SLA is required. So the questions that came from this limitation are about supporting extensible and scalable deployments beyond the single host scenario. Here are some I want to discuss today:

  1. How can I run, deploy, and support the HA RabbitMQ cluster spanning multiple hosts?
  2. How can we deploy the cluster in a consistent manner?
  3. Can we run the cluster inside an AWS VPC?
  4. How do we run the cluster inside our own data center (not on AWS)?

Until Docker 1.9, there was no simple way to solve all these questions. Docker did not have native support for deploying containers across multiple hosts for container redundancy or placement strategies. Now that 1.9 is released we can re-evaluate these great questions.

Today’s post will cover how to setup a distributed Docker Swarm using the new production-ready Docker 1.9, Docker Swarm 1.0, and Docker Compose 1.5.1. Specifically, I want to share how to build a development Swarm (running on a single host), a distributed Swarm (3 hosts), and a production Swarm environment (9 hosts).

Let’s get started!

THE BASICS- DOCKER SWARM COMPONENTS

Users that are new to Docker Swarm should read over the documentation in case there is feature drift invalidating some of the following sections. Docker is being actively developed and that’s a good thing.

COMPONENTS AND TERMINOLOGY

  1. Docker Daemon (daemon) – A process that handles container management on a single host or vm.
  2. Swarm Join (join) – A process that handles registering a single host with a Service Discovery Manager and exposing the host’s Docker Daemon as an available service.
  3. Swarm Node (node) – This is not an official Docker term but a logical association for a host machine in the Swarm that is only responsible for running containers (I think of these as similar to an OpenShift Node for hosting applications). At a minimum, a Swarm Node needs to have the Docker Daemon and the Swarm Join running on it.
  4. Swarm Manager (manager) – The service a user uses for managing containers across the registered Swarm Node(s). This is the endpoint for interfacing with a Swarm environment.
  5. Service Discovery Managers (consul for this post) – Multiple service discovery managers are supported by Docker Swarm. When I started with 1.9, I read this assessment and decided to use consul. These managers track registered services, members, and sessions between replicas of themselves. Swarm Joins and Swarm Managers connect with these instances and the cluster of Service Discovery Managers handles the rest (including outages and adding/removing Swarm Nodes while in operation).
  6. Swarm Cluster Token (token) – A Docker Swarm can be deployed without running your own Service Discovery Managers; however, this means the token will be shared over an encrypted connection with Docker Hub. This is a useful way for getting started with a single host environment. Please consider that this means your token and environment are now dependent on a third party. It is not recommended for production use (In a distributed environment I saw Nodes disappear from membership with this approach). If you want to run a distributed Swarm using a token, then all Swarm Joins and Swarm Managers need to specify the token://<your token string> instead of consul://<consul uri>/<swarm name e.g. myswarm>. The advantage is simplicity when you do not need to integrate with a service discovery tool to get going. The disadvantage is each Swarm Node in the Cluster needs to know the token after a new Swarm Cluster has been created. I found assigning the Cluster token to the EC2 display name as a tag to be a simple way to sync multiple Swarm Nodes, but it also meant the environment had to have an initial starting Node up and running that was responsible for creating the first Swarm Cluster before the other Swarm Nodes could be provisioned. Readers familiar with syncing the RabbitMQ Erlang cookie file to start a cluster will probably write the file to disk after checking the initial EC2 host’s display name on startup (or some other persistent location like S3). Additionally, consul can be started in bootstrap single server mode, for hosting a self-contained single host environment. (consul agent -server -bootstrap-expect 1 -data-dir /tmp/consul). For demonstration purposes I will just use Docker Hub.
  7. Docker Compose (compose) – This is a tool for defining and running multi-container Docker applications. This tool takes a simple configuration yml file and deploys the containers like a manifest. As a sample, here’s one I released for deploying the HA RabbitMQ cluster as a Docker Compose configuration yml file. It is the deployment tool for utilizing the new multi-host networking features that allow the Swarm Nodes to have applications running across distributed hosts. Compose also handles placement strategies for ensuring your containers are distributed evenly (or not) across the Swarm Nodes. This allows for container redundancy at the host level which is good for production resiliency.
  8. Overlay Network – This is the new native Docker networking type for deploying containers that are linked only with the other containers on the network. This is utilizing VXLAN technology and is really interesting for its ability to separate containers (even on the same node) as well linking containers across multiple Swarm Nodes. The containers deployed with an overlay network can see the other linked containers on the network by using the entries defined in the /etc/hosts file in the container. I highly recommend reading more about overlay networks for building software defined networks.

INSTALLATION

Since 1.9 is fairly new, figuring out how to install each component can seem a little overwhelming. Here is how to install each component for installation on Fedora or an AWS AMI.

  1. Docker Daemon

    curl -sSL -O https://get.docker.com/builds/Linux/x86_64/docker-1.9.0 chmod +x docker-1.9.0 
    mv docker-1.9.0 /usr/local/bin/docker
  2. Docker Machine

    curl -L https://github.com/docker/machine/releases/download/v0.5.0/docker-machine_linux-amd64.zip > machine.zip
    unzip machine.zip 
    rm machine.zip 
    mv -f docker-machine* /usr/local/bin
  3. Docker Swarm

    export GOPATH=""
    go get github.com/docker/swarm
    chmod 777 $GOPATH/bin/swarm
    rm -f /usr/local/bin/swarm >> /dev/null
    ln -s $GOPATH/bin/swarm /usr/local/bin/swarm
  4. Docker Compose

    curl -L https://github.com/docker/compose/releases/download/1.5.1/docker-compose-`uname -s`-`uname -m` > /usr/local/bin/docker-compose 
    chmod +x /usr/local/bin/docker-compose
  5. Consul

    wget https://releases.hashicorp.com/consul/0.5.2/consul_0.5.2_linux_amd64.zip -O /tmp/consul.zip
    unzip consul.zip
    cp ./consul /usr/local/bin
    cp ./consul /usr/bin
    rm -f /tmp/consul.zip
    rm -f /tmp/consul

I created this GIST for installing all the components on a host at once:

Development Swarm Environment (Single Host)

Most of the documents and tutorials I found want to use docker-machine to provision a virtualbox host for the swarm. I do not want any unnecessary components or possible issues to debug on a production environment, so I started by provisioning a single AWS EC2 ami running in a t2.small and then running the installation script to get everything installed. When creating a single host development environment I found it beneficial to consider the following reference diagram:

Docker Swarm Single Host Environment Reference Diagram

Here is how to setup the environment in order:

  1. Start the Docker Daemon running with:

    nohup /usr/local/bin/docker daemon -H tcp://INTERNAL_IP_ADDRESS:2

  2. Create the Swarm Cluster and store the Token in a file:

    mkdir -p /opt/swarm
    /usr/local/bin/swarm create > /opt/swarm/clustertoken
    echo "Cluster Token: "
    cat /opt/swarm/clustertoken
    chmod 666 /opt/swarm/clustertoken
  3. Start the Swarm Join by running:

    nohup /usr/local/bin/swarm join --addr=INTERNAL_IP_ADDRESS:2375 token://$(cat /opt/swarm/clustertoken) &
  4. Start the Swarm Manage by running:

    bashnohup swarm manage -H INTERNAL_IP_ADDRESS:4000 token://$(cat /opt/swarm/clustertoken) &
  5. Point the Docker command line interface at the Swarm Manager:

    export DOCKER_HOST=INTERNAL_IP_ADDRESS:4000
  6. View the Swarm info and ensure the host is registered in the Docker Swarm:

    # docker info
    Containers: 0
    Images: 0
    Role: primary
    Primary: 10.0.0.137:4000
    Strategy: spread
    Filters: health, port, dependency, affinity, constraint
    Nodes: 1
    swarm1.internallevvel.com: 10.0.0.137:2375
    └ Containers: 0
    └ Reserved CPUs: 0 / 1
    └ Reserved Memory: 0 B / 2.054 GiB
    └ Labels: executiondriver=native-0.2, kernelversion=4.1.10-17.31.amzn1.x86_64, operatingsystem=Amazon Linux AMI 2015.09, storagedriver=devicemapper
    swarm1.internallevvel.com: 10.0.0.137:2375
  7. Start a container that is deployed on the Swarm

    # docker run -itd --name=singletest busybox
    db0af98e5b13bc24801047594f989437d33956bcf27aa70702e805ec4cc16a1b
    #
  8. View the container

    docker ps

DISTRIBUTED SWARM ENVIRONMENT (RUNNING ON MULTIPLE HOSTS)

Now that the single host environment is working, we can break out each component for redundancy and preparing for a production environment. We will be removing the Swarm Cluster Token and using consul from here on. In this configuration, each Swarm Node is really a replica of each t2.small. Here is a reference diagram for the Swarm I am running:

Docker Swarm Distributed Environment Reference Diagram

I find looking at the running processes a good way to figure out connectivity and ordering on a new platform.

Here are the processes running on Swarm Node 1 (notice the increasing PIDs indicate the ordering for startup):

# ps x | grep consul
3510 ? Sl 1:18 /usr/local/bin/consul agent -server -config-dir=/etc/consul.d/server -data-dir=/opt/consul/data -bind=10.0.0.137
3583 ? Sl 2:28 /usr/local/bin/docker daemon -H 10.0.0.137:2375 --cluster-advertise 10.0.0.137:2375 --cluster-store consul://10.0.0.137:8500/swarmnodes --label=com.docker.network.driver.overlay.bind_interface=eth0
3672 ? Sl 0:01 /usr/local/bin/swarm join --addr=10.0.0.137:2375 consul://10.0.0.137:8500/swarmnodes
3690 ? Sl 0:03 /usr/local/bin/swarm manage -H tcp://10.0.0.137:4000 --replication --advertise 10.0.0.137:4000 consul://10.0.0.137:8500/swarmnodes

Here are the processes running on Swarm Node 2:

# ps x | grep consul
3506 ? Sl 2:38 /usr/local/bin/consul agent -server -config-dir=/etc/consul.d/server -data-dir=/opt/consul/data -bind=10.0.0.54
3566 ? Sl 2:29 /usr/local/bin/docker daemon -H 10.0.0.54:2375 --cluster-advertise 10.0.0.54:2375 --cluster-store consul://10.0.0.54:8500/swarmnodes --label=com.docker.network.driver.overlay.bind_interface=eth0
3585 ? Sl 0:00 /usr/local/bin/swarm join --addr=10.0.0.54:2375 consul://10.0.0.54:8500/swarmnodes
3652 ? Sl 0:02 /usr/local/bin/swarm manage -H tcp://10.0.0.54:4000 --replication --advertise 10.0.0.54:4000 consul://10.0.0.54:8500/swarmnodes

Here are the processes running on Swarm Node 3:

# ps x | grep consul
3507 ? Sl 1:14 /usr/local/bin/consul agent -server -config-dir=/etc/consul.d/server -data-dir=/opt/consul/data -bind=10.0.0.146
3568 ? Sl 2:31 /usr/local/bin/docker daemon -H 10.0.0.146:2375 --cluster-advertise 10.0.0.146:2375 --cluster-store consul://10.0.0.146:8500/swarmnodes --label=com.docker.network.driver.overlay.bind_interface=eth0
3589 ? Sl 0:00 /usr/local/bin/swarm join --addr=10.0.0.146:2375 consul://10.0.0.146:8500/swarmnodes
3653 ? Sl 0:03 /usr/local/bin/swarm manage -H tcp://10.0.0.146:4000 --replication --advertise 10.0.0.146:4000 consul://10.0.0.146:8500/swarmnodes

Here are the ordered commands to start the Swarm. Please run all of them on each Node separately.

  1. Setup a consul configuration file on each host (make sure to create the /etc/consul.d/server and /opt/consul/data directories):

    cat /etc/consul.d/server/config.json 
    { 
    "datacenter" : "", 
    "bootstrap" : false, 
    "bootstrap_expect" : 3, 
    "server" : true, 
    "data_dir" : "/opt/consul/data", 
    "log_level" : "INFO", 
    "enable_syslog" : false, 
    "start_join" : [], 
    "retry_join" : [], 
    "client_addr" : "0.0.0.0" 
    }
  2. Start consul

    nohup /usr/local/bin/consul agent -server -config-dir=/etc/consul.d/server -data-dir=/opt/consul/data -bind=INTERNAL_IP_ADDRESS &
  3. Have consul join the consul cluster

    /usr/local/bin/consul join swarm1.internallevvel.com swarm2.internallevvel.com swarm3.internallevvel.com
  4. Start the Docker Daemon

    nohup /usr/local/bin/docker daemon -H INTERNAL_IP_ADDRESS:2375 --cluster-advertise INTERNAL_IP_ADDRESS:2375 --cluster-store consul://INTERNAL_IP_ADDRESS:8500/ --label=com.docker.network.driver.overlay.bind_interface=eth0 &
  5. Start the Swarm Join

    nohup /usr/local/bin/swarm join --addr=INTERNAL_IP_ADDRESS:2375 consul://INTERNAL_IP_ADDRESS:8500/ &
  6. Start the Swarm Manager

    nohup /usr/local/bin/swarm manage -H tcp://INTERNAL_IP_ADDRESS:4000 --replication --advertise INTERNAL_IP_ADDRESS:4000 consul://INTERNAL_IP_ADDRESS:8500/ &
  7. Point each host the Swarm Manager by setting the environment variable:

    export DOCKER_HOST=INTERNAL_IP_ADDRESS:4000
  8. Once all the Swarm Nodes are installed and setup, you can confirm the Swarm is ready with:

    # docker info
    Containers: 0
    Images: 0
    Role: primary
    Strategy: spread
    Filters: health, port, dependency, affinity, constraint
    Nodes: 3
    swarm1.internallevvel.com: 10.0.0.137:2375
    └ Containers: 0
    └ Reserved CPUs: 0 / 1
    └ Reserved Memory: 0 B / 2.054 GiB
    └ Labels: com.docker.network.driver.overlay.bind_interface=eth0, executiondriver=native-0.2, kernelversion=4.1.10-17.31.amzn1.x86_64, operatingsystem=Amazon Linux AMI 2015.09, storagedriver=devicemapper
    swarm2.internallevvel.com: 10.0.0.54:2375
    └ Containers: 0
    └ Reserved CPUs: 0 / 1
    └ Reserved Memory: 0 B / 2.054 GiB
    └ Labels: com.docker.network.driver.overlay.bind_interface=eth0, executiondriver=native-0.2, kernelversion=4.1.10-17.31.amzn1.x86_64, operatingsystem=Amazon Linux AMI 2015.09, storagedriver=devicemapper
    swarm3.internallevvel.com: 10.0.0.146:2375
    └ Containers: 0
    └ Reserved CPUs: 0 / 1
    └ Reserved Memory: 0 B / 2.054 GiB
    └ Labels: com.docker.network.driver.overlay.bind_interface=eth0, executiondriver=native-0.2, kernelversion=4.1.10-17.31.amzn1.x86_64, operatingsystem=Amazon Linux AMI 2015.09, storagedriver=devicemapper
    CPUs: 3
    Total Memory: 6.163 GiB
    Name: swarm2.internallevvel.com
  9. Deploy one app to each Swarm Node

    # docker run -itd --name=AppDeployedToNode1 --env="constraint:node==swarm1.internallevvel.com" busybox
    5e5d3e056aee3a5e621ed9775245392f31e2b908922ee6087706bafbd665df08
    # docker run -itd --name=AppDeployedToNode2 --env="constraint:node==swarm2.internallevvel.com" busybox
    e01065a81645a35c7c3d71e7796a6804a1092d1d038f6b3df8fa7c9f72567b01
    # docker run -itd --name=AppDeployedToNode3 --env="constraint:node==swarm3.internallevvel.com" busybox
    88bb54327a910f0fb6ce3a502f820746c48e8b712b34ec72a8ce34c09605ad75
  10. Confirm the apps were deployed and running on the correct host

    # docker ps
    CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
    88bb54327a91 busybox "sh" 29 seconds ago Up 28 seconds swarm3.internallevvel.com/AppDeployedToNode3
    e01065a81645 busybox "sh" 36 seconds ago Up 35 seconds swarm2.internallevvel.com/AppDeployedToNode2
    5e5d3e056aee busybox "sh" 43 seconds ago Up 43 seconds swarm1.internallevvel.com/AppDeployedToNode1
  11. Inspect the Swarm and confirm there is a container on each Node

    # docker info
    Containers: 3
    Images: 3
    Role: primary
    Primary: 10.0.0.54:4000
    Strategy: spread
    Filters: health, port, dependency, affinity, constraint
    Nodes: 3
    swarm1.internallevvel.com: 10.0.0.137:2375
    └ Containers: 1
    └ Reserved CPUs: 0 / 1
    └ Reserved Memory: 0 B / 2.054 GiB
    └ Labels: com.docker.network.driver.overlay.bind_interface=eth0, executiondriver=native-0.2, kernelversion=4.1.10-17.31.amzn1.x86_64, operatingsystem=Amazon Linux AMI 2015.09, storagedriver=devicemapper
    swarm2.internallevvel.com: 10.0.0.54:2375
    └ Containers: 1
    └ Reserved CPUs: 0 / 1
    └ Reserved Memory: 0 B / 2.054 GiB
    └ Labels: com.docker.network.driver.overlay.bind_interface=eth0, executiondriver=native-0.2, kernelversion=4.1.10-17.31.amzn1.x86_64, operatingsystem=Amazon Linux AMI 2015.09, storagedriver=devicemapper
    swarm3.internallevvel.com: 10.0.0.146:2375
    └ Containers: 1
    └ Reserved CPUs: 0 / 1
    └ Reserved Memory: 0 B / 2.054 GiB
    └ Labels: com.docker.network.driver.overlay.bind_interface=eth0, executiondriver=native-0.2, kernelversion=4.1.10-17.31.amzn1.x86_64, operatingsystem=Amazon Linux AMI 2015.09, storagedriver=devicemapper
    CPUs: 3
    Total Memory: 6.163 GiB
    Name: swarm2.internallevvel.com

    AT THIS POINT THE SWARM IS ABLE TO DEPLOY CONTAINERS ACROSS THE SWARM NODES. NOW WE CAN CONFIRM THE OVERLAY NETWORK CAN LINK CONTAINERS ACROSS MULTIPLE HOSTS.

  12. Create a docker-compose.yml file for deploying the HA RabbitMQ cluster containers from the previous post (I will be posting a public version on docker hub soon). Here are the contents from mine:

    $ cat cluster/docker-compose.yml 
    rabbit1:
    image: jayjohnson/rabbitclusternode
    hostname: cluster_rabbit1_1
    cap_add:
    - ALL
    - NET_ADMIN
    - SYS_ADMIN
    ports:
    - "1883:1883"
    - "5672:5672"
    - "8883:8883"
    - "15672:15672"
    rabbit2:
    image: jayjohnson/rabbitclusternode
    hostname: cluster_rabbit2_1
    cap_add:
    - ALL
    - NET_ADMIN
    - SYS_ADMIN
    environment: 
    - CLUSTERED=true
    - CLUSTER_WITH=cluster_rabbit1_1
    - RAM_NODE=true
    ports:
    - "1884:1883"
    - "5673:5672"
    - "8884:8883"
    - "15673:15672"
    rabbit3:
    image: jayjohnson/rabbitclusternode
    hostname: cluster_rabbit3_1
    cap_add:
    - ALL
    - NET_ADMIN
    - SYS_ADMIN
    environment: 
    - CLUSTERED=true
    - CLUSTER_WITH=cluster_rabbit1_1 
    ports:
    - "1885:1883"
    - "5674:5672"
    - "8885:8883"
    - "15674:15672"
  13. Now use Docker Compose to deploy the containers as a RabbitMQ cluster according to the docker-compose.yml configuration. Make sure to run this in the same directory as the yml file and specify the new overlay networking.

    docker-compose --x-networking --x-network-driver overlay up -d
  14. Confirm the RabbitMQ cluster containers are running and distributed across the Swarm Nodes

    # docker ps
    CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
    377cfd780f9d jayjohnson/rabbitclusternode "/bin/sh -c /opt/rabb" 8 seconds ago Up 6 seconds 10.0.0.146:1883->1883/tcp, 10.0.0.146:5672->5672/tcp, 4369/tcp, 10.0.0.146:8883->8883/tcp, 9100-9105/tcp, 10.0.0.146:15672->15672/tcp, 25672/tcp swarm3.internallevvel.com/cluster_rabbit1_1
    45b2111b35de jayjohnson/rabbitclusternode "/bin/sh -c /opt/rabb" 8 seconds ago Up 7 seconds 4369/tcp, 9100-9105/tcp, 25672/tcp, 10.0.0.54:1884->1883/tcp, 10.0.0.54:5673->5672/tcp, 10.0.0.54:8884->8883/tcp, 10.0.0.54:15673->15672/tcp swarm2.internallevvel.com/cluster_rabbit2_1
    52f8fbad2f98 jayjohnson/rabbitclusternode "/bin/sh -c /opt/rabb" 9 seconds ago Up 7 seconds 4369/tcp, 9100-9105/tcp, 25672/tcp, 10.0.0.137:1885->1883/tcp, 10.0.0.137:5674->5672/tcp, 10.0.0.137:8885->8883/tcp, 10.0.0.137:15674->15672/tcp swarm1.internallevvel.com/cluster_rabbit3_1
    88bb54327a91 busybox "sh" 14 minutes ago Up 14 minutes swarm3.internallevvel.com/AppDeployedToNode3
    e01065a81645 busybox "sh" 14 minutes ago Up 14 minutes swarm2.internallevvel.com/AppDeployedToNode2
    5e5d3e056aee busybox "sh" 14 minutes ago Up 14 minutes swarm1.internallevvel.com/AppDeployedToNode1
  15. Try connecting to one of the RabbitMQ brokers

    [root@swarm3 ~]# telnet 0.0.0.0 5672
    Trying 0.0.0.0...
    Connected to 0.0.0.0.
    Escape character is '^]'.
    AMQP Connection closed by foreign host.
    [root@swarm3 ~]#
  16. Login to one of the containers

    [root@swarm3 ~]# docker exec -t -i cluster_rabbit1_1 /bin/bash
    [root@cluster_rabbit1_1 /]#
  17. Confirm the overlay network set the /etc/hosts for connectivity between the RabbitMQ containers running on different Swarm Nodes

    [root@cluster_rabbit1_1 /]# cat /etc/hosts
    10.0.0.4 cluster_rabbit1_1
    127.0.0.1 localhost
    ::1 localhost ip6-localhost ip6-loopback
    fe00::0 ip6-localnet
    ff00::0 ip6-mcastprefix
    ff02::1 ip6-allnodes
    ff02::2 ip6-allrouters
    10.0.0.2 cluster_rabbit3_1
    10.0.0.2 cluster_rabbit3_1.cluster
    10.0.0.3 cluster_rabbit2_1
    10.0.0.3 cluster_rabbit2_1.cluster
    [root@cluster_rabbit1_1 /]#
  18. Check the RabbitMQ cluster status

    [root@cluster_rabbit1_1 /]# rabbitmqctl cluster_status
    Cluster status of node rabbit@cluster_rabbit1_1 ...
    [{nodes,[{disc,[rabbit@cluster_rabbit1_1,rabbit@cluster_rabbit3_1]},
    {ram,[rabbit@cluster_rabbit2_1]}]},
    {running_nodes,[rabbit@cluster_rabbit2_1,rabbit@cluster_rabbit3_1,
    rabbit@cluster_rabbit1_1]},
    {cluster_name,<<"rabbit@cluster_rabbit1_1">>},
    {partitions,[]}]
    [root@cluster_rabbit1_1 /]#
  19. Logout and stop the RabbitMQ cluster with Docker Compose

    [root@cluster_rabbit1_1 /]# exit
    [root@swarm3 ~]#
    [root@swarm3 ~]# docker-compose --x-networking stop
    Stopping cluster_rabbit1_1 ... done
    Stopping cluster_rabbit2_1 ... done
    Stopping cluster_rabbit3_1 ... done
    [root@swarm3 ~]#

At this point we have tested deployment using Docker Compose for demonstrating automatic placement of containers across the Swarm Nodes, but what if you want to ensure each Swarm Node gets the appropriate container every time.

If you do not want to deploy across the Swarm using Docker Compose you can create a custom overlay network and then manually deploy containers with the docker run command specifying to use that overlay network. Here is how I am deploying another RabbitMQ cluster across the Swarm:

This script creates the ‘testoverlay’ network and then deploys the same RabbitMQ container from docker hub across the Swarm Nodes one at a time. So far, I have not been able to build a Docker Compose configuration file that was able to handle exact placement configuration so I had to create this script for making the deployment consistent every time.

PRODUCTION SWARM ENVIRONMENT

Every production environment is going to have nuances that need to be handled carefully. This reference architecture diagram is an example deployment topology for running a production Swarm environment. Not everyone is going to fit into the same shoe, and we want to hear your feedback on what would not work in your environment. With that consideration, here’s a starting point for building out a Production Docker Swarm Environment:

Docker Swarm Production Environment Reference Diagram

From the diagram you can see all we have to do is change where the Swarm Managers and Service Discovery Managers run. This allows for a consistent Swarm Node build that is only going to host Docker Containers running your applications. Developers or the DevOps driving production publishes will only interface with the Swarm Managers, and for security it makes sense to lock down access to the Container Nodes to only the ports necessary for hosting the applications and the management ports for the Swarm.

In the future, I plan on releasing the repository with the provisioning scripts, installers, setup scripts, and tooling for deploying to a targeted AWS VPC instance within a few minutes. Even with the lack of documentation for troubleshooting, Docker Swarm is still significantly easier to setup than some of the other enterprise on-premise PaaS offerings I have installed and ran on production before, and it has a great community of developers supporting it.

With the ability to run a Docker Swarm on your own computer, in your own data center, or on a major cloud provider like AWS, the toolset for managing the container lifecycle from development to production is getting easier and easier with each release. Developers can run their own environment, they can share environments using the overlay network, QA could spin up Swarm environments for testing and shut them down when they are done, and IT now has the ability to cut down on costs by dynamically adjusting the running Swarm Nodes based off the application traffic demand. All in all, I think this release is a huge success for Docker. I will be opening up some PRs in the hopes of improving the documentation and add some debugging tips soon.

Lastly, here are some of my final considerations after running Swarm on AWS:

  1. It takes around 5 minutes to provision a new 3-host Docker Swarm environment from scratch.
  2. Managing containers across a Swarm means developers and devops teams only interface with the Swarm Managers.
  3. This release supports adding and removing Swarm Nodes without downtime. I have been able to migrate containers to newly provisioned Swarm Nodes and then shutdown the host once the containers were relocated. This allows for hosts to be updated without breaking a running Swarm.
  4. Docker supports TLS encryption for security when accessing the Swarm Management nodes. I am hosting this Swarm under an internal Route 53 Hosted Zone so there is no access outside of the VPC (you could also host sensitive applications in a Private subnet). I would consider putting the Swarm Managers into a DMZ that has the only access to the production Swarm Nodes and consul cluster.
  5. Making sure there is enough redundant components for all internal Swarm processes and the consul cluster. Consul requires 3 instances at a minimum, but 5 ensures better support with updates and rolling restarts.
  6. Understanding your container deployment strategy to prevent over-utilizing Swarm Nodes. There are pros and cons to using Docker Compose with a hands-off approach versus the granular control that docker run provides. Knowing the container application’s resource needs before going to production will help determine which strategy to consider.
  7. Consul has a web ui for viewing the consul cluster’s registered services and members. This would be nice to have hosted as a monitoring application.
  8. Host your own Service Discovery Managers once you get comfortable running in single host mode. I had issues with losing Node membership when I tried to use the Docker Hub token across a Distributed Swarm environment.
  9. Docker Swarm can run inside and outside of AWS.

Let’s recap what we have done in this post. We have:

  1. Demonstrated multi-host networking is supported with Docker Swarm
  2. Deployment of redundant Swarm Nodes on AWS
  3. Deployed and managed containers across the Swarm
  4. Set up redundant Docker Swarm internal processes to improve failure tolerance
  5. Integrated with a clustered Service Discovery Manager (consul)
  6. Used Docker Swarm to deploy a working HA RabbitMQ cluster of containers linked using the new VXLAN overlay network across multiple hosts

Well that is all for now! Thanks for reading and I hope you found this post valuable. There is a lot to talk about in this new Docker Swarm release, and we are excited to hear your feedback on running Docker Swarm. If your organization would like assistance determining your Docker and Docker Swarm strategy, please reach out to us at Levvel and we can get you started.

Until next time,

- Jay

Jay Johnson

Jay Johnson

Principal Consultant

IT Professional with 10+ years of experience in architecture, design and implementation of large distributed, real-time systems across a variety of environments. Focused on executing aggressive timelines by leveraging my expertise in technology, process, and best practices.

GitHub Portfolio: https://github.com/jay-johnson

Previous Post →