Machine Learning Part Two—Running a Machine Learning Data Store on Redis Labs

Redis Labs

Editor’s note: This is the second post in a two-part series about machine learning. In part one, we discussed how to get started with machine learning: define, benchmark, and deploy.

Managing large, pre-trained predictive models across an organization and ensuring the same version is on production can be a challenge with the rapid pace of changes in the AI/machine learning space. Here, we have an approach that demonstrates how to automate building, storing, and deploying predictive models from a Remote Machine Learning Data Store hosted on Redis Labs. This approach is focused on showing how DevOps CI/CD artifact pipelines can be used to build and manage machine learning model artifacts with Jupyter IPython notebooks, accompanying command line automation versions, and administration tools to help manage artifacts across a team. By utilizing DevOps for your machine learning build workflows you can easily manage model deployments across intelligent environments.

The Basics—What Are We Automating?

In general, machine learning workflows share these common steps to create a predictive model:

  1. Define a dataset
  2. Slice the dataset up into train and test sets
  3. Build your machine learning algorithm model
  4. Train the model
  5. Test the model

We wanted to share how to automate these common steps within a machine learning pipeline under a server API that creates model artifacts on completion. Artifacts are dictionaries containing the models’ analysis, accuracy, predictions, and binary model objects. Once the artifact is created, it can be compressed as a pickle serialized object and uploaded to a configurable S3 location or in another persistent storage location. This post is also a demonstration for designing a machine learning API with a pseudo-factory to abstract how each step works and the underlying machine learning model implementation. This approach lets a team focus on improving model predictive accuracy, improving a dataset’s features, sharing models across an organization, helps with model evaluation, and deploying pre-trained models to new environments for automation and live intelligent service layers that need to make predictions or forecasts in real-time.

What Does the Workflow Look Like?

Here is the workflow for using a machine learning data store powered by Redis Labs and an S3 artifact backbone:

This workflow is built to help find highly predictive models because it uses an API that can scale out expensive tasks (like building, learning, training and testing models) and natively manages machine learning models with Redis Labs caching with an S3 backbone for archiving. Just like DevOps in the enterprise software world, automating build workflows enables your organization to focus on stuff that matters like: finding the most predictive models, defining quality datasets, and testing newly engineered features.

How Does It Work?

This post is about reducing the time it takes to create quality predictive models. The examples below are from the Jupyter-based docker container repository: https://github.com/jay-johnson/sci-pype. This repo shares the same API with a backend worker https://github.com/jay-johnson/datanode for running the heavyweight tasks across a distributed environment. Sci-pype was built to make it easier to analyze all columns in a dataset and determine the best-of predictive models with Jupyter. Datanode was built to distribute the time-intensive workloads like the building and training steps to keep creating larger, more accurate models. The larger your dataset, the larger your models can end up growing, and why having a decoupled node capable of processing new tests using a publisher-subscriber pattern can reduce how long your team spends waiting to create models from scratch with a new dataset.

In a future post, I will discuss running datanode workers on a container-ready PaaS like OpenShift to significantly increase how fast your team can analyze and deploy large models for making new predictions and handle automatic testing with newly rolled datasets. Once the models are built and trained, they can be archived as an artifact and uploaded to S3 for automatic or controlled deployment. This workflow provides a clean handoff for promoting pre-trained models built on a Development environment to other environments like QA and Production. Once deployed back into the Redis Labs QA or Production environments you can start making new predictions with the same API call used to build the models from scratch and sci-pype will use the pre-trained, cached models without exposing your dataset beyond secured environments or worrying about how to replicate the intelligent infrastructure in a new environment. This is all because everything in this post can run in a docker container or outside from the command line using these two complementary repositories.

Powering Your Own Remote Machine Learning Data Store

The remainder of this post discusses how to run your own Machine Learning Data Store in a Redis Labs Cloud instance. This demo was built with Redis Labs because they offer a hybrid solution that allows your organization to run enterprise-grade Redis either remotely out of a cloud endpoint or you can host an on-premise enterprise cluster in your private datacenter which keeps your models and data secured behind your firewall. Because each supports the same underlying Redis client, both options are natively supported under the sci-pype API for caching, importing and extracting models and analysis artifacts.

This demo is composed of four IPython notebooks included in the repository examples directory and accompanying command line versions. To make these samples finish quickly, all of them analyze the same IRIS dataset with an xgboost regression machine learning algorithm model and utilize the same Redis Labs Cloud instance. I picked xgboost because it is winning 50% of the machine learning Kaggle competitions, and this is how an organization can automate a machine learning pipeline with a model that is proven and highly customizable.

1. Set up your own Redis Labs Cloud instance

i. Register for your own Redis Cloud account

ii. Create a new Redis Cloud instance

iii. Find the Redis Cloud instance endpoint

iv. The repository examples use a free tier instance named cloudcache deployed on AWS at redis-16005.c8.us-east-1-4.ec2.cloud.redislabs.com:16005

https://github.com/jay-johnson/sci-pype/raw/master/examples/images/v2/Redis-Labs-Cloud-Endpoint-Used-as-a-Remote-Machine-Learning-Data-Store.png

Update the Redis Cloud configuration:

Replace all the Redis Cloud endpoint instances with your own using the command:

sed -i 's|redis-16005.c8.us-east-1-4.ec2.cloud.redislabs.com:16005|your-redislabs-endpoint|g' configs/cloud-redis.json

Start the container using the Redis Labs Cloud compose file:

Run this wrapper script to start the docker composition with your own Redis Cloud instance (which mounts the repository’s ./configs into the container at runtime). This makes it easy to move between different Redis Labs Cloud instances:

https://github.com/jay-johnson/sci-pype/blob/master/rl-start.sh

Navigate to the container’s running Jupyter instance listening on port 8888:

Open a browser to: http://localhost:8888/tree/examples

2. Development—Define and Benchmark

This IPython notebook analyzes the IRIS dataset by creating an xgboost regression model for each column in the dataset. After testing, the models are packaged up all with their predictions and accuracies into a serialized dictionary. Once packaged up, they are cached on the Redis Labs Cloud instance. Once cached they can be exported and archived as a serialized, compressed file stored on S3. After the models are trained and tested, the notebook will display seaborn visualizations to help with model evaluation. A future release will archive these analysis images as a separate artifact with optional S3 upload support.

I like XGBoost because it is a parallelized, highly-tunable machine learning algorithm supporting multiple parameters for optimizing how it learns and trains itself. The sci-pype server API supports multiple algorithm models (gradient boosting machines, random forests, and extra trees), and creating these models with a common request dictionary allows for the underlying implementation of these models to be versioned and updated while your organization controls what is exposed to the clients like Jupyter or a web application. Here are the latest supported XGBoost parameters with the sci-pype API:

**Extending the sklearn XGBoost API parameters**

http://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.sklearn

::

    ml_request = {
            "MLAlgo" : {
              "Train"     :{
                "LearningRate"          : 0.1, 
                "NumEstimators"         : 1000,
                "Objective"             : "reg:linear",
                "MaxDepth"              : 6,
                "MaxDeltaStep"          : 0,
                "MinChildWeight"        : 1,
                "Gamma"                 : 0,
                "SubSample"             : 0.8,
                "ColSampleByTree"       : 0.8,
                "ColSampleByLevel"      : 1.0,
                "RegAlpha"              : 0,
                "RegLambda"             : 1,
                "BaseScore"             : 0.5,
                "NumThreads"            : -1, # infinite = -1
                "ScaledPositionWeight"  : 1,
                "Seed"                  : 27,
                "Debug"                 : False
              }
           }
        }

Which are used to create an XGBRegressor object in the pycore.py

This IPython notebook has a command line version and virtual environment inside this repository and the datanode repository for sharing the same analysis and tooling for use inside or outside of docker across a distributed, scalable environment.

https://github.com/jay-johnson/sci-pype/blob/master/bins/ml/builders/rl-build-regressor-iris.py

3. Test, QA, Production—Make New Predictions with Pre-trained Models

Make new predictions with the pre-trained models deployed to the Redis Labs Cloud instance. The latest API also supports dataset versioning and working with legacy datasets by providing targeted column lists for making predictions (as long as the dataset still has them).

This IPython notebook has a command line version and virtual environment inside the datanode repository for making new predictions, and it can run inside or outside of docker across a distributed, scalable environment.

https://github.com/jay-johnson/sci-pype/blob/master/bins/ml/predictors/rl-predict-from-cache-iris-regressor.py

4. Import and Deploy—Administration Tool

This IPython notebook has a command line version and virtual environment inside the datanode repository for importing pre-trained machine learning artifacts from S3 and deploying them to the configured Redis Labs Cloud instance. It is a handy administration tool when you want to deploy specific model versions and benchmark accuracies with different machine learning models. It can run inside or outside of docker across a distributed, scalable environment.

https://github.com/jay-johnson/sci-pype/blob/master/bins/ml/importers/rl_import_iris_regressor.py

5. Extract and Archive—Administration Tool

View in container: http://localhost:8888/notebooks/examples/ML-IRIS-Redis-Labs-Extract-From-Cache.ipynb

View on GitHub: Extract and archive models from the Redis Labs Cloud and upload them to S3 as a serialized, compressed artifact

Similar to the Importer tool, this creates the model and analysis artifact file by building a dictionary that is compressed and pickle serialized before uploading to the configurable S3 bucket.

This IPython notebook has a command line version and virtual environment inside the datanode repository for extracting and archiving pre-trained models from the Redis Labs Cloud instance as an uploaded artifact to S3. It is a handy administration tool when you want to extract the models or share a model test run with another teammate.

https://github.com/jay-johnson/sci-pype/blob/master/bins/ml/extractors/rl_extract_and_upload_iris_regressor.py

6. Stop the Container

To stop the container, run:

https://github.com/jay-johnson/sci-pype/blob/master/rl-stop.sh

Next Steps

One of the advantages of using an API layer for your machine learning pipeline is that you can create aggregated predictions from these type of standalone models and combine them with other models from well-supported frameworks like Tensorflow, MXNet, or Theano, including neural networks, to improve your predictions. Each of these frameworks is already included in the latest sci-pype docker container and local python virtual environment. Additionally, this repository contains a Confluent Kafka broker and python client, which I will be blogging about in a future post related to Amazon’s Echo with Alexa and IoT.

By powering your machine learning data store on Redis Labs, your organization can quickly start treating machine learning models as artifacts that are moved between environments. Being able to deploy a private on-premise, enterprise cluster or run right out the Redis Cloud makes this hybrid caching backbone a solution that can grow with your organization. If your organization would like help setting up your own machine learning data store on Redis Labs, designing and implementing a scalable AI or machine learning API, building a machine learning artifact pipeline, or building a scalable intelligent cloud to optimize resources, reach out to us at Levvel, and we’d be happy to get you started.

Jay Johnson

Jay Johnson

Principal Consultant

IT Professional with 10+ years of experience in architecture, design and implementation of large distributed, real-time systems across a variety of environments. Focused on executing aggressive timelines by leveraging my expertise in technology, process, and best practices.

GitHub Portfolio: https://github.com/jay-johnson