Machine Learning Part 1: Define, Benchmark, Deploy

Machine Learning

Define, Benchmark, Deploy

Working with clients gives us a great opportunity to apply new technologies to solve real problems. Combine opportunity with a love for the bleeding edge, and it makes for some engaging dialogs on how to make a business more effective. Today’s article is the first part in a series discussing how to use machine learning.

How do I get started with Machine Learning?

To get started in machine learning you need to know:

  1. What do you want to predict?
  2. How good is your data for making predictions?

Data science used to come with a steep learning curve and expensive expertise to help build systems and models to make predictions. These approaches would lead to teams trying to support a convolution neural network (CNN), recurrent neural network (RNN), or machine learning algorithms. Building, supporting, and re-validating model accuracy with an in-house solution is an expensive venture.

It’s an Exciting Time to Jump Into Machine Learning

With the emergence of winning open source algorithms and frameworks, you no longer need a PhD in statistics to be effective. Popular frameworks like Tensorflow enable DevOps engineers with a scalable foundation to quickly start building an intelligent application stack. A team can use all FOSS tools or Amazon Machine Learning to start figuring out how to make better predictions. Hypothetically, an organization might want to increase campaign conversions, hit better click-through rates, build a recommendation engine, or create forecasts to predict events using historical data like fraud detection. These are very different end goals, but at the core, the machine learning tools will mostly be the same. This leaves your team time to focus on understanding how to create a quality, unbiased dataset to improve your product, experience, and platform.

Determining how predictive your data is and what features can be built to help improve prediction accuracy is a journey, not a few sprints.

DevOps for Data Science

DevOps adoption has made software teams more effective by helping streamline deployments. Teams looking to increase their data science productivity should leverage these CI/CD tools to build predictive artifacts from a repository commit webhook, just like software teams. Doing so enables an organization to focus on the defining the predictive dataset, benchmark which algorithms and datasets are effective, and deploy those best-of algorithm models across any environment.

DevOps data science tooling is an emerging ecosystem, and we can help your teams build a data science artifact pipeline that is setup to continually adjust, test, and refine predictions. A data science pipeline enables an organization time to focus on refining intelligent features to hit better predictive success rates. By building an effective data science pipeline, your data scientist team can focus on testing new ideas, not deploying intelligent infrastructure.

What Kind of Gains Can We Get from Machine Learning?

A recent study found a general trend that large companies using machine learning are targeting higher sales growth and, more importantly, understand their data better than before. Like most things, an organization will get out what it puts into machine learning. Building and supporting a predictive model is like always trying to find a better unicorn. The more you start to see how data can be used to predict an event, the more you will want to make a better mousetrap. Tools like eXtreme gradient boosting (XGB), Tensorflow, and MXNet are great starting points for teams looking to dive into the FOSS ecosystem.

Building that first predictor model is no harder than building a compiled binary, but initial predictive accuracy will likely be pretty low. To increase predictive success, an organization can leverage feature engineering or use deep learning to find hidden relationships in the data. Feature engineering is a science and an art I will reserve for an individual article, but for now, think of this as building a component signal for all or parts of a prediction. Feature engineering can also be a negative due to overfitting and bias, and why it is important to choose tools that enable model evaluation for pruning out biased features. The gist of all this is: if you can identify signals more accurately, you can make better predictions.

If your organization has enough data, you can leverage deep learning to help find hidden relationships. If not, you can use algorithms like XGB to find important features using its native gradient boosting techniques to reduce error and rank features by importance. I found XGB to be a great starting point to machine learning. Under the hood, it is a highly-tunable algorithm that supports running in parallel. Turn a few dials and XGB can quickly build large, trained data models for making predictions.

Choosing the right tools, frameworks, and pipelines will enable your organization to start small and scale into larger problems requiring more data and more processing power (even running on GPUs). These tools, combined with cloud-managed services, are making this discussion easier month by month.

Data is the New Gold

The more data you can collect, the more you can use to make predictions. The emergence of customizable, competition-winning algorithms like XGB are great starting places, but at the core, all machine learning models need quality predictive data to train and learn. Predictive data comes in many forms, and if your organization is concerned about having limited data, then you can build tunable features for helping carve out better success rates while accounting for negative overfitting and bias.

If your organization needs help with machine learning, building model pipelines, distributed model caching, building a remote data science store, or feature engineering, please reach out to us at Levvel, and we’d be happy to get you started.

Until next time,


Additional Reading

  1. Artificial Intelligence Trends
  2. Distributed Machine Learning Community
  3. eXtreme Gradient Boosting - XGB
  4. Nvidia’s Deep Learning Core Concepts
  5. Kaggle
Jay Johnson

Jay Johnson

Principal Consultant

IT Professional with 10+ years of experience in architecture, design and implementation of large distributed, real-time systems across a variety of environments. Focused on executing aggressive timelines by leveraging my expertise in technology, process, and best practices.

GitHub Portfolio: