What is a Data Lake? A Primer on Big Data Storage

Blog

February 7, 2020

TABLE OF CONTENTS

Introduction to Data Lakes

Before your big-brained data scientists wring value out of your reams of data, it has to be accessible and, on some basic level, coherently arranged. And while data architecture is certainly something any data scientist should at least be familiar with, few would consider it among the tasks that drew them to the profession.

To harness all that brainpower, and keep them from running elsewhere, you need to keep the data wrangling to a minimum. Enter the data lake, the catch-all buzz phrase we love to bandy about when talking about data that isn’t necessarily ready for prime time but will someday come in handy.

This is not to say that information stored in a data lake cannot be vital to success. But with the advent of limitless storage, the question of whether or not to retain data is less about it being vital now and more about making sure the data are accessible, properly secured, ready for analysis, and consistently defined.

Durable Storage and Data Definition

The primary challenge facing any organization looking to stand up a data lake is where and how to store it. All major cloud providers provide the basics of a data lake:

  • Limitless space for object storage
  • Highly durable and available
  • Flexible and fine-grained access control

These three features should be considered the base requirement when keeping more data for longer that may contain sensitive information to your business or customers.

But since these features have become commonplace, cloud practitioners want their data automatically categorized and immediately consumable by a wide variety of products. At Levvel, we frequently use AWS Glue to generate schemas for paths in S3 buckets. The result is not only a handy way to stitch together multiple AWS products but also a Hive metastore that can be referenced by external tools. Google and Azure have similar offerings in “Cloud Composer” and “Data Catalog,” respectively.

Data Processing

Data lakes are organized like a file system. Paths determine the data set and may be used to partition data, as well. When coupled with a Hive metastore and a Hadoop cluster, you can execute traditional interactive queries or batch jobs on your data lake. If better performance is needed, a data warehouse such as AWS Redshift can quickly ingest data directly from S3.

Serverless Options

Whatever process is generating this data is probably not thinking too hard about making life easy for data architects or data scientists. Whether it is coming from IoT devices or application logs, it makes sense to send them through a buffer to collect a reasonable block of data, perform some transformations or enhancements, then write to its final destination in the data lake. AWS Kinesis Firehose is a common tool for the job, able to buffer 1 to 128 megabytes before executing a serverless function to augment the data. Choosing your preferred compression format is as simple as checking a box. Learn more about serverless architectures with our webinar.

Unsure if your business can benefit from a data lake? See our checklist to help you make that determination.

Authored By

Ben Hunter

Senior Cloud Consultant

RECOMMENDED CONTENT

Video Series: The State of the Insurance Industry

Blog

2020 Insurance Technology Executive Report

Report

2020 Legacy Modernization Report

Report

Meet our Experts

Ben Hunter
Senior Cloud Consultant

Ben is a data scientist and AWS Certified Solutions Architect and Developer. As an analyst and data scientist, he has worked in the retail, banking and automotive industries in consulting and practitioner capacities. In his work as an cloud consultant, he has advised Fortune 50 banks, written a Python library for multiple-account management, and created big-data and machine-learning pipelines for nationally-recognized media brands. He holds an M.S. in Economics and lives in New York City.

Related Content

Video Series: The State of the Insurance Industry

In this new video series from Levvel, our experts discuss the disruption happening in the insurance industry, common pain points, stories from the field, and the opportunities for established insurers to modernize and level the playing field.

Blog

Jul 14

2020 Insurance Technology Executive Report

Levvel Research analyzed legacy dependence’ impact on an insurer’s ability to meet business objectives and make software changes quickly, and our findings reveal key links between underlying infrastructure, process, culture, and time-to-market.

Report

Jul 14

Let's chat.

You're doing big things, and big things come with big challenges. We're here to help.

Access the Blog

By clicking the button below you agree to our Terms of Service and Privacy Policy.

levvel mark white

Let's improve the world together.

levvel-mark-mint

© Levvel 2020