Aman Jagadhabhi's personal website

Wrote an article

Why AWS S3 is the best fit for Data Lake!

What is a data lake and why should your organisation consider creating one? A data lake is essentially a technique to monetise and derive economic value from your data.

For example, Fortnite from Epic Games has been a wildly successful game that has scaled incredibly quickly (to over 125 million players), and they've accomplished this through an engagement model with their customers in which they monitor interaction with the game in near real-time, run analytics on that data, and in the process constantly customise the game to offer a better player experience. Thus, a real-time feedback loop that makes the game extremely sensitive to user input.

Example AWS DL Arch for Gaming Industry

AWS may be utilised as a data lake to host the gaming platform and do the analytics required to keep the players interested.

AWS S3 may be utilised to store the telemetry data from a large number of gamers, and then this near real-time data pipeline architecture can be used to analyse the streaming data in near real-time (comprising of Spark and DynamoDB). Also place this data in batch pipelines (comprised of S3, EMR, etc.) that may be utilised later to do more in-depth analyses, machine learning models, etc. It provides them with a real-time engagement engine as well as far deeper insights over time, allowing them to optimise and add responsive new features to the game.

Data-Lake as a Journey…

When you begin to consider what a data lake implies for your organisation, you will go on a journey. Whether you're just starting out and have never done any serious analytics before, or if you're merely used to doing basic business insights, or whether you're a really inventive practitioner of analytics, there is likely always opportunity for progress and evolution.

One of the fundamental concepts of data-lake architecture is that you must be able to evolve around your data as your skills grow, and more importantly, be able to innovate and experiment with your data in a non-disruptive manner; determine if a new algorithm or processing tool will add value and then rapidly scale it into production. This will need you to fundamentally adapt your tools and procedures to the data.

Therefore, when we consider constructing data-lake architectures, we want to ensure that they can be constructed at a rate that allows you to change and innovate at the optimal rate for your organisation.

Why Amazon Web Services for Big Data Analytics and Data Lakes?

A major component of this is agility. You want to innovate as quickly as possible without being hindered by the infrastructure/tools/platform you're utilising to drive innovation. Therefore, AWS is primarily focused on providing you with a platform where you can:

As the need arises, implement new features, services, and capabilities in an extremely agile manner.
Try things out, fail quickly, and move on to the next item at very little expense; or, if the experiment is successful, scale it rapidly, as scaling is the second part of a data lake.

Agility & Scalability:

AWS's platforms and tools for constructing data lakes are intrinsically scalable up to hundreds of petabytes and exabytes of data.

Capabilities:

Because your use cases will be unique, you must possess a comprehensive set of capabilities that you can apply to the data to extract value. And as we go in the data-lake architectural journey, your abilities will grow, you'll want to do new things and get new insights, therefore you'll need a portfolio that won't limit you, so that you may discover (whether it's an AWS native service or a partner) the proper tool for the task.

Cost:

Cost is another essential factor on which we must be intently focused. If you're successful, your data quantities will rise beyond your wildest dreams. When you're just starting out, your budget is unlikely to expand rapidly, thus we must be able to keep costs optimised and not increase the cost of the infrastructure as your demand increases.

Migration:

This is unlikely to be an AWS-exclusive option. You will either have legacy equipment, on-premise data sources, or third-party data sources. Therefore, data transfer and integration with data-lake must be simple.

Faster Insights:

Faster Insights provides you with a competitive advantage. accelerated time to market and enhanced capacity to provide new services When designing this data-lake, one of the essential pillars we need to emphasise is the expedited acquisition of insights.

So how do we define the data-lake at AWS?

Since the earliest days of Hadoop, when a data lake was essentially HDFS, there have been several meanings of the term.

However, we wish to adopt a broader perspective, which we describe as encompassing both relational and non-relational data. The previous data lakes consisted solely of structured and semi-structured data from a number of sources. However, as we examine more and more novel use-cases, we begin to observe an increase in unstructured data kinds, such as video or radar data, LIDAR data, etc.

It's not just about Hadoop or data warehousing; it's about a wide array of tools that can go into the data and perform precisely what you want with it.

Data-Lake on AWS

So, bringing this down further, what does a data lake on AWS look like? S3 is the foundation's core component.

1. Data Ingestion

So the first thing you have to do is get your data into S3.

AWS provides a multitude of data import solutions to facilitate this process. AWS Kinesis is a suite of technologies for ingesting streaming data, such as log data, streaming video data, etc. In addition, Amazon Web Services offers Kinesis Analytics, which allows you to analyse data as it streams in and make choices on it before it reaches the data lake.

AWS provides a Database Migration Service for integrating relational data from on-premises or cloud-based relational databases into the data lake.

AWS Storage Gateway may be used to integrate and migrate on-premises lab equipment that does not necessarily speak object storage or an analytics interface, but is accustomed to communicating with a file device, to the cloud. Lastly, you may already have an on-premises Hadoop cluster or a data warehouse. You might configure AWS Direct Connect to provide a direct network link between on-premises installations and AWS services.

You may have acquired a great deal of data in on-premise storage devices and wish to transfer it to the data lake. However, it is challenging to keep these two worlds in harmony. AWS therefore developed DataSync to facilitate this. It is a highly high-performance agent that you can install on your current on-premises storage, and it will automatically transmit and synchronise data with AWS. It is simple to use, has great speed, and allows you to automatically synchronise your on-premise settings where you may stage data with your AWS data lake.

Data ingestion is essential for making your data actionable, and you must select the appropriate method for each type of data.

2. Catalogue

The second component, which is important to the construction of a data lake. Without a data catalogue, you have only a storage platform, not a true data lake. If you want to get insights from your data, you need to know what you have, what sort of data it is, what metadata is connected with it, and how various data sets relate to one another. Therefore, this is where AWS Glue comes in. Using this rich and adaptable data catalogue, you can quickly crawl data, categorise it, catalogue it, and gain insights from it.

3. User Interface

After analysing the data and deriving insights from it, you must be able to communicate those findings to a wide range of consumers. You may do it directly, using analytic tools that speak SQL natively, or by putting API gateways in front of it and establishing a data consumption mechanism similar to that of a shopping cart. A range of AWS products such as API Gateway, Cognito, and AppSync may help you construct user interfaces on top of your data lake.

4. Security

Managing security and governance is a fundamental component as well. It would not be an useable data-lake if it were not secure, because a data-lake is ultimately about combining several data silos in order to get more insights. When bringing all data and all users to a single platform, it is much easier to secure a large number of silos if there are many silos present. AWS provides a vast array of security and management capabilities, which we will explore in further detail, to assist you in doing so in a safe, resilient, and granular manner.

5. Analytics

In the end, a data lake is all about extracting value from your data, and this boils down to the analytical tools you employ. AWS has a multitude of native tools for querying data in situ, such as Athena, Redshift Spectrum, SageMaker, etc., as well as a multitude of third-party tools that are far more performance and scalable for applications such as Spark or data warehousing.

AWS S3 Best Place for Data-Lake

AWS S3 was created with 99.999999999% (11 9s) of durability and high availability in mind. It is the second-oldest service offered by Amazon (about 13 years old) and has a vast size, including exabytes of data and trillions of items.
A large range of security compliance and auditing features are natural to S3 because security is one of its foundational components. As your data lake grows, you may wish to be able to govern these items at the level of each individual object. This might be to provide extremely granular security and access features or to implement highly intelligent data management strategies that will help you optimise expenses.
You will also need business insights into your data, which are distinct from analytic insights into your data. This entails analysing how your data is utilised by various consumers in order to charge them accordingly.
And lastly, capabilities for ingesting data. Before you can do anything with the data, you must first enter it. As a result, there are more options to import data into AWS S3 than virtually any other platform.

Additional References

“Cloud Object Storage – Amazon S3 – Amazon Web Services.” Amazon Web Services, Inc., aws.amazon.com, aws.amazon.com/s3. Accessed 21 May 2022.

“Amazon Athena - Serverless Interactive Query Service - Amazon Web Services.” Amazon Web Services, Inc., aws.amazon.com, aws.amazon.com/athena. Accessed 21 May 2022.

“Fast NoSQL Key-Value Database – Amazon DynamoDB – Amazon Web Services.” Amazon Web Services, Inc., aws.amazon.com, aws.amazon.com/dynamodb. Accessed 21 May 2022.

“Managed Open-Source Elasticsearch and OpenSearch Search and Log Analytics – Amazon OpenSearch Service – Amazon Web Services.” Amazon Web Services, Inc., aws.amazon.com, aws.amazon.com/elasticsearch-service. Accessed 21 May 2022.

“Amazon Kinesis - Process & Analyze Streaming Data - Amazon Web Services.” Amazon Web Services, Inc., aws.amazon.com, aws.amazon.com/kinesis. Accessed 21 May 2022.

“AWS Database Migration Service - Amazon Web Services.” Amazon Web Services, Inc., aws.amazon.com, aws.amazon.com/dms. Accessed 21 May 2022.

“AWS Storage Gateway | Amazon Web Services.” Amazon Web Services, Inc., aws.amazon.com, aws.amazon.com/storagegateway. Accessed 21 May 2022.

“Amazon QuickSight - Business Intelligence Service - Amazon Web Services.” Amazon Web Services, Inc., aws.amazon.com, aws.amazon.com/quicksight. Accessed 21 May 2022.

“Amazon Cognito - Simple and Secure User Sign Up & Sign In | Amazon Web Services (AWS).” Amazon Web Services, Inc., aws.amazon.com, aws.amazon.com/cognito. Accessed 21 May 2022.