It’s no surprise how companies and many of my customers are hearing about the many benefits of a Data Lake. With the recent trends of the latest data hypes within data management, the Data Lake has proven its worth.
So, a good colleague of mine, William Cairns and I had a discussion a while back about how important the R.O.A.D. is in any Big Data journey, such as a Data Lake. Now when I say R.O.A.D. I’m not talking about the asphalt surface you’re driving down, I’m talking about the Resources, Objectives, Architecture, and Data. Four major points that I recommend will drive and help you on your successful Data Lake journey.
Objectives
This is a no-brainer, but I see a lot of customers wanting the world. As you probably heard many times before, “You have to crawl before you walk.” With that said, I always tell my clients to focus on two or three use cases. I prefer three use cases as I recommend to my clients they should be broken into the following use cases;
- Easy and quick win to show ROI for the business and IT
- Champion business unit use case. Selecting a use case for a specific business unit that can be a champion for your Data Lake solution
- A larger use case that could support multiple lines of business units
Adding any more than three use cases will cause more problems for you then results and could potentially have your Data Lake solution fail.
Architecture
Big data is constantly evolving for the better, and in the post, I won’t address the Data Lake Architecture best practices, as I will address that in another post. But architecture is important as there are many questions you will need to ask yourself as you design your Data Lake. Question you will need to ask are, what type of Hadoop distribution would you want to use, should use be on-prem or cloud. If you go with on-prem what type of hardware will you need? Also, are you looking to leverage more open source or are you wanting to leverage more enterprise big data solutions for your Data Lake? What about security? Or do you use something as simple as AWS or MS Azure Data Lake solution and quickly spin up a cloud instance.
You have many options, and I recommend if you are new to Data Lake architecture, your best is to use the two to three use cases as your starting part. From there work with an expert or your partner to help guide you through this process for your architectural design. In addition, start small no need to buy the BMW when you could start off with a Ford. For example, the company I work for at NGDATA, offers a quick 25-day Data Lake quick start on AWS for one use case: Data Lake Foundation on AWS with Informatica – NGDATA Consulting Offer.
Resources
Know that the Data Lake you create will need to be maintained and owned by resources you decided to invest in. For example, you have many options when it comes to resources.
- You can invest in training to train your resources so that they would be able to maintain, govern and own the Data Lake solution that has been created
- You can hire a Big Data consultant firm to help guide and support you on your Data Lake journey while teaching your internal team how to take full ownership. So that at a point of time your internal team feels confident to take full ownership.
- Hire an internal a big data resource in which will also train your existing resources and maintain your Data Lake
- Outsource your Data Lake management to a Big Data consultant firm.
Now, regardless of your option know that there are costs. So again, you will need to ask yourself what is the best option for your company and what will be the best for you to obtain the best ROI? Now many of my customers like to go with Option #2 and #3. But, I want to point out for Option #3 that this may not be cost-effective depending on your location. For example, though I have never been Anchorage, Alaska so I could be wrong with this statement, you won’t find the larger pool of resources as you would, let’s say, in NYC. Due to this, that resource can be more than Option #4. Also, in regard to Option #1: If you go with more open source solutions know that big data solutions are evolving and you may need to constantly train your resources. In addition, for Option #1, some employees hate change in which may result to turn over.
The possibilities are endless. But, this needs to be addressed as you start your Data Lake Journey. In addition, I always like to recommend to my midsize or larger clients to look at Big Data enterprise solutions when architecting your Data Lake solution. Why do you ask? For two reasons, one you making fewer investments in training and two you have a larger resource pool to choose from in the future. This way you don’t have to worry about teaching your team SCALA today and PIG tomorrow, or finding a resource that knows both.
Data
The part many people say that is easy but not always so. Now that you have narrowed done your scope to about two or three use case the data should be easy right? Well yes and no. I won’t get into the data quality piece, or how much data should be dumped first. Know that when you define your use cases try to hone in on the data also. In addition, revisit your data strategy. This will tie in with your architecture and resources. But understanding the data will be important as you show the business the benefits of the data lake.
I truly recommend that if you focus on these 4 points, hopefully, that will help you rethink and allow you to successfully implement a Data Lake solution rather than some Data Swamp or Data Dump as you start your journey. If you would like to learn more about Data Lake Solutions I recommend checking out my partner Dan Rezac from Informatica, and my colleagues Gil Rosen, and William Cairns in: Webinar Replay | Start Your Data Lake Right: Leveraging AWS and Informatica to Drive Customer Engagement.