Is a Datalake in our future ?

Datalakes are all the rage in IT right now, but can it really transform the tactics and strategy of the business ?

千里之行,始於足下, a journey of a thousand miles begins with a single step

At the end of the last century, business and IT formed a bit of a difficult partnership.  We termed it Business Intelligence or Business Analytics and connected it to IT with something called a data warehouse strategy. The business would generate or spot potentially lucrative opportunities and with the help of IT determine what data would be needed to support decision making for that opportunity. The IT team would then determine what generated the class(s) of data needed to feed this beast, collect it, clean and normalize it, organize it and then build a data platform that could report on it so its correlations and indications could be leveraged by the business in decision-making. 

In hindsight, this historic process represented a very myopic single use-case effort and had very limited downstream flexibility of what data got collected, how it was manipulated or used. It required a very heavy lift to produce the needed result. The business had little data back then so the method seemed to work well because we only activated that partnership in a limited fashion due to its expense and complexity. The result: a few highly tactical, extremely expensive analytics solutions to help drive the business. 

Our problem in the 21st century is that the data, in a sense, is now self-spawning. Data is everywhere and covers everything. Data no longer needs a specific business decision to identify it prior to its generation, collection, cleaning or correlation. Unfortunately without an historic “proper use-case”, this new self-spawning data is soon lost as “exhaust data” from the business or socially associated entities. Exhaust data is generated by transaction systems whose need for certain types of data retention is limited to the time it takes to actually perform a transaction: like mistakenly ordering a drug for a specific patient, but then not completing the order because something changed in the needs of the patient. Like any exhaust, it’s simply vented and forgotten about. Given the previous set of solution platforms, there is simply too much data being generated at the same time to do anything meaningful with it because of our inability to identify, structure, collect, organize and determine its value before it is purged. If we could capture and leverage proper data along with that we exhausted, it would paint a much clearer picture of our environment and make prediction possible rather than guess work.

 You don’t know what you don’t know.

With the business and its associates unknowingly venting all this data and not being able to determine if it had value or what problems that data could potentially have solved, we are in a classic landfill situation. We are filling a crater with all the things we perceived as having no value at an exact point in time, but down stream, when we are more enlightened and have a different perspective, we’ll wish we still had some of that data.  Unlike an actual landfill, that can be mined in the future to recover those lost assets, our virtual landfill is gone forever and can never be reclaimed. If only you’d known its value then… Albert Einstein once said: “We can't solve problems by using the same kind of thinking we used when we created them.”  

Enter Hadoop. 

Build a Landfill? That’s your advice? Yes! As the Chinese philosopher Loazi (founder of Taoism) once said: a journey of a thousand miles begins with a single step”. Build an Hadoop cluster from whatever means possible RIGHT NOW. Hardware/software vendors do not matter! Build it now! Don’t let one more day pass without it. Don’t wait until your IT group “GETS IT” or your business folks “UNDERSTAND IT” or the CIO “DEMANDS IT". Put it in place now; turn on the storage and collection of every part of your exhaust data and their associated sources right now, thus stopping the data leakage. You can all sit down and figure out the data’s value and its impact to your Business at a later time.

 Does it have to be Hadoop?

Why can’t I just do this with my existing database investments, why Hadoop?  Because, traditional databases require you to structure the data before you store the very first byte. This means you have to understand the source data prior to storage or query and we don’t have time for that. In fact, Hadoop is recognition that in the 21st century, you’ll never have time for understanding the data prior to storage again.  As we mentioned above, there is no opportunity for that as the velocity of data is too fast and its structure is too varied or complex. Hadoop is an unstructured data store, like GOOGLE it does not care about structure at storage time. You can provide structure over just the portion you understand later and expand that structure later as you begin to understand more and more of it. You may even change your mind about its original structure completely in the future and Hadoop will accommodate that. Unlike a traditional data warehouse that would require you to purge all data not understood at the time, because there is no structure to store it in, the Hadoop will still have all that data when you get your mind finally wrapped around it. Take the first step of the thousand-mile journey by putting Hadoop in place right now to stop the data loss. The rest will come as a natural outgrowth of Hadoop and its ability to store and organize a terabyte of data for as little as $240.00 compared to traditional databases that run between $6,000.00 - $40,000.00 a terabyte and require understanding the data prior to its storage

Transforming the way the business sees and uses data? 

Force the revolution on your organization by simply demanding the collection of all this data starting next month…. They’ll arrive at Hadoop as the platform of choice because only Hadoop can accomplish this goal in short order. Detractors may claim: “we are building a data cesspool” or “we’ll never even know what’s in it”.  This is common when IT folks cannot separate in their mind the collection of data from the structuring of data because we were all trained to think that way in the past. In the 21st century we no longer have to do all that. Putting in Hadoop, turning on the data captures for everything in the corporation; transactions, logs, meta-data as well as subscription data like sentiment from other feeds out on the internet is painless and need not be complicated by overthinking the situation. Don’t allow its complexity to become the barrier, just store it. Once the collections become stabilized, you’ll find even the traditional data warehouse goes to the Hadoop for its primary sources rather than attempting a destabilizing conversation with the source systems itself. The first step is putting a data lake between the consumers of data and producers of data, which does not require deep thought about structure or value prior to capture.

What’s next?

In latter posts I’ll describe how to accomplish this complete transformation from an Enterprise Architecture perspective, so the effort has easily identifiable and achievable goals and constraints. With Hadoop, the sum of the whole is truly much greater than its individual parts. As such, I’ll present an iterative approach to displacing the traditional warehouse with 21st century methods which exceed the demands of the business and remains high level, keeping us out of the weeds and enabling top-down management of the effort rather than the unfortunate bottom-up experimentation all too common today, which generally leads to overthinking and under delivery and what I call DarkHadoop.

GOOGLE is no cesspool!

  • No comments found

Machine Learning Articles

Datalakes are all the rage in IT right now, but can it really transform the tactics and strategy of the business ?

Read more