
By Jnan Dash | Article Rating: |
|
September 18, 2015 11:00 AM EDT | Reads: |
394 |

Yesterday I attended a session in Palo Alto on the subject of Data Refinery and the speaker was Will Gorman of Pentaho. I did not realize that Pentaho was acquired by Hitachi Data Systems couple of months ago. The terms “data lake” was coined by James Dixon of Pentaho. I wrote a blog on this subject last year. As soon as the term started to appear in the data lexicon, other interesting terms such as “data swamp” appeared.
The term data lake has been coined to convey the concept of a centralized repository containing virtually inexhaustible amounts of raw (or minimally curated) data that is readily made available anytime to anyone authorized to perform analytical activities. The often unstated premise of a data lake is that it relieves users from dealing with data acquisition and maintenance issues, and guarantees fast access to local, accurate and updated data without incurring development costs (in terms of time and money) typically associated with structured data warehouses. According to IBM, “However appealing this premise, practically speaking, it is our experience, and that of our customers, that “raw” data is logistically difficult to obtain, quite challenging to interpret and describe, and tedious to maintain. Furthermore, these challenges multiply as the number of sources grows, thus increasing the need to thoroughly describe and curate the data in order to make it consumable”. I completely agree.
During the early days of Data Warehousing, the terms ETL dealt with all the data preparation stages – extract, transform, and load the curated data for query and reporting. I used to call this jokingly, “answer to 25 years of sin”. In my understanding, Pentaho’s SDR (Streamlined Data Refinery) is a modern form of ETL that deals with both internal structured data and external unstructured data including machine-generated data. In Pentaho’s own words, “The big data stakes are higher than ever before. No longer just about quantifying ‘virtual’ assets like sentiment and preference, analytics are starting to inform how we manage physical assets like inventory, machines and energy. This means companies must turn their focus to the traditional ETL processes that result in safe, clean and trustworthy data. However, for the types of ROI use cases we’re talking about today, this traditional IT process needs to be made fast, easy, highly scalable, cloud-friendly and accessible to business. And this has been a stumbling block – until now. Streamlined Data Refinery, a market-disrupting innovation that effectively brings the power of governed data delivery to “the people” unlocks big data’s full operational potential”.
Earlier I wrote about Data Curation and how new companies such as Tamr are addressing the issue. Pentaho’s SDR is another form of data curation. IBM calls it Data Wrangling process.
As usual, we love to confuse with variety of terms describing the same.
Published September 18, 2015 Reads 394
Copyright © 2015 SYS-CON Media, Inc. — All Rights Reserved.
Syndicated stories and blog feeds, all rights reserved by the author.
More Stories By Jnan Dash
Jnan Dash is Senior Advisor at EZShield Inc., Advisor at ScaleDB and Board Member at Compassites Software Solutions. He has lived in Silicon Valley since 1979. Formerly he was the Chief Strategy Officer (Consulting) at Curl Inc., before which he spent ten years at Oracle Corporation and was the Group Vice President, Systems Architecture and Technology till 2002. He was responsible for setting Oracle's core database and application server product directions and interacted with customers worldwide in translating future needs to product plans. Before that he spent 16 years at IBM. He blogs at http://jnandash.ulitzer.com.
![]() Sep. 18, 2015 08:15 PM EDT Reads: 175 |
By Liz McMillan Sep. 18, 2015 07:45 PM EDT Reads: 187 |
By Liz McMillan ![]() Sep. 18, 2015 07:30 PM EDT Reads: 170 |
By Liz McMillan ![]() Sep. 18, 2015 07:00 PM EDT Reads: 203 |
By Elizabeth White ![]() Sep. 18, 2015 06:45 PM EDT Reads: 521 |
By Liz McMillan ![]() Sep. 18, 2015 04:30 PM EDT Reads: 440 |
By Liz McMillan ![]() Sep. 18, 2015 03:15 PM EDT Reads: 561 |
By Elizabeth White ![]() Sep. 18, 2015 02:45 PM EDT Reads: 444 |
By Glenn Rossman ![]() Sep. 18, 2015 02:00 PM EDT Reads: 104 |
By Elizabeth White ![]() Sep. 18, 2015 01:45 PM EDT Reads: 361 |
By Elizabeth White ![]() Sep. 18, 2015 01:15 PM EDT Reads: 499 |
By Liz McMillan ![]() Sep. 18, 2015 01:00 PM EDT Reads: 324 |
By Elizabeth White ![]() Sep. 18, 2015 12:30 PM EDT Reads: 312 |
By Elizabeth White ![]() Sep. 18, 2015 12:30 PM EDT Reads: 199 |
By Elizabeth White ![]() Sep. 18, 2015 12:00 PM EDT Reads: 246 |
Join @Sandy_Carter at @ThingsExpo Silicon Valley | #BigData #DevOps #IoT #M2M #API #InternetOfThings By Pat Romanski ![]() Sep. 18, 2015 12:00 PM EDT Reads: 224 |
By Elizabeth White ![]() Sep. 18, 2015 12:00 PM EDT Reads: 347 |
By Pat Romanski ![]() Sep. 18, 2015 11:30 AM EDT Reads: 1,181 |
By Elizabeth White ![]() Sep. 18, 2015 11:30 AM EDT Reads: 373 |
By Elizabeth White ![]() Sep. 18, 2015 11:15 AM EDT Reads: 202 |