Transforming Business Practices With Advanced Big Data Tools By Sanjeet Banerji, Sr. Vice President

Transforming Business Practices With Advanced Big Data Tools

Sanjeet Banerji, Sr. Vice President – Technology, Datamatics Global Services | Friday, 17 February 2017, 04:28 IST

No business can afford to ignore or underplay the impact of technology-driven disruptions that are transforming and redefining what it means to be a business in a digital world. Today’s information world is characterized by a sudden & sharp rise in Data Volumes which is unprecedented. The total amount of data until 2013 was said to be 4.4 Zetabytes, which will grow to 44 Zetabytes by 2020 according to an estimate. Volume, velocity, variety, and value of data create new sources of potential knowledge and prescience. But it brings with itself the challenges of storing, retrieving, managing, analyzing and then finally leveraging the knowledge to transform your business into an extremely agile and profitable enterprise. The challenge will lie in creating a digital business strategy that may include references to how the new sources of data can be harvested to define new business paradigms in a new era digital world.

Welcome to the world of Big-Data

Think of Big Data and one instantly thinks of Hadoop, the decade-old open source platform that allows crunching of large data sets across a cluster of computers. Hadoop is a seminal idea that has led to several variants and has ignited creation of a vibrant eco-system with a plethora of big data support tools aimed at making life easy should someone decide to start on a big-data journey.

The open-source support tools like Flume & Sqoop can help ingest data to create an extremely cheap data store and use Drill to extract data for small reporting requirements. The open source analytics can get you started on initial analytics and reports that you can start building on in a step function manner.

What’s encouraging is that the active open source communities of the various big data tools are constantly improving and rolling-out new features. Let’s look at some of the open-source components that could get you to a quick start and establish foundations for an advanced deployment if you are new and contemplating Big Data to transform your business.

Apache Spark offers a cluster computing framework for machine learning applications, interactive analysis and it also overcomes performance limitations found in the earlier Map Reduce framework that allowed distribution of programs to run at each nodes of the cluster. Spark is becoming the preferred system for real-time analysis of streaming data as seen in Financial Institutions or in IoT devices.

Apache Zeppelin is a web-front for spark and is a notebook-styled approach for providing users data discovery, exploration, and visualization of Spark apps in an interactive manner.

Apache Yarn takes the complexities of management of resources and the apps that run on these nodes in the clusters.

Apache Flink running on top of Yarn can provide a framework for bringing together Streaming analytics and Historical Analytics.

Apache Drill supports a query like language for efficiently extracting or querying the data stored in a Hadoop cluster.

Apache Kylin is an extremely high-performing OLAP engine and is useful for managing massive cubes of data. This provides sub-second query latency in Hadoop for more than 10 billion rows of data, which is enough concurrency to support thousands of users and the range to scale out into petabyte territory.

Apache Ranger allows one to set up a security policy on all the data that gets into the Hadoop Cluster. This lets users centralize security administration and manage security related tasks.

The ability of Hadoop to store structured or un-structured data with equal efficiency is what makes it so exciting. All kinds of Enterprise Data in its raw form can be pushed into a Hadoop store to form what is called a Data Lake. A Data Lake can be likened to a Library where one can visit to seek information on specific topics by simply referring to that subject either by way of search or indices created for the library. A data lake can be created as a Big Warehouse of Data without having a specific use in mind but with a known intent. This store can be used and augmented at will as new data sources become known or available without having to migrate data to the new form.

Our experience shows the best way to start is to set up an experimental Data warehouse of storing data and query based adhoc reporting and performing data exploration (Experimental Phase). The next step is to expand this to include more data sources and performing advanced analytics (Expansion Phase). Finally, the next step is to consolidate all that has been done in the past and optimize the use for business transformation (Consolidation Phase).