Introduction to big data & Apache Hadoop

 This blog post series on big data and Hadoop is my attempt to document my learning and caters to anyone who is a complete beginner.

 

Before I begin with defining big data, let us understand a dataset.  Dataset can be defined as a collection of information that is stored together. The next question that comes to mind is: Who generates this data? 

Given the current boom in technology, data is being generated from a plethora of sources like social media, google searches, zoom meeting information, retail, transport, et al. Given the diversity and user base of these sources, data is generated in high volumes namely petabytes (1024 terabytes), exabytes, zettabytes, yottabytes, et al.Our legacy systems like mainframes, RDBMS, relational databases cannot handle these high volumes of data as it is unmanageable for their infrastructure. 

So, in simple terms, big data can be defined as a dataset: which is nothing but collection of data. There are 2 main goals for big data: storage and processing. 

IBM defines big data using 5Vs:

Let's take an example: Amazon being one of the most popular e-commerce player in the industry and the data they generate is stored in a data warehouse. The amount of data that is being added from a data source to the warehouse, is called volume of data.

The rate at which this data is being added from data source to warehouse is called velocity of data. 

There are different forms of data and the varieties can be defined as: 

  • Structured data which is organized and can be stored in the form of columns, tables. eg. payroll, excel tables, csv
  • Unstructured data which doesn't have a structure eg. social media data, audio, video, images
  • semi structured data which is partially organized eg. JSON, XML

 With the abundance in data, comes the need for classifying something as authentic vs fake. Veracity can be defined as the accuracy of data and process of finding inconsistencies in data. Of all the data found in the world, studies have shown that only 70% is accurate. 

If a bulk dataset has no meaning, then it has no value for your organization. 

If any dataset has high volume, high velocity, lots of variety and high veracity, then IBM calls that data as big data

Let's take an example: 

For a startup company X, 500 TB of data causes a strain on their IT infrastructure. Whereas the same 500TB of data is considered normal for a company like Google as they usually maintain large volumes of data. Hence Google doesn't consider 500TB of data as big data. 

Basically, there's no specific number associated with categorizing data as big data or not - whenever the volume of data causes a strain to your infrastructure, then we call it Big Data. To address big data related issues, specifically storage and processing, we resort to the open source framework called Hadoop.

 

Hadoop is based on the master slave architecture as projected below:

Now to draw parallel with the hadoop world:

We will focus on how the hadoop cluster functions and addresses failures in the next post!





Comments