Posts

Introduction to big data & Apache Hadoop

Image
 This blog post series on big data and Hadoop is my attempt to document my learning and caters to anyone who is a complete beginner.   Before I begin with defining big data, let us understand a dataset.  Dataset can be defined as a collection of information that is stored together. The next question that comes to mind is: Who generates this data?  Given the current boom in technology, data is being generated from a plethora of sources like social media, google searches, zoom meeting information, retail, transport, et al. Given the diversity and user base of these sources, data is generated in high volumes namely petabytes (1024 terabytes), exabytes, zettabytes, yottabytes, et al.Our legacy systems like mainframes, RDBMS, relational databases cannot handle these high volumes of data as it is unmanageable for their infrastructure.  So, in simple terms, big data can be defined as a dataset: which is nothing but collection of data. There are 2 main goals for big data: storage and processin