What is Big Data
With increased usages of the internet and the availability of high-speed data packs at affordable prices, the smartphone industry has grown exponentially in recent times. So let’s understand properly what is Big Data.
We have started creating a huge amount of data via unlimited mobile apps and many other desktop applications that were not in form of structured data which we were managing via structured database solutions.
So problem surfaces how to store data and process them to get meaningful outcomes from it. This problem is defined as big data. It can be referred to as the name of a problem faced by data scientists.
It can be understood or you can define like below
“collection of data that are huge and complex which is difficult to store and process using available database management tools or traditional data processing applications.”
The main challenges related to big data are
1. Capturing data 2. Curating data 3. Storing data 4. Searching data
5. Sharing data 6. Transferring data 7. Analyzing data 8. Visualizing data
Data scientists defined these problems into different categories and came up with the major challenges as 5 V,s in handling these data. Those are described below precisely.
1.VOLUME: The amount of data that is growing day by day at a very fast pace.
2.VELOCITY: The pace at which different sources generate the data every day. The flow of data is massive and it continuously keeps growing.
3.VARIETY: There are various sources which are contributing to big data and they are generating different category of data like structured, semi-structured, and even unstructured category of data.
4.VALUE: What value can be generated from this huge amount of data by segregating or filtering valuable data of it in benefit of the study, business. Data analysis comes into the picture.
Many companies like Google, Amazon are the leaders of how they are taking full advantage of data analysis to grow their customer base and presenting customer’s products based on their interest and analyzing customer’s online behavior.
5.VERACITY: There are many inconsistencies in present data like uncertainty or doubt or incompleteness in available data.
So now we know the problem, which needs to be addressed. How we can solve these problems. Here come Hadoop come up with the solution.
Hadoop as Big data Solution
What is Hadoop
“Hadoop is a framework that allows us to store and process large data sets in distributed and parallel fashion.”
Hadoop came up with a solution to the problem related to big data. Instead of going into more detail about how it named Hadoop and let us see what are the main components of it.
The main component of Hadoop is HDFS which resolves the problem of storing huge data for processing and MapReduce processing technology. HDFS creates a level of abstraction over the resources from where we can see the whole HDFS as a single unit.
What is HDFS
HDFS stands for Hadoop Distributed File System. Hadoop System is a Client-Server-based model solution like many old Client-Server based solutions for example NIS, LDAP, or many others.
You might be familiar with Distributed File System offered by many operating systems. The only major difference is that in previous DFS there was overhead on the Master node to process all the data collected from the node.
But in HDFS parallel processing is taking place even parallel processing term is not a very new concept as it was already in place in the data processing.
The main components of HDFS are
1. Master Node also called Name Node
In general, it contents metadata about stored data. You might be aware of the VTOC volume table of contents as it stores all the information about the disk. It actually does not store data.
In version 1 of Hadoop size of data block was 64 MB and in version 2 it is 128 MB. Data is distributed among all slave nodes in the chunk of 128 MB among slave nodes and for redundancy of data by default it maintains two additional replicas of your data block.
2. Data Node also referred to as Slave Node
Actual data is stored here and also the processing of data takes place. Heartbeat maintained to update their status with the master node.
It is the core component of processing in the Hadoop echo system as it provides the main logic for data processing.
What is MapReduce
“MapReduce is an S/W framework which helps in writing applications that process large data sets using distributed and parallel algorithms inside the Hadoop environment.”
MapReduce’s name is the collection of two functions’ names. A MapReduce program contains two functions called Map() and Reduce().Map() performs actions like filtering, grouping, and sorting whereas Reduce() aggregates and summarize the results produced by Map().Map() function output is the input for Reduce().They are tightly coupled and so named MapReduce.
The result generated by Map() in terms of key-value pairs(K,V) which acts as the input for Reduce().
What is YARN
There was overhead identified in version 1 on master node scheduler in case of the huge number of slave nodes increased over many thousand it becomes almost impossible to manage by the scheduler. This was introduced in version 2.
To overcome this Yahoo came up with solution names as YARN to distribute the load of the scheduler. YARN is an abbreviation of Yet Another Resource Negotiator. Components of YARN are
1. Resource Manager
RM (Resource Manager) is a cluster-level resource manager (one per cluster) and runs on the master machine. It manages resources and schedule applications running on top of YARN.
2. Node Manager
A node-level component [One per node] and runs on each slave machine. It is responsible for managing containers and monitoring resource utilization in each container, It manages Node health and logs.
Due to the open-source nature of the Hadoop system, day by day various tools are getting incorporated into this system to achieve a specific goal or business objectives. The system is getting expanded exponentially and entire tools comprising big data solutions make as Hadoop Eco System.
It mainly contains the below tools in the Hadoop Echo system.
1. HDFS 2. YARN 3. MapReduce 4. Spark 5. PIG/HIVE 6. HBase 7. Mahout Spark MLib 8. Apache Drill 9. Zookeeper 10. Oozie 11. Flume, Sqoop 12. Solr & Lucen 13. Ambari
To summarize Hadoop Echo System keeps evolving with many tools to achieve specific goals and it opens up various career opportunities like data scientists, data analysts, big data architects,s and many more.
It is futuristic and tends to evolve more and more. Many institutes and engineering colleges are keeping this as a curriculum to meet industry requirements as industry acknowledged knowledge gap, especially in this case.
There are many online courses available to explore and there are some free online courses also available.
Just review from your side it is really for you considering your current role and how big data is going to affect your current role in the organization and if you feel it is worth to learn then start with free online courses and then take any other courses depending on your personnel assessment and go for a certification in same.
So if you have gone through it I expect you must be having a basic idea about what is big data.
If you like the above post and want to learn more. Let me ask to join my FB group with the below link. I can assure you about consistency but not frequency due to my other engagements.
I can assure you will learn more and more about existing and future trends in the IT industry while being with this group in the future. So please do share if you think it deserves to be shared.
UNIX LINUX Resource Center
Cisco Certification Related Books Link for Interested Audience