Basics of big data
What is Big Data ?
With increased usages of internet and availability of high speed data packs on affordable prices smartphone industry have grown exponentially in recent time.So lets understand properly What is Big Data ?
We have started creating huge amount on data via unlimited mobile apps and many other desktop applications which were not in form of structured data which we were managing via structured database solutions.So problem surfaces how to store data and process them to get meaningful outcome from it.This problem is defined as bigdata. It can be referred as a name of problem faced by data scientists.
It can be understand or you can define like below
“collection of data that are huge and complex which is difficult to store and process using available database management tools or traditional data processing applications.”
The main challenges related to bigdata are
1. Capturing data 2. Curating data 3. Storing data 4. Searching data
5. Sharing data 6.Transferring data 7. Analyzing data 8. Visualizing data
Data scientist defined these problems into different category and came up with the major challenges as 5 V,s in handling these data.Those are described below precisely.
1.VOLUME : The amount of data which is growing day by day at very fast pace.
2.VELOCITY : The pace at which different sources generate the data every day .Flow of data is massive and it is continuously keep growing.
3. VARIETY :There are various sources which are contributing to bigdata and they are generating different category of data like structured , semi structured and even unstructured category of data.
4. VALUE : What value can be generated from these huge amount of data by segregating or filtering valuable data of it in benefit of study, bushiness. Data analysis comes into picture.Many companies like Google,Amazon are the leaders how they are taking full advantage of data analysis to grow there customer base and presenting customers product based on their interest and analyzing customers online behavior .
5.VERACITY : There are many inconsistencies in present data like uncertainty or doubt or incompleteness in available data.
So now we know the problem,which needs to be addressed.How we can solve these problems.Here come hadoop come up with the solution.
Hadoop as Big data Solution
So What is Hadoop ?
“Hadoop is a framework which allow us to store and process large data sets in distributed and parallel fashion.”
Hadoop came up with solution to the problem related to big data.Instead of going in more details how it named as hadoop and let us see what are the main components of it.The main component of hadoop is HDFS which resolves the problem of storing huge data for processing and MapReduce processing technology. HDFS creates a level of abstraction over the resources from where we can see the whole HDFS as a single unit.
What is HDFS ?
HDFS stands for Hadoop Distributed File System . Hadoop System is an Client-Server based model solution like many old Client-Server based solutions for example NIS, LDAP or many others.
You might be familiar with Distributed File System offered by many operating systems .Only major difference is that in previous DFS there was overhead on Master node to process all the data collected from node .
But in HDFS parallel processing is taking place even parallel processing term is not very new concept as it was already in place in data processing.
Main components of HDFS are
1. Master Node also called Name Node
In general it contents metadata about stored data .You might be aware about VTOC volume table of contents as it stores all the information about disk.It actually does not store data.In version 1 of hadoop size of data block was 64 MB and in version 2 it is 128 MB.Data is distributed among all slave nodes in chunk of 128 MB among slave nodes and for redundancy of data by default it maintains two additional replica of your data block.
2. Data Node also referred as Slave Node
Actual data is stored here and also processing of data takes place. Heartbeat maintained to update their status with master node.
It is the core component of processing in hadoop echo system as it provides the main logic for data processing.
What is Map Reduce ?
“MapReduce is a S/W framework which helps in writing applications that process large data sets using distributed and parallel algorithm inside hadoop environment.”
MapReduce name is the collection of two functions name. A MapReduce program contains two functions called Map() and Reduce().Map() performs actions like filtering,grouping and sorting where as Reduce() aggregates and summarize the results produced by Map().Map() function output is the input for Reduce().They are tightly coupled and so named as MapReduce. The result generated by Map() in terms of key value pairs(K,V) which acts as the input for Reduce().
What is YARN ?
There was overhead identified in version 1 on master node scheduler in case of huge number of slave nodes increased over many thousand it become almost impossible to manage by scheduler. This was introduced in version 2.
To overcome this Yahoo came up with solution names as YARN to distribute the load of scheduler. YARN is abbreviation of Yet Another Resource Negotiator. Components of YARN are
1. Resource Manager
RM (Resource Manager) is cluster level resource manager (one per cluster) and runs on master machine.It manages resources and schedule applications running on top of YARN.
2. Node Manager
A node level component [One per node] and runs on each slave machines.It is responsible for managing containers and monitoring resource utilization in each container ,It manage Node health and logs.Due to open source nature of hadoop system day by day various tools is getting incorporated in this system to achieve specific goal or business objectives.System is getting expanded exponentially and entire tools comprising big data solutions make as Hadoop Eco System.
It mainly contains below tools in Hadoop Echo system.
1. HDFS 2. YARN 3. MapReduce 4. Spark 5. PIG/HIVE 6. HBase 7. Mahout Spark MLib 8. Apache Drill 9. Zookeeper 10. Oozie 11. Flume,Sqoop 12. Solr & Lucen 13. Ambari
To summarize Hadoop Echo System is keep evolving with many tools to achieve specific goal and it opens up various career opportunities like data scientist,data analyst ,big data architect and many more.
It is futuristic and tend to evolve more and more.Many institutes and engineering colleges are keeping this as curriculum to meet industry requirement as industry acknowledged knowledge gap specially in this case.
There are many online courses available to explore and there are some free online courses also available .
Just review from your side it is really for you considering your current role and how bigdata is going to affect your current role in organization and if you feel it is worth to learn then start with free online courses and then take any other courses depending on your personnel assessment and go for certification in same.
So if you have gone through it i expect you must be having basic idea about what is bigdata ?
If you like above post and want to learn more.Let me ask to join my FB group with below link.I can assure about consistency but not frequency due to my other engagements.I can assure you will lean more and more about existing and future trends in IT industry while being with this group in future.So please do share if you think it deserves to be shared.
UNIX LINUX Resource Center
Cisco Certification Related Books Link for Interested Audience