Tuesday, January 1, 2013

Introduction To Big Date: O'Reilly Course

The main concept of Hadoop is to process large amount of data distributed on multiple machines

when we talk about hadoop we talk about
1- MapReduce framework
2- HDFS, Hadoop Distributed file system

MapReduce: is the algorithm that allows processing large amount of data distributed on multiple machines

the user should write Map function and Reduce function, Hadoop will take care of the rest (distributing the work over multiple machines, handling errors, fault tolerance ...)

MapReduce example:

you have a file pets.txt, we divided this file into blocks, consider that each block is saved on a machine, the MAP function will process the input and convert it to KEY/VALUE pairs, here the key is the animal name, the value is just 1.
SHUFFLE is the process of moving the input with similar keys to the same machine (THIS IS DONE AUTOMATICALLY BY HADOOP).
Then the REDUCE function will do simply a count and the output pet_freq.txt contains the result which is the number of times the animal each animal is mentioned in the file

- HDFS: is a way to distribute our data on multiple machines, Hadoop will store multiple copy of the data on different machines so we are sure the data is not lost. (Default Replication factor is 3)
- the recommended block size in hadoop is 64 or 128 MB
- It is better to have small number of large files 
- hadoop reads the stored data SEQUENTIALLY no Random Access, the read is from beginning to the end of the file, this reduce the disk seek time
- there is no update, if you want to change the value in a file then write another one ( there are some solution for this which is introducing another layer on top of hadoop like HBASE ).

as you can see, the data is broken into blocks, blocks are stored in NODES, the blocks are replicated.

The Master Node has what is called Name Node, the Name Node knows where each block is stored.

- in order to deal with HDFS you may use HDFS SHELL (Command Line), Web user Interface (like Data Meer), Java API, REST.

Apache Hive
it is on top of hdfs, it is based on the idea of defining tables on top of hdfs, so what we do is do the mapping between the tables and hdfs.

There are no tables, it is just imaginary things.

Now you can use Hive QL, which is similar to SQL, and hive will convert them to MapReduce

how to do the mapping between HIVE and HDFS

as you can see you define a the table fields, and how it is mapped to the hadoop file, here we are using Regex for this mapping.

very important, HIVE is not used like relational database, it is used for batch and Analytics stufy


PIG 
on top of hadoop, it is a procedural language, so here you dont write sql queries you write something like stored procedure.

you can start your development working on sample file without hadoop, do the development then go to hadoop.

the nice thing in PIG that you can write your procedure line by line, write a line check the result if you are satesfied move to the next line.

you can always check the execution plan to see the generated map reduce

example of PIG

as you can see after you load the document (e.g. sales.dat) you specify the schema of the document.

so use PIG when you want to write complex query that cannot be written in a single HIVE QL.

Scalding
So here we use scala, and scala will be converted to MapReduce.
so the benifit is that you are using a programming language which will ease your life.

basically scallding will be compiled to cascading which is a MapReduce library which convert scala to MapReduce.


HADOOP Ecosystem

we have hdfs, on top of it we have YARN,
YARN fixes the multi tenancy problem of hadoop (running multipe map/reduce at the same time), in addition YARN allows hadoop to scale over 4000 machine in a cluster

TEZ: it provides better performance by elemenating some read write operations, TEZ transform map/reduce requests to directed acyclic graph. TEZ makes HDFS used for not only batch processing but for real time as well
zoo-keeper: is used to manage clusters.
spark: uses TEZ, and it is in memory data structure
flink: uses TEZ, it convert the program to a compiled execution plan

sqoop: is used for integration, basicaly export and import to relational database.

flume: also used for integration but with non relational database, e.g. file system, MQ ...

Mahoot: is used for maching learning (not mentioned in the graph above)

HBASE: has something called bloom filter, it detects if the key is not there and return directly.

NOSQL
we are talking about:
1- non relational databaes
2- distributed database.
3- CAP theorem: Consistency, Availability, Partition Tolerance
4- Eventual Consistency: consistency will happen in milliseconds

Ofcourse RDBMS is CA (Consistent and Available)

there is a relation between scalability and datamode complexity in NOSQL, in that sense if you want to order datastores in term of scalability:
1- Key/Value
2- Column family
3- Document
4- Graph

and the complexity is the opposite order


STREAMING
when we talk about streaming we are talking about different way of generating results

Normally, we have queries we run them on data and get results.
in Streaming we have data as an input to queries and we get the result


some products:
APACHE STORM
APACHE SPARK
APACHE FLINK

example of comparison between STORM and Hadoom


No comments:

Post a Comment