Friday, November 18, 2016


Git Essentials LiveLessons



install git from
install sublime from

from git bash set your global variables

//set global user name
 git config --global "hassan jamous"
//set global email
 git config -global "''"
//set global color
 git config --global color.ui "auto"
//set global editor to  sublime
 git config --global core.editor "'C:\Program Files\Sublime Text 3\sublime_text.exe' -w"

you can view your global configuration 
git config --list


create a git repository
you can create a git repository in any folder, 
cd to the folder and type
git init
git will create a .git hidden folder in this folder and all subfolders, you can type 
ls -a to check 

Branch Master
when you create a new git repository, you will be working on the Master branch.

Untracked, Staging and Commit
when you create a new file in your repository, the file will be in Untracked. 
to track the file you type 
git add FILENAME
or you can use
git add .
to add all the files to the staging area

after you add the file the file is in the staging area.

to commit your changes, which means save it to the branch you type
git commit -m "commit message"

to check the commit log
git log

Now if you change a file, the file will NOT be in the staging are, you should add it then commit.

the HEAD is your last commit, so when you commit new changes you are moving the HEAD, to the new commit

Check the differences
if you change the file, and before moving the file to the staging area. you can compare your changes with the last commit by typing

git diff 

if the file is already in the staging area you should type

git diff --staged

after you commit you can compare with the previous version by
gid diff HEAD~1
which means compare with one commit before the head, you can use HEAD~2 or 3 ....

also you can compare with commit id, if you use
git log
you will git something like

commit 5dbed2e0bcd7bdb844d6a6fdfc6519b9f5da7e31
Author: hassan jamous <>
Date:   Wed Nov 16 19:00:50 2016 +1100

    second commit

commit c39fbd23776eb5e569bff21b5bd8d05eacb1facd
Author: hassan jamous <>
Date:   Wed Nov 16 18:42:30 2016 +1100

    first commit

you can use the commit id to compare the differences
git diff c39fbd23776eb5e569bff21b5bd8d05eacb1facd

you can move your head between commit by using git checkout.
for example you can move your head to the previous commit

git checkout HEAD~1

if you type git status here , it will tell you that the head is deattached and now is pointing to the previous commit 
$ git status
HEAD detached at c39fbd2
nothing to commit, working tree clean

to go back to the last commit type 
git checkout master 

sure you can also checkout to commit id 
git checkout c39fbd23776eb5e569bff21b5bd8d05eacb1facd

lets say you want an old version of a file, you simply checkout that file from the commit you want

git checkout HEAD~1 readme.txt

notice here that you are not moving the head, you just telling git that i want the file from HEAD~1

of course you can type 
git checkout c39fbd23776eb5e569bff21b5bd8d05eacb1facd

now if you check git status, you will see that the file is modified and in the staging area, 
$ git status
On branch master
Changes to be committed:
  (use "git reset HEAD <file>..." to unstage)

        modified:   README.txt

now you can commit your new changes.

Deleting a file
when you delete a file, the file delete action will not be in the staging area, you can confirm the delete simply by typing 
git add
to add it to the staging area, then 
git commit -m 'we deleted the file'
to commit

no if you want to reverse your changes, you dont want to add them to the staging area, you should type
git checkout master readme.txt
notice that we returned to the master version

Moving a file from a staging area to out of the staging area (undo the git add)
to move the file from the staging area type
git reset HEAD readme.txt

undo your changes
if you make a change, it is not in the staging area, if you want to undo the change
git reset --hard

Adding new folder to git repository
if you create a new empty folder, you will notice that git will not recognise that, you should have a file in the folder to be recognised. 
that's why people create a .gitkeep file inside empty folder, git will recognise the folder, the .gitkeep will be hidden so normal users will not see it.
of course you can see the hidden file from the bash, by running ls -a

ignore files 
to ignore files from git, you should create 
file on the root folder of the respoitory, inside this file you can add the files or patterns that you would like to ignore.

force ignored file to be committed 
to force an ignored file to commit 
git add -f FILENAME


GIT is structured this way

as you can see, you have a local copy, and you have remote, remote could be GITHUB, GITBUCKET, GITLAB or anything that follow git structure

you can have multiple remote, however the primary Remote is called origin (this is a convention)

you push or pull from remote

adding a remote
1- create a repositroy on github
2- we will use this repository as remote
3- we will add the remote, and it is gonna be our primary we will call it origin 

git remote add origin
as you can see we named this remote ORIGIN, you can call it whatever you want, but as it is the primary we are following the convintion ==> it should be called remote.

now you need to push your repository to GITHUB
git push origin master

which means i want to push my master to remote called origin.

it will ask you for github username and password

checking what remote do you have 
you can use 
git remote -v
to get the list of remote repository that you have, you will get something like
$ git remote -v
origin (fetch)
origin (push)

as you can see, for each remote you have to entries, one to fetch the code and another to push the code.

USE SSH to connect to github
any change that you wanna do on github, you need to provide a username and password.
in order to solve this, you should use ssh rather than http url to connect to github

to do that 
1- go to the root folder and create .ssh folder (on linux ~/.ssh, for windows C:\Users\Hassan)
2- cd to that folder 
3- type  ssh-keygen
4- you will receive this message Enter file in which to save the key (/c/Users/hassa/.ssh/id_rsa):
5- put the file name like id_rsa
6- then you will get the following messgae

Your identification has been saved in id_rsa.
Your public key has been saved in
The key fingerprint is:

 now, your key is store in

7- open
8- copy the key
9- go to github, open settings menu then SSH and GPG keys

10- add a new key, and copy the key.

11- now git the ssh location from git 

and type
git remote add origin SSHLOCATION

so now lets say you updated a file, you committed the changes, these changes will be stored locally, to push these changes to git hub 

git push origin master

now the file will go to github 

Pulling changes from GIT HUB
you can edit files on git hub directly, lets say you edited the file from git hub or you want to pull the latest changes 

you can type
git pull origin master

when you push your changes, you might git an error, which will say that a conflict has happened because you dont have the latest version.
you should pull the latest changes and then push
git pull origin master
git push origin master

when you pull auto merge might not be possible so you should handle this manually

Creating a new branch

to create a new branch you can typ
git branch BRANCHNAME

this will create a branch from where you are, so if you are on master the branch will be created from master.

or you can type 
git checkout -b BRANCHNAME

to list the list of branches you have 
git branch -a

to move from one branch to another
git checkout BRANCHNAME

to delete a branch
git branch -d BRANCHNAME

in order to force delete a branch (in case there is some work on this branch that is not committed yet)
git branch -D BRANCH NAME
use capital D

to merge change to the master branch , you should first checkout to that branch
git checkout master

then you can merge
git merge BRANCHNAME

when you have a new task, 
1- checkout from the master branch 
git checkout -b NEWJIRA
2-do the changes that you want
3- now we should push this branch to remote
git push origin NEWJIRA
4- go to github website and create a pull request, here you should specify the base branch and the branch that you want to be merged (base branch will be master, the branch to be merged is NEWJIRA)
5- someone will see the pull request, will review it and accept the request.
6- after this you can delete the NEWJIRA branch from your github.

now your merged the NEWJIRA branch to the master branch on github, which means you merged on REMOTE. you need to pull these changes to your local master

git checkout master
git pull origin master

now your master is similar to the remote master

when you type
git branch -a
you will get something like
$ git branch -a
* master

as you can see, it lists the branches that you have locally, and the remote branches.
lets say that you went to github and deleted the testingBranch, so you are deleting the remote testingBranch

after you do that you should update your local repository in order to be synced with the remote, in order to sync your remote with local WITHOUT CHANGING YOUR LOCAL BRANCHES, you should use git fetch

git fetch

this will sync the remote branches, however in order to delete the missing branches as well you should type

git fetch -- prune
now if you type 
git branch -a

$ git branch -a
* testingBranch


you will notice that the branch is deleted.

basically git pull is git fetch + git merge

you can use 
git log 
to get the log of a branch
however there are too many information there. in order to see better version

git log --oneline

this will print one line for each commit 

to print all commits
git log --oneline --all 

to print a graph and to see which branch merged the changes

git log --oneline --all --decorate --graph 

from the log you can see how the branches are related, it will tell you which branch is before another, and which branches are pointing to the same thing

for example, the following image tells us that development, origin/master and master branch are on the same level

and this image tells you that origin/master and master or on the same level
and feature/folder_documentation, origin/development and development are on the same level and in front of the master branch

in order to sync branches we used to do, git merge, what basically happens in git merge is the following

lets say you have this case

now when you merge you do the follwoing
git checkout master
git merge experiment

and this what will happen

there is another way to merge which is rebase

lets say we have the following 

rather than going to master and merge, we will do the following 
git checkout experiment
git rebase master

now this what will happen, the output will be

First, rewinding head to replay your work on top of it...
Applying: added staged command

so we forwarded expermint to C4' infront of master

now we will type

$ git checkout master
$ git merge experiment

and the result
Fast-forwarding the master branch.

Lesson 4

Adding a collaborator 
if you want to add someone to your github project so they can push and pull, go to your respository then choose setting -> collaborator, and then add the collaborator 

now the collaborator should donwload the project. to do that use git clone
this will donwload the project and will create the required remotes.

now the collaborator can push and pull

if you have a big a project with many collaborator, you will not add all of them, the best solution for this is to FORK

fork means that you have project REMOTE and you will clone this project to your remote. 
so when you fork it means that you are taking this project to your account (your remote), and when you push and pull you are basically doing that on your account not the project account.

by you do the update on your account (your remote) then create a pull request to merge it to the project remote

so, from GITHUB you can press the FORK button, now the project is forked and it is in your account.
you can copy the ssh link and use
git remote add origin SSHLINK

now do the changes and push it to your remote, then create a pull request to merge it to the project remote.

Now you when many people fork the project you will have a sync problem with a project.
to handle this you should add a new remote, this remote is the project itself,
so now you have your account remote (which we call it ORIGIN) and we should add the project remote (we call it upstream).
git remote add upstream PROJECT_SSH_LINK

very important that what you do is: 
1- pull from UPSTREAM (i.e. get the latest version from project remote)
2- push to ORIGIN (i.e. push your change to your remote)
3- create a pull request to merge from ORIGIN to UPSTREAM

There are some situation when you resolve conflict, 
you do git rebase
then you should do 
git rebase --continue
git rebase --skip

and when you do rebase you usually have to force push
git push -f oringin master


check this url for branching 

Monday, May 23, 2016

OREILLY Learning Apache Maven

Maven has 3 lifecycles: Clean, default and site.
in each lifecycles we have alot of phases, executing a phase means executes all the previous phases.

Maven is convention over configuration, which means you dont tell maven in some configuration where your java files are, by convention they must be in src/main/java

You can change these convention but it is not recommended, you may need that if you are working on a legacy application

you can have a Terminal inside eclipse, use TCF Terminal plugin

Inheritance in Maven
in the general pom all the directories and conventional stuff are defined

Maven profile
you can use Maven profile to build the project based on your environment, for example a profile for the test environment, and another for DEV another for production.

now when you run maven use -P
mvn -Pproduction package

if you dont want to use -P you can do something else
you can define an Environment Variable in windows and then use it in pom.xml

now Maven will check PACKAGE_ENV environment variable and determine which profile to use

Maven Dependency

Maven can handle transitive dependency which means if you depend on X.jar and X.jar depends on Y.jar, Maven will fetch Y.jar

You can define Remote repositories in Maven.

You can define Scope for your dependencies, which means you dont need junit when you compile you need it when you test

Maven can handle conflicts, for example you depend on X.jar and Y.jar, X.jar depends on Z.jar version1 and Y.jar depends on Z.jar version2 . Maven can handle this conflict.
it will fetch the latest version by default, however you can control this behaviour using the <exclusion> tag.

Maven Lifecycles 
there are 3 different life cycles in Maven, default, clean and site
cycles have phases
phases are connected to plugins, for each plugin you have goals that must pass in order for the phase to pass

default is the most used cycle
in default you have these phases:
Compile: which means compile everything in src/main/java
test-compiles: compile everything in src/main/test
test: run unit test
package: create the jar or ear or war
install: take the generated package and put it in local repository so other project can use it as dependeny
deploy: take the package and put it in remote repository, so other teams in the company can use it

Sunday, March 13, 2016

Learning Apache Hadoop OREILLY Course

we should know the concept of Disk Stripping, in Disk Stripping or RAID0, the data is divided into multiple chunk, so lets say you have 4 hard Disks, data will be divided into 4 pieces, by that accessing data will be much faster,

in RAID1, we do mirroring for the data

in order to ensure that your data is safe you should combing RAID1 with RAID0.

Hadoop logically does that in the cluster, it stripes and mirror data.

- Hadoop is Fault tolerant, it means if a disk is couropted or a network card not working, this is fine.
- Hadoom has master slave structure.

you should choose a powerful and expensive computer for your master node.
master node is a single point of failure,
you should have 2 or 3 Master node in a cluster
you should have redundancy ( as it is SPOF)

You need alot of RAM, more than 25 as the deamon takes alot of ram
you should use RAID
you should use HOT SWAP Disk drive
you should have redundant Network card
you should have dual power supply

the bottom line, the MASTER NODE should never goes down.

CPU is not important like RAM here.

you will have 4 to 4000 slave node in a cluster
slave nodes are not single point of failure.
7400RPM disks are fine
more disks are better, which means 8 * 1 TB data is much better than 4 * 2 TB
it is better that all slaves have the same disk size.

sure slave IS NOT Redundant
you dont need RAID or Dual network card or Dual power supply

you need alot of RAM

lets say you have 10TB of data every month
you have slaves with 8TB
you have replication factor of 3
you should know that you have something called "intermediate data" which is the generated data betwee MAP and REDUCE. this data is about 25% of the disk size ( in this case 2TB )

the avaialbe space formula is = (RAW - ID) / RF = (8 - 2)/3 = 2 TB

which means each slave has 2TB not 8TB, which means you need 5 slaves every months (as you have 10TB every months).

it means all the things on top of hadoop,
basically when we say hadoop we main HDFS and MAPREDUCE.
main things in hadoop are
1- NAME NODE: part of the master
2- Secondery Name Node: part of the master
3- Job Tracker: part of the master
4- Data Node: part of the slave
5- Task Tracker: part of the slave

1- HBAS: fast scalable NoSql database
2-HIVE: write sql like queries instead of map reduce
3- pig: write functional queries instead of map reduce
4- sqoop: pull and push data to RDBMS, used for integration
5- flume: pull data into HDFS
6- HUE: web interface for users
7- Cloud Manager: web interface for managing the cluster for admin
8- oozie: workflow builder
9- Impala: real time sql queries, 70 faster that MapReduce.
10-Avro: serialize complex object to save in hadoop
11- Maheut: machine learingng in hadoop
12- Zoo Keeper
13- Spark
14- YARN
15- Storm

hadoop is used for batch processing, which means parallelization, which means problems like graph based doesnt fit with hadoop

the best is Cloudera


when we talk about Hadoop, we are talking about 2 main things
1- storage: whcih is HDFS, a distributed redundant storage
2- processing: which is MapReduce: a distributed processing system

some terminology to know:
1- a job: all tasks need to run on all data.
2- a task: individual thing, which is either a map or a reduce
3- Slave/Master: these are computers
4- NameNode, DataNode: these are daemons, which means JVM instances

we have MapReduce v1: old and stable
we have MapReduce v2: new things like dynamic allocation and scalability

Hadoop cluster has 5 daemons:
- Storage Daemons:
NameNode(on Master)
- Processing Daemons:

Master Daemons are for orchestration
Slave Daemons are for working

NameNode: Handle Storage meta data, it puts some information in Memory for fast access but also it persist data.
Secondary Name Node: it checks NameNode if it is alive or not, it is not a failover node
Job Tracker: coordinate processing and schedualling.

NOTE: use different machines for Name Node and Secondary Name Node, because if the machine is down, the Secondary Name Node will detect that and build a new Name Node directly

NOTE: you can install the job tracker on the same machine with the Name Node, and move it to another machine when your project gets bigger.

Data Node: handle row data (Read & Write)

Task Tracker: handle individual taks (Map or Reduce)

data node and task tracker always sends heart beats to the master to tell him that we are alive and we are working on this.

Hadoop run modes

1- Local JobRunner: Single computer, single JVM with all daemons, good for debugging
2- pseudo Distribution: Single computer, 5 JVM (one for each daemon), good for testing
3- Fully Distributed: Multiple computers, multiple JVM, this is the real environment.

when you install Hadoop it is recommended to use linux, use RHEL for Master and CentOS for slaves.

use Redhat Kickstart to install hadoop on multiple machines.

Elastic Map Reduce

is a solution from Amazon similar to hadoop.

it has this structure:

the master instance group: is like the master node
the core instance group: is like the slave node, but it is only responsilbe for storage ( as you can see it uses HDFS
task instance groupe: is like the slave node, it is only responsible for processing ( doing map reduce job)

usually we use S3 to write information and intermediate data.

Core instance group is static, you cannot add any new machine after you start the cluster, however the task instance group is not static, you can add new machine whenever you want


in this lap he created 5 EC2 instances, one is a master and 4 slaves
he installed cloudera manager
he installed Hadoop from cloudera manager
then he uploaded some data to Hadoop from the command line
then he ran a Map/Reduce example
then he checked everything from Cloudera manager

then he gave an example of  downloading Cloudera Quickstart VM locally, to install hadoop locally

HE used Ubunto 12.04 AMI


Hadoop Distributed File System (HDFS)

you can use HDFS without MAP/REDUCE, in that case you only need NameNode, SeconderyNameNode, DataNode

when you upload a file to HDFS it will be divided into blocks and stored in slaves nodes

every block will be replicated to 3 machines (by default)

you cannot edit or append the file you upload to HDFS, if you wanna change anything you should delete and create the file again.

the default block size is 64MB, however it is recommended to change it to 128MB

it is a master node
it has only metadata information about the files that are stored in slaves (e.g. name of the file, permessions, where the blocks exist).


the client asks the name node about the file then the client goes and read it directly from the slave node.

The name node metadata exists in RAM, however it is also persisted.

we have 2 files for the persisted metadata in name node:
1- FSIMAGE: it is a point in time image about the information that exists in HDFS
2- edit log: the changes that happened since we created the FSIMAGE, it stores the delta information

every now and then FSIMAGE and edit log will be merged and saved on the hard disk

you have to have multiple hard disks with RAID to insure that you will not lose the data.
it is also better to use remote NFS.
and daily or weekly backup.

every 3 seconds the datanode will send a heart beat to the name node
if 30 seconds passed without a heart beat, the node is out
if 10 minutes with no heart beat, hadoop will start copy the data that should be on that node to another machine.

every one hour (or after the restart of the name node) all data nodes will send Block report, which is a list of all blocks that they have.

Hadoop uses checksum to insure that data is transfered correctly.
every 3 weeks hadoop will do general checksum check on all blocks.


How writing Happen in Hadoop

here is an example

so the client divided the file into 4 pieces ,
he asked the name node to write the first piece,
the name node gave a pipeline which is: write to datanode A then c then F
the client write to A, then A write to C, then c write to f
F ack C, C ack A, a Ack the client, the client ack the nn and request a pipline for the next block.

how do we handle a failed node,

lets say DN_A is bad, the client will try with C, if not with F.
as long as the client is able to write into one node the client can move to the next block

general information:
1- checksum is used for each block
2- the file is considered as the number of written blocks, so lets say your file is 4 blocks and you wrote only 2 blocks, so to this point your file is only 2 blocks, and HDBS will see your file as 2 blocks.
for that it is better to have 2 folders, INCOMING: keep here the file that is under upload process, once you finish uploading the whole file, move the file to READY_TO_PROCESS folder.

How reading is handled

the client ask for a file, the Name Node also gives a read pipline for each block

Secondary Name Node

as we mentioned before, we have 2 files in the NameNode, fsimage which is a point in time file and edit log which is delta since the last fsimage

Note: we have 2 files, fsimage and edit log, because fsimage is a big file, opening a big file will slow down hadoop, thats why we have edit log, a small file and contains only delta information, using edit log means dealing with a small file ==> better performance



IF SECONDARY NAME NODE IS DOWN nothing will happen, the name node will keep writing on the edit log, the edit log will become bigger and bigger and the system will become slower and slower.


new lab, we used hadoop fs -put

when you do the instalation with cloudera manager, a trash directories will be created for you by default, when you delete something it will be moved to the trash directory.
if the directory is not created, it is recommended to create one.


High Availability Name Node
Name node as single point of failure is not acceptable,
thats why we have a new solution by cloudera which is intrduced in Hadoop 2, and called Name Node high availability.

as you can see, the Standby namenode will take over if the name node is off AND YOU DONT HAVE TO START THE WRITE OR READ OPERATIONS FROM BIGINNING .

NOTE IMPORTANT: Clients send all operations to both the NN and Standby NN, both of them have complete picture of what is happening in the memory.

With the architecture above, i can handle the failure of the NAME NODE, however the fsimage and edits log are still a Single Point of Failure.
that is why High Availaibility Structure introduced a new thing called JournalNode.

 the current active name node now writes, synchronously, the fsimage and edits log to set of journal nodes, the standby NN reads from these nodes

in order not to the nameNode and stand by name node misunderstood each other (maybe one think that it is the active name node now). they use something called epoch number with each write to the Journal Node


we use a cluster of Zoo keepers to determine who is the active name node (the number should be odd to avoid brain split).
as you can see we have ZKFC service in NameNode and StandBy Node, they send information to the ZooKeeper cluster to tell about the health of the node,
if the NAME NODE ZKFC noticed that the NameNode is down, he will send this information to zoo keeper, zoo keeper will set the stand by node a the active name node and the old name node as the standby one

as you can see HA is complicated, extra machines, extra configurations ...
you dont need this most of the times, the secondery name node on a different machine is usually enough.


scale name node functions by breaking up namespaces to multiple machines.

hadoop has authorization, but it doesnt have authentication, for example lets say you are sending a write request to machine1 as user xxx, user xxx is not authorized to do write operation but user yyy has, simple create user yyy and send a request as yyy, hadoop will not check that you are yyy for real.

to do Authentication you should use something else, Kerberos.

hadoop uses linux like permissions.



these are the player of map reduce

and here is how the job is done

we have also new version which is called MapReduce v2, in this version they focus on the scalability of the job tracker and removing a restriction on the number of the map and reduce jobs that can be run on each slave machine.

the Map Reduce configuration files are:
1- mapred-site.xml

in this lab he gave an example how to run a java map reduce function

this is the statement to run a map reduce, hadoop-examples.jar contains the Map and Reduce java classes.

he went over everyline of code, you can check it.

How MapReduce works in detailes

so to summarize, job tracker asks name node where the blocks are, it assigns some slaves to do map jobs, then it assigns one or more slaves to do reduce job, the reduce task trackers WILL COPY THE OUTPUT OF MAP TASK TRACKERS TO THEIR LOCAL MACHINES.


Hadoop is Rack Awareness


Advanced MapReduce, Partioners, Combiners, Comparators, And more

 firstly we should know that the Mapper and Reducers do some kind of sorting

The mapper sort the keys, and the reducer after the shuffle it also sort by the keys.

You can define a Comparator to do secondary sorting to sort the value in the Reducer, so in the example above we have us:[55,20] the secondary sorting will sort it to us:[20,55].

also we can define what we call a combiner, which is a pre-reducer, the combiner will run in the Map face, as you can see in the example above, the first mapper adds the US values and the output was 55, this is the combiner job.
With combiner you may reduce the processing time and the intermediate data.

we also have something called partioner

the mapper can partition its output to multiple partitions, and later the reducer can fetch the partion that it is intrested in,
in the example above we did a partition by key, and as you can see each reducer grabs a specific key.

There is a full example about writing a Partitioner.


for unit testing you have MRUnit  which is a new apache project.

he gave a practical example about loggin as well

when you do benchmarking we talk about terasort number, then number will give us an indecator about the performance of the cluster, and weather adding new machine gave us a gain in performance.

TERASORT is simply a simple or lets say the simplest mapreduce job hadoop can do. to do a TERASORT test you should use 3 scripts
1- teragen: to generate a dataset
2- terasort: it is a job that sorts the dataset.
3- teravalidate: it is used to validate if the dataset got sorted.

Hive vs Pig Vs Impala

we know Hive and Pig, we know that they are simply converting your requests to MapReduce requests.
they are in general 10-15% slower than a native java mapreduce.

as Hive and Pig converts the requests to MapReduce, they use the job tracker and task trackers

Impala is developed in cloudera, they are designed for real time queries, they use specific daemons for them, not the task trackers and job trackers. IMPALA DOESNT USE MAPREDUCE AT ALL.
Impala is not fault tolerant. Baisclly MApReduce is slow becuase of the time we need to start the jvm for map reduce jobs. Impala uses its own deamons.
Impala is on top of Hive, so it uses Hive (actually it is a sub set of HiveQL)



in HIVE, you can do the installation on each client and start calling.

or, you can have a HIVE server:

we always need a metastore, where we store the mapping between HIVE tables and HDFS data.

NOTE: in HIVEQL there is no update or delete, as HIVE runs on top of Hadoop and as we mentioned before you cannot delete or update a record.

Check the HIVE & PIG LAB.



Data Import and Export

we have 2 types of import and export:
1- Real Time Ingestion and Analysis:
products like Flume, Storm, Kafka, and Kinesis
the idea of these product is that you have multipel agents who push and pull data from each other.

these system doesnt care if the end system is Hadoop or NoSql or a Flat file

The products are similar, however Storm, Kafka and Kinesis has more Analysis functionality than Flume

2- Database Import Export:
Sqoop (SQL to Hadoop)
it is simply a single process that import/export data to/from hadoop.

there is no analysis or filtering or.. just import export.

you can do something like on 2:00 pull  all data from hadoop and put it in table xxx.


Flume is used to move massive amount of data from system A to System B (which is usually HDFS, MongoDB, NoSQL ...)

He talked about the architure of FLUME and there is a LAB.


some REST call examples


he gave a lab about sqoop

Oozie is used to build a workflow, the workflow is represented in XML format

Tuesday, March 8, 2016

EJB 3.1 cookbook

Chapter 1: Getting Started with EJBs

Creating a simple session EJB

simple example with @Stateless annotation.

very important is that @Stateless takes a parameter

mappedName is used as a JNDI name.

Accessing a session bean using dependency

as you can see, we inject the bean using @EJB.

and we used here @WebServlet to define a servlet.

Accessing the session bean using JNDI

as you can see, we use InitialContext() and lookup() to find the bean.

we used this path in the code
which is 

java:global: means search in all beans that are globally accessable
java:app: means search in all the beans that can be seen in the same application
java:module: means search in all the beans that can be seen in the same module

IMPORTANT: the bean can be packaged in application-ejb.jar, or application-war.war. the application-ejb.jar can be packaged inside application.ear.

so the[<app-name> is when you are packaged inside application.ear
the module-name is the name of the war.war or ejb.jar
the bean-name is the name of the bean.

so if you want to search for beans inside a module :

searching inside the application:

we will see later that the bean could implement a localInterface and/or RemoteInterface

public class Salutation implements SalutationLocalInterface, SalutationRemoteInterface {

in this case you JNDI lookup will be


Creating a simple message-driven bean

as you can see we use @MessageDriven, we implement MessageListner and override onMessage.

mappedName: is the name of the queue that we are gonna listen to
and we added some config which are the acknowledgeMode and the destinationType.

Sending a message to a message-driven bean

as you can see, you need a queueConnectionFactory and a queue.
we create a connection
then we create a session
then we create a producer
then we send the message.

also you can see that we injected the resources (connection factory and queue , which you usually create them in Glassfish) using @Resource 

Accessing an EJB from a web service

firstly we will define a singlton bean

simply we use @Singleton

then we will use this bean inside JAX-WebService

as you can see we use @WebService and @WebMethod to define a service.

and sure to inject the bean we use @EJB.

Accessing an EJB from a web service

firstly we will define a Stateless bean

then we can define

as you can see we define @Path, @GET, @POST ...
and sure to inject the bean @EJB is used.

Accessing an EJB from an Applet

you can access a bean from Applet check the example if you want 

Accessing an EJB from JSP

in this example we will create a Remote Stateless EJB.

to do that, firstly we should define the remote interface:

then we implement the interface in the bean class

no we will use InitialContext to get the bean in JSP.

Calling an EJB from JSF

calling an EJB from JSF is similar to JSP, however before EJB 3.1 we had to define what we call a managedbean which is like a wrapper to the EJB.

in this example we will see how to define a managedbean.

firstly we will define the bean

@Named is similar to @Component in Spring

then we define the managed bean

as you can see the managed bean is just a wrapper for the actual bean

now we can use the bean like this

Accessing an EJB from a Java Application
using JNDI

accessing EJB from Java Application can be done easily by using JNDI

Accessing an EJB from a Java Application using an embeddable container

you can access the EJB using what we call embeddable container, The embeddable EJB container allows EJBs to be executed outside of a Java EE environment.

the code looks like this

Accessing the EJB container

EJB needs to access the container, which means access it to use its services (security, transaction ...).

accessing the container happens through EJBContext Interface.

as you can see we defined SessionContext (which implements EJBContext) and annotated that with @Resource

we have 3 context:

SessionContext for Session Beans
MessageDrivenContext for MDB
EntityContext for an Entity

EJB 3.1 cookbook Chapter 2: Session Bean 2

Session bean has 3 types: Stateless, stateful and Singleton

Stateless: no state
Statefull: keep the state between callse

Beans can be access locally (No Interface or Interface) or Remotely 
when access locally you should be in the same JVM
parameters will pass by reference when locally, and by value when Remotely.

Creating a stateless session bean

as you can see we use @Stateless, @LocalBean (there is no need to use this one, it is the default value).

The lifecycle of Stateless bean has @PostConstruct and @PreDestroy

Creating a stateful session bean


Creating a singleton bean


Using multiple singleton beans

as you can see we used
@Startup: which means initialize the bean as soon as the application starts up
@DependsOn("BEANNAME") it means that PlayerBean should be initialized before this bean
you can wirte @DependsOn("x","y","z")

Using container managed concurrency

by default the container is responsible for handling the Singleton bean concurrent requests, only one client can access the bean at a time whether for read or write

you can change the concurrency behaviour by using @ConcurrencyManagement and @Lock

as you can see we say here that the container (which is the default behaviour) will take care of handling the concurrency, and we limit getState() to just Read lock and setState() to write lock

you can specify also the timeout for a lock by using @AccessTimeout(5000) 

Using bean managed concurrency

with bean managed concurrency, you should handle everything by using synchronized key words.

Using session beans with more than one business interface

you can use multiple interfaces with a bean, 

we used @Named just to use it with JSF

Understanding parameter behavior and granularity

we know that local beans runs on the same JVM and remote beans on different JVM

when you have a class with multiple private variables, then you need one call to get each variable instance ==> in case of local beans, this is fine as we are doing local calls, however in case of remote beans this is a lot of overhead. (FINE GRAINED APPROACH)

you can also pass the whole object ==> only single call, this is good in case of remote call (COARSE GRAINED APPROACH)

we know that passing objects between JVM we are passing by value, it is a good practice to make the object immutable.

lets take an example about fine grained:

we have this interface which represent the Orbit

the implementation for this interface

as you can see, if you want any value from this remote bean you should make a call, so you need to make 6 calls to get all the Orbit informations

in the example above we made a call to get the Eccentricity value, if you want to get the Longitude you should do position.getLonituteof...().

alot of call.

however if you go with the Coarse grained fashion, you can define the remote interface like this

as you can see just one method that return an object

as you can see it return an object

and you can do the call like this

as you can see orbitalElements.getPosition() will return the whole object in one call, no need for other remote calls.
now when you do getEccentricity() you are doing a local call

Using an asynchronous method to create a background process

if you want the bean to run asyncronusly so you dont have to wait for the results, you have 2 options:
1- Invoke and Forget: which means you run the bean and you dont care about the results
2- Invoke and Return in Future: which means you run the bean, the bean will store the results in a Future object which you can access later.

we will see the 2 approaches in this example:

as you can see we use @Asynchronous with printAndForget().
and we return Future<String> in case we want a future object,
as you can see we return new AsyncResult<String>()

to use this bean:

as you can see we do futureResult.get() to get the results and you should handle the exceptions

NOTE: Future object is not just used for getting the results, you can use it to cancel the task, check if it has completed and other thigns

EJB 3.1 cookbook Chapter 3: Message-Driven Beans 3

we know everything about Message Driven Bean, we will start by example directly

Handling a string-based message

for string based message, simply you can write

and you can read the message

Handling a byte-based message

and reading from queue

Handling a stream-based message

and reading

Handling a map-based message

and the read

Handling an object-based message

and you read that

Using an MDB in a point-to-point application

all the previous examples where point to point where we have the following architecture

also consider this architecture, sometimes it is better to have a chain of queues

Using MDB in a publish-and-subscribe application

in case of public subscribe we should create a Topic.

this is how to receive the message, and as you rmember Durability means that the message will stay in the Topic if the subscriber is offline.

and here is how we send a message

Specifying which types of message to receive using the message selector

now to read this specific type of messages

Browsing messages in a message queue

 you can use queuebrowser to browse the queue.

EJB 3.1 cookbook Chapter 4 & Chapter 5 & Chapter 6 : EJB Persistence & JPA Query & Transaction Processing

An entity is a class representing data persisted to a backing store using JPA. The @Entity annotation designates a class as an entity

Entities can also be declared in an orm.xml file

The persistence unit defines mapping between the entity and the data store
The persistence context keeps track of the state and changes made to its entities
The EntityManager manages the entities and their interaction with the data store

Creating an entity

@Entity: for defining an entity
@ID: for setting the primary key
@GeneratedValue: to generate primary key value

Creating an entity facade

the idea is to crate an abstract facade with general functions:

now you start implemnting this facade in your beans

Using the EntityManager

now what you can do is using the bean defined before like this:

Controlling the Object-Relationship Mapping (ORM) process

you define database information and persistence unit in persistence.xml

we have annotations like

Using embeddable classes in entities

you can embed entity withen another, and they will be in the same table.

for example, you can have an employee class

and you can define the Address class like this

with @Embaddable 

the table in database will be a single table (EMPLOYEE) with employee and Address fields.

Validating Fields

private String name;

private String name;

private String name;

@Size(min=12, max=36)
private String name;

private Date dateOfBirth;

@Past // the value should be in the past
private Date dateOfBirth;

@Future // the value should be in the future
private Date dateOfBirth;

private String zipCode;

@AssertTrue// this means resident should be true
private boolean resident;

private int monthsToExpire;

you can also use a Validator class to do the validation

Chapter 5 JPA Query

nothing much here we will just add few examples

1- create and run a query
public List<Patient> findAll() {
Query query = entityManager.createQuery("select p FROM Patient p");
List<Patient> list = query.getResultList();
return list;

2- control the number of returned entities
Query query = entityManager.createQuery("SELECT p FROM
Patient p");
List<Patient> list = query.getResultList();

3- delete query
public int delete(String firstName, String lastName) {
Query query = entityManager.createQuery("DELETE FROM Patient p
WHERE p.firstName = '" + firstName + "' AND p.lastName = '" +
lastName + "'");
int numberDeleted = query.executeUpdate();
return numberDeleted;

4- update query
 public int updateDosage(String type, int dosage) {
Query query = entityManager.createQuery("UPDATE Medication m " +
"SET m.dosage = " + dosage + " WHERE m.type = '" + type + "'");
int numberUpdated = query.executeUpdate();
return numberUpdated;

5- use parameter in query
public List<Patient> findByLastName(String lastName) {
Query query = em.createQuery("SELECT p FROM Patient p WHERE
p.lastName = :lastName");
query.setParameter("lastName", lastName);
List<Patient> list = query.getResultList();
return list;

6- using named query
query="SELECT m FROM Medication m WHERE m.type = ?1")
public class Medication implements Serializable { ...}

public List<Medication> findByType(String type) {
Query query = entityManager.createNamedQuery("findByType");
return query.getResultList();

7- Using the Criteria API

public void findAllMales(PrintWriter out) {
CriteriaBuilder criteriaBuilder;
criteriaBuilder = getEntityManager().getCriteriaBuilder();
CriteriaQuery<Patient> criteriaQuery =
Root<Patient> patientRoot = criteriaQuery.from(Patient.class);
List<Patient> patients =
for (Patient p : patients) {
out.println("<h5>" + p.getFirstName() + "</h5>");

CMTs can be used with session beans, message-driven beans, and entities. However, BMTs
can only be used with session- and message-driven beans.

CHAPTER 6 Transactions

you have either container managed transaction or bean managed transactions.

by default we have container managed transactions.

Using the SessionSynchronization interface with session beans

if you implement SessionSynchronization interface you can use functions like afterBegin, beforeCompletion, afterCompletion

we have something important which is called TransactionAttributeType which you can set for methods or classes

REQUIRED – Must always be part of a transaction
REQUIRES_NEW – Requires the creation of a new transaction
SUPPORTS – Becomes part of another transaction if present
MANDATORY – Must be used as part of another transaction
NOT_SUPPORTED – May not be used as part of a transaction
NEVER – Similar to NOT_SUPPORTED but will result in an EJBException being thrown

A Message Driven Bean (MDB) only supports the REQUIRED and NOT_SUPPORTED values.

usually the transactionAttirbuteType is set on the method level, it defines how the method will behave in case there is a parent transction or not

Handling transactions manually


then you should start and commit transactions by your self


Rolling back a transaction

in case of bean managed transaction you can rollback using these methods
UserTransaction.rollback(): which cause an immediate rollback of the transaction
SessionContext.setRollBackOnly(): which marks the transaction for rollback however the transaction will not be interrupted it will continue to the end.

in case of container managed transaction you can only use setRollBackOnly().

Handling errors in a transaction

If an unchecked exception is thrown, a transaction is automatically rolled back. For checked exceptions, the UserTransaction's rollback method or the SessionContext's setRollbackOnly method are used to explicitly force a rollback.

when you define an exception you can set if it should rollback or not.

Using timeouts with transactions

if you are using container managed transaction you can change the transaction timeout from the container GUI, 

for Bean managed transaction you can use.

EJB 3.1 cookbook Chapter 7 EJB Security

you can use annotation to secure access to methods, or you can use some code, use the code when the annotation cannot do what you want (e.g. access is allowed only in the morning).

When we talk about security we talk about REALM, users, groups and roles.
we define the REALM and under it the users and groups in the JAVAEE server (e.g. glassfish), usually they have a GUI for that.

the roles are defined on the application level, we assign the roles to groups and users.

as you can see we defined the roles and security-constraint in web.xml

mapping roles to groups and users should be done in (sun-application.xml, sun-web.xml, or
sun-ejb-jar.xml) depending on how the application is deployed

now when you write code

as you can see you define the roles that the class will handle, then you use @RolesAllowed, @PermitALL and @DenyAll.

Sometimes you need a class to run in a higher Role, so lets say that you have an employee Role and you want to call something which needs a Manager Role, the class itself can allow you to do that by using RunAs annotation

How to control security dynamically
after the user is authenticated by JAVAEE Server, it will be represented as Principal object as part of context, you can use this object for programmatic access.
some of the calls that you might need

Principal principal = sessionContext.getCallerPrincipal();

EJB 3.1 cookbook Chapter 8 Interceptors 

To use interceptors:
1- define your interceptors class

2- specify where you want to use this interceptor

as you can see here the interceptor is on class level

3- you can define multiple interceptors like this:

4- when you define an interceptor on class level, it will be applied to all methods, if you want to exclude some methods, you write:

5- you can also define an interceptor on method level

6- you can define an interceptor for all EJBs.
you do that by adding <interceptor-binding> in ejb-jar.xml

7- as you can see, in the defined Interceptor method we have, a parameter which is InvocationContext, this parameter has useful methods like:

8- you can annotate methods also with @PreDestroy and @PostConstruct @PrePassivate and @PostActivate

EJB 3.1 cookbook Chapter 9 Timer Service & Chapter 10 Web Services & Chapter 11 Packaging EJB & Chapter 12 EJB Techniques 

To schedual a method to run at specific time

you can also create an event programatically
1- define a time service resource
TimerService timerService;
2- create an action timer:
3- create a timeout function

createSingleActionTimer() will create an event for one time
createIntervalTimer() will create interval events
createCalendarTimer() will create calendar event

the Timer object in the timeout function has alot of useful methods

Persistent vs non-persistent timers
persistent timer means if the server is down the server event will be recorded and executed later.
you can define a persistent in @Schedual

@Schedule(second="0", minute="*", hour = "*", info="", persistent=true)


Chapter 10 Web Services

to define web services you can use JAX-WS 
@WebService, @WebMethod and @WebParam

in order to define RESTFul services, you can use JAX-RS
@Path, @GET, @Produces("text/html"), @QueryParam, 


Chapter 11 Packaging the EJB

1- *-ejb.jar: this jar file contains your EJBs, the deployment descriptor is ejb-jar.xml, it will be inside META-INF ( if you annotated your classes then there is no need for ejb-jar.xml ).
2- *.war: host your web application, the deployment descriptor is web.xml, it will be inside WEB-INF.
3- *.ear: put jars and wars inside it, application.xml is the deployment discriptor.
4- *.rar: this is to define resource adapters, this is something related to JAVA EE Connector Architecture, for integration; the deployment descriptor is ra.xml inside META-INF

then the chapter talks about class loading, as we know class loading is vendor specific, it is not a standard specification thing.


Chapter 12 EJB Techniques

this chapter talks about general things, like using currency, handling exceptions, using interceptors to handle exceptions and logging ....