Thursday, January 3, 2013

OREILLY Learning MongoDb

Introduction to MongoDB

1- Documents are represented in JSON, and are saved as BSON binary format

Some terms in MongoDB are:
1- Document: inside documents we have FIELDS
2- Collection: list of documents with similar structure
3- a document may REFERENCE another document in another collection (i.e. creating relationships)
4- Embeded Document: you can embed document inside another so there is no relations
5- Cursur: when you run a query, it returns a cursur, we iterate over this cursor to read the results.

Some Of MongoDB Features

here we created an index on the field name, the value one means ascending order.
as you can see in the created index we have a list of names with a direct link to the location

you can have an index on multiple fields

as you can see we are sorting cost descending and name ascending

OTHER TYPES OF INDEXES: Geospetial, Hashed, Text, Multikeys

this is the structure of Mongo DB, the data is Sharded among servers or (MONGOS). config server gives metadata information about the lcoation of the data.

if you want to request lets say one element that exists in shard B:

the system will go to shard B directly to fetch the element a:123

if you want to read from multiple shards:

each shard will return some data, and then the data will be returned to client

you have some control over where to read data from, for example you can set that you always want to read from the primary server, or secondary or nearease

as you can see we have in the middle "MongoDB Shard Routing Service" called Mongos which talks to the configuration server to get information about the shards location

when we do the write we should allocate a space, the allocation happen based on what we call "Chunks" chunks are seperated based on KeySpace, for example chunk1 holds keys from 1 to 10, chunk2 holds keys from 11 to 20 ...

all write operations are accepted by primary, primary maintain an operation log "OP LOG" which contains all the write operations. the Secondary shards will take a copy of this LOG in order to replicate the primary.

when you write, we have something called write concern, we have 5 write concerns levels

1- Error Ignored
2- Unacknowledged : no acknowledged to the app (very fast)
3- acknowledged: the write will be acknowledged to the app (the default)
4- Journaled: this is similar to transaction, MongoDB will achnolwedge after writing the information to the JOURNAL LOG (not operation log). the journal log will not be affected if there was a hard shut down (Durability).
5- Replica Acknowledge: the strongest, all replica should acknowledge the write

MongoDB Aggregation
1- pipeline aggregation: here you write some queries where the output of one is the input to the next.
2- Map/Reduce: here you write Map function, Reduce function and the optional finalize function
3- Single purpose aggregation: it is a query that does one thing like count

Create, Read, Update and Delete Operations

1- when you want to search and find something

2- you have the follwoing operators for compression: $gt, $gte, $in, $lt, $lte, $ne, $nin

3- you have the following logical operators, $and, $not, $or, $nor

4- you have the following to check an element: $exists, $type. 
which means check if the element b exists.

5- for evaluation you have $mod, $regex, $where

6- when you want to search docuemnts which have an array field, you can use the array operators, $elemMatch, $size
db.customers.find({result:{ $elemMatch: { $gte: 80, $lt: 85 } } })

in mongodb we have the term Projection which means the fields that you want to include in the results
for example
which means show id not name.

we have some projection operators 

1- return first match element  $

2- $elemMatch
db.customers.find({query}, {$elemMatch:{field:value1,field2:{$gt:value}}}

3- $slice

4- other operators like $min, $max, $orderBy, $explain

Optimize Database
1- use indexes, to create an index 
1 = ascending, -1= descending 

2- use limit() to limit the returned results.

3- use projection to limit the number of returned fields

4- use explain() to check the query plan and analyze the performance

5- use hint() to force MongoDB to use an index.


Some Examples:

1- find()

as you can see
1- selectDB() to get the DB from the mongo instance
2- selectCollection() to get the collection
3- find() to find the query

2- Using projection

as you can see we define projection as an array, we also use it in find()

3- using limit()

as you can see we use limit()

4- find only one document

you see that we are using findOne()
you can use limit(1)

5- Sort()

we use sort()
as you can see the sort is by surname 1 which is ascending and -1 descending 

you can sort by multiple fields
$sort = array('surname' => 1 , 'xxx' => -1);

6- grouping using aggregate()

as you can see you group by country then you match (like having in MySql, which means filter the groups) then you sort.

and then we use aggregate with the three arrays $group, $match, $sort

Adding Information: Database, Collection, Document
basically you create multiple databases in single MongoDB instance, in the db you create collection, and we add document

we have save() and insert() for update/create. With insert() you can also specify the write concern 

you also have batchInsert() to insert multiple documents 

as you can see before we were using selectDB() and selectCollection() 

The _ID field
for each document in MongoDB we have _ID field, the value of this field is generated by MongoDB, the value is a mix of machineID time stamp and other values
you can override this value, so when you insert a new document you can give _ID a value (e.g. 1 2 3 ...) however you might do duplication which leads to an exception.

dont use this field, create your own ID field


as you can see you have a query that returns some documents, and you can update using $update array, here we are decreasing the balance by 50

you can also do upsert(), which means if it doesnt exist then insert

you can remove documents by using remove(), you can use an option which is justOne in case multiple documents match your criteria

Data Modeling 

One-To-One relations
document with large number of fields, split a large document to 2 documents with one-to-one relation

you will handle the relation by your self, which means you will define 2 collections, and you should manage the ids between these collections. we call this the manual approach

you can also define one collection and embed one document inside another,

there is another way which we call it DBREF, it means that you define 2 collections but you link them through a dbref, which is 3 values $ref,$id and $db.
so what you are doing is representing a document in a different collection in a good way.

One-To-Many Relations
you can model this using 3 collections

you can also have only 2 collections, Customers and Products, we embed Purchase inside customer

You can also use dbref, which means define define only Cusomers & Products, inside Customers put an array of Purchase products, this array is dbref.

Tree Structure
when we talk about tree, we are talking about something similar to

you can represent this in MongoDB using 5 ways

MongoDB Database Management

the MongoDB authentication is off by default you should enable it in mongodb.conf

we have 5 privileges: read, readWrite, dbAdmin, userAdmin and clusterAdmin (build Cluster)

-to add an overall admin user, when you start MongoDb and if there is no users, you can access it from the shell without authentication.
go to the DB you want
use admin
then create the user

you can also add user

to login you use

to get a list of users in a db

MongoDb Replication
- we have primary which accepts read & write
- we have secondary, which stores a back up copy from the binary
- if the primary is down, there will be a voting to elect new primary from the secondaries.
- we use Arbiter, which is a mongodb instance that doesnt hold data, it is used only during the election of new primary in case we have even number of secondary.
- when you do a read, you can specify the read mode

if you have a collection with alot of documents, a single mongo instance cannot handle all documents, thats why we shard the collection on multiple instances.

or , when you are on a production environment you shard and put in replica set

the sharding cluster contains the following

as you can see we have a config server which contains all sharding information, and you communicate with multiple Routing Service (mongos) to bring the data.

data is written in Chnuks 

based on a shardkey the chunk will be determined, as you can see above we are deviding based on sequential key

another way is through a hash function

the shardkey is chosen from some fields in documents.

based on the shard key MongoDb will determine the chunk where the document will be saved.

your job is to provide a shard key where data will be saved in chunks evenly (i.e. we dont want to have chunks with data more than other chunks)

that is why your key should be random, and highly distributed.

when you choose a shardkey you should consider 3 things

Cardinality: which refers to the uniqueness of the key, for example lets say we have AddressBook Document, if you want to use STATE as a shard key, we only have 50 states in US. This field has low cardinality, all douments with the same state will be stored in the same chunk EVEN IF THE CHUNK SIZE EXCEED THE MAX CHUNK SIZE.
so we will have only 50 chunks, some chunks will have documents more than the others.

on the other hand if you use ZipCode, this field has many values --> high cardinality 

Writing Scale: your key should be random, lets for example think about a key which is Year-Month-Day of document creation, as you can see this key has high cardinality as we can generate alot of values, however these keys are not highly random, you can see that documents that are generated on the same day will be stored in the same chunk, 
on the other hand a key with Minutes-Seconds will be high Cardinal and Highly random.

Query Isolation: sure if you want to get results fast, you should go to one shard, as you can see with Cardinality and write scaling we were talking about randomness, 
if you want to get documents fast you should use a key where it redirect you to few shards.

Index and Performance
to increase performance
1- create indexes
2- reduce document size: reduce field name for example, use GridFS for large files which is used for files more than 60MB

Backup Mongo
1- you can simply copy the MongoDB folder for back up, (copy the whole folder not few files)
2- you can use mongodump and mongorestore.
3- you can use mongoexport and mongoimport

Monitoring Mongo
1- use db.stats()
2- db.serverStatus()
3- mongostat
4- mongotop
5- use rockmongo tool 
6- use mongobird tool


MongoDB University course

1- it is a document DB
2- datamodel is json
3- no relation, scale out
4- atomic read & write on single document
5- Mongo, or the shell used to connect to MongoDB, uses TCP to connect to Database.
6- the driver, like JAVA driver, that you use to connect to MongoDB, also uses TCP, and the data that is transfered over the connection is BSON style
7- JSON has String Integer boolean array object data type
8- BSON extends JSON datatype, and has some extra information to make scanning the document faster, the scanning is linear scan, however adding some information like field length can help the scanner to do some jumps while scanning.

No comments:

Post a Comment