Mapr-DB

Object: MapR-DB supports two types of tables - binary tables and json tables. Binary tables are "drop-in" replacement of Apache HBase tables. Json tables are MapR-DB specific. If you are familar with Mongo DB, you find a lot of similarity. Json supports multiple column families just like binary tables. In supports secondary index. In this notebook, we learn how to work with JSON tables in MapR-DB. You can access Json tables using MaprDB driver or the REST api.


Basic MapR-DB commands


In this demo we will create a couple of tables - movies and ratings in MapR-DB, load data into them, query these tables etc. Create a directory to hold movie lens tables.

$ hadoop fs -mkdir /tables/movielens

Launch db-shell

[mapr@maprdemo ~]$ mapr dbshell

====================================================

* MapR-DB Shell *

* NOTE: This is a shell for JSON table operations. *

====================================================

Version: 6.0.1-mapr


MapR-DB Shell


Create two tables - movies and ratings. Note that we are creating these tables under /

maprdb mapr:> create /tables/movielens/movies

maprdb mapr:> create /tables/movielens/ratings


Insert a new document to movies table

maprdb mapr:> insert /tables/movielens/movies --value '{"_id":"movie0000002", "title":"Developers on the Edge", "studio":"Command Line Studios"}'


Another form of insert a new record

maprdb mapr:> insert /tables/movielens/movies --id movie0000003 --value '{"title":"The Golden Master", "studio":"All-Nighter"}'


Find records

maprdb mapr:> find /tables/movielens/movies --fields _id,studio,title

{"_id":"movie0000002","studio":"Command Line Studios","title":"Developers on the Edge"}

{"_id":"movie0000003","studio":"All-Nighter","title":"The Golden Master"}


Update a record. Set a new document field

maprdb mapr:> update /tables/movielens/movies --id movie0000003 --m '{"$set":{"genres":"[romance|crime]"}}'

View the updated record

maprdb mapr:> find /tables/movielens/movies --id movie0000003

{"_id":"movie0000003","genres":"[romance|crime]","studio":"All-Nighter","title":"The Golden Master"}

1 document(s) found.


Delete a record by id

maprdb mapr:> delete /tables/movielens/movies --id movie0000003

maprdb mapr:> find /tables/movielens/movies --fields _id,studio

{"_id":"movie0000002","studio":"Command Line Studios"}


Create tables in MapR-DB for bulk loading

Launch db-shell

[mapr@maprdemo ~]$ mapr dbshell

====================================================

* MapR-DB Shell *

* NOTE: This is a shell for JSON table operations. *

====================================================

Version: 6.0.1-mapr


MapR-DB Shell


Create the following json tables - movies and ratings. Skip is the tables are already there.

maprdb mapr:> create /tables/movielens/movies

Table /tables/movies created.

maprdb mapr:> create /tables/ratings

Table /tables/ratings created.


View existing tables. If you want to isolate your tables, you may create a separate volume for that purpose.

maprdb mapr:> list /tables

/tables/movies

/tables/ratings

2 table(s) found.


Load data

View the movies.json file (file is attached below)

Movies.json: https://raw.githubusercontent.com/abulbasar/data/master/movielens/movies.json

Ratings.json: https://raw.githubusercontent.com/abulbasar/data/master/movielens/ratings.json


[mapr@maprdemo mapr]$ head movies.json

{"movieId":"1","title":"Toy Story (1995)","genres":["Adventure|Animation|Children|Comedy|Fantasy"]}

{"movieId":"2","title":"Jumanji (1995)","genres":["Adventure|Children|Fantasy"]}

{"movieId":"3","title":"Grumpier Old Men (1995)","genres":["Comedy|Romance"]}

{"movieId":"4","title":"Waiting to Exhale (1995)","genres":["Comedy|Drama|Romance"]}

{"movieId":"5","title":"Father of the Bride Part II (1995)","genres":["Comedy"]}

{"movieId":"6","title":"Heat (1995)","genres":["Action|Crime|Thriller"]}

{"movieId":"7","title":"Sabrina (1995)","genres":["Comedy|Romance"]}

{"movieId":"8","title":"Tom and Huck (1995)","genres":["Adventure|Children"]}

{"movieId":"9","title":"Sudden Death (1995)","genres":["Action"]}

{"movieId":"10","title":"GoldenEye (1995)","genres":["Action|Adventure|Thriller"]}

Note: movieId has been transformed into a string data type.


View the ratings.json. Note: the _id is a new composite column that contains concontenated values of movieId and userId .

[mapr@maprdemo mapr]$ head ratings.json

{"userId":1,"movieId":16,"rating":4.0,"timestamp":1217897793,"_id":"16-1"}

{"userId":1,"movieId":24,"rating":1.5,"timestamp":1217895807,"_id":"24-1"}

{"userId":1,"movieId":32,"rating":4.0,"timestamp":1217896246,"_id":"32-1"}

{"userId":1,"movieId":47,"rating":4.0,"timestamp":1217896556,"_id":"47-1"}

{"userId":1,"movieId":50,"rating":4.0,"timestamp":1217896523,"_id":"50-1"}

{"userId":1,"movieId":110,"rating":4.0,"timestamp":1217896150,"_id":"110-1"}

{"userId":1,"movieId":150,"rating":3.0,"timestamp":1217895940,"_id":"150-1"}

{"userId":1,"movieId":161,"rating":4.0,"timestamp":1217897864,"_id":"161-1"}

{"userId":1,"movieId":165,"rating":3.0,"timestamp":1217897135,"_id":"165-1"}

{"userId":1,"movieId":204,"rating":0.5,"timestamp":1217895786,"_id":"204-1"}


Upload movies.json and ratings.json to /user/mapr/movies dierctory and /user/mapr/ratings directory.


Import movies dataset

$ mapr importJSON -idfield "movieId" -src /user/mapr/movies/movies.json -dst /tables/movielens/movies -mapreduce false


Import ratings dataset

$ mapr importJSON -idfield "_id" -src /user/mapr/ratings/ratings.json -dst /tables/movielens/ratings -mapreduce false


Query data using DB Shell

Launch MapR-DB Shell to query the data.

[mapr@maprdemo ~]$ mapr dbshell


Set mapr DB shell, to show json values in non-pretty format.

maprdb mapr:> jsonoptions --pretty true

{"pretty":true, "withTags":true}


Find 2 values from movies table.

maprdb mapr:> find /tables/movies --limit 2

{

"_id" : "1",

"genres" : [ "Adventure|Animation|Children|Comedy|Fantasy" ],

"movieId" : "1",

"title" : "Toy Story (1995)"

}

{

"_id" : "10",

"genres" : [ "Action|Adventure|Thriller" ],

"movieId" : "10",

"title" : "GoldenEye (1995)"

}

2 document(s) found.


Find the movie by Id = 1

maprdb mapr:> findbyid /tables/movies --id 1

{

"_id" : "1",

"genres" : [ "Adventure|Animation|Children|Comedy|Fantasy" ],

"movieId" : "1",

"title" : "Toy Story (1995)"

}

1 document(s) found.


Find all movies that for which movieId begins with 50. Show 2 values.

maprdb mapr:> find /tables/ratings --fromid 50 --limit 2

{

"_id" : "50-1",

"movieId" : 50,

"rating" : 4,

"timestamp" : 1217896523,

"userId" : 1

}

{

"_id" : "50-101",

"movieId" : 50,

"rating" : 5,

"timestamp" : 1278886003,

"userId" : 101

}

2 document(s) found.


Find movies which have got more than 4.5 rating.

maprdb mapr:> find /tables/ratings --where {"$gt":{"rating":4.5}} --offset 4 --fields userId,movieId,rating,timestamp --orderby timestamp --limit 2

{

"movieId" : 292,

"rating" : 5,

"timestamp" : 828564955,

"userId" : 312

}

{

"movieId" : 168,

"rating" : 5,

"timestamp" : 828564955,

"userId" : 312

}

2 document(s) found.


Query MapR-DB tables using Drill

Launch drill. Here I am launching the embedded mode.

[mapr@maprdemo drill-1.13.0]$ pwd

/opt/mapr/drill/drill-1.13.0

[mapr@maprdemo drill-1.13.0]$ bin/drill-embedded

Aug 09, 2018 12:31:45 PM org.glassfish.jersey.server.ApplicationHandler initialize

INFO: Initiating Jersey application, version Jersey: 2.8 2014-04-29 01:25:26...

apache drill 1.13.0-mapr

"a drill is a terrible thing to waste"

0: jdbc:drill:zk=local>


Show databases in Drill.

0: jdbc:drill:zk=local> show databases;

+---------------------+

| SCHEMA_NAME |

+---------------------+

| cp.default |

| dfs.default |

| dfs.root |

| dfs.tmp |

| hbase |

| hive.default |

| INFORMATION_SCHEMA |

| sys |

+---------------------+

8 rows selected (2.117 seconds)


Query first 2 rows from the maprDB table.

0: jdbc:drill:zk=local> select * from dfs.`/tables/movies/` limit 2;

+------+--------------------------------------------------+----------+-------------------+

| _id | genres | movieId | title |

+------+--------------------------------------------------+----------+-------------------+

| 1 | ["Adventure|Animation|Children|Comedy|Fantasy"] | 1 | Toy Story (1995) |

| 10 | ["Action|Adventure|Thriller"] | 10 | GoldenEye (1995) |

+------+--------------------------------------------------+----------+-------------------+

2 rows selected (0.781 seconds)

0: jdbc:drill:zk=local> select * from dfs.`/tables/ratings` limit 2;

+--------+----------+---------+----------------+---------+

| _id | movieId | rating | timestamp | userId |

+--------+----------+---------+----------------+---------+

| 1-100 | 1.0 | 3.0 | 1.117381586E9 | 100.0 |

| 1-103 | 1.0 | 3.0 | 9.78554606E8 | 103.0 |

+--------+----------+---------+----------------+---------+

2 rows selected (0.297 seconds)


Find top 10 movies with highest avg rating.

0: jdbc:drill:zk=local> select t1.movieId, t1.title, avg(t2.rating) avg_rating from dfs.`/tables/movies` t1 join dfs.`/tables/ratings` t2 on cast(t1.movieId as int) = t2.movieId group by t1.movieId, t1.title having count(1)>100 order by avg_rating desc limit 10;

+----------+-----------------------------------------+---------------------+

| movieId | title | avg_rating |

+----------+-----------------------------------------+---------------------+

| 318 | Shawshank Redemption, The (1994) | 4.454545454545454 |

| 858 | Godfather, The (1972) | 4.392857142857143 |

| 50 | Usual Suspects, The (1995) | 4.328947368421052 |

| 1136 | Monty Python and the Holy Grail (1975) | 4.3019480519480515 |

| 527 | Schindler's List (1993) | 4.296370967741935 |

| 1193 | One Flew Over the Cuckoo's Nest (1975) | 4.2727272727272725 |

| 608 | Fargo (1996) | 4.2711442786069655 |

| 2571 | Matrix, The (1999) | 4.264367816091954 |

| 1221 | Godfather: Part II, The (1974) | 4.260714285714286 |

| 1213 | Goodfellas (1990) | 4.2592592592592595 |

+----------+-----------------------------------------+---------------------+

10 rows selected (5.039 seconds)


Save the SQL query output as parquet file format.

Save the output of query as a table in tmp directory.

0: jdbc:drill:zk=local> create table dfs.tmp.top10movies as select t1.movieId, t1.title, avg(t2.rating) avg_rating from dfs.`/tables/movies` t1 join dfs.`/tables/ratings` t2 on cast(t1.movieId as int) = t2.movieId group by t1.movieId, t1.title having count(1)>100 order by avg_rating desc limit 10;

SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".

SLF4J: Defaulting to no-operation (NOP) logger implementation

SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.

+-----------+----------------------------+

| Fragment | Number of records written |

+-----------+----------------------------+

| 0_0 | 10 |

+-----------+----------------------------+

1 row selected (3.348 seconds)

0: jdbc:drill:zk=local>


You can set the storage pluggin configuration to set the default file format as parquet.





Location of the files

https://github.com/abulbasar/data/upload/master/movielens


Drillbit configuration http://localhost:8047