Mapr-DB

Object: MapR-DB supports two types of tables - binary tables and json tables. Binary tables are "drop-in" replacement of Apache HBase tables. Json tables are MapR-DB specific. If you are familar with Mongo DB, you find a lot of similarity. Json supports multiple column families just like binary tables. In supports secondary index. In this notebook, we learn how to work with JSON tables in MapR-DB. You can access Json tables using MaprDB driver or the REST api.


Basic MapR-DB commands


In this demo we will create a couple of tables - movies and ratings in MapR-DB, load data into them, query these tables etc. Create a directory to hold movie lens tables.

$ hadoop fs -mkdir /tables/movielens

Launch db-shell

[mapr@maprdemo ~]$ mapr dbshell
====================================================
*                  MapR-DB Shell                   *
* NOTE: This is a shell for JSON table operations. *
====================================================
Version: 6.0.1-mapr

MapR-DB Shell


Create two tables - movies and ratings. Note that we are creating these tables under /

maprdb mapr:> create /tables/movielens/movies
maprdb mapr:> create /tables/movielens/ratings


Insert a new document to movies table

maprdb mapr:> insert /tables/movielens/movies --value '{"_id":"movie0000002", "title":"Developers on the Edge", "studio":"Command Line Studios"}'

Another form of insert a new record

maprdb mapr:> insert /tables/movies --id movie0000003 --value '{"title":"The Golden Master", "studio":"All-Nighter"}'


Update a record. Set a new document field

maprdb mapr:> update /tables/movielens/movies --id movie0000003 --m
        '{
    "$set":{"genres":"[romance|crime]"}
        }'


Delete a record by id

maprdb mapr:> delete /tables/movielens/movies --id movie0000003


Create tables in MapR-DB for bulk loading

Launch db-shell

[mapr@maprdemo ~]$ mapr dbshell
====================================================
*                  MapR-DB Shell                   *
* NOTE: This is a shell for JSON table operations. *
====================================================
Version: 6.0.1-mapr

MapR-DB Shell


Create the following json tables - movies and ratings

maprdb mapr:> create /tables/movies
Table /tables/movies created.
maprdb mapr:> create /tables/ratings
Table /tables/ratings created.


View existing tables. If you want to isolate your tables, you may create a separate volume for that purpose.

maprdb mapr:> list /tables
/tables/movies
/tables/ratings
2 table(s) found.


Load data

View the movies.json file (file is attached below)

[mapr@maprdemo mapr]$ head movies.json
{"movieId":"1","title":"Toy Story (1995)","genres":["Adventure|Animation|Children|Comedy|Fantasy"]}
{"movieId":"2","title":"Jumanji (1995)","genres":["Adventure|Children|Fantasy"]}
{"movieId":"3","title":"Grumpier Old Men (1995)","genres":["Comedy|Romance"]}
{"movieId":"4","title":"Waiting to Exhale (1995)","genres":["Comedy|Drama|Romance"]}
{"movieId":"5","title":"Father of the Bride Part II (1995)","genres":["Comedy"]}
{"movieId":"6","title":"Heat (1995)","genres":["Action|Crime|Thriller"]}
{"movieId":"7","title":"Sabrina (1995)","genres":["Comedy|Romance"]}
{"movieId":"8","title":"Tom and Huck (1995)","genres":["Adventure|Children"]}
{"movieId":"9","title":"Sudden Death (1995)","genres":["Action"]}
{"movieId":"10","title":"GoldenEye (1995)","genres":["Action|Adventure|Thriller"]}

Note: movieId has been transformed into a string data type.


View the ratings.json

[mapr@maprdemo mapr]$ head ratings.json
{"userId":1,"movieId":16,"rating":4.0,"timestamp":1217897793,"_id":"16-1"}
{"userId":1,"movieId":24,"rating":1.5,"timestamp":1217895807,"_id":"24-1"}
{"userId":1,"movieId":32,"rating":4.0,"timestamp":1217896246,"_id":"32-1"}
{"userId":1,"movieId":47,"rating":4.0,"timestamp":1217896556,"_id":"47-1"}
{"userId":1,"movieId":50,"rating":4.0,"timestamp":1217896523,"_id":"50-1"}
{"userId":1,"movieId":110,"rating":4.0,"timestamp":1217896150,"_id":"110-1"}
{"userId":1,"movieId":150,"rating":3.0,"timestamp":1217895940,"_id":"150-1"}
{"userId":1,"movieId":161,"rating":4.0,"timestamp":1217897864,"_id":"161-1"}
{"userId":1,"movieId":165,"rating":3.0,"timestamp":1217897135,"_id":"165-1"}
{"userId":1,"movieId":204,"rating":0.5,"timestamp":1217895786,"_id":"204-1"}


Note: the _id is a new composite column that contains concontenated values of movieId and userId .


Import movies dataset

$ mapr importJSON -idfield "_id" -src /user/mapr/ratings.json -dst /tables/ratings -mapreduce false


Import ratings dataset

$ mapr importJSON -idfield "_id" -src /user/mapr/ratings.json -dst /tables/ratings -mapreduce false


Query data using DB Shell

Launch MapR-DB Shell to query the data.

[mapr@maprdemo ~]$ mapr dbshell


Set mapr DB shell, to show json values in non-pretty format.

maprdb mapr:> jsonoptions --pretty true
{"pretty":true, "withTags":true}


Find 2 values from movies table.

maprdb mapr:> find /tables/movies --limit 2
{
  "_id" : "1",
  "genres" : [ "Adventure|Animation|Children|Comedy|Fantasy" ],
  "movieId" : "1",
  "title" : "Toy Story (1995)"
}
{
  "_id" : "10",
  "genres" : [ "Action|Adventure|Thriller" ],
  "movieId" : "10",
  "title" : "GoldenEye (1995)"
}
2 document(s) found.


Find the movie by Id = 1

maprdb mapr:> findbyid /tables/movies --id 1
{
  "_id" : "1",
  "genres" : [ "Adventure|Animation|Children|Comedy|Fantasy" ],
  "movieId" : "1",
  "title" : "Toy Story (1995)"
}
1 document(s) found.


Find all movies that for which movieId begins with 50. Show 2 values.

maprdb mapr:> find /tables/ratings --fromid 50  --limit 2
{
  "_id" : "50-1",
  "movieId" : 50,
  "rating" : 4,
  "timestamp" : 1217896523,
  "userId" : 1
}
{
  "_id" : "50-101",
  "movieId" : 50,
  "rating" : 5,
  "timestamp" : 1278886003,
  "userId" : 101
}
2 document(s) found.



Find movies which have got more than 4.5 rating.

maprdb mapr:> find /tables/ratings --where {"$gt":{"rating":4.5}} --offset 4 --fields userId,movieId,rating,timestamp --orderby timestamp --limit 2
{
  "movieId" : 292,
  "rating" : 5,
  "timestamp" : 828564955,
  "userId" : 312
}
{
  "movieId" : 168,
  "rating" : 5,
  "timestamp" : 828564955,
  "userId" : 312
}
2 document(s) found.


Query MapR-DB tables using Drill

Launch drill. Here I am launching the embedded mode.

[mapr@maprdemo drill-1.13.0]$ pwd
/opt/mapr/drill/drill-1.13.0
[mapr@maprdemo drill-1.13.0]$ bin/drill-embedded
Aug 09, 2018 12:31:45 PM org.glassfish.jersey.server.ApplicationHandler initialize
INFO: Initiating Jersey application, version Jersey: 2.8 2014-04-29 01:25:26...
apache drill 1.13.0-mapr
"a drill is a terrible thing to waste"
0: jdbc:drill:zk=local>


Show databases in Drill.

0: jdbc:drill:zk=local> show databases;
+---------------------+
|     SCHEMA_NAME     |
+---------------------+
| cp.default          |
| dfs.default         |
| dfs.root            |
| dfs.tmp             |
| hbase               |
| hive.default        |
| INFORMATION_SCHEMA  |
| sys                 |
+---------------------+
8 rows selected (2.117 seconds)


Query first 2 rows from the maprDB table.

0: jdbc:drill:zk=local> select * from dfs.`/tables/movies/` limit 2;
+------+--------------------------------------------------+----------+-------------------+
| _id  |                      genres                      | movieId  |       title       |
+------+--------------------------------------------------+----------+-------------------+
| 1    | ["Adventure|Animation|Children|Comedy|Fantasy"]  | 1        | Toy Story (1995)  |
| 10   | ["Action|Adventure|Thriller"]                    | 10       | GoldenEye (1995)  |
+------+--------------------------------------------------+----------+-------------------+
2 rows selected (0.781 seconds)
0: jdbc:drill:zk=local> select * from dfs.`/tables/ratings` limit 2;
+--------+----------+---------+----------------+---------+
|  _id   | movieId  | rating  |   timestamp    | userId  |
+--------+----------+---------+----------------+---------+
| 1-100  | 1.0      | 3.0     | 1.117381586E9  | 100.0   |
| 1-103  | 1.0      | 3.0     | 9.78554606E8   | 103.0   |
+--------+----------+---------+----------------+---------+
2 rows selected (0.297 seconds)


Find top 10 movies with highest avg rating.

0: jdbc:drill:zk=local> select t1.movieId, t1.title, avg(t2.rating) avg_rating from dfs.`/tables/movies` t1 join dfs.`/tables/ratings` t2 on cast(t1.movieId as int) = t2.movieId group by t1.movieId, t1.title having count(1)>100 order by avg_rating desc limit 10;
+----------+-----------------------------------------+---------------------+
| movieId  |                  title                  |     avg_rating      |
+----------+-----------------------------------------+---------------------+
| 318      | Shawshank Redemption, The (1994)        | 4.454545454545454   |
| 858      | Godfather, The (1972)                   | 4.392857142857143   |
| 50       | Usual Suspects, The (1995)              | 4.328947368421052   |
| 1136     | Monty Python and the Holy Grail (1975)  | 4.3019480519480515  |
| 527      | Schindler's List (1993)                 | 4.296370967741935   |
| 1193     | One Flew Over the Cuckoo's Nest (1975)  | 4.2727272727272725  |
| 608      | Fargo (1996)                            | 4.2711442786069655  |
| 2571     | Matrix, The (1999)                      | 4.264367816091954   |
| 1221     | Godfather: Part II, The (1974)          | 4.260714285714286   |
| 1213     | Goodfellas (1990)                       | 4.2592592592592595  |
+----------+-----------------------------------------+---------------------+
10 rows selected (5.039 seconds)


Save the SQL query output as parquet file format.

Save the output of query as a table in tmp directory.

0: jdbc:drill:zk=local> create table dfs.tmp.top10movies as select t1.movieId, t1.title, avg(t2.rating) avg_rating from dfs.`/tables/movies` t1 join dfs.`/tables/ratings` t2 on cast(t1.movieId as int) = t2.movieId group by t1.movieId, t1.title having count(1)>100 order by avg_rating desc limit 10;
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
+-----------+----------------------------+
| Fragment  | Number of records written  |
+-----------+----------------------------+
| 0_0       | 10                         |
+-----------+----------------------------+
1 row selected (3.348 seconds)
0: jdbc:drill:zk=local>


You can set the storage pluggin configuration to set the default file format as parquet.





Location of the files

https://github.com/abulbasar/data/upload/master/movielens