Mapr-DB
Object: MapR-DB supports two types of tables - binary tables and json tables. Binary tables are "drop-in" replacement of Apache HBase tables. Json tables are MapR-DB specific. If you are familar with Mongo DB, you find a lot of similarity. Json supports multiple column families just like binary tables. In supports secondary index. In this notebook, we learn how to work with JSON tables in MapR-DB. You can access Json tables using MaprDB driver or the REST api.
Basic MapR-DB commands
In this demo we will create a couple of tables - movies and ratings in MapR-DB, load data into them, query these tables etc. Create a directory to hold movie lens tables.
$ hadoop fs -mkdir /tables/movielens
Launch db-shell
[mapr@maprdemo ~]$ mapr dbshell
====================================================
* MapR-DB Shell *
* NOTE: This is a shell for JSON table operations. *
====================================================
Version: 6.0.1-mapr
MapR-DB Shell
Create two tables - movies and ratings. Note that we are creating these tables under /
maprdb mapr:> create /tables/movielens/movies
maprdb mapr:> create /tables/movielens/ratings
Insert a new document to movies table
maprdb mapr:> insert /tables/movielens/movies --value '{"_id":"movie0000002", "title":"Developers on the Edge", "studio":"Command Line Studios"}'
Another form of insert a new record
maprdb mapr:> insert /tables/movielens/movies --id movie0000003 --value '{"title":"The Golden Master", "studio":"All-Nighter"}'
Find records
maprdb mapr:> find /tables/movielens/movies --fields _id,studio,title
{"_id":"movie0000002","studio":"Command Line Studios","title":"Developers on the Edge"}
{"_id":"movie0000003","studio":"All-Nighter","title":"The Golden Master"}
Update a record. Set a new document field
maprdb mapr:> update /tables/movielens/movies --id movie0000003 --m '{"$set":{"genres":"[romance|crime]"}}'
View the updated record
maprdb mapr:> find /tables/movielens/movies --id movie0000003
{"_id":"movie0000003","genres":"[romance|crime]","studio":"All-Nighter","title":"The Golden Master"}
1 document(s) found.
Delete a record by id
maprdb mapr:> delete /tables/movielens/movies --id movie0000003
maprdb mapr:> find /tables/movielens/movies --fields _id,studio
{"_id":"movie0000002","studio":"Command Line Studios"}
Create tables in MapR-DB for bulk loading
Launch db-shell
[mapr@maprdemo ~]$ mapr dbshell
====================================================
* MapR-DB Shell *
* NOTE: This is a shell for JSON table operations. *
====================================================
Version: 6.0.1-mapr
MapR-DB Shell
Create the following json tables - movies and ratings. Skip is the tables are already there.
maprdb mapr:> create /tables/movielens/movies
Table /tables/movies created.
maprdb mapr:> create /tables/ratings
Table /tables/ratings created.
View existing tables. If you want to isolate your tables, you may create a separate volume for that purpose.
maprdb mapr:> list /tables
/tables/movies
/tables/ratings
2 table(s) found.
Load data
View the movies.json file (file is attached below)
Movies.json: https://raw.githubusercontent.com/abulbasar/data/master/movielens/movies.json
Ratings.json: https://raw.githubusercontent.com/abulbasar/data/master/movielens/ratings.json
[mapr@maprdemo mapr]$ head movies.json
{"movieId":"1","title":"Toy Story (1995)","genres":["Adventure|Animation|Children|Comedy|Fantasy"]}
{"movieId":"2","title":"Jumanji (1995)","genres":["Adventure|Children|Fantasy"]}
{"movieId":"3","title":"Grumpier Old Men (1995)","genres":["Comedy|Romance"]}
{"movieId":"4","title":"Waiting to Exhale (1995)","genres":["Comedy|Drama|Romance"]}
{"movieId":"5","title":"Father of the Bride Part II (1995)","genres":["Comedy"]}
{"movieId":"6","title":"Heat (1995)","genres":["Action|Crime|Thriller"]}
{"movieId":"7","title":"Sabrina (1995)","genres":["Comedy|Romance"]}
{"movieId":"8","title":"Tom and Huck (1995)","genres":["Adventure|Children"]}
{"movieId":"9","title":"Sudden Death (1995)","genres":["Action"]}
{"movieId":"10","title":"GoldenEye (1995)","genres":["Action|Adventure|Thriller"]}
Note: movieId has been transformed into a string data type.
View the ratings.json. Note: the _id is a new composite column that contains concontenated values of movieId and userId .
[mapr@maprdemo mapr]$ head ratings.json
{"userId":1,"movieId":16,"rating":4.0,"timestamp":1217897793,"_id":"16-1"}
{"userId":1,"movieId":24,"rating":1.5,"timestamp":1217895807,"_id":"24-1"}
{"userId":1,"movieId":32,"rating":4.0,"timestamp":1217896246,"_id":"32-1"}
{"userId":1,"movieId":47,"rating":4.0,"timestamp":1217896556,"_id":"47-1"}
{"userId":1,"movieId":50,"rating":4.0,"timestamp":1217896523,"_id":"50-1"}
{"userId":1,"movieId":110,"rating":4.0,"timestamp":1217896150,"_id":"110-1"}
{"userId":1,"movieId":150,"rating":3.0,"timestamp":1217895940,"_id":"150-1"}
{"userId":1,"movieId":161,"rating":4.0,"timestamp":1217897864,"_id":"161-1"}
{"userId":1,"movieId":165,"rating":3.0,"timestamp":1217897135,"_id":"165-1"}
{"userId":1,"movieId":204,"rating":0.5,"timestamp":1217895786,"_id":"204-1"}
Upload movies.json and ratings.json to /user/mapr/movies dierctory and /user/mapr/ratings directory.
Import movies dataset
$ mapr importJSON -idfield "movieId" -src /user/mapr/movies/movies.json -dst /tables/movielens/movies -mapreduce false
Import ratings dataset
$ mapr importJSON -idfield "_id" -src /user/mapr/ratings/ratings.json -dst /tables/movielens/ratings -mapreduce false
Query data using DB Shell
Launch MapR-DB Shell to query the data.
[mapr@maprdemo ~]$ mapr dbshell
Set mapr DB shell, to show json values in non-pretty format.
maprdb mapr:> jsonoptions --pretty true
{"pretty":true, "withTags":true}
Find 2 values from movies table.
maprdb mapr:> find /tables/movies --limit 2
{
"_id" : "1",
"genres" : [ "Adventure|Animation|Children|Comedy|Fantasy" ],
"movieId" : "1",
"title" : "Toy Story (1995)"
}
{
"_id" : "10",
"genres" : [ "Action|Adventure|Thriller" ],
"movieId" : "10",
"title" : "GoldenEye (1995)"
}
2 document(s) found.
Find the movie by Id = 1
maprdb mapr:> findbyid /tables/movies --id 1
{
"_id" : "1",
"genres" : [ "Adventure|Animation|Children|Comedy|Fantasy" ],
"movieId" : "1",
"title" : "Toy Story (1995)"
}
1 document(s) found.
Find all movies that for which movieId begins with 50. Show 2 values.
maprdb mapr:> find /tables/ratings --fromid 50 --limit 2
{
"_id" : "50-1",
"movieId" : 50,
"rating" : 4,
"timestamp" : 1217896523,
"userId" : 1
}
{
"_id" : "50-101",
"movieId" : 50,
"rating" : 5,
"timestamp" : 1278886003,
"userId" : 101
}
2 document(s) found.
Find movies which have got more than 4.5 rating.
maprdb mapr:> find /tables/ratings --where {"$gt":{"rating":4.5}} --offset 4 --fields userId,movieId,rating,timestamp --orderby timestamp --limit 2
{
"movieId" : 292,
"rating" : 5,
"timestamp" : 828564955,
"userId" : 312
}
{
"movieId" : 168,
"rating" : 5,
"timestamp" : 828564955,
"userId" : 312
}
2 document(s) found.
Query MapR-DB tables using Drill
Launch drill. Here I am launching the embedded mode.
[mapr@maprdemo drill-1.13.0]$ pwd
/opt/mapr/drill/drill-1.13.0
[mapr@maprdemo drill-1.13.0]$ bin/drill-embedded
Aug 09, 2018 12:31:45 PM org.glassfish.jersey.server.ApplicationHandler initialize
INFO: Initiating Jersey application, version Jersey: 2.8 2014-04-29 01:25:26...
apache drill 1.13.0-mapr
"a drill is a terrible thing to waste"
0: jdbc:drill:zk=local>
Show databases in Drill.
0: jdbc:drill:zk=local> show databases;
+---------------------+
| SCHEMA_NAME |
+---------------------+
| cp.default |
| dfs.default |
| dfs.root |
| dfs.tmp |
| hbase |
| hive.default |
| INFORMATION_SCHEMA |
| sys |
+---------------------+
8 rows selected (2.117 seconds)
Query first 2 rows from the maprDB table.
0: jdbc:drill:zk=local> select * from dfs.`/tables/movies/` limit 2;
+------+--------------------------------------------------+----------+-------------------+
| _id | genres | movieId | title |
+------+--------------------------------------------------+----------+-------------------+
| 1 | ["Adventure|Animation|Children|Comedy|Fantasy"] | 1 | Toy Story (1995) |
| 10 | ["Action|Adventure|Thriller"] | 10 | GoldenEye (1995) |
+------+--------------------------------------------------+----------+-------------------+
2 rows selected (0.781 seconds)
0: jdbc:drill:zk=local> select * from dfs.`/tables/ratings` limit 2;
+--------+----------+---------+----------------+---------+
| _id | movieId | rating | timestamp | userId |
+--------+----------+---------+----------------+---------+
| 1-100 | 1.0 | 3.0 | 1.117381586E9 | 100.0 |
| 1-103 | 1.0 | 3.0 | 9.78554606E8 | 103.0 |
+--------+----------+---------+----------------+---------+
2 rows selected (0.297 seconds)
Find top 10 movies with highest avg rating.
0: jdbc:drill:zk=local> select t1.movieId, t1.title, avg(t2.rating) avg_rating from dfs.`/tables/movies` t1 join dfs.`/tables/ratings` t2 on cast(t1.movieId as int) = t2.movieId group by t1.movieId, t1.title having count(1)>100 order by avg_rating desc limit 10;
+----------+-----------------------------------------+---------------------+
| movieId | title | avg_rating |
+----------+-----------------------------------------+---------------------+
| 318 | Shawshank Redemption, The (1994) | 4.454545454545454 |
| 858 | Godfather, The (1972) | 4.392857142857143 |
| 50 | Usual Suspects, The (1995) | 4.328947368421052 |
| 1136 | Monty Python and the Holy Grail (1975) | 4.3019480519480515 |
| 527 | Schindler's List (1993) | 4.296370967741935 |
| 1193 | One Flew Over the Cuckoo's Nest (1975) | 4.2727272727272725 |
| 608 | Fargo (1996) | 4.2711442786069655 |
| 2571 | Matrix, The (1999) | 4.264367816091954 |
| 1221 | Godfather: Part II, The (1974) | 4.260714285714286 |
| 1213 | Goodfellas (1990) | 4.2592592592592595 |
+----------+-----------------------------------------+---------------------+
10 rows selected (5.039 seconds)
Save the SQL query output as parquet file format.
Save the output of query as a table in tmp directory.
0: jdbc:drill:zk=local> create table dfs.tmp.top10movies as select t1.movieId, t1.title, avg(t2.rating) avg_rating from dfs.`/tables/movies` t1 join dfs.`/tables/ratings` t2 on cast(t1.movieId as int) = t2.movieId group by t1.movieId, t1.title having count(1)>100 order by avg_rating desc limit 10;
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
+-----------+----------------------------+
| Fragment | Number of records written |
+-----------+----------------------------+
| 0_0 | 10 |
+-----------+----------------------------+
1 row selected (3.348 seconds)
0: jdbc:drill:zk=local>
You can set the storage pluggin configuration to set the default file format as parquet.
Location of the files
https://github.com/abulbasar/data/upload/master/movielens
Drillbit configuration http://localhost:8047