Oozie - Incremental Table Load Workflow
Build a oozie workflow that would incrementally load a table from mysql and create a matching table in hive. Schedule this workflow to run every hour.
Create a .password file containing db user's password. Here the user in mysql db is "root" and corresponding password id "cloudera".
Upload the .password to /sqoop directory in HDFS.
Restrict the access only to yarn user.
Remove the .password from the local system for security reason.
Create a sqoop job for incremental load in append mode. Create this job under yarn user because we will ultimately build a oozie job that runs under yarn user.
View the existing jobs under yarn user.
You can test the sqoop job by running
Once test is successful delete the sqoop job and delete the HDFS target directory.
And re-create the sqoop job import-employee as earlier.
Verify that the sqoop job is visible as yarn user
Open hive terminal and create the following two tables
Create a employee.hql file with the following hive statement. Upload this script to /user/cloudera/scripts directory in HDFS. This script will be used in the Oozie workflow.
Now open hue Oozie workflow editor, http://localhost:8888/oozie/editor/workflow/list/ and create a new workflow, name it "Build Employee Table".
Add the following properties Oozie Workflow:
Open the Oozie Workflow Workspace (it is a folder in HDFS that contains job.properties, workflow.xml files and lib folder. In the lib folder upload the mysql-connector-java-5.1.34-bin.jar file that you can generally find in /usr/share/java directory. Workspace directory (HDFS) for Oozie workflow looks something like this - /user/hue/oozie/workspaces/hue-oozie-1495733114.84.
Now, in the Workflow, add the following 2 actions: See the screenshot.
A. Sqoop job "job --exec import-employee"
B. Hive 2 job "/user/cloudera/scripts/employee.hql"
Save the workflow and run. Once finished successfully, you can open impala or hive and run the queries on employee table.
Insert new records in employee table in mysql and run the workflow again. Verify that hive employee table is showing the exact same view as as employee table in mysql.
Schedule the Job to Run Every 5 mins
In Hue, go to Workflows > Editors > Workflows.
Select the workflow you created and click on Schedule.
Click on Save and Submit. You incremental import job should now run every 5 mins.