PredictionIO for Machine Learning
Install JDK 1.8.
Download the rpm for linux 64 bit version (for example jdk-8u101-linux-x64.rpm) from to your local machine
Move the .rpm file to the machine where you want to install JDK. You can use scp or similar tools.
Install rpm using yum
$ sudo yum localinstall <rpm name e.g. jdk-8u101-linux-x64.rpm> -y
Set JAVA_HOME environment variable in /etc/profile. For saving env variable permanently make an entry to /etc/profile.
$ export JAVA_HOME=/usr/java/jdk1.8.0_111
Install sbt following the instructions.
Download Spark 1.6.2 binaries and untar it to /usr/lib and set SPARK_HOME environment variable to the location. For saving env variable permanently make an entry to /etc/profile
$ export SPARK_HOME=/usr/lib/spark-1.6.2-bin-hadoop2.6
Download PredictionIO source to the Download or any temporary folder.
Untar the downloaded file.
$ tar -xf apache-predictionio-0.10.0-incubating.tar.gz
Delete org.apache.predictionio from ivy2 cache
$ rm -rf ~/.ivy2/cache/org.apache.predictionio
Compile prediction IO source to create binary
$ cd apache-predictionio-0.10.0-incubating
$ ./make-distribution.sh
Untar the binary to /usr/lib and set PIO_HOME environment to the location. For saving env variable permanently make an entry to /etc/profile
$ tar -xf PredictionIO-0.10.0-incubating.tar.gz
$ sudo mv PredictionIO-0.10.0-incubating /usr/lib
$ export PIO_HOME=/usr/lib/Predictionio-0.10.0-incubating
Download MySQL connector Jar from maven repo and place it inside $PIO_HOME/lib
Install Mysql 5.7. Find more instruction here.
Create a user for pio in Mysql and grant permission. Default password root user user is "Pass123!".
$ mysql -uroot -p
mysql> CREATE USER 'pio'@'%' IDENTIFIED BY 'Pass123!';
mysql> GRANT ALL PRIVILEGES ON * . * TO 'pio'@'%';
mysql> FLUSH PRIVILEGES;
Create a database for PredictionIO
mysql> CREATE DATABASE pio;
Download mysql-jdbc-connector jar from maven and save to $PIO_HOME/lib directory.
$ wget http://central.maven.org/maven2/mysql/mysql-connector-java/5.1.37/mysql-connector-java-5.1.37.jar -P $PIO_HOME/lib/
Open $PIO_HOME/conf/pio-env.sh to make certain changes as outline below.
a. Comment SPARK_HOME
b. Replace existing MYSQL properties variable values
Start Event server
$ $PIO_HOME/bin/pio eventserver
Open browser to http://localhost:7070/ or run the following command and you should see {"status":"alive"}
$ curl http://localhost:7070/
Open another shell terminal and verify whether a process for PIO org.apache.predictionio.tools.console.Console is running.
$ jps -l
Also, verify the status using pio command
$PIO_HOME/bin/pio status
Install predictionio package in python. Seems to work with Python 2.7, which is default in CentOS. Version of predictionio should be 0.9.8 or above.
$ sudo easy_install predictionio
Build Scala Documentation
Go to the directory that contains the PredictionIO source code and run the following commands
$ sbt clean
$ sbt doc
The above commands produces the following 4 docs repo - one for each of the sub projects under PredictionIO. The ones in bold are major doc.
common/target/scala-2.10/api...
e2/target/scala-2.10/api
data/target/scala-2.10/api
core/target/scala-2.10/api
tools/target/scala-2.10/api
core - Building blocks of a prediction engine
data - Event Store API.
You can view the document by launching a simple python microsite by running the following command
$ PIO_SOURCE=/home/cloudera/Downloads/apache-predictionio-0.10.0-incubating
$ cd $PIO_SOURCE/e2/target/scala-2.10/api; python -m SimpleHTTPServer 7676 &
$ cd $PIO_SOURCE/data/target/scala-2.10/api; python -m SimpleHTTPServer 7677 &
$ cd $PIO_SOURCE/core/target/scala-2.10/api; python -m SimpleHTTPServer 7678 &
$ cd $PIO_SOURCE/tools/target/scala-2.10/api; python -m SimpleHTTPServer 7679 &
$ cd $PIO_SOURCE/common/target/scala-2.10/api; python -m SimpleHTTPServer 7680 &
You can open 4 browser tabs at the following urls
http://localhost:7676
http://localhost:7677
http://localhost:7678
http://localhost:7679
http://localhost:7680
Standard Operating Procedure to work with templates
Download the template from github
Create an application using pio command
Update engine.json with appId
Update build.sbt with project name and dependencies
Add Eclipse Nature to the project
In Eclipse project, replace the references of io.prediction with org.apache.predictionio
Compile project using pio build command
Download training data
Import training data using import script
Train model using pio train command
Deploy model using pio deploy command
Test the model by sending sample messages either using curl or postman plugin of Chrome/Firefox browsers
Linear Regression
Template: MLLib-LinearRegression
$ git clone https://github.com/RAditi/PredictionIO-MLLib-LinReg-Template.git MyLinearRegression
$ pio app new <app name, e.g. MyLinearRegression>
$ pio app list
Set ACCESS_KEY to the access key for the application. View the access keys running the following command
$ pio app list
Update appId in the following in engine.json
Update name and library dependencies in build.sbt as below.
name := "MyLinearRegression"
libraryDependencies ++= Seq(
"org.apache.predictionio" %% "apache-predictionio-core" % "0.10.0-incubating" % "provided",
"org.apache.spark" %% "spark-core" % "1.6.2" % "provided",
"org.apache.spark" %% "spark-mllib" % "1.6.2" % "provided")
We need to update the references of io.prediction to org.apache.predictionio. Eclipse will be helpful to make this change rather than doing them through
$ sbt eclipse
Import the project in Eclipse and update the import statements in the following scala files. Replace the reference of io.prediction with org.apache.predictionio. After updating the references Eclipse should display no errors.
Algorithm.scala
DataSource.scala
Engine.scala
Serving.scala
PreparedData.scala
Build project
$ pio build
Download test data. Before running the command, make sure you are in MyLinearRegression directory.
$ curl https://raw.githubusercontent.com/apache/spark/master/data/mllib/ridge-data/lpsa.data --create-dirs -o data/sample_data.txt
Import test data into PIO event server. Make sure the event server show alive status at localhost:7070.
$ curl http://localhost:7070
$ python data/import_eventserver.py --access_key $ACCESS_KEY
There should be 67 events. You can verify the in storage backend.
Train the model
$ pio train
Deploy the model
$ pio deploy
Test the model
$ curl -H "Content-Type: application/json" -d '{"features" :[-1, -2, -1, -3, 0, 0, -1, 0]}' http://localhost:8000/queries.json
Recommender System
Template: Recommendation
$ git clone https://github.com/apache/incubator-predictionio-template-recommender.git MyRecommender
$ pio app new MyRecommender
Set ACCESS_KEY to Access Key of the application.
Update engine.json with the following
"appName": "MyRecommender"
"numIterations": 10
Update build.sbt
Set name to "MyRecommender"
Update the spark api version to 1.6.2
Build the project
$ pio build
Import Sample Data
$ curl -i -X POST http://localhost:7070/events.json?accessKey=$ACCESS_KEY \
-H "Content-Type: application/json" \
-d '{
"event" : "rate",
"entityType" : "user",
"entityId" : "u0",
"targetEntityType" : "item",
"targetEntityId" : "i0",
"properties" : {
"rating" : 5
}
"eventTime" : "2014-11-02T09:39:45.618-08:00"
}'
$ curl -i -X POST http://localhost:7070/events.json?accessKey=$ACCESS_KEY \
-H "Content-Type: application/json" \
-d '{
"event" : "buy",
"entityType" : "user",
"entityId" : "u1",
"targetEntityType" : "item",
"targetEntityId" : "i2",
"eventTime" : "2014-11-10T12:34:56.123-08:00"
}'
Access Data
$ curl -i -X GET "http://localhost:7070/events.json?accessKey=$ACCESS_KEY"
Bulk import sample data
Download sample data
$ curl https://raw.githubusercontent.com/apache/spark/master/data/mllib/sample_movielens_data.txt --create-dirs -o data/sample_movielens_data.txt
Import data using the python SDK
$ python data/import_eventserver.py --access_key $ACCESS_KEY
Train the model
$ pio train
Deploy the model
$ pio deploy
Once deployed you can view the model Web UI http://localhost:8000
Test the model by sending sample request.
$ curl -H "Content-Type: application/json" -d '{ "user": "1", "num": 4 }' http://localhost:8000/queries.json
Add eclipse nature to the project
$ sbt eclipse
Sentiment Analysis using Stanford NLP library
Template: OpenNLP Sentiment Analysis Template
Download source
$ git clone https://github.com/vshwnth2/OpenNLP-SentimentAnalysis-Template.git MySentimentAnalyzer
Create new app
$ pio app new MySentimentAnalyzer
Set the ACCESS_KEY environment variable to the access key.
Import training data.
$ cd MySentimentAnalyzer/data
$ python import_eventserver.py --access_key $ACCESS_KEY
Train the model
$ train
Deploy the model
$ deploy
Test:
$ curl -H "Content-Type: application/json" -d '{ "sentence":"I like speed and fast motorcycles." }' http://localhost:8000/queries.json
{"sentiment":"Positive"}
Text Classification
Template: https://github.com/apache/incubator-predictionio-template-text-classifier
Get the source
$ git clone https://github.com/apache/incubator-predictionio-template-text-classifier.git MyTextClassifier
Create a new app and set ACCESS_KEY environment.
$ pio app new MyTextClassifier
Import data. Bulk import feature.
$ pio import --appid 3 --input data/stopwords.json
$ pio import --appid 3 --input data/emails.json
Build
$ pio build --verbose
Train and
$ pio train
Deploy
$ pio deploy
Test the application
$ curl -H "Content-Type: application/json" -d '{ "text":"I like speed and fast motorcycles." }' http://localhost:8000/queries.json
$ curl -H "Content-Type: application/json" -d '{ "text":"Earn extra cash!" }' http://localhost:8000/queries.json
Evaluate
$ pio eval org.template.textclassification.AccuracyEvaluation org.template.textclassification.EngineParamsList
Event Type
Create new app. Take a note of the app Id and access key.
$ pio app new MyTestApp
Set ACCESS_KEY environment variable
$ ACCESS_KEY=<access key of the new app>
Create an event record
$ curl -i -X POST http://localhost:7070/events.json?accessKey=$ACCESS_KEY \
-H "Content-Type: application/json" \
-d '{
"event" : "$set",
"entityType" : "user",
"entityId" : "2",
"properties" : {
"a" : 3,
"b" : 4
},
"eventTime" : "2014-09-09T16:17:42.937-08:00"
}'
User-2 will have 2 properties a and b. Connect to DB using SQL workbench and view the table. The name has the following naming convention event_<app id>, for example, event_8, which 8 is the application id.
mysql> select * from pio_event_8 order by entitytype, cast(entityid as signed);
Add new property called "c" to the entity you just created.
$ curl -i -X POST http://localhost:7070/events.json?accessKey=$ACCESS_KEY \
-H "Content-Type: application/json" \
-d '{
"event" : "$set",
"entityType" : "user",
"entityId" : "2",
"properties" : {
"b" : 5,
"c" : 6
},
"eventTime" : "2014-09-10T13:12:04.937-08:00"
}'
Launch pio shell to view the data. Before launching, make sure spark has the package reference of mysql jdbc connector in spark-defaults.conf file
$ pio-shell --with-spark
scala> val appName = "MyTestApp"
scala> import org.apache.predictionio.data.store.PEventStore
scala> PEventStore.aggregateProperties(appName=appName, entityType="user")(sc).collect()
You should an aggregated result for the user 2 with all three properties with latest values.
Remove a property.
$ curl -i -X POST http://localhost:7070/events.json?accessKey=$ACCESS_KEY \
-H "Content-Type: application/json" \
-d '{
"event" : "$unset",
"entityType" : "user",
"entityId" : "2",
"properties" : {
"b" : null
},
"eventTime" : "2014-09-11T14:17:42.456-08:00"
}'
Verify the aggregated result in pio-shell.
scala> PEventStore.aggregateProperties(appName=appName, entityType="user")(sc).collect()
Delete the entity user 2.
$ curl -i -X POST http://localhost:7070/events.json?accessKey=$ACCESS_KEY \
-H "Content-Type: application/json" \
-d '{
"event" : "$delete",
"entityType" : "user",
"entityId" : "2",
"eventTime" : "2014-09-12T16:13:41.452-08:00"
}'
Verify the result from pio-shell.
scala> PEventStore.aggregateProperties(appName=appName, entityType="user")(sc).collect()
In the retrieve request, you can apply time filter.
scala> import org.joda.time.DateTime
scala> PEventStore.aggregateProperties(appName=appName, entityType="user", untilTime=Some(new DateTime(2014, 9, 11, 0, 0)))(sc).collect()
View source code for PEventStore for more details.