PredictionIO for Machine Learning


Install JDK 1.8.

  • Download the rpm for linux 64 bit version (for example jdk-8u101-linux-x64.rpm) from to your local machine
  • Move the .rpm file to the machine where you want to install JDK. You can use scp or similar tools.
  • Install rpm using yum

$ sudo yum localinstall <rpm name e.g. jdk-8u101-linux-x64.rpm> -y

Set JAVA_HOME environment variable in /etc/profile. For saving env variable permanently make an entry to /etc/profile.

$ export JAVA_HOME=/usr/java/jdk1.8.0_111

Install sbt following the instructions.

Download Spark 1.6.2 binaries and untar it to /usr/lib and set SPARK_HOME environment variable to the location. For saving env variable permanently make an entry to /etc/profile

$ export SPARK_HOME=/usr/lib/spark-1.6.2-bin-hadoop2.6

Download PredictionIO source to the Download or any temporary folder.

Untar the downloaded file.

$ tar -xf apache-predictionio-0.10.0-incubating.tar.gz

Delete org.apache.predictionio from ivy2 cache

$ rm -rf ~/.ivy2/cache/org.apache.predictionio

Compile prediction IO source to create binary

$ cd apache-predictionio-0.10.0-incubating

$ ./

Untar the binary to /usr/lib and set PIO_HOME environment to the location. For saving env variable permanently make an entry to /etc/profile

$ tar -xf PredictionIO-0.10.0-incubating.tar.gz

$ sudo mv PredictionIO-0.10.0-incubating /usr/lib

$ export PIO_HOME=/usr/lib/Predictionio-0.10.0-incubating

Download MySQL connector Jar from maven repo and place it inside $PIO_HOME/lib

Install Mysql 5.7. Find more instruction here.

Create a user for pio in Mysql and grant permission. Default password root user user is "Pass123!".

$ mysql -uroot -p

mysql> CREATE USER 'pio'@'%' IDENTIFIED BY 'Pass123!';

mysql> GRANT ALL PRIVILEGES ON * . * TO 'pio'@'%';


Create a database for PredictionIO


Download mysql-jdbc-connector jar from maven and save to $PIO_HOME/lib directory.

$ wget -P $PIO_HOME/lib/

Open $PIO_HOME/conf/ to make certain changes as outline below.

a. Comment SPARK_HOME

b. Replace existing MYSQL properties variable values

Start Event server

$ $PIO_HOME/bin/pio eventserver

Open browser to http://localhost:7070/ or run the following command and you should see {"status":"alive"}

$ curl http://localhost:7070/

Open another shell terminal and verify whether a process for PIO is running.

$ jps -l

Also, verify the status using pio command

$PIO_HOME/bin/pio status

Install predictionio package in python. Seems to work with Python 2.7, which is default in CentOS. Version of predictionio should be 0.9.8 or above.

$ sudo easy_install predictionio

Build Scala Documentation

Go to the directory that contains the PredictionIO source code and run the following commands

$ sbt clean

$ sbt doc

The above commands produces the following 4 docs repo - one for each of the sub projects under PredictionIO. The ones in bold are major doc.






core - Building blocks of a prediction engine

data - Event Store API.

You can view the document by launching a simple python microsite by running the following command

$ PIO_SOURCE=/home/cloudera/Downloads/apache-predictionio-0.10.0-incubating

$ cd $PIO_SOURCE/e2/target/scala-2.10/api; python -m SimpleHTTPServer 7676 &

$ cd $PIO_SOURCE/data/target/scala-2.10/api; python -m SimpleHTTPServer 7677 &

$ cd $PIO_SOURCE/core/target/scala-2.10/api; python -m SimpleHTTPServer 7678 &

$ cd $PIO_SOURCE/tools/target/scala-2.10/api; python -m SimpleHTTPServer 7679 &

$ cd $PIO_SOURCE/common/target/scala-2.10/api; python -m SimpleHTTPServer 7680 &

You can open 4 browser tabs at the following urls






Standard Operating Procedure to work with templates

  1. Download the template from github
  2. Create an application using pio command
  3. Update engine.json with appId
  4. Update build.sbt with project name and dependencies
  5. Add Eclipse Nature to the project
  6. In Eclipse project, replace the references of io.prediction with org.apache.predictionio
  7. Compile project using pio build command
  8. Download training data
  9. Import training data using import script
  10. Train model using pio train command
  11. Deploy model using pio deploy command
  12. Test the model by sending sample messages either using curl or postman plugin of Chrome/Firefox browsers

Linear Regression

Template: MLLib-LinearRegression

$ git clone MyLinearRegression

$ pio app new <app name, e.g. MyLinearRegression>

$ pio app list

Set ACCESS_KEY to the access key for the application. View the access keys running the following command

$ pio app list

Update appId in the following in engine.json

Update name and library dependencies in build.sbt as below.

name := "MyLinearRegression"

libraryDependencies ++= Seq(

"org.apache.predictionio" %% "apache-predictionio-core" % "0.10.0-incubating" % "provided",

"org.apache.spark" %% "spark-core" % "1.6.2" % "provided",

"org.apache.spark" %% "spark-mllib" % "1.6.2" % "provided")

We need to update the references of io.prediction to org.apache.predictionio. Eclipse will be helpful to make this change rather than doing them through

$ sbt eclipse

Import the project in Eclipse and update the import statements in the following scala files. Replace the reference of io.prediction with org.apache.predictionio. After updating the references Eclipse should display no errors.

  • Algorithm.scala
  • DataSource.scala
  • Engine.scala
  • Serving.scala
  • PreparedData.scala

Build project

$ pio build

Download test data. Before running the command, make sure you are in MyLinearRegression directory.

$ curl --create-dirs -o data/sample_data.txt

Import test data into PIO event server. Make sure the event server show alive status at localhost:7070.

$ curl http://localhost:7070

$ python data/ --access_key $ACCESS_KEY

There should be 67 events. You can verify the in storage backend.

Train the model

$ pio train

Deploy the model

$ pio deploy

Test the model

$ curl -H "Content-Type: application/json" -d '{"features" :[-1, -2, -1, -3, 0, 0, -1, 0]}' http://localhost:8000/queries.json

Recommender System

Template: Recommendation

$ git clone MyRecommender

$ pio app new MyRecommender

Set ACCESS_KEY to Access Key of the application.

Update engine.json with the following

  • "appName": "MyRecommender"
  • "numIterations": 10

Update build.sbt

  • Set name to "MyRecommender"
  • Update the spark api version to 1.6.2

Build the project

$ pio build

Import Sample Data

$ curl -i -X POST http://localhost:7070/events.json?accessKey=$ACCESS_KEY \

-H "Content-Type: application/json" \

-d '{

"event" : "rate",

"entityType" : "user",

"entityId" : "u0",

"targetEntityType" : "item",

"targetEntityId" : "i0",

"properties" : {

"rating" : 5


"eventTime" : "2014-11-02T09:39:45.618-08:00"


$ curl -i -X POST http://localhost:7070/events.json?accessKey=$ACCESS_KEY \

-H "Content-Type: application/json" \

-d '{

"event" : "buy",

"entityType" : "user",

"entityId" : "u1",

"targetEntityType" : "item",

"targetEntityId" : "i2",

"eventTime" : "2014-11-10T12:34:56.123-08:00"


Access Data

$ curl -i -X GET "http://localhost:7070/events.json?accessKey=$ACCESS_KEY"

Bulk import sample data

Download sample data

$ curl --create-dirs -o data/sample_movielens_data.txt

Import data using the python SDK

$ python data/ --access_key $ACCESS_KEY

Train the model

$ pio train

Deploy the model

$ pio deploy

Once deployed you can view the model Web UI http://localhost:8000

Test the model by sending sample request.

$ curl -H "Content-Type: application/json" -d '{ "user": "1", "num": 4 }' http://localhost:8000/queries.json

Add eclipse nature to the project

$ sbt eclipse

Sentiment Analysis using Stanford NLP library

Template: OpenNLP Sentiment Analysis Template

Download source

$ git clone MySentimentAnalyzer

Create new app

$ pio app new MySentimentAnalyzer

Set the ACCESS_KEY environment variable to the access key.

Import training data.

$ cd MySentimentAnalyzer/data

$ python --access_key $ACCESS_KEY

Train the model

$ train

Deploy the model

$ deploy


$ curl -H "Content-Type: application/json" -d '{ "sentence":"I like speed and fast motorcycles." }' http://localhost:8000/queries.json


Text Classification


Get the source

$ git clone MyTextClassifier

Create a new app and set ACCESS_KEY environment.

$ pio app new MyTextClassifier

Import data. Bulk import feature.

$ pio import --appid 3 --input data/stopwords.json
$ pio import --appid 3 --input data/emails.json


$ pio build --verbose

Train and

$ pio train


$ pio deploy

Test the application

$ curl -H "Content-Type: application/json" -d '{ "text":"I like speed and fast motorcycles." }' http://localhost:8000/queries.json
$ curl -H "Content-Type: application/json" -d '{ "text":"Earn extra cash!" }' http://localhost:8000/queries.json


$ pio eval org.template.textclassification.AccuracyEvaluation org.template.textclassification.EngineParamsList

Event Type

Create new app. Take a note of the app Id and access key.

$ pio app new MyTestApp

Set ACCESS_KEY environment variable

$ ACCESS_KEY=<access key of the new app>

Create an event record

$ curl -i -X POST http://localhost:7070/events.json?accessKey=$ACCESS_KEY \

-H "Content-Type: application/json" \

-d '{

"event" : "$set",

"entityType" : "user",

"entityId" : "2",

"properties" : {

"a" : 3,

"b" : 4


"eventTime" : "2014-09-09T16:17:42.937-08:00"


User-2 will have 2 properties a and b. Connect to DB using SQL workbench and view the table. The name has the following naming convention event_<app id>, for example, event_8, which 8 is the application id.

mysql> select * from pio_event_8 order by entitytype, cast(entityid as signed);

Add new property called "c" to the entity you just created.

$ curl -i -X POST http://localhost:7070/events.json?accessKey=$ACCESS_KEY \

-H "Content-Type: application/json" \

-d '{

"event" : "$set",

"entityType" : "user",

"entityId" : "2",

"properties" : {

"b" : 5,

"c" : 6


"eventTime" : "2014-09-10T13:12:04.937-08:00"


Launch pio shell to view the data. Before launching, make sure spark has the package reference of mysql jdbc connector in spark-defaults.conf file

$ pio-shell --with-spark

scala> val appName = "MyTestApp"

scala> import

scala> PEventStore.aggregateProperties(appName=appName, entityType="user")(sc).collect()

You should an aggregated result for the user 2 with all three properties with latest values.

Remove a property.

$ curl -i -X POST http://localhost:7070/events.json?accessKey=$ACCESS_KEY \

-H "Content-Type: application/json" \

-d '{

"event" : "$unset",

"entityType" : "user",

"entityId" : "2",

"properties" : {

"b" : null


"eventTime" : "2014-09-11T14:17:42.456-08:00"


Verify the aggregated result in pio-shell.

scala> PEventStore.aggregateProperties(appName=appName, entityType="user")(sc).collect()

Delete the entity user 2.

$ curl -i -X POST http://localhost:7070/events.json?accessKey=$ACCESS_KEY \

-H "Content-Type: application/json" \

-d '{

"event" : "$delete",

"entityType" : "user",

"entityId" : "2",

"eventTime" : "2014-09-12T16:13:41.452-08:00"


Verify the result from pio-shell.

scala> PEventStore.aggregateProperties(appName=appName, entityType="user")(sc).collect()

In the retrieve request, you can apply time filter.

scala> import org.joda.time.DateTime

scala> PEventStore.aggregateProperties(appName=appName, entityType="user", untilTime=Some(new DateTime(2014, 9, 11, 0, 0)))(sc).collect()

View source code for PEventStore for more details.