Apache Solr Basics

What is Apache Solr

Solr is an open source enterprise search platform, written in Java, from the Apache Lucene project. Its major features include full-text search, hit highlighting, faceted search, real-time indexing, dynamic clustering, database integration, NoSQL features and rich document (e.g., Word, PDF) handling. Providing distributed search and index replication, Solr is designed for scalability and fault tolerance. Solr is the second-most popular enterprise search engine after Elasticsearch. (wikipedia)

Ports Used

Configurations can be altered in this configuration file /etc/default/solr

File Locations for installables

/usr/lib/solr - binary files
/etc/default/solr - default configurations
/usr/share/doc - examples
/etc/solr/conf - custom configurations
/var/lib/solr - stores collection configurations
hdfs://localhost:8020/solr - data and index for collections

Initialize Solr

Management of solr services can be done using REST api commands or using solrctl, which is a wrapper cli application around the REST api. You need have solrctl in one of the server in the solr cluster.

Stop Solr services

$ sudo service solr-server stop

Create solr namespace in Zookeeper.

$ solrctl init --force

Start and verify Solr service is running

$ sudo service solr-server start

$ sudo jps -lmv | grep -i solr

Create a collection in Solrcloud

Collection holds indexes, configuration file for indexing solrconfig.xml, schema.xml. Files for a collection is maintained in a instance directory. First you create an instance directory, add the instance directory to Zookeeper. Once the instance directory is added to Zookeeper, create a collection to SolrCloud.

Create a instance directory. The instance directory contains solrconfig.xml, schema.xml among other supporting files.

$ solrctl instancedir --generate $HOME/solr_configs

Upload the instance directory to Zookeeper. Here collection1 is name space in Zookeeper.

$ solrctl instancedir --create collection1 $HOME/solr_configs

Any subsequent update to the local instance directory to be updated using solrctl instancedir --update command.

View the collection

$ solrctl instancedir --list

Verify the instance directories in Zookeeper

$ usr/lib/zookeeper/bin/zkCli.sh -server localhost

$ ls /solr/configs

Create a collection with 1 number of shards.

$ solrctl collection --create collection1 -s 1

$ solrctl collection --list

Validate Solr REST Api

$ cd /usr/share/doc/solr-doc*/example/exampledocs

$ java -Durl=http://localhost:8983/solr/collection1/update -jar post.jar *.xml

Now, open http://localhost:8983/solr/#/collection1_shard1_replica1/query and start querying for various query strings and facets.

Index Sample Twitter Data Using MapReduce

Create the skeleton of the instance dir.

$ solrctl instancedir --generate $HOME/solr_configs3

Copy the schema specific for the twitter data

$ cp /usr/share/doc/search*/examples/solr-nrt/collection1/conf/schema.xml $HOME/solr_configs3/conf

Upload the instance dir to Zookeeper

$ solrctl instancedir --create collection3 $HOME/solr_configs3

Create a new collection

$ solrctl collection --create collection3 -s 1

Create a data directory in HDFS, in which we will upload a sample twitter dataset.

$ hadoop fs -mkdir indir

Upload sample twitter data into HDFS dir. Note the data is in avro file format.

$ hadoop fs -put /usr/share/doc/search*/examples/test-documents/sample-statuses-*.avro indir

$ hadoop fs -ls indir

Create an empty output dir

$ hadoop fs -rm -r -skipTrash outdir

$ hadoop fs -mkdir outdir

$ hadoop fs -ls outdir

Delete existing index data, if there is any.

$ solrctl collection --deletedocs collection3

Run the indexing job. The indexing job depends on morphline project. Read more about morphline here.

$ hadoop jar /usr/lib/solr/contrib/mr/search-mr-*-job.jar org.apache.solr.hadoop.MapReduceIndexerTool --help

$ hadoop jar /usr/lib/solr/contrib/mr/search-mr-*-job.jar org.apache.solr.hadoop.MapReduceIndexerTool -D 'mapred.child.java.opts=-Xmx500m' --log4j /usr/share/doc/search*/examples/solr-nrt/log4j.properties --morphline-file /usr/share/doc/search*/examples/solr-nrt/test-morphlines/tutorialReadAvroContainer.conf --output-dir hdfs://localhost:8020/user/cloudera/outdir --verbose --go-live --zk-host localhost:2181/solr --collection collection3 hdfs://localhost:8020/user/localhost/indir

Now, open solr and run queries.

http://localhost:8983/solr/#/collection3_shard1_replica1/query

Clean up

$ solrctl collection --delete collection3

$ solrctl instancedir --delete collection3

$ hadoop fs -rm -r indir

$ solrctl collection --deletedocs collection3

Index HDFS Data using LucidWorks

Hadoop solr project: https://github.com/LucidWorks/hadoop-solr. This project includes tools to build a Hadoop job jar that can index documents from HDFS to Solr.

Supported versions

Solr 5.x
Hadoop 2.x

Query Types

More on query patterns: https://lucene.apache.org/core/2_9_4/queryparsersyntax.html

Indexing SFPD Crime Data

Get SFPD crime data from https://data.sfgov.org/Public-Safety/Map-Crime-Incidents-from-1-Jan-2003/gxxq-x39z/data. This file is in csv. You can convert this file using the process mentioned here.

Create the skeleton of the instance dir.

$ solrctl instancedir --generate $HOME/solr_configs4

Copy the schema specific for the twitter data

$ cp /usr/share/doc/search*/examples/solr-nrt/collection1/conf/schema.xml $HOME/solr_configs4/conf

Replace the document fields with the following

Copy the following file to local

$ cp /usr/share/doc/search*/examples/solr-nrt/test-morphlines/tutorialReadAvroContainer.conf ~/SfpdMorphLine.conf

Add the following field mapping

extractAvroPaths {

flatten : false

paths : {

IncidntNum: /IncidntNum

Category: /Category

text: /Descript

DayOfWeek: /DayOfWeek

Date: /Date

Time: /Time

PdDistrict: /PdDistrict

Resolution: /Resolution

Address: /Address

X: /X

Y: /Y

Location: /Location

}

Upload the instance dir to Zookeeper

$ solrctl instancedir --create collection4 $HOME/solr_configs4

Create a new collection

$ solrctl collection --create collection4 -s 1

Create an empty output dir

$ hadoop fs -rm -r -skipTrash outdir

$ hadoop fs -mkdir outdir

$ hadoop fs -ls outdir

Delete existing index data, if there is any.

$ solrctl collection --deletedocs collection4

Run the indexing job. The indexing job depends on morphline project. Read more about morphline here.

$ hadoop jar /usr/lib/solr/contrib/mr/search-mr-*-job.jar org.apache.solr.hadoop.MapReduceIndexerTool -D 'mapred.child.java.opts=-Xmx500m' --log4j /usr/share/doc/search*/examples/solr-nrt/log4j.properties --morphline-file ~/Downloads/SfpdMorphLine.conf --output-dir hdfs://localhost:8020/user/cloudera/outdir --verbose --go-live --zk-host localhost:2181/solr --collection collection4 hdfs://localhost:8020/user/cloudera/sfpd.avro

Now, open solr and run queries.

http://localhost:8983/solr/#/collection3_shard1_replica1/query

If you make any change to the schema, run the following commands for the new config to take effect.

$ hadoop fs -rm -r -skipTrash outdir

$ hadoop fs -mkdir outdir

$ solrctl instancedir --update collection4 $HOME/solr_configs4

$ solrctl collection --reload collection4

$ solrctl collection --deletedocs collection4

Rebuild the index using the above hadoop jar command.