Apache Solr Basics
What is Apache Solr
Solr is an open source enterprise search platform, written in Java, from the Apache Lucene project. Its major features include full-text search, hit highlighting, faceted search, real-time indexing, dynamic clustering, database integration, NoSQL features and rich document (e.g., Word, PDF) handling. Providing distributed search and index replication, Solr is designed for scalability and fault tolerance. Solr is the second-most popular enterprise search engine after Elasticsearch. (wikipedia)
Ports Used
Configurations can be altered in this configuration file /etc/default/solr
File Locations for installables
/usr/lib/solr - binary files
/etc/default/solr - default configurations
/usr/share/doc - examples
/etc/solr/conf - custom configurations
/var/lib/solr - stores collection configurations
hdfs://localhost:8020/solr - data and index for collections
Initialize Solr
Management of solr services can be done using REST api commands or using solrctl, which is a wrapper cli application around the REST api. You need have solrctl in one of the server in the solr cluster.
Stop Solr services
$ sudo service solr-server stop
Create solr namespace in Zookeeper.
$ solrctl init --force
Start and verify Solr service is running
$ sudo service solr-server start
$ sudo jps -lmv | grep -i solr
Create a collection in Solrcloud
Collection holds indexes, configuration file for indexing solrconfig.xml, schema.xml. Files for a collection is maintained in a instance directory. First you create an instance directory, add the instance directory to Zookeeper. Once the instance directory is added to Zookeeper, create a collection to SolrCloud.
Create a instance directory. The instance directory contains solrconfig.xml, schema.xml among other supporting files.
$ solrctl instancedir --generate $HOME/solr_configs
Upload the instance directory to Zookeeper. Here collection1 is name space in Zookeeper.
$ solrctl instancedir --create collection1 $HOME/solr_configs
Any subsequent update to the local instance directory to be updated using solrctl instancedir --update command.
View the collection
$ solrctl instancedir --list
Verify the instance directories in Zookeeper
$ usr/lib/zookeeper/bin/zkCli.sh -server localhost
$ ls /solr/configs
Create a collection with 1 number of shards.
$ solrctl collection --create collection1 -s 1
$ solrctl collection --list
Validate Solr REST Api
$ cd /usr/share/doc/solr-doc*/example/exampledocs
$ java -Durl=http://localhost:8983/solr/collection1/update -jar post.jar *.xml
Now, open http://localhost:8983/solr/#/collection1_shard1_replica1/query and start querying for various query strings and facets.
Index Sample Twitter Data Using MapReduce
Create the skeleton of the instance dir.
$ solrctl instancedir --generate $HOME/solr_configs3
Copy the schema specific for the twitter data
$ cp /usr/share/doc/search*/examples/solr-nrt/collection1/conf/schema.xml $HOME/solr_configs3/conf
Upload the instance dir to Zookeeper
$ solrctl instancedir --create collection3 $HOME/solr_configs3
Create a new collection
$ solrctl collection --create collection3 -s 1
Create a data directory in HDFS, in which we will upload a sample twitter dataset.
$ hadoop fs -mkdir indir
Upload sample twitter data into HDFS dir. Note the data is in avro file format.
$ hadoop fs -put /usr/share/doc/search*/examples/test-documents/sample-statuses-*.avro indir
$ hadoop fs -ls indir
Create an empty output dir
$ hadoop fs -rm -r -skipTrash outdir
$ hadoop fs -mkdir outdir
$ hadoop fs -ls outdir
Delete existing index data, if there is any.
$ solrctl collection --deletedocs collection3
Run the indexing job. The indexing job depends on morphline project. Read more about morphline here.
$ hadoop jar /usr/lib/solr/contrib/mr/search-mr-*-job.jar org.apache.solr.hadoop.MapReduceIndexerTool --help
$ hadoop jar /usr/lib/solr/contrib/mr/search-mr-*-job.jar org.apache.solr.hadoop.MapReduceIndexerTool -D 'mapred.child.java.opts=-Xmx500m' --log4j /usr/share/doc/search*/examples/solr-nrt/log4j.properties --morphline-file /usr/share/doc/search*/examples/solr-nrt/test-morphlines/tutorialReadAvroContainer.conf --output-dir hdfs://localhost:8020/user/cloudera/outdir --verbose --go-live --zk-host localhost:2181/solr --collection collection3 hdfs://localhost:8020/user/localhost/indir
Now, open solr and run queries.
http://localhost:8983/solr/#/collection3_shard1_replica1/query
Clean up
$ solrctl collection --delete collection3
$ solrctl instancedir --delete collection3
$ hadoop fs -rm -r indir
$ solrctl collection --deletedocs collection3
Index HDFS Data using LucidWorks
Hadoop solr project: https://github.com/LucidWorks/hadoop-solr. This project includes tools to build a Hadoop job jar that can index documents from HDFS to Solr.
Supported versions
Solr 5.x
Hadoop 2.x
Query Types
More on query patterns: https://lucene.apache.org/core/2_9_4/queryparsersyntax.html
Indexing SFPD Crime Data
Get SFPD crime data from https://data.sfgov.org/Public-Safety/Map-Crime-Incidents-from-1-Jan-2003/gxxq-x39z/data. This file is in csv. You can convert this file using the process mentioned here.
Create the skeleton of the instance dir.
$ solrctl instancedir --generate $HOME/solr_configs4
Copy the schema specific for the twitter data
$ cp /usr/share/doc/search*/examples/solr-nrt/collection1/conf/schema.xml $HOME/solr_configs4/conf
Replace the document fields with the following
<field name="IncidntNum" type="long" indexed="true" stored="true" required="true" multiValued="false" />
<field name="Category" type="string" indexed="true" stored="true" />
<field name="text" type="string" indexed="true" stored="true" />
<field name="DayOfWeek" type="string" indexed="true" stored="true" />
<field name="Date" type="string" indexed="true" stored="true" />
<field name="Time" type="string" indexed="true" stored="true" />
<field name="PdDistrict" type="string" indexed="true" stored="true" />
<field name="Resolution" type="string" indexed="true" stored="true" />
<field name="Address" type="string" indexed="true" stored="true" />
<field name="X" type="string" indexed="true" stored="true" />
<field name="Y" type="string" indexed="true" stored="true" />
<field name="Location" type="string" indexed="true" stored="true" />
Copy the following file to local
$ cp /usr/share/doc/search*/examples/solr-nrt/test-morphlines/tutorialReadAvroContainer.conf ~/SfpdMorphLine.conf
Add the following field mapping
extractAvroPaths {
flatten : false
paths : {
IncidntNum: /IncidntNum
Category: /Category
text: /Descript
DayOfWeek: /DayOfWeek
Date: /Date
Time: /Time
PdDistrict: /PdDistrict
Resolution: /Resolution
Address: /Address
X: /X
Y: /Y
Location: /Location
}
}
}
Upload the instance dir to Zookeeper
$ solrctl instancedir --create collection4 $HOME/solr_configs4
Create a new collection
$ solrctl collection --create collection4 -s 1
Create an empty output dir
$ hadoop fs -rm -r -skipTrash outdir
$ hadoop fs -mkdir outdir
$ hadoop fs -ls outdir
Delete existing index data, if there is any.
$ solrctl collection --deletedocs collection4
Run the indexing job. The indexing job depends on morphline project. Read more about morphline here.
$ hadoop jar /usr/lib/solr/contrib/mr/search-mr-*-job.jar org.apache.solr.hadoop.MapReduceIndexerTool -D 'mapred.child.java.opts=-Xmx500m' --log4j /usr/share/doc/search*/examples/solr-nrt/log4j.properties --morphline-file ~/Downloads/SfpdMorphLine.conf --output-dir hdfs://localhost:8020/user/cloudera/outdir --verbose --go-live --zk-host localhost:2181/solr --collection collection4 hdfs://localhost:8020/user/cloudera/sfpd.avro
Now, open solr and run queries.
http://localhost:8983/solr/#/collection3_shard1_replica1/query
If you make any change to the schema, run the following commands for the new config to take effect.
$ hadoop fs -rm -r -skipTrash outdir
$ hadoop fs -mkdir outdir
$ solrctl instancedir --update collection4 $HOME/solr_configs4
$ solrctl collection --reload collection4
$ solrctl collection --deletedocs collection4
Rebuild the index using the above hadoop jar command.