Apache Solr

What is Apache Solr?

Solr is the popular, blazing-fast, open source enterprise search platform built on Apache Lucene™.

Related/Adjacent projects:

  • Apache Lucene is a high-performance, full-featured text search engine library written in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform.
  • Apache Nutch is a well matured, production ready Web crawler.
  • Apache Tika(TM) is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries.

Environment:

Cloudera-quickstart VM that I downloaded from this link.

Dataset:

I used San Francisco Police department dataset which is available on https://data.sfgov.org. Unzip and upload to HDFS using Hue or HDFS command line tool.

Login to quickstart VM and open a terminal and follow the steps below.

$ hadoop fs -ls sfpd/
Found 1 items
-rw-r--r-- 1 cloudera cloudera 382725387 2018-02-16 23:33 sfpd/Map__Crime_Incidents_-_from_1_Jan_2003.csv

View a few lines from the line

$ hadoop fs -text sfpd/* | head
IncidntNum,Category,Descript,DayOfWeek,Date,Time,PdDistrict,Resolution,Address,X,Y,Location
050436712,ASSAULT,BATTERY,Wednesday,04/20/2005 12:00:00 AM,04:00,MISSION,NONE,18TH ST / CASTRO ST,-122.435002864271,37.7608878061245,"(37.7608878061245, -122.435002864271)"
080049078,LARCENY/THEFT,GRAND THEFT FROM A BUILDING,Sunday,01/13/2008 12:00:00 AM,18:00,PARK,NONE,1100 Block of CLAYTON ST,-122.446837820235,37.7622550270122,"(37.7622550270122, -122.446837820235)"
130366639,ASSAULT,AGGRAVATED ASSAULT WITH A KNIFE,Sunday,05/05/2013 12:00:00 AM,04:10,INGLESIDE,"ARREST, BOOKED",0 Block of SGTJOHNVYOUNG LN,-122.444707063455,37.7249307267936,"(37.7249307267936, -122.444707063455)"
030810835,DRIVING UNDER THE INFLUENCE,DRIVING WHILE UNDER THE INFLUENCE OF ALCOHOL,Tuesday,07/08/2003 12:00:00 AM,01:00,SOUTHERN,"ARREST, BOOKED",MASON ST / TURK ST,-122.408953598286,37.7832878735491,"(37.7832878735491, -122.408953598286)"
130839567,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,Friday,10/04/2013 12:00:00 AM,20:53,TENDERLOIN,"ARREST, BOOKED",TURK ST / LEAVENWORTH ST,-122.414056291891,37.7827931071006,"(37.7827931071006, -122.414056291891)"
070838580,BURGLARY,"BURGLARY OF APARTMENT HOUSE, UNLAWFUL ENTRY",Tuesday,08/14/2007 12:00:00 AM,07:00,NORTHERN,NONE,3100 Block of FRANKLIN ST,-122.426730544229,37.8034674969672,"(37.8034674969672, -122.426730544229)"
080233102,DRUG/NARCOTIC,POSSESSION OF MARIJUANA,Tuesday,03/04/2008 12:00:00 AM,14:23,INGLESIDE,"ARREST, CITED",MISSION ST / PERSIA AV,-122.43597721703,37.7231288306727,"(37.7231288306727, -122.43597721703)"
060711805,OTHER OFFENSES,"DRIVERS LICENSE, SUSPENDED OR REVOKED",Wednesday,07/05/2006 12:00:00 AM,15:50,INGLESIDE,"ARREST, CITED",2300 Block of SAN JOSE AV,-122.447241159611,37.7201577971255,"(37.7201577971255, -122.447241159611)"
040062593,LARCENY/THEFT,GRAND THEFT FROM A BUILDING,Wednesday,12/10/2003 12:00:00 AM,09:30,INGLESIDE,NONE,0 Block of MOFFITT ST,-122.432787775164,37.7371566745272,"(37.7371566745272, -122.432787775164)"

The data is in CSV format and having 12 columns - c

We can consider the following data type for each column to get started.

Note: Solr out of the box data types https://lucene.apache.org/solr/guide/6_6/field-types-included-with-solr.html

$ hadoop fs -text sfpd/Map__Crime_Incidents_-_from_1_Jan_2003.csv | wc -l
1888568

Total number of records 1888568 - this includes header.

The location column has comma in the field value. Even considering this, let's validate where all rows have 12 + 1 parts delimited by comma.

$ hadoop fs -text sfpd/* | awk -F "," 'NF !=13 {print $0}' | head
IncidntNum,Category,Descript,DayOfWeek,Date,Time,PdDistrict,Resolution,Address,X,Y,Location
130366639,ASSAULT,AGGRAVATED ASSAULT WITH A KNIFE,Sunday,05/05/2013 12:00:00 AM,04:10,INGLESIDE,"ARREST, BOOKED",0 Block of SGTJOHNVYOUNG LN,-122.444707063455,37.7249307267936,"(37.7249307267936, -122.444707063455)"
030810835,DRIVING UNDER THE INFLUENCE,DRIVING WHILE UNDER THE INFLUENCE OF ALCOHOL,Tuesday,07/08/2003 12:00:00 AM,01:00,SOUTHERN,"ARREST, BOOKED",MASON ST / TURK ST,-122.408953598286,37.7832878735491,"(37.7832878735491, -122.408953598286)"
130839567,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,Friday,10/04/2013 12:00:00 AM,20:53,TENDERLOIN,"ARREST, BOOKED",TURK ST / LEAVENWORTH ST,-122.414056291891,37.7827931071006,"(37.7827931071006, -122.414056291891)"
070838580,BURGLARY,"BURGLARY OF APARTMENT HOUSE, UNLAWFUL ENTRY",Tuesday,08/14/2007 12:00:00 AM,07:00,NORTHERN,NONE,3100 Block of FRANKLIN ST,-122.426730544229,37.8034674969672,"(37.8034674969672, -122.426730544229)"
080233102,DRUG/NARCOTIC,POSSESSION OF MARIJUANA,Tuesday,03/04/2008 12:00:00 AM,14:23,INGLESIDE,"ARREST, CITED",MISSION ST / PERSIA AV,-122.43597721703,37.7231288306727,"(37.7231288306727, -122.43597721703)"
060711805,OTHER OFFENSES,"DRIVERS LICENSE, SUSPENDED OR REVOKED",Wednesday,07/05/2006 12:00:00 AM,15:50,INGLESIDE,"ARREST, CITED",2300 Block of SAN JOSE AV,-122.447241159611,37.7201577971255,"(37.7201577971255, -122.447241159611)"
110051822,NON-CRIMINAL,"STAY AWAY OR COURT ORDER, NON-DV RELATED",Monday,01/17/2011 12:00:00 AM,15:35,INGLESIDE,NONE,600 Block of CAMPBELL AV,-122.408761072232,37.7159000951041,"(37.7159000951041, -122.408761072232)"
140196921,"SEX OFFENSES, FORCIBLE",ASSAULT TO RAPE WITH BODILY FORCE,Monday,02/17/2014 12:00:00 AM,14:30,INGLESIDE,COMPLAINANT REFUSES TO PROSECUTE,600 Block of LONDON ST,-122.43792838007,37.7193276406568,"(37.7193276406568, -122.43792838007)"
140902790,OTHER OFFENSES,CONSPIRACY,Saturday,10/25/2014 12:00:00 AM,00:01,MISSION,"ARREST, BOOKED",MISSION ST / 20TH ST,-122.419052694349,37.7586324051562,"(37.7586324051562, -122.419052694349)"

In the above records, the resolution also has value containing comma. Anyway, it is just an observation.

Let create a solr template using solrctl command.

[cloudera@quickstart ~]$ solrctl instancedir --generate sfpd
[cloudera@quickstart ~]$ cd sfpd 
[cloudera@quickstart sfpd]$ ls -l
total 4
drwxr-xr-x 6 cloudera cloudera 4096 Feb 16 08:17 conf

Updated the conf/schema.xml file.

Delete existing fields,dynamic fields and copy commands: these are are line 265-287(23lines), 190-231 (42 lines), 109 to 171 (63 lines). Delete from the bottoms first as listed here so that lines numbers are still applicable.

Add the following lines within the <fields> element.

<field name="IncidentNum" type="string" indexed="true" stored="true"  />

<field name="Category" type="string" indexed="true" stored="true" multiValued="true" />

<field name="Description" type="string" indexed="true" stored="true" multiValued="true" />

<field name="DayOfWeek" type="string" indexed="true" stored="true" />
<field name="Date" type="tdate" indexed="true" stored="true"  />
<field name="Time" type="string" indexed="true" stored="true" />
<field name="District" type="string" indexed="true" stored="true"  />

<field name="Resolution" type="string" indexed="true" stored="true" multiValued="true"/>

<field name="Address" type="string" indexed="true" stored="true"  />
<field name="Location" type="location_rpt" indexed="true" stored="true" />
<field name="text" type="text_general" indexed="true" stored="false" multiValued="true" />

Set unique Id to IncidentNum

<uniqueKey>IncidentNum</uniqueKey>

Add the copyField tags.

<copyField source="IncidentNum" dest="text"/>
<copyField source="Category" dest="text"/>
<copyField source="Description" dest="text"/>
<copyField source="DayOfWeek" dest="text"/>
<copyField source="Time" dest="text"/>
<copyField source="Resolution" dest="text"/>
<copyField source="Address" dest="text"/>

Check the position of the above tags in the template. If necessary keep a copy of the original schema.xml so that you can refer.

Create Zookeeper entry for the collection

$ solrctl --zk localhost:2181/solr instancedir --create sfpd ./sfpd

Create the collection in the solrcloud.

$ solrctl --zk localhost:2181/solr collection --create sfpd

Create a morphline configuration file. You can start with an existing conf file. One is attached below in this page.

$ morphtester sfpd/conf/morphlines.conf sfpd.sample.csv 
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
Displaying Record #0
-----------------------------------------------
   'Address'      => '[18TH ST / CASTRO ST]'
   'Category'     => '[ASSAULT]'
   'Date'         => '[2005-04-20T07:00:00.000Z]'
   'DayOfWeek'    => '[Wednesday]'
   'Description'  => '[BATTERY]'
   'District'     => '[MISSION]'
   'IncidentNum'  => '[050436712]'
   'Location'     => '[(37.7608878061245, -122.435002864271)]'
   'Resolution'   => '[NONE]'
   'Time'         => '[04:00]'
   'X'            => '[-122.435002864271]'
   'Y'            => '[37.7608878061245]'
Displaying Record #1
-----------------------------------------------
   'Address'      => '[1100 Block of CLAYTON ST]'
   'Category'     => '[LARCENY/THEFT]'
   'Date'         => '[2008-01-13T08:00:00.000Z]'
   'DayOfWeek'    => '[Sunday]'
   'Description'  => '[GRAND THEFT FROM A BUILDING]'
   'District'     => '[PARK]'
   'IncidentNum'  => '[080049078]'
   'Location'     => '[(37.7622550270122, -122.446837820235)]'
   'Resolution'   => '[NONE]'
   'Time'         => '[18:00]'
   'X'            => '[-122.446837820235]'
   'Y'            => '[37.7622550270122]'
Displaying Record #2
-----------------------------------------------
   'Address'      => '[0 Block of SGTJOHNVYOUNG LN]'
   'Category'     => '[ASSAULT]'
   'Date'         => '[2013-05-05T07:00:00.000Z]'
   'DayOfWeek'    => '[Sunday]'
   'Description'  => '[AGGRAVATED ASSAULT WITH A KNIFE]'
   'District'     => '[INGLESIDE]'
   'IncidentNum'  => '[130366639]'
   'Location'     => '[(37.7249307267936, -122.444707063455)]'
   'Resolution'   => '[ARREST, BOOKED]'
   'Time'         => '[04:10]'
   'X'            => '[-122.444707063455]'
   'Y'            => '[37.7249307267936]'

In case you want to make further changes and update configs

Create Zookeeper entry for the collection

$ solrctl --zk localhost:2181/solr instancedir --update sfpd ./sfpd
$ solrctl --zk localhost:2181/solr collection --reload sfpd

Note: to avoid passing --zk parameter, you can set the following env variable.

SOLR_ZK_ENSEMBLE=localhost:2181/solr

Run the import jar. If you run the jar multiple times, duplicate records could be created. So, before re-running the file, delete the collection and create it again.

$ cp /usr/share/doc/search-*/examples/solr-nrt/ log4j.properties .
$ hadoop jar /usr/lib/solr/contrib/mr/search-mr-job.jar \
org.apache.solr.hadoop.MapReduceIndexerTool \
--zk localhost:2181/solr \
--collection sfpd \
--morphline-file sfpd/conf/morphlines.conf \
--go-live \
--output hdfs://localhost:8020/tmp/sfpd_out \
--log4j log4j.properties \
hdfs://localhost:8020/user/cloudera/sfpd/

After the load is complete, see the overview tag to see the number of docs in Solr.

In the file, there are 1888568 records (including header). There must be some duplicate in IncNum field. Let's check using spark how many distinct values are there in the IncidentNum column.

As we see there are 1493323 distinct IncidntNum are there, which is close to 1493332 that Solr admin is showing.

Check the size of the index and compare with original data size.

[cloudera@quickstart ~]$ hadoop fs -du -s -h /solr/sfpd
256.1 M 256.1 M /solr/sfpd
[cloudera@quickstart ~]$ hadoop fs -du -s -h sfpd
365.0 M 365.0 M sfpd
[cloudera@quickstart ~]$

What is precision step?

The precisionStep is a count, after how many bits of the indexed value a new term starts. The original value is always indexed in full precision. Precision step of 4 for a 32 bit value(integer) means terms with these bit counts: All 32, left 28, left 24, left 20, left 16, left 12, left 8, left 4 bits of the value (total 8 terms/value). A precision step of 26 would index 2 terms: all 32 bits and one single term with the remaining 6 bits from the left.

What is _version_ field?

The _version_ field is an internal field that is used by the partial update procedure, the update log process, and by SolrCloud. It is only used internally for those processes, and simply providing the _version_ field in your schema.xml should be sufficient.

Lucene Query Syntax

{!lucene df=title q.op=$myop} "phrase query slop"~2 w?ldcard* fuzzzy~0.7 -(updatedAt:[* TO NOW/DAY-2YEAR] +boostMe^5)

Supported operator AND OR && || NOT + -

Faceting

  • Field specific parameter: (Works for highlighting too) f.myfieldname.facet.mincount=1
  • Field value faceting: facet=on, facet.field=myfieldname, facet.sort=count (count, index), facet.limit=100, facet.offset=0, facet.mincount=0, facet.missing=off, facet.prefix, facet.method (enum, fc, or fcs)
  • Range faceting: facet=on, facet.range=myfieldname, facet.range.start, facet.range.end, facet.range.gap (for example, +1DAY), facet.range.hardend=off, facet.range.other=off, facet.range.include=lower (lower upper, edge, outer, or all)
  • Facet queries: facet=on, facet.query
  • Facet pivots: facet.pivot=field1,field2,field3
  • Facet keys: facet.field={!key=Type}r_type
  • Filter exclusion: fq={!tag=r_type}r_type:Album&facet.field={!ex=r_type}r_type

Highlighting

Components: hl=off, hl.fl, hl.requireFieldMatch=off, hl.usePhraseHighlighter=off(the recommended one is on), hl.highlightMultiTerm=off, hl.snippets=1, hl.fragsize=100, and hl.mergeContiguous=off.

Spell Check

Search Components:

spellcheck=off

spellcheck.dictionary=default

spellcheck.q (alternative to q)

spellcheck.count=1

spellcheck.onlyMorePopular=off

spellcheck.extendedResults=off

spellcheck.collate=off

spellcheck.maxCollations=1

spellcheck.maxCollationTries=0

spellcheck.maxCollationEvaluations=10000

spellcheck.collateExtendedResults=off