Solr: Indexing using spark

Goal: clean up messy data using Spark and load cleaned data to Solr for search.

Environment:

Cloudera Quickstart VM 5.12.

Solr 4.10

Set env variable for Zookeeper.

$ export SOLR_ZK_ENSEMBLE=localhost:2181/solr

Create template

$ solrctl instancedir --generate winereview

Update the schema in Solr as below.

<fields>

<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />

<field name="_version_" type="long" indexed="true" stored="true"/>

<field name="country" type="text_general" indexed="true" stored="true"/>

<field name="description" type="text_general" indexed="true" stored="true"/>

<field name="designation" type="text_general" indexed="true" stored="true"/>

<field name="points" type="tint" indexed="true" stored="true"/>

<field name="price" type="tdouble" indexed="true" stored="true"/>

<field name="province" type="text_general" indexed="true" stored="true"/>

<field name="region_1" type="text_general" indexed="true" stored="true"/>

<field name="region_2" type="text_general" indexed="true" stored="true"/>

<field name="taster_name" type="text_general" indexed="true" stored="true"/>

<field name="taster_twitter_handle" type="text_general" indexed="true" stored="true"/>

<field name="title" type="text_general" indexed="true" stored="true"/>

<field name="variety" type="text_general" indexed="true" stored="true"/>

<field name="winery" type="text_general" indexed="true" stored="true"/>

<field name="text" type="text_general" indexed="true" stored="false" multiValued="true"/>

</fields>

<uniqueKey>id</uniqueKey>

<copyField source="country" dest="text"/>

<copyField source="description" dest="text"/>

<copyField source="designation" dest="text"/>

<copyField source="province" dest="text"/>

<copyField source="region_1" dest="text"/>

<copyField source="region_2" dest="text"/>

<copyField source="taster_name" dest="text"/>

<copyField source="title" dest="text"/>

<copyField source="variety" dest="text"/>

<copyField source="winery" dest="text"/>

<copyField source="points" dest="text"/>

<copyField source="price" dest="text"/>

Upload collection configuration to Zookeeper

$ solrctl instancedir --update winereview winereview

Launch the collection to solr cloud.

$ solrctl collection --create winereview

Find the code example below.

https://github.com/abulbasar/SparkSolrJavaExamples