Solr: Indexing using spark
Goal: clean up messy data using Spark and load cleaned data to Solr for search.
Environment:
Cloudera Quickstart VM 5.12.
Solr 4.10
Set env variable for Zookeeper.
$ export SOLR_ZK_ENSEMBLE=localhost:2181/solr
Create template
$ solrctl instancedir --generate winereview
Update the schema in Solr as below.
<fields>
<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />
<field name="_version_" type="long" indexed="true" stored="true"/>
<field name="country" type="text_general" indexed="true" stored="true"/>
<field name="description" type="text_general" indexed="true" stored="true"/>
<field name="designation" type="text_general" indexed="true" stored="true"/>
<field name="points" type="tint" indexed="true" stored="true"/>
<field name="price" type="tdouble" indexed="true" stored="true"/>
<field name="province" type="text_general" indexed="true" stored="true"/>
<field name="region_1" type="text_general" indexed="true" stored="true"/>
<field name="region_2" type="text_general" indexed="true" stored="true"/>
<field name="taster_name" type="text_general" indexed="true" stored="true"/>
<field name="taster_twitter_handle" type="text_general" indexed="true" stored="true"/>
<field name="title" type="text_general" indexed="true" stored="true"/>
<field name="variety" type="text_general" indexed="true" stored="true"/>
<field name="winery" type="text_general" indexed="true" stored="true"/>
<field name="text" type="text_general" indexed="true" stored="false" multiValued="true"/>
</fields>
<uniqueKey>id</uniqueKey>
<copyField source="country" dest="text"/>
<copyField source="description" dest="text"/>
<copyField source="designation" dest="text"/>
<copyField source="province" dest="text"/>
<copyField source="region_1" dest="text"/>
<copyField source="region_2" dest="text"/>
<copyField source="taster_name" dest="text"/>
<copyField source="title" dest="text"/>
<copyField source="variety" dest="text"/>
<copyField source="winery" dest="text"/>
<copyField source="points" dest="text"/>
<copyField source="price" dest="text"/>
Upload collection configuration to Zookeeper
$ solrctl instancedir --update winereview winereview
Launch the collection to solr cloud.
$ solrctl collection --create winereview
Find the code example below.