Solr: Index pdf, word etc (Tika)
Goal: Index pdf documents to make them searble using solr.
The following technique can be used to index bulk of pdf docs in batches one or multiple times. Later, we will see how to use a request handler to index doc on demand.
Download some pdfs and upload to them HDFS. Below is an example.
$ wget http://www8.hp.com/h20195/v2/GetPDF.aspx/4AA6-1049EEP.pdf -O sample1.pdf
$ hadoop fs -mkdir pdfs
$ hadoop fs -put sample1.pdf pdfs
Using solrctl command create a collection template - docs. Name of collection is docs.
$ export NAME=docs
$ export SOLR_ZK_ENSEMBLE=localhost:2181/solr
Above, since SOLR_ZK_ENSEMBLE is created as environment variable we can avoid mentioned --zk argument in solrctl command.
$ solrctl instancedir --generate $NAME
From the $NAME/conf/schema.xml file remove existing fields if required add the following.
<field name="content" type="text_general" indexed="true" stored="true" />
<field name="title" type="text_general" indexed="true" stored="true" multiValued="true"/>
<field name="subject" type="text_general" indexed="true" stored="true"/>
<field name="description" type="text_general" indexed="true" stored="true"/>
<field name="comments" type="text_general" indexed="true" stored="true"/>
<field name="author" type="text_general" indexed="true" stored="true"/>
<field name="keywords" type="text_general" indexed="true" stored="true"/>
<field name="category" type="text_general" indexed="true" stored="true"/>
<field name="resourcename" type="text_general" indexed="true" stored="true"/>
<field name="url" type="text_general" indexed="true" stored="true"/>
<field name="content_type" type="string" indexed="true" stored="true" multiValued="true"/>
<field name="last_modified" type="date" indexed="true" stored="true"/>
<field name="links" type="string" indexed="true" stored="true" multiValued="true"/>
Create morphlines conf file $NAME/conf/morphlines.conf. with the content below.
solrLocator : {
collection: docs
zkHost : "127.0.0.1:2181/solr"
batchSize : 100
}
morphlines: [
{
id : morphlinepdfs
importCommands : ["org.kitesdk.**", "org.apache.solr.**"]
commands : [
{ detectMimeType { includeDefaultMimeTypes : true } }
{
solrCell {
solrLocator : ${solrLocator}
captureAttr : true
lowernames : true
capture : [id, title, author, content, content_type, subject, description, keywords, category, resourcename, url, last_modified, links]
parsers : [ { parser : org.apache.tika.parser.pdf.PDFParser } ]
}
}
{ generateUUID { field : id } }
{ sanitizeUnknownSolrFields { solrLocator : ${solrLocator} } }
{ loadSolr: { solrLocator : ${solrLocator} } }
]
}]
Upload the configuration to Zookeeper.
$ solrctl instancedir --create $NAME $NAME
Create a solr collection
$ solrctl collection --create $NAME
Run the MapReduce indexer job.
$ hadoop jar /usr/lib/solr/contrib/mr/search-mr-job.jar \
org.apache.solr.hadoop.MapReduceIndexerTool \
--zk $SOLR_ZK_ENSEMBLE \
--collection $NAME \
--morphline-file $NAME/conf/morphlines.conf \
--go-live \
--output hdfs://localhost:8020/tmp/$NAME_out \
--verbose \
hdfs://localhost:8020/user/cloudera/pdfs
Open Solr UI, you should find the doc.
Using Custom Handler Extraction
This method can be used to index a pdf, word doc using on demand.
Add the following dynamic field to the schema - $NAME/conf/schema.xml.
<dynamicField name="ignored_*" type="text_general" indexed="true" stored="true"/>
Add the following update handler in $NAME/conf/solrconfig.xml
<requestHandler name="/update/extract" class="org.apache.solr.handler.extraction.ExtractingRequestHandler">
<lst name="defaults">
<str name="fmap.Last-Modified">last_modified</str>
<str name="uprefix">ignored_</str>
</lst>
</requestHandler>
Also at top of the $NAME/conf/solrconfig.xml file add the following library location. If the jars are not present, download them from internet.
<lib path="/usr/lib/solr/solr-cell.jar" />
<lib path="/usr/lib/solr/tika-core-1.9.jar" />
<lib path="/usr/lib/solr/apache-xml-xerces.jar" />
Update and reload the collection information.
$ solrctl instancedir --update $NAME $NAME
$ solrctl collection --reload $NAME
Using curl, send one doc for indexing.
$ curl 'http://localhost:8983/solr/docs/update/extract?literal.id=doc10&commit=true' -F "myfile=@sample.pdf"
Open Solr UI, you should be able to find the doc.