DSE Search
Goal
Using DSE Cassandra and Solr, create a operational database that is searchable.
Note: this feature is not available in Apache Cassandra.
Start DSE Cassandra with Solr activated. You find the instructions to start in datastax site. https://docs.datastax.com/en/dse/5.1/dse-dev/datastax_enterprise/operations/startStop/startDseStandalone.html
Start the spark node with Spark Analytics, DSE Graph, and DSE Search node
$ bin/dse cassandra -k -g -s
For this demo, we will create table for movies and load data into the table and search.
Create a keyspace, table
cassandra@cqlsh> create KEYSPACE demo WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1};
cassandra@cqlsh> use demo;
cassandra@cqlsh:demo> create table movies (movieId int primary key, title text, genres text);
Load data into the table
cassandra@cqlsh:demo> COPY movies (movieId, title, genres) from '/data/movie_lens/ml-20m/movies.csv' WITH HEADER = true;
Using 7 child processes
Starting copy of demo.movies with columns [movieid, title, genres].
Processed: 27278 rows; Rate: 15285 rows/s; Avg. rate: 21541 rows/s
27278 rows imported from 1 files in 1.266 seconds (0 skipped).
Create search index on the table.
cqlsh:data> CREATE SEARCH INDEX IF NOT EXISTS ON demo.movies;
See the full documentation on the search index option.
https://docs.datastax.com/en/dse/6.0/cql/cql/cql_reference/cql_commands/cqlCreateSearchIndex.html
cassandra@cqlsh:demo> select * from demo.movies where solr_query = 'title:*' limit 4;
movieid | genres | solr_query | title
---------+---------------------------------------------+------------+--------------------------
1 | Adventure|Animation|Children|Comedy|Fantasy | null | Toy Story (1995)
2 | Adventure|Children|Fantasy | null | Jumanji (1995)
3 | Comedy|Romance | null | Grumpier Old Men (1995)
4 | Comedy|Drama|Romance | null | Waiting to Exhale (1995)
(4 rows)
DSE search by default is case sensitive. For example, look at following two queries.
cassandra@cqlsh:demo> select * from demo.movies where solr_query = 'title:juma*' limit 4;
movieid | genres | solr_query | title
---------+--------+------------+-------
(0 rows)
Above did not return any result, but the below one returns a result.
cassandra@cqlsh:demo> select * from demo.movies where solr_query = 'title:Juma*' limit 4;
movieid | genres | solr_query | title
---------+----------------------------+------------+----------------
2 | Adventure|Children|Fantasy | null | Jumanji (1995)
(1 rows)
To make dse search case insensitive you have to define customer filter (Solr filter) and subsequently change the field type .
Using dsetool, check the status of the indexing job.
$ dsetool core_indexing_status demo.movies
[demo.movies]: FINISHED
View solr schema that is created by default by DSE.
cassandra@cqlsh:demo> DESC ACTIVE SEARCH INDEX SCHEMA ON demo.movies;
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<schema name="autoSolrSchema" version="1.5">
<types>
<fieldType class="org.apache.solr.schema.StrField" name="StrField"/>
<fieldType class="org.apache.solr.schema.TrieIntField" name="TrieIntField"/>
</types>
<fields>
<field indexed="true" multiValued="false" name="title" type="StrField"/>
<field docValues="true" indexed="true" multiValued="false" name="movieid" type="TrieIntField"/>
<field indexed="true" multiValued="false" name="genres" type="StrField"/>
</fields>
<uniqueKey>movieid</uniqueKey>
</schema>
Let us change the field type for title column from StrField to TextField.
Drop the index from the title column.
cassandra@cqlsh:demo> ALTER SEARCH INDEX SCHEMA ON demo.movies DROP field title;
Verify that the column has been column in pending schema.
cassandra@cqlsh:demo> DESC PENDING SEARCH INDEX SCHEMA ON demo.movies;
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<schema name="autoSolrSchema" version="1.5">
<types>
<fieldType class="org.apache.solr.schema.StrField" name="StrField"/>
<fieldType class="org.apache.solr.schema.TrieIntField" name="TrieIntField"/>
</types>
<fields>
<field docValues="true" indexed="true" multiValued="false" name="movieid" type="TrieIntField"/>
<field indexed="true" multiValued="false" name="genres" type="StrField"/>
</fields>
<uniqueKey>movieid</uniqueKey>
</schema>
Add a new search field type of type TextField. [Not sure, why DSE does not include this data type by default in the valid data types]
cassandra@cqlsh:demo> ALTER SEARCH INDEX SCHEMA ON demo.movies ADD types.fieldType[@class='org.apache.solr.schema.TextField', @name='TextField'];
Add index on the title column with type as TextField
cassandra@cqlsh:demo> ALTER SEARCH INDEX SCHEMA ON demo.movies ADD fields.field[@name='title', @type='TextField', @multiValued='false' @indexed='true'];
Docs on solr field classes:
https://docs.datastax.com/en/dse/6.0/cql/cql/cql_reference/solrFieldClasses.html
Verify the schema before activating it.
cassandra@cqlsh:demo> DESC PENDING SEARCH INDEX SCHEMA ON demo.movies;
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<schema name="autoSolrSchema" version="1.5">
<types>
<fieldType class="org.apache.solr.schema.StrField" name="StrField"/>
<fieldType class="org.apache.solr.schema.TrieIntField" name="TrieIntField"/>
<fieldType class="org.apache.solr.schema.TextField" name="TextField"/>
</types>
<fields>
<field docValues="true" indexed="true" multiValued="false" name="movieid" type="TrieIntField"/>
<field indexed="true" multiValued="false" name="genres" type="StrField"/>
<field indexed="true" name="title" type="TextField"/>
</fields>
<uniqueKey>movieid</uniqueKey>
</schema>
Activate the configuration and rebuild the index.
cassandra@cqlsh:demo> RELOAD SEARCH INDEX ON demo.movies;
cassandra@cqlsh:demo> REBUILD SEARCH INDEX ON demo.movies;
Try rest api using CURL.
$ curl "http://einext02:8983/solr/demo.movies/select?q=title:(Grump*+OR+Old*)&wt=json&rows=10&fl=title,movieid&indent=true"
{
"responseHeader":{
"status":0,
"QTime":8},
"response":{"numFound":22,"start":0,"docs":[
{
"movieid":27357,
"title":"Old Men in New Cars (Gamle mænd i nye biler) (2002)"},
{
"movieid":7582,
"title":"Old Acquaintance (1943)"},
{
"movieid":25786,
"title":"Old Dark House, The (1932)"},
{
"movieid":27773,
"title":"Old Boy (2003)"},
{
"movieid":113374,
"title":"Old Lady and the Pigeons, The (La vieille dame et les pigeons) (1997)"},
{
"movieid":117885,
"title":"Grumpy Cat's Worst Christmas Ever (2014)"},
{
"movieid":50259,
"title":"Old Joy (2006)"},
{
"movieid":3,
"title":"Grumpier Old Men (1995)"},
{
"movieid":107314,
"title":"Oldboy (2013)"},
{
"movieid":129849,
"title":"Old Man Drinking a Glass of Beer (1898)"}]
}}
The above query does the following
search for movies, for which title starts with Old or Grump
limiting the number of rows to ten
the response returns title and movieid
Output is in json format
Indent the output for easy reading
Here is blog for solr query parameters.
https://blog.einext.com/apache-solr/solr-query-parameters
Note: If you are running REST api remotely you have to enable remote access.
In cassandra.yaml, set the following two configuration to the IP address of the machine to make Solr Admin UI remotely accessible.
native_transport_address=...
native_transport_broadcast_address=...
Pros and cons of REST api
Pros - easy to use, portable
Cons - load balancing has to be planned