DSE Search

Goal

Using DSE Cassandra and Solr, create a operational database that is searchable.

Note: this feature is not available in Apache Cassandra.

Start DSE Cassandra with Solr activated. You find the instructions to start in datastax site. https://docs.datastax.com/en/dse/5.1/dse-dev/datastax_enterprise/operations/startStop/startDseStandalone.html

Start the spark node with Spark Analytics, DSE Graph, and DSE Search node

$ bin/dse cassandra -k -g -s

For this demo, we will create table for movies and load data into the table and search.

Create a keyspace, table

cassandra@cqlsh> create KEYSPACE demo WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1};

cassandra@cqlsh> use demo;

cassandra@cqlsh:demo> create table movies (movieId int primary key, title text, genres text);

Load data into the table

cassandra@cqlsh:demo> COPY movies (movieId, title, genres) from '/data/movie_lens/ml-20m/movies.csv' WITH HEADER = true;

Using 7 child processes

Starting copy of demo.movies with columns [movieid, title, genres].

Processed: 27278 rows; Rate: 15285 rows/s; Avg. rate: 21541 rows/s

27278 rows imported from 1 files in 1.266 seconds (0 skipped).

Create search index on the table.

cqlsh:data> CREATE SEARCH INDEX IF NOT EXISTS ON demo.movies;

See the full documentation on the search index option.

https://docs.datastax.com/en/dse/6.0/cql/cql/cql_reference/cql_commands/cqlCreateSearchIndex.html

cassandra@cqlsh:demo> select * from demo.movies where solr_query = 'title:*' limit 4;

movieid | genres | solr_query | title

---------+---------------------------------------------+------------+--------------------------

3 | Comedy|Romance | null | Grumpier Old Men (1995)

(4 rows)

DSE search by default is case sensitive. For example, look at following two queries.

cassandra@cqlsh:demo> select * from demo.movies where solr_query = 'title:juma*' limit 4;

movieid | genres | solr_query | title

---------+--------+------------+-------

(0 rows)

Above did not return any result, but the below one returns a result.

cassandra@cqlsh:demo> select * from demo.movies where solr_query = 'title:Juma*' limit 4;

movieid | genres | solr_query | title

---------+----------------------------+------------+----------------

(1 rows)

To make dse search case insensitive you have to define customer filter (Solr filter) and subsequently change the field type .

Using dsetool, check the status of the indexing job.

$ dsetool core_indexing_status demo.movies

[demo.movies]: FINISHED

View solr schema that is created by default by DSE.

cassandra@cqlsh:demo> DESC ACTIVE SEARCH INDEX SCHEMA ON demo.movies;

<?xml version="1.0" encoding="UTF-8" standalone="no"?>

<types>

</types>

</fields>

<uniqueKey>movieid</uniqueKey>

</schema>

Let us change the field type for title column from StrField to TextField.

Drop the index from the title column.

cassandra@cqlsh:demo> ALTER SEARCH INDEX SCHEMA ON demo.movies DROP field title;

Verify that the column has been column in pending schema.

cassandra@cqlsh:demo> DESC PENDING SEARCH INDEX SCHEMA ON demo.movies;

<?xml version="1.0" encoding="UTF-8" standalone="no"?>

<types>

</types>

</fields>

<uniqueKey>movieid</uniqueKey>

</schema>

Add a new search field type of type TextField. [Not sure, why DSE does not include this data type by default in the valid data types]

cassandra@cqlsh:demo> ALTER SEARCH INDEX SCHEMA ON demo.movies ADD types.fieldType[@class='org.apache.solr.schema.TextField', @name='TextField'];

Add index on the title column with type as TextField

cassandra@cqlsh:demo> ALTER SEARCH INDEX SCHEMA ON demo.movies ADD fields.field[@name='title', @type='TextField', @multiValued='false' @indexed='true'];

Docs on solr field classes:

https://docs.datastax.com/en/dse/6.0/cql/cql/cql_reference/solrFieldClasses.html

Verify the schema before activating it.

cassandra@cqlsh:demo> DESC PENDING SEARCH INDEX SCHEMA ON demo.movies;

<?xml version="1.0" encoding="UTF-8" standalone="no"?>

<types>

</types>

</fields>

<uniqueKey>movieid</uniqueKey>

</schema>

Activate the configuration and rebuild the index.

cassandra@cqlsh:demo> RELOAD SEARCH INDEX ON demo.movies;

cassandra@cqlsh:demo> REBUILD SEARCH INDEX ON demo.movies;

Try rest api using CURL.

$ curl "http://einext02:8983/solr/demo.movies/select?q=title:(Grump*+OR+Old*)&wt=json&rows=10&fl=title,movieid&indent=true"

{

"responseHeader":{

"status":0,

"QTime":8},

"response":{"numFound":22,"start":0,"docs":[

{

"movieid":27357,

"title":"Old Men in New Cars (Gamle mænd i nye biler) (2002)"},

{

"movieid":7582,

"title":"Old Acquaintance (1943)"},

{

"movieid":25786,

"title":"Old Dark House, The (1932)"},

{

"movieid":27773,

"title":"Old Boy (2003)"},

{

"movieid":113374,

"title":"Old Lady and the Pigeons, The (La vieille dame et les pigeons) (1997)"},

{

"movieid":117885,

"title":"Grumpy Cat's Worst Christmas Ever (2014)"},

{

"movieid":50259,

"title":"Old Joy (2006)"},

{

"movieid":3,

"title":"Grumpier Old Men (1995)"},

{

"movieid":107314,

"title":"Oldboy (2013)"},

{

"movieid":129849,

"title":"Old Man Drinking a Glass of Beer (1898)"}]

}}

The above query does the following

search for movies, for which title starts with Old or Grump
limiting the number of rows to ten
the response returns title and movieid
Output is in json format
Indent the output for easy reading

Here is blog for solr query parameters.

https://blog.einext.com/apache-solr/solr-query-parameters

Note: If you are running REST api remotely you have to enable remote access.

In cassandra.yaml, set the following two configuration to the IP address of the machine to make Solr Admin UI remotely accessible.

native_transport_address=...

native_transport_broadcast_address=...

Pros and cons of REST api

Pros - easy to use, portable

Cons - load balancing has to be planned