Sunday, February 8, 2015

working with Cassanra

Introduction


Cassandra is a NoSQL database that brings best from two different projects BigTable (Goolge) and Dynamo (Amazon). It is believed to perform better from many other NoSQL solutions (See benchmakr link). It provides kye-value pair data model in which rows are sorted on key (only in one order) and columns are sorted on column_names based on the provided Comparator. The benefit of using columns is that you can sort them in both orders e.g. Asc and Desc. There is no theoratical limit on the number of columns. Cassandra can support millions of columns in one row.
It is less structured as compared to RDBMS where you first need to define a schema and then insert records. unlike RDBMS, column names are not specified in advance. each row could have different number of columns. So columns are sparse in Cassandra.

Data Model

Key points of Cassandra data model is briefly discussed below.

Keyspace - database name in RDBMS
ColumnFamily - equivalent to tables in RDMBS
Column - the atomic unit. Each column has a name against which values are inserted  along with timestamp (timestamp used to provide latest data. used in readrepair process)
Keys - each row is identified by a key, it should be unique

Design philosophy

Unlike RDBMS, data model designing in Cassandra is query driven. First you analyse your queries and then design your model accordingly.

For Text Indexing
There is no rich support to search for text as is the case with many RDBMS using the like keyword. There is one project builts atop Cassandra that provides such a text indexing and searching support. It is called Lucandra.  It based on Apache lucene and it uses Cassandra as its data storage. I have tried it with cassandra 0.6.x. However, from cassandra 0.7 onwards, this project is known as Solandra or Solr. which is based on solr, lucene and cassandra 0.7x+. At the time of writing this page, solandra only supported cassandra 0.7 which is the latest release. Now Cassandra has moved quite a lot and its latest release is 2.1.2 (will write about new changes in this version a later post). Solandra comes with cassandra and launches solandra server and cassandra within one jvm.

Experience with Solandra

There was an scenario in which I wanted to use my existing cassandra 0.7 instance with solandra for text indexing. As solandra comes with its own cassandra and launches it, there is no provided mechanism that helps in not launching solandra's own cassandra server. However, if we look into the code, we can stop solandra from launaching the cassandra. For this you need to comment CassandraUtils.startup() in SolandraInitializer class.

Solandra with Cassandra

One possible solution that worked for me is the following.

  • Run solandra as a standalone server (not from tomcat using the solandra.war)
  • Add your cassandra related schema from the code. in Cassandra 0.7, you can create or remove your keyspaces, column familys (entire schema) at runtime. 
  • As solandra requires a schema for storing and indexing the incoming data, you need to write a schema and upload it to the solandra server. 

in order to upload it to solandra, you can use 'curl' utility and the command would be like this.

SCHEMA_URL=http://localhost:8983/solandra/schema/myschema
SCHEMA=~/myschema.xml

curl $SCHEMA_URL --data-binary @$SCHEMA -H 'Content-type:text/xml; charset=utf-8'

echo "Posted $SCHEMA to $SCHEMA_URL"

the name you used at the end of 'http://localhost:8983/solandra/schema' which is 'myschema' would be used for reading from and writing to solandra.

You can also upload the schema from the java code using the java.net.HttpUrlConnection.


References
Cassandra: http://cassandra.apache.org/
Benchmark: http://www.datastax.com/resources/whitepapers/benchmarking-top-nosql-databases
Apache Lucene: http://lucene.apache.org/
Apache Solr: http://lucene.apache.org/solr/



No comments: