Schabby's Blog
OpenGL, Java, Cassandra and other stuff that totally makes the world go round

This post addresses Java developers who want to get their feet wet with Cassandra. This is the first post in a series of three in which I describe Cassandras data model as seen from the angle of a typical Java developer. By contributing a javaish view on the data model, I try to extend the set of existing data model descriptions.

The second post in this series will briefly describe how to install and configure Cassandra. The third post will provide several hands-on examples for Cassandra with Java.

I have been toying around with Cassandra for quite some time now. From all NOSQL databases I have seen (and there are quite a few already as Michael pointed out to me earlier), Cassandra seems to be the most promising one to me for reasons that are definitely worth discussing, but are here be beyond the scope of this post.

Data Model

Cassandras data model has been described more than once. In contrast to the descriptions above, I will try to follow a more javaish view which I find easiest and most powerful to work with. I thereby start describing Cassandras data model as nested hash maps.

The way in which data get's stored in key/value based databases like Cassandra strongly resembles the use of ordinary hash maps. To recall, hash maps store data for a (unique) key. The key is also later used to retrieve the data from the hash map. For example, in order to map string keys to byte arrays you would write in Java

Map<String, byte[]> map = new HashMap<String, byte[]>();

This principle stays the same with Cassandra. However, in Cassandra you do not have a single hash map but up to three layers of nested hash maps! What does the mean? Imagine you dont store your values in a single byte array for each key, but again in a hash map, like

Map<String, Map<byte[], byte[]>> map = new HashMap<String, Map<byte[], byte[]>>();

This way you would partition the data you want to store as key/value pairs that are first filled in the data hash map. The data hash map then gets inserted in the higher-order hash map for a given key string. Similarly, to retrieve a value, you would provide the key string and get the data hash map from which you would extract the value you are interested in.

Let us further assume that we dont want to store the key/value pairs as two individual values, but coupled in a class called "Column" so that our data model would look like this:

Map<String, Map<byte[], Column>> map = new HashMap<String, Map<byte[], Column>>();

Where Column is defined as:

class Column {
    byte[] name;
    byte[] value;
    long timestamp;
 
    public Column(byte[] name, byte[] value) {
         this.name = name;
         this.value = value;
         this.timestamp = System.currentMillis();
    }
}

This is already pretty close to what is called a Column Family in Cassandra. You need to restrain yourself from deriving something from the name "Column". Also ignore timestamp which is used by Cassandra to avoid data inconsistency and which shall not bother us here.

Before we go on, let us have a look on a concrete example on how you would need to work with this kind of data structure. Let us assume we want to store the profile data of a single user for some imaginary social networking website.

/* data model to store user profiles */
Map<String, Map<byte[], Column>> user = new HashMap<String, Map<byte[], Column>>();
 
/* create a user 'schabby' */
user.put("schabby", new HashMap<byte[], Column>());
 
/* fill in some profile data for user 'schabby' */
Column age = new Column("age".toBytes(), new byte[]{ 27b });
user.get("schabby").put(age.name, age);
 
Column realName = new Column("real name".toBytes(), "Johannes Schaback".toBytes());
user.get("schabby").put(realName.name, realName);
 
Column nationality = new Column("nationality".toBytes(), "German".toBytes());
user.get("schabby").put(nationality.name, nationality);

Again, do not get confused by the use of the byte arrays where normal string would make more sense. This is to resemble the Cassandra data model as close as possible. You will later realize that it's actually quite nifty to keep the inner hash map byte based for the price of manually converting everything to byte arrays.

If we want to retrieve values from our data structure, we would need to do as follows:

byte age = user.get("schabby").get("age".toBytes()).value[0];
String realName  = new String(user.get("schabby").get("real name".toBytes()).value);
String nationality= new String(user.get("schabby").get("nationality".toBytes()).value);

And this is it. There is not much more conceptual stuff to understand in order to use Cassandra. So we are now ready to project this structure to Cassanda terminology.

Column Family

Cassandra structures its data model in keyspaces, Column Families (CF), Columns and SuperColumns.

A keyspace is a namespace to group Column Families and can be compared to a schema or single database in the SQL world. A keyspace contains one or more Column Families.

A Column Family can be seen as a multidimensional hash map like the one in our example above. In the SQL analogy, you may see a Column Family as a single table that belongs to a schema, however this comparison will not take you far. It is really more a dynamically growing and shrinking hash map rather than a table with fixed columns. Still, in Cassandras terminology you speak of rows when you refer to the hash map that you get for a key string.

Rows are accessed by string keys and each row - which can be seen as a "data hash map" - has several columns. Each column within a row is a bundled pair of a byte array key (a.k.a name) and its byte array data field (a.k.a. value) very similar to our example.

Depending on your configuration, you can let Cassandra apply a sorting scheme to impose an order over your columns in a row. This enables to query ranges over columns. For example, imagine a telephone book from which you want to retrieve all names starting with "Smi". In Java terms, this could be compared to using SortedMap instead of Map. But we sticked to Map for simplicity here.

SuperColumns

The cool thing about Cassandra is its support for an additional hash map layer. This additional layer is added to the Column layer and enables you to store and access your data as a hash map in a hash map in a hash map, or in other words, as a three dimensional hash map. This additional hash map is called a SuperColumn (SC)

In our Java-like example, a Column Family with SuperColumns look like

Map<String, Map<byte[], SuperColumn>> superColumn 
     = new HashMap<String, Map<byte[], SuperColumn>>();

where SuperColumn is again a hash map over columns like

class SuperColumn extends HashMap<byte[], Column>
{
}

Again, I want to point out that the actual SuperColumn definition in Cassandra is different and that this explanatory definition is not too accurate, but nicely serves the illustration purpose.

Similar to normal Columns, the values within a SuperColumn are also stored in an order depending on your configuration, enabling to cut out slices from your SuperColumns.

To continue our social networking site example, let us have a look on how SuperColumns are used to store the friend and relations of the user 'schabby'.

/* create ColumnFamily with SuperColumns */
Map<String, Map<byte[], SuperColumn>> columnFamily = new HashMap<String, Map<byte[], SuperColumn>>();
 
/* prepare a SuperColumn for 'schabby' */
columnFamily.put("schabby", new HashMap<byte[], SuperColumn>());
 
/* create SC to store friend info */
SuperColumn friends = new SuperColumn();
 
/* fill in some friends */
Column friend1 = new Column("friend_1".toBytes(), "Merry".toBytes());
friends.put(friend1.name, friend1);
 
Column friend2 = new Column("friend_2".toBytes(), "Robert".toBytes());
friends.put(friend2.name, friend2);
 
Column friend3 = new Column("friend_3".toBytes(), "Susan".toBytes());
friends.put(friend3.name, friend3);
 
/* finally store SC in Colunm Family */
columnFamily.get("schabby").put("friends".toBytes(), friends);

We are free to create another SuperColumn in the same Column Family to store other list-like data for 'schabby', for example his inbox.

/* ... continued example */
 
SuperColumn inbox = new SuperColumn();
 
/* add two mails to inbox */
Column mail1 = new Column("Hi Schabby".toBytes(), "I hope you are well! Cheers, Nick".toBytes());
inbox.put(mail1.name, mail1);
 
Column mail2 = new Column("Welcome".toBytes(), "some message body".toBytes());
inbox.put(mail2.name, mail2);
 
columnFamily.get("schabby").put("inbox".toBytes(), inbox);

Retrieving the mails from the inbox is straight forward:

/* continued example */
 
SuperColumn inbox = columnFamily.get("schabby").get("inbox".toBytes());
 
for(byte[] subject: inbox.keySet())
{
   String body = inbox.get(subject);
   // do something with subject/body
}

And this is it. I hope this enlightened your understanding of Cassandras data model. It's not that difficult all in all, especially when you start using it.

Please leave some comments for corrections and feedback.


Tags:

Trackbacks/Pingbacks

  1. Cassandra for service registry/discovery service @ Scalable web architectures

29 Antworten

  1. Michael says:

    Great post! Looking forward to part 2!

  2. The one thing I would add is that the columns and supercolumns are all sorted by name -- in java terms, SortedMaps -- so you can also ask for "slices" of columns as well as accessing by name. This allows treating them as lists, as well as dictionaries.

  3. schabby says:

    Hi Jonathan, oh yes, thanks! I will add that!

  4. Dravid says:

    is there a way in cassandra to grab all keys and iterate over them to get their individual values. similar to

    a. select * that we do in sqls or
    b. users.keySet();

    Very very nice post .. nice addition to wtf post. Clarified things for me

  5. Aatish says:

    Really great post!

    I am diligently following your posts and Cassandra overall.
    I have left a comment for you on post 2. Please reply back.

    Also, looking forward for your post 3.

    Thanks

  6. schabby says:

    Hi! Sorry for answering so late. The notification mails ended up in my spam folter. Sorry!

    As for your question: With the current state of development, you can only do range queries if you use an order-preserving partitionier (not random partitioner which is the default). If thats the case you can check out the thrift method get_key_range.

    Otherwise you need to keep track of your keys yourself, for example in a meta-CF. However, I hope that this feature will be implemented soon.

    Johannes

  7. Mehar Chaitanya says:

    Hi I am stucked how to insert the reocrds into a keySpace in cassandra can u compare with mysql like KeySpace as schema in mysql like that.

    I am from SQL background and unable to understand this cassandra COlumn Family

    How can i insert the data into column family like below

    UserList = {
    John: {
    username: "john",
    email: "john@blah.com",
    },
    Smith: {
    username: "ieure",
    email: "ieure@example.com",
    age: "66",
    }
    }

    How can i do this ?

  8. ElangovanS says:

    Excellent article... explained very lay(java)man terms. appreciate it!

  9. Mike says:

    Thanks for a wonderful post, l ve been looking for such information, I will join jour rss feed now.

  10. Nick says:

    Great Article.

  11. Sagar says:

    Great post for a Java user starting with Cassandra, like me :) Thanks!

  12. Jonathan says:

    Thanks for great post, wait more detail in chapter 2.
    I have long time search java client for cassandra...

  13. Prakash says:

    I am beginner to Cassandra , please help me understand the following.

    In you article you had mentioned

    Map<String, Map>
    where :

    String - is the Key
    Map - is the collection of columns

    So a super column should be

    Map<String,Map<superbyte[],<Map>>>

    where:

    String - is the key
    superbyte[] - is the super column name
    Map - Is the collection of columns under supercolumn

    Please correct me if i am worng.

    If my understanding is right the SuperColumn Class in you article should be

    class SuperColumn
    {
    byte[] SuperColumnName;
    Map Columns;
    SuperColumn(byte[] p_SuperName,Map p_Columns)
    {

    this.SuperColumnName = p_SuperName;
    this.Columns = p_Columns;

    }
    }

    Please advice....

  14. Very nice post. It is what I missed in order to see if Cassandra is what I am looking for. Straightforward, simple put, without noise, and on the developer's language. Thank you

  15. TS75 says:

    The best article ever! Do you know when will you come up with the third post providing several hands-on examples for Cassandra with Java? I am really looking forward to the third one as I may have to use it in my project soon.

  16. Ekrem SABAN says:

    Hello!

    Nice posting! I run into problems with the .toBytes() method that seems not to exist under Java 1.6. *-/ But using the .getBytes() method goes also. But making the content as visible as a String value was something that I couldn't manage.

    I tried a hash map tutorial, but couldn't get around my problems. At the end, I replaced all byte[] objects to String and removed the getBytes() calls. Now, I could see the contents of the hash map. :-)

  17. Cass Bud says:

    Hey Bud,
    Very Impressive Comp Science Jeek Blog which brings clarity.
    Interested in writing Client/Query Lang for Cassandra ?

    Another Comp Science Pal

  18. schabby says:

    Hi there,

    thanks for your kind comment! As for your inquiry I am afraid I have to pass, although I definitely would love to contribute to a decent query lang. Something close to JSON would make most sense probably. I am too much involved in my job atm so that I would not have the time to spend enough time on the matter like it deserves. But thanks for considering me though!

    Cheers

    Johannes

  19. donneo says:

    Thanks for this great post, buddy!

  20. Agito says:

    Thnx, its yet best explanation of cassandra data model I found :)

  21. Karthic says:

    Nice post.. Was very helpful.

  22. Soumendu says:

    Thanks a ton for sharing such a lovely article. So easy to read and understand. Looking forward to the next one... Thanks once again!

  23. Sunil Sodah says:

    Excellent post. Very helpful.

  24. alonso says:

    very nice post. Thx!!!

  25. kitty says:

    Very helpful post. Thanks a lot :)

  26. Suku says:

    Nice Post...

    Looking forward for more such tips.. Thanks :)

  27. Sergey says:

    great explanation for modeling OLAP storages in DFS.

  28. Qiu Ping says:

    Your article is quite helpful for me on understanding the basic concepts of cassandra. thanks.

Post Comment

Please notice: Comments are moderated by an Admin.