12/08/2018, 14:50

Cassandra NoSQL Database

1. Apache Cassandra Apache Cassandra is an open source, distributed, highly available and decentralized storage system (database) for managing very large amounts of structured datasets on clusters with many thousands of nodes deployed across multiple data centers across the world. It provides ...

1. Apache Cassandra Apache Cassandra is an open source, distributed, highly available and decentralized storage system (database) for managing very large amounts of structured datasets on clusters with many thousands of nodes deployed across multiple data centers across the world. It provides highly available service with no single point of failure. It is a popular column-family NoSQL database that is based on peer-to-peer architecture.

Cassandra can manage huge volume of structured, semi-structured, and unstructured data in a large distributed cluster across multiple data centers. It provides linear scalability, high performance, fault tolerance, and supports a very flexible data model.

Cassandra’s architecture is responsible for its ability to scale, perform, and offer continuous uptime. Rather than using a legacy master-slave or a manual and difficult-to-maintain sharded architecture, Cassandra has a masterless “ring” design that is elegant, easy to setup, and easy to maintain.

Sharding is a type of database partitioning that separates very large databases the into smaller, faster, more easily managed parts called data shards. The word shard means a small part of a whole.

In Cassandra, all nodes play an identical role. As in cassandra, there is no concept of a master node, all nodes communicating with each other equally. Cassandra’s built-for-scale architecture ensures that it is capable of handling large amounts of data and thousands of concurrent users or operations per second even across multiple data centers. Cassandra’s architecture also means that, unlike other master-slave or sharded systems, it is capable of offering true continuous availability and uptime by simply add new nodes to an existing cluster without having to take it down.

Relational Model Cassandra Model
Database Keyspace
Table Column Family (CF)
Primary key Row Key
Column name Column name/Key
Column value Column value

2. Additional Data Types in Cassandra

  • 2.1 Counter Data Type CQL defines built-in data types for columns. The counter type is unique. A counter is a special kind of column used to store a number that incrementally counts the occurrences of a particular event or process. For example, you might use a counter column to count the number of times a page is viewed. We cannot index, delete, or re-add a counter column. All non-counter columns in the table must be defined as part of the primary key. To load data into a counter column, or to increase or decrease the value of the counter, we will use the UPDATE command. Cassandra rejects USING TIMESTAMP or USING TTL in the command to update a counter column. A column which is assigned counter data type and is used to maintain a distributed counter that can be added to or subtracted from by concurrent transactions. In the presence of a counter column, all non-counter columns in a table must be part of the primary key. We should not assign this type to a column that serves as the primary key. Also, we should not use the counter type in a table that contains anything other than counter types (and primary key). To generate sequential numbers for surrogate keys, we need to use the timeuuid type instead of the counter type. We cannot create an index on a counter column. For example,

    1. Create a keyspace. For example, create a keyspace for use in a single data center and a replication factor of 3. Use the default data center name from the output of nodetool status, for example datacenter1.

    2. Create a table for the counter column.

      CREATE KEYSPACE counterks WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy',
            'datacenter1' : 3 };
      
      CREATE TABLE counterks.page_view_counts
            (counter_value counter,
            url_name varchar,
            page_name varchar,
            PRIMARY KEY (url_name, page_name)
          );
      
    3. Load data into the counter column.

       UPDATE counterks.page_view_counts
       SET counter_value = counter_value + 1
       WHERE url_name = 'www.datastax.com' AND page_name = 'home';
      
    4. Take a look at the counter value.

      SELECT * FROM counterks.page_view_counts
      
      Output:
      url_name         | page_name  | counter_value
      -----------------+------------+----------------------
      www.datastax.com |   home     |             1
      
    5. Increase the value of the counter.

        UPDATE counterks.page_view_counts
        SET counter_value = counter_value + 2
        WHERE url_name = 'www.datastax.com' AND page_name = 'home';
      
    6. Take a look at the counter value.

      url_name         | page_name   | counter_value
      -----------------+-------------+--------------------
      www.datastax.com |   home      |             3
      
  • 2.2 Static column A column can be defined as static, which only makes sense in a table with multi-row partitions, to denote a column whose value is shared by all rows in a partition.

  • 2.3 Collection Types Cassandra Query Language also provides a collection data types. List -> A list is a collection of one or more ordered elements. A collection of one or more ordered elements: [ literal, literal, literal ] Map -> A map is a collection of key-value pairs. A JSON-style array of literals: { literal : literal, literal : literal ... } Set -> A set is a collection of one or more elements. A collection of one or more elements: { literal, literal, literal }

    Meant to be dynamic part of a table Update syntax is very much different from insert Reads require all of collections to be read

    CQL Set: Set is sorted by CQL type comparator

    Declaration Syntax : 
                set_exam set<text>
                here,  set_exam is Collection name
                       set<text> is collection of String
    
    Insert into collection_examples (id, set_exam) values (1, {‘1-one’,’2-two’});
    

    Updating :

    Update collection_examples set set_exam = set_exam + {‘0-zero’} where id = 1;
    

    After adding one element, it will sort to the beginning Removing an element from the set:

    Update collection_examples set set_exam = set_exam - {‘1-one’} where id = 1;
    
  • CQL List: List Ordered by insertion

    Declaration Syntax : 
                list_exam list<text>
                here,  list_exam is Collection name
                       list<text> is collection of String
    
    Insert into collection_examples (id, list_exam) values (1, [‘1-one’,’2-two’]);
    

    Adding an element to the end of a list:

    Update collection_examples set list_exam = list_exam + [‘3-three’] where id = 1;
    

    Adding an element to the beginning of a list:

    Update collection_examples set list_exam = [‘0-zero’] + list_exam where id = 1;
    

    Deleting an element from the list:

    Update collection_examples set list_exam = list_exam - [‘3-three’] where id = 1;
    
  • CQL map: Key is sorted using CQL type comparator

    Declaration Syntax : 
                map_exam map<int, text>
                here,  map_exam is Collection name
                       map<int, text> is a key value pair of int and string
    
    Insert into collection_examples (id, map_exam) values (1, {1: ‘one’, 2: ‘two’});
    

    Add an element to the map:

    Update collection_examples set map_exam[3] = ‘three’ where id = 1;
    

    Update existing element in the map:

    Update collection_examples set map_exam[3] = ‘tres’ where id = 1;
    

    Delete an element in the map:

    Delete map_exam[3] from collection_examples where id = 1;
    

3. Keyspace – analogous to a schema.

  • Has various storage attributes.
  • The keyspace determines the RF (replication factor).

Keyspace or schema is a top-level namespace

  • All data objects (e.g., tables) must belong to some keyspace
  • Defines how data is replicated on nodes
  • Keyspace per application is a good idea

4. Cassandra Primary key Primary key may be single-column key or composite key. In case of composite primary key, first part of the key is called partition key and second part of the key is the clustering key.The partition key determines which node stores the data. The additional columns determine per-partition clustering. Clustering is a storage engine process that sorts data within the partition. The combination of a partition key and a clustering key uniquely identifies a row in a table and is called a primary key.

  • The Partition Key is responsible for data distribution across your nodes.
  • The Clustering Key is responsible for data sorting within the partition.
  • The Primary Key is equivalent to the Partition Key in a single-field-key table.
  • The Composite/Compound Key is just a multiple-columns key

It is worth mentioning that both partition and clustering key can be made by more than one column. We can demonstrate using an example

 create table stackoverflow (
                 k_part_one text,
                 k_part_two int,
                 k_clust_one text,
                 k_clust_two int,
                 k_clust_three uuid,
                 data text,
                 PRIMARY KEY((k_part_one,k_part_two), k_clust_one, k_clust_two, k_clust_three)      
 );

Here, both k_part_one and k_part_two are partition keys and all of k_clust_one, k_clust_two & k_clust_three are clustering keys. All columns listed after the partition key are called clustering columns.Where the partition key is important for data locality, the clustering column specifies the order that the data is arranged inside the partition. The way we read this is left to right.

  • The partition key determines on which node the partition resides
  • Data is ordered in cluster column order within the partition
  • Partitions are not ordered. Rows within partition are ordered by clustering key

Controlling order of the clustering columns: Since the clustering columns specify the order in a single partition, it would be helpful to control the directionality of the sorting. We could accomplish this run time by added an ORDER BY clause to our SELECT like this:

SELECT * FROM user_videos
WHERE userid = 522b1fe2-2e36-4cef-a667-cd4237d08b89
ORDER BY added_date DESC;

What if we want to control the sort order as a default of the data model? We can specify that at table creation time using the CLUSTERING ORDER BY clause:

CREATE TABLE user_videos (
    userid uuid,
    added_date timestamp,
    videoid uuid,
    name text,
    preview_image_location text,
    PRIMARY KEY (userid, added_date, videoid)
) WITH CLUSTERING ORDER BY (added_date DESC, videoid ASC);

Now when we insert data into user_videos the data will be pre-sorted to added_date in descending order. While the partition key component of a primary key is always mandatory, the clustering key component is optional. Single-row partitions: A table with no clustering key can only have single-row partitions because its primary key is equivalent to its partition key and there is a one-to-one mapping between partitions and rows.

Multi-row partitions: A table with a clustering key can have multi-row partitions because different rows in the same partition have different clustering keys. Rows in a multi-row partition are always ordered by clustering key values in ascending (default) or descending order. Each partition will contain multiple rows.

When data is inserted into the cluster, the first step is to apply a hash function to the partition key. The output is used to determine what node (and replicas) will get the data. The algorithm used by Apache Cassandra utilizes Murmur3 which will take an arbitrary input and create a consistent token value. That token value will be inside the range of tokens owned by single node. In simpler terms, a partition key will always belong to one node and that partition’s data will always be found on that node.

Why is that important? If there wasn’t an absolute location of a partition’s data, then it would require searching every node in the cluster for your data. In a small cluster, this may complete quickly, but in much larger cluster it would be painfully slow.

5. Cassandra data modeling concept Data modeling is a process that involves identifying the entities (items to be stored) and the relationships between entities. To create your data model, identify the patterns used to access data and the types of queries to be performed. These two ideas inform the organization and structure of the data, and the design and creation of the database's tables. Indexing the data can lead to either performance or degradation of queries, so understanding indexing is an important step in the data modeling process.

Data modeling in Cassandra uses a query-driven approach, in which specific queries are the key to organizing the data.Cassandra's database design is based on the requirement for fast reads and writes, so the better the schema design, the faster data is written and retrieved.

In contrast, relational databases normalize data based on the tables and relationships designed, and then writes the queries that will be made. Data modeling in relational databases is table-driven, and any relationships between tables are expressed as table joins in queries.

0