This post has already been read 3215 times!

Intro to Cassandra Data Model

 

Non-relational, sparse model designed for high scale distributed storage

cassandradatamodel

Data Model – Example Column Families

datamodel-columnfamilies

Data Model – Super & Composite column

supercompositecolumn

Don’t think of a Relational Table

Instead, think of a nested, sorted map data structure

SortedMap<RowKey, SortedMap<ColumnKey, ColumnValue>>

Why?

• Physical model is more similar to sorted map than relational How?

  • Map gives efficient key lookup & sorted nature gives efficient scans
  • Unbounded no. of column keys
  • Key can itself hold value

Each column has timestamp associated. Ignore it during modeling

Refinement - Think of outer map as unsorted

Map<RowKey, SortedMap<ColumnKey, ColumnValue>>

Why?

• Row keys are sorted in natural order only if OPP is used. OPP is not recommended!

Think of a below visual!

columnkeyname

How about super column ?

Map<RowKey, SortedMap<SuperColumnKey,SortedMap<ColumnKey, ColumnValue>>>

Super column is the past! Instead, use Composite column names:

Think of a below visual!

supercolumnCQL should not influence modeling

  1. Best is to forget CQL during modeling exercise
  2. It’s a relational-style interface on top of non-relational model.
  3. Don’t think in relational or CQL-way while designing model!

Storing value in column name is perfectly ok

Leaving ‘column value’ empty (Valueless Column) is also ok

  • 64KB is max for column key. Don't store long text fields, such as item descriptions!
  • 2 GB is max for column value. But limit the size to only a few MBs as there is no streaming.

Use wide row for ordering, grouping and filtering

  • Since column names are stored sorted, wide rows enable ordering of data and hence efficient filtering.
  • Group data queried together in a wide row to read back efficiently, in one query.
  • Wide rows are heavily used with composite columns to build custom indexes.

storetimeseries

But, don’t go too wide!

  • Because a row is never split across nodes

Traffic
All of the traffic related to one row is handled by only one node/shard (by a single set of replicas, to be more precise).

Size

Data for a single row must fit on disk within a single node in the cluster.

Choose proper row key – It’s your “shard key”

Or you’ll end up with hot spots, even with Random Partitioner

Example:

Badrowkey: “ddmmyyhh”
Better row key: “ddmmyyhh|eventtype”

eventlogMake sure column key and row key are unique

Otherwise, data could get accidentally overwritten

No unique constraint enforcement, of course.

CQL INSERT and UPDATE are semantically the same – UPSERT

Example:

Timestamp alone as a column name can cause collisions

Use TimeUUID to avoid collisions.

columnkeyrowkey

Define correct comparator & validator

Don’t just use the default BytesType comparator and validator

Inappropriate comparator can store data(column names) in inappropriate order.

Costly or impossible to do column slice/scans later.

Can’t change comparator once defined, without data migration.

Validator helps to validate column value & row key. Can change later.

Validator - Data type for a column value or row key.
Comparator - Data type for a column keys & defines sort order of column keys.

 

Favor composite column over super column

Composite column supports features of super columns & more

Concerns with Super column:

  • Sub-columns are not indexed. Reading one sub-column de- serializes all sub-columns.
  • Built-in secondary index does not work with sub-columns.
  • Can not encode more than two layers of hierarchy.

Order of sub-columns in composite column matters

Order defines grouping

ordergrouping<state|city>

Ordered by State first and then by City. Cities will be grouped by state physically on the disk.

<city|state>

The other way around, which you don’t want.

Order affects your queries

orderqueries

It’s like compound index!

Assume,
CF with composite column name as <subcolumn1 | subcolumn2 | subcolumn3>

Not all the sub-columns needs to be present. But, can’t skip also.

Query on ‘subcolumn1|subcolumn2’ is fine. But not only for ‘subcolumn2’.

Sub-columns passed after the sliced (scanned) sub-column are ignored.

Query on ‘subcolumn1|slice of subcolumn2|subcolumn3’ will ignore subcolumn3.

Correct order of sub-columns ultimately depends on your query patterns.

Model column families around query patterns

But start your design with entities and relationships, if you can

  • Not easy to tune or introduce new query patterns later by simply creating indexes or building complex queries using join, group by, etc.
  • Think how you can organize data into the nested sorted map to satisfy your query requirements of fast look- up/ordering/grouping/filtering/aggregation/etc.

Identify the most frequent query patterns and isolate the less frequent.

Identify which queries are sensitive to latency and which are not.

De-normalize and duplicate for read performance

But don’t de-normalize if you don’t need to.

It’s all about finding the right balance.

Normalization in Relational world:

  • Pros: less data duplication, fewer data modification anomalies, conceptually cleaner, easier to maintain, and so on.
  • Cons: queries may perform slowly if many tables are joined, etc.
    The same holds true in Cassandra, but the cons are magnified

“Likes” relationship between User & Item

likerelationship

Option 1: Exact replica of relational model

replicarelationalmodel

There is no easy way to query:
- Items that a particular user has liked

- Users who liked a particular item

Option 2: Normalized entities with custom indexes

normalizedentitiescustomindexes

  • Normalized entities except user and item id mapping stored twice
  • What if we want to get the titles in addition to item ids, and username inaddition to user ids.

– How many queries to get all usernames who liked a given item with like-count of 100?

Option 3: Normalized entities with de-normalization into custom indexes

normalizeddenormalizationindexes

  • ‘Title’ and ‘Username’ are de-normalized now.
  • What if we want:
  • -  Given a item id, get all item data along with user names who liked the item.

- Given a user id, get all user data along with item titles liked by that user.

How many queries in the current model? Can it increase further if user becomes

active or item becomes hot?

Option 4: Partially de-normalized entities

useritem

  • Looks messy. Just to save one query?
  • If User and Item are highly shared across domains, I would prefer option 3 at a constant cost of one additional query.

Best Option for this use case – Option 3

bestoption

Leave a Reply

Post Navigation