Let us assume that we have a single table (column family) for a cluster to keep things simple.

Assumptions and Storage Formula

We will make the following assumptions for this example:

• 100 columns per row.
• 10,000 write operations per second and 500 read operations per second (95 percent writes and 5 percent reads).
• For the sake of simplicity, we will also assume that the reads and writes per second is uniform throughout the day.
• The total size of each column name and column value is 30 bytes.
• We are storing time series data, and hence the columns are all set to expire. We set the expiration to 1 month because we do not need data older than a month in our table.
• The average size of a primary key is 10 bytes.
• The replication factor is 3.
• The compaction strategy is size-tiered (50 percent storage overhead).

Let us calculate the storage requirements for one month for this example:

Storage Requirements
``Storage requirement = (((Number_of_columns * (column_name_size + column_value_size + 23)) + 23) * Number_of_rows + Number_of_rows * (32 + primary_key_size)) * Replication_Factor * (1 + Compaction_Overhead)/1024/1024/1024/1024 TB``

This formula can in general be applied to calculate your storage requirements at a column family level.

Note also these important points:

• The storage overhead per column is 23 bytes for a column that expires, as with time series.
• The storage overhead per row is 23 bytes.
•  The storage overhead per key for the primary key is 32 bytes.
• For a regular column, the storage overhead is 15 bytes.
• The total rows written per month are as follows: 10,000*86400*30 = 25920000000

Here are our variables:

Number_of_columns = 100
column_name_size = 20
column_value_size = 10
Number_of_rows = 25920000000

Primary_key_size = 10

Replication_Factor = 3