This post has already been read 1318 times!
Let us assume that we have a single table (column family) for a cluster to keep things simple.
Assumptions and Storage Formula
We will make the following assumptions for this example:
- 100 columns per row.
- 10,000 write operations per second and 500 read operations per second (95 percent writes and 5 percent reads).
- For the sake of simplicity, we will also assume that the reads and writes per second is uniform throughout the day.
- The total size of each column name and column value is 30 bytes.
- We are storing time series data, and hence the columns are all set to expire. We set the expiration to 1 month because we do not need data older than a month in our table.
- The average size of a primary key is 10 bytes.
- The replication factor is 3.
- The compaction strategy is size-tiered (50 percent storage overhead).
Let us calculate the storage requirements for one month for this example:
This formula can in general be applied to calculate your storage requirements at a column family level.
Note also these important points:
- The storage overhead per column is 23 bytes for a column that expires, as with time series.
- The storage overhead per row is 23 bytes.
- The storage overhead per key for the primary key is 32 bytes.
- For a regular column, the storage overhead is 15 bytes.
- The total rows written per month are as follows: 10,000*86400*30 = 25920000000
Here are our variables:
Number_of_columns = 100
column_name_size = 20
column_value_size = 10
Number_of_rows = 25920000000
Primary_key_size = 10
Replication_Factor = 3
Compaction_Overhead = 50 percent(0.5)
Applying the preceding formula with the values from our example preceding, the storage requirement equals 569 TB. Thus, we will need at least 569 TB of disk space to provision for this cluster.