Data Density with Marc Selwan | Ep. 116 Distributed Data Show




Distributed Data Show show

Summary: Marc Selwan, Product Manager for core database at DataStax, joins the show to discuss why we recently updated our recommended maximum data density for DataStax Enterprise nodes from 1 TB to 2 TB, the engineering behind this new guidance, and how this can save users money. Highlights: 0:00 - Intro 0:27 - We learn about Marc's background which includes everything from gaming to network operations, presales engineering to product management 3:03 - In 2013, Marc got involved with the Cassandra community and ended up joining DataStax. There's many ways to contribute to open source community besides being a committer. 5:30 - Historically, the DataStax recommended guidance for Cassandra and DataStax Enterprise deployments has been to limit stored data size aka node density to about 1TB of data per node. 8:25 - The DataStax Vanguard team has noted tradeoff between latency and data density. 10:40 - Bootstrapping, decommissioning and compaction are operational tasks for Cassandra and DSE that use increasing amounts of resources as data density increases as more SSTables have to be read 12:05 - The more data, the more operational workload. To improve performance, we need to decrease the amount of work. 13:15 - The current recommendation of 1TB per node can be a factor in the cost of running clusters 15:08 - In August 2019, DataStax released updated guidance for data density for DSE 6, which has been out for over a year. 2TB is now a reasonable guidance for core workloads 19:03 - The timing of the announcement was conservative - we took our time updating the guidance until we had prove the new storage engine thoroughly. 20:38 - The 2TB guidance was tested with SizeTieredCompactionStrategy and TimeWindowCompactionStrategy. Testing with LeveledCompactionStrategy is still in progress but may not be much higher than 1TB. 22:50 - The response from customers has been very positive 24:10 - Testing in your own environment should include throughput tests up to maximum capacity and then stress testing at 90% capacity, and latency testing for specific p99 SLA targets. The DSE metrics collector is useful, you can find some default dashboards at: https://github.com/datastax/dse-metric-reporter-dashboards 27:18 - Make sure to test bootstrapping and decommissioning - these streaming operations are CPU and I/O intensive and might impact your SLAs 29:54 - Zero copy streaming coming in Cassandra 4.0 will greatly improve the performance of streaming and may allow the density guidance to be increased again. 31:31 - Improvements that increase density improve the user experience of Cassandra and DSE as a whole and give you more flexibility to optimize for speed or cost. 33:05 - Wrapping up