Partition-based Compaction with Matija Gobec | Ep. 121 Distributed Data Show




Distributed Data Show show

Summary: Matija Gobec shares with Patrick McFadin why he started working on a new compaction strategy for Apache Cassandra and how the Cassandra community can collaborate more effectively to introduce new capabilities such as partition-based compaction. Highlights: 0:00 - Patrick welcomes Matija Gobec to the show. 1:30 - Matija introduces the concept of compaction in Cassandra and some of the challenges with existing compaction strategies 2:53 - Existing strategies include size-tiered Compaction (the default) and leveled compaction. Many companies don't understand the value they can get out of leveled compaction, especially for read-heavy workloads. 4:54 - Leveled compaction works well on AWS I2 instances which are backed by SSDs that can handle the IO associated with promoting between levels and compacting SSTables at each level. 6:35 - Matija is working on a new partition-based compaction strategy (PBCS). This strategy leverages a mix of leveled and size tiered compaction strategies. The first tier (tier 0) consists of SSTables that are flushed directly from memory. The second tier (tier 1) is partitioned by token ranges. Data is compacted by partition. 7:45 - The goal is to serve reads from a partition from a single file. You can specify the desired number of partitions per SSTable, which helps to control the file sizes. 10:48 - In tier 1, every partition exists in only one SSTable. This could be a good fit for time series data model that don't fit the time window compaction strategy, for example if you have data that is arriving out of order. 13:14 - The primary goal of this strategy to get the guarantee of serving reads from a single SSTable. It also ended up addressing some of the weaknesses of leveled compaction such as write amplification 14:03 - Matija is collaborating with Joey Lynch from Netflix who has been working on something similar, with the goal of releasing an open source implementation by the end of 2019. Getting some of the large Cassandra users to test this out will really help. 15:50 - Patrick and Matija look forward to the development of a Cassandra Improvement Process that would facilitate collaboration and help efforts like Matija's gain acceptance more quickly.