Avoiding Common Cassandra Issues with Carlos Rolo | Ep. 119 Distributed Data Show




Distributed Data Show show

Summary: Patrick McFadin connects with Carlos Rolo from Pythian at ApacheCon NA to recap the talk Carlos gave on some of the most common issues he sees in production Cassandra clusters and how to avoid them. You can listen to the full talk at https://feathercast.apache.org/2019/09/12/day-to-day-with-cassandra-the-weirdest-and-complex-situations-we-found-carlos-rolo/ 0:00 - Patrick welcomes Carlos back to the show to recap his talk at ApacheCon about some of the worst cases he's seen with Cassandra clusters. New users frequently struggle with the transition to using a distributed database and don't understand implications of decisions they're making. 2:46 - The biggest problem is choosing a bad data model. People forget about eventual consistency and are slow to learn where to use batches and lightweight transactions. 4:55 - Learning to select a partition key is important. If your query includes many partition keys in an IN clause, you are scanning the cluster. IN clauses are ok within the same partition 8:47 - The solution is to do query-based data modeling. Often this is a challenge when the application is already in production. One quick workaround is to use the asynchronous operations in the drivers. 11:29 - As a DBA in the relational world, there were tricks that you could use to optimize queries. This is not the case with Cassandra - the performance is largely determined by the original table design, and it changes the dynamic between DBAs and developers. 13:32 - Batches can be effective if used correctly. It can be used for consistency across partitions, at a performance cost. Don't try to put too much data (by size) in a batch, and don't increase the warning thresholds - or ignore the warnings! 16:05 - Carlos has seen a number of problems caused by bad hardware choices - it's important to size your hardware correctly. Most choices in public cloud environments are ok, and you can jump up to larger instance sizes when needed. 19:14 - A final set of problems occur when people use experimental features in Cassandra without really understanding them. Use the mailing list, community forums and documentations to find the answers. 22:25 - Patrick encourages all our listeners to share their knowledge and experiences with running Cassandra with the rest of the community. We all benefit from the talks you give and blogs you write.