Distributed Data Show

Summary: The Distributed Data Podcast is your weekly source for the latest news and technical expertise to help you succeed in building large-scale distributed systems. Brought to you by the Developer Advocate team, we go in-depth with DataStax engineers and special guests from the broader data community. New episodes each Tuesday.

Podcasts:

How to keep your DBA happy | Ep. 129 Distributed Data Show | File Type: audio/mpeg | Duration: 00:22:35

Patrick and Jeff share why DBAs are the biggest cheerleaders for our Cassandra Developer Workshops and some of the top things DBAs wish developers knew about Cassandra. 0:00 - Jeff and Patrick talk about our upcoming developer workshops and how we get into the code as quickly as possible: https://www.datastax.com/events/cassandra-developer-workshop-new-york https://www.datastax.com/events/cassandra-developer-workshop-london https://www.datastax.com/events/cassandra-developer-workshop-paris https://www.datastax.com/events/cassandra-application-launch-pad-developers-austin-tx 5:00 - We still provide training for DBAs as well and look for some upcoming public events. 6:40 - Cassandra's lightweight transactions (LWT) are often misunderstood. You should use them to avoid race conditions, not as a general purpose tool. 10:36 - Things DBAs wish developers knew include inserting nulls, which creates tombstones and results in more work for the database. 13:45 - Batches in Cassandra are not atomic (although some wars have been fought over that term), and they're not for bulk loading. Use them to keep writes to multiple tables in sync, but be careful about batch size. 20:17 - We'd love to see you at upcoming developer workshops in North America and Europe. Bring a friend!

First Look at the DSE Kubernetes Operator with Christopher Bradford | Ep128 Distributed Data Show | File Type: audio/mpeg | Duration: 00:27:05

Recently KubeCon 2019 was held in San Diego where we released the DataStax Kubernetes Operator. With KubeCon finished up I caught up with Christopher Bradford, Product Manager for the operator, about this release. 00:30 - 01:12 The not secret name of the DataStax Enterprise Operator for Kubernetes, or as I call it the DSE K8s Operator. 01:13 - 02:28 Discussing KubeCon, which reached 12k attendees this year! 02:29 - 04:19 What is the DataStax mission in creating a K8s Operator? How to squelch the complexity of Kubernetes! 04:20 - 06:06 Delving into the idea behind custom resources, architectural setup around "Cassandra Racks and Datacenters" and how that maps to Kubernetes and the Operator managing these within a Kubernetes Cluster. 06:07 - 07:59 Where do the container images live, who maintains them, and how can you get them for use with Kubernetes? 08:00 - 11:59 What about support, business impacts, and what are the plans for the human operators around this? How can a company get support for this product? 12:00 - 15:28 What about he other Kubernetes Operators for Apache Cassandra? Well, we discuss some of the options (see links below under the references) and the differences between those operators and the DSE Operator. 15:29 - 23:14 It's important to discuss the current state of stateful workloads. Elaborating on how we can have pets and not just cattle for specific systems like this in Kubernetes now! 23:15 - 25:13 Last minute thoughts on KubeCon & Kubernetes. 25:14 - [end] Take it for a test spin in beta! Mentions for more materials, references, etc, which I've posted below. K8s Operators for DataStax Enterprise https://github.com/datastax/labs/tree/master/dse-k8s-operator https://www.datastax.com/blog/2019/11/simplifying-datastax-enterprise-deployments-kubernetes-containerized-workflows https://www.datastax.com/dev/kubernetes K8s Operators for Apache Cassandra https://github.com/jetstack/navigator https://github.com/vgkowski/cassandra-operator https://github.com/Orange-OpenSource/cassandra-k8s-operator https://github.com/instaclustr/cassandra-operator

Microservices Reboot with Cedrick Lunven | Ep. 127 Distributed Data Show | File Type: audio/mpeg | Duration: 00:20:37

We previously discussed Mircoservices almost two years ago (https://www.youtube.com/watch?v=z1Kanef1QTk), now we're revisiting with Cedrick and Aleks to ask: Has microservices evolved since then? Related links and webinar you can find on our website including a demo. Slides https://drive.google.com/drive/u/1/folders/1sjgR2dnB4Hivcd00TLpPBi9KHpracamB Recordings :https://www.datastax.com/resources/webinar/scaler-vos-microservices-avec-apache-cassandratm (FR, more advanced) https://www.datastax.com/resources/webinar/speed-dating-apache-cassandratm (EN, Introduction)

A Frontend Developer's Guide to the Galaxy with Cristina and David | Ep. 126 Distributed Data Show | File Type: audio/mpeg | Duration: 00:17:23

In this episode, Cristina and David talk about Apollo, what it is, and how this can benefit frontend developers by bypassing the backend altogether. Highlights: 1:36: What is Apollo? 3:15: How does Apollo work on the Operational side? 5:40: How does Apollo approach security? 8:00: Any other features that could benefit Frontend devs? 9:09: The user experience of Apollo 9:34: Orders of magnitude - Setting up Apollo 11:12: A UI can make or break a product 12:00: Easy & obvious is our motto 12:08: Cristina's FED seal of approval 12:40: Most programmers are mid level - junior range 12:25: Apollo's 'getting started' docs are awesome 15:15: How can this help FEDs with their web apps?

Porting an App to DataStax Apollo with Jeff Carpenter | Ep. 125 Distributed Data Show | File Type: audio/mpeg | Duration: 00:25:16

David Gilardi turns the tables on Jeff Carpenter to ask him the questions about what it took to port a Java microservice from Apache Cassandra to run on Apollo, DataStax's Apache Cassandra as a Service which is in public, free Beta.

TinkerPop 4.0 Preview with Marko Rodriguez | Ep. 124 Distributed Data Show | File Type: audio/mpeg | Duration: 00:18:33

Graph database veteran Marko Rodriguez joins the show to discuss how he got involved with graph databases and the Apache TinkerPop project, what's coming in TinkerPop 4.0, his work in distributed computing research and stream ring theory, and the origin of that lovable Gremlin character. Highlights: 0:00 - Jeff welcomes Marko to the show to discuss a decade of work in the Graph database community starting at the Los Alamos National Laboratory when he needed a better way to store data. Graph databases were actually Marko's first introduction to databases and he had to learn RDBMS later! 2:29 - Marko has been working with Apache TinkerPop and Gremlin for about 10 years. The Gremlin character came out of some comic books that Ketrina Yim created to teach kids about graph database concepts 4:05 - The world of graph technology has changed a lot over the past 10 years. The first 5 years the community was very small, and there weren't a lot of customers. The past 5 years graph has become mainstream and there is a lot more discussion. 5:41 - There were multiple factors in getting this technology to critical mass including having multiple vendors supporting the project and the project being a part of the Apache Software Foundation, (TODO) 6:40 - What's coming in the TinkerPop 4.0 release: first, this release is a complete rewrite of the Gremlin virtual machine. This is a distributed virtual machine that works with RxJava, Apache Spark, Apache Flink. Users can write in multiple languages including Gremlin, which compile down to a standard bytecode 9:27 - There are three architectural components: the storage engine (the primary structure of vertices and edges and secondary structure including indexes, sort order, etc), the processing engine (how data is read and manipulated, whether on a single machine or distributed), and the query language (which compiles to a bytecode that is executed by the processing language). 11:40 - The TinkerPop 4.0 release introduces the concept of the MMADT - the multi-model abstract data type can be used to represent any data model including tabular, graph, document, and key-value. This enables an extensible architecture for distributed computing 13:43 - The decision for the 4.0 release will be how much of the vision of MMADT to incorporate at this time vs. moving that work to a new project. A draft version of the MMADT specification is available. 15:25 - During his sabbatical, Marko began working on stream ring theory to create a unified processing model for distributed computing. There is a common algebra that various distributed computing platforms like Spark, Flink, Akka, etc. can fit into.

Cassandra at Orange with Jean-Armel Luce | Ep. 123 Distributed Data Show | File Type: audio/mpeg | Duration: 00:20:30

Jeff Carpenter talks with Jean Armel Luce about usage of Cassandra at French telecommunications provider Orange, all the way from their first work with Cassandra 2012 to their open-source Kubernetes operator for Cassandra which is expected to be production-ready in 2020. Highlights: 0:00 - Jeff welcomes Jean-Armel to the show. Orange has been using Cassandra since 2012. They started using this on their customer dataset which was multiple TB with 1000s of requests per second and were able to achieve higher throughput. 3:03 - Orange are also using Cassandra for storing security and infrastructure information. They are working on providing a Cassandra database as a service offering within Orange to better support teams which need dedicated clusters 5:27 - The overall results have been much higher availability and scalability, for example adding capacity without downtime and having active-active configurations in multiple data centers. 7:25 - The biggest challenge came from an application that was writing millions of tombstones into a Cassandra 2.1 cluster, which they discovered after three days. They had to execute a large compaction but were able to recover without downtime. 9:52 - Orange have built web services on top of Cassandra which provide RESTful key-value APIs on top of Cassandra. They've also developed their own graph database on top of Cassandra. 12:47 - Orange is running Cassandra in multiple data centers across France, and are investigating running Cassandra in public clouds. 13:45 - Orange have built their own automation tools based on Ansible, and are working on a Cassandra Kubernetes operator which they hope to production in 2020. They have open sourced this project at: https://github.com/Orange-OpenSource/cassandra-k8s-operator. 15:15 - They began working on the operator about two years ago. The operator currently supports the ability scale up and scale down, schedule repairs, and modify configuration parameters. 16:27 - The remaining work includes building a multi-region operator. 17:27 - They are also working to incorporate a service mesh to help with dissemination of encryption keys for Cassandra and other applications. They had some initial challenges when using Istio. 20:04 - Wrapping up

Instagram's Cassandra Layer with Michaël Figuière | Ep. 122 Distributed Data Show | File Type: audio/mpeg | Duration: 00:19:33

Instagram engineer Michaël Figuière talks with Jeff Carpenter about the Cassandra abstraction layer his team maintains, how it helps development teams move faster, and when companies should consider creating their own abstractions. Highlights: 0:00 - Welcoming Michaël to the show and recapping past conversations on Cassandra usage at Instagram including the RocksDB storage engine we discussed with Dikang Gu and the geographic replication approach that Andrew Whang presented at the 2019 DataStax Accelerate conference 1:50 - Michaël joined DataStax in 2012 to lead the driver team, with his particular focus on Java. Since then he's worked for some large Cassandra users including Instagram. 3:54 - Michaël encourages users to provide feedback to the DataStax drivers team. It's a great thing when companies in the Cassandra community help each other out! 5:45 - Michaël is now at Instagram, which is one of the biggest users of Cassandra since they have over 1bn users active on Instagram's service monthly spread across the globe. 7:09 - Instagram has built an abstraction layer in front of Cassandra that limits what CQL features are exposed. This stateless API is based on a simplified version of the original Cassandra Thrift API, since Thrift is the main RPC interface style in use at Instagram. 9:22 - Having an abstraction layer provides an insertion doing double writes to main and shadow clusters for testing or hardware migration purposes. 11:38 - The abstraction layer helps teams move faster by making it simpler to experiment with new client side configurations for load balancing, traffic optimization, and monitoring. 14:01 - One thing they haven't tried to abstract away is the Cassandra data modeling best practices of creating a denormalized table per query. Michaël's team provides a UI to allow internal customers to define their tables including desired quality of service and anticipated growth. 15:56 - Instagram uses a homegrown Graph database built on MySQL as a significant part of their data infrastructure alongside Cassandra. 16:57 - Challenges in building the Cassandra abstraction layer included the migration from Thrift API to CQL and optimizing load balancing given Instagram's unique networking requirements. 18:25 - Companies may want to investigate creating their own Cassandra abstraction layers when reaching usage of certain size.

Partition-based Compaction with Matija Gobec | Ep. 121 Distributed Data Show | File Type: audio/mpeg | Duration: 00:17:04

Matija Gobec shares with Patrick McFadin why he started working on a new compaction strategy for Apache Cassandra and how the Cassandra community can collaborate more effectively to introduce new capabilities such as partition-based compaction. Highlights: 0:00 - Patrick welcomes Matija Gobec to the show. 1:30 - Matija introduces the concept of compaction in Cassandra and some of the challenges with existing compaction strategies 2:53 - Existing strategies include size-tiered Compaction (the default) and leveled compaction. Many companies don't understand the value they can get out of leveled compaction, especially for read-heavy workloads. 4:54 - Leveled compaction works well on AWS I2 instances which are backed by SSDs that can handle the IO associated with promoting between levels and compacting SSTables at each level. 6:35 - Matija is working on a new partition-based compaction strategy (PBCS). This strategy leverages a mix of leveled and size tiered compaction strategies. The first tier (tier 0) consists of SSTables that are flushed directly from memory. The second tier (tier 1) is partitioned by token ranges. Data is compacted by partition. 7:45 - The goal is to serve reads from a partition from a single file. You can specify the desired number of partitions per SSTable, which helps to control the file sizes. 10:48 - In tier 1, every partition exists in only one SSTable. This could be a good fit for time series data model that don't fit the time window compaction strategy, for example if you have data that is arriving out of order. 13:14 - The primary goal of this strategy to get the guarantee of serving reads from a single SSTable. It also ended up addressing some of the weaknesses of leveled compaction such as write amplification 14:03 - Matija is collaborating with Joey Lynch from Netflix who has been working on something similar, with the goal of releasing an open source implementation by the end of 2019. Getting some of the large Cassandra users to test this out will really help. 15:50 - Patrick and Matija look forward to the development of a Cassandra Improvement Process that would facilitate collaboration and help efforts like Matija's gain acceptance more quickly.

Transactions on Cassandra with Hiroyuki Yamada | Ep. 120 Distributed Data Show | File Type: audio/mpeg | Duration: 00:25:35

Jeff Carpenter talks with Hiroyuki Yamada of Scalar to learn how his team have built Scalar DB, which provides a transaction capability on top of Apache Cassandra. To find out more, visit https://github.com/scalar-labs/scalardb Highlights: 0:00 - Jeff Carpenter welcomes Hiroyuki Yamada to the show. Hiroyuki gave a very interesting talk at ApacheCon on building transactions on top of Cassandra 1:40 - Hiroyuki shares his background in distributed systems and analytics 2:10 - Got involved with Cassandra looking for a highly scalable database 2:53 - Cassandra historically is pretty conservative with respect to providing consistency guarantees - what was the idea for building ACID transactions? 5:34 - Scalar has two products - Scalar DB provides ACID compliant transactions. Scalar BL uses Scalar DB to provide a ledger style database. 6:57 - Scalar DB is architected so that it could be implemented on top of multiple underlying databases, Cassandra is the first implementation. 8:00 - Scalar DB is a Java library which provides the an API based on the concept of records. You'll add metadata columns to your underlying Cassandra table, there is also a coordinator table used internally to track transaction status. 9:14 - The API is a Java interface with get/put semantics and simple CRUD operations. Transactions consist of a start statement, multiple put statements, then a commit statement. 10:30 - The implementation uses optimistic concurrency control. conflicts are detected at commit time. The algorithm is based on the Cherry Garcia protocol. 12:33 - The implementation uses Cassandra lightweight transactions (LWT) as a linearizability primitive, Scalar DB is the first implementation of the algorithm in the Cherry Garcia paper 14:07 - Testing has included destructive testing with Jepsen over the past two years. They have passed the Jepsen tests on a 100 node cluster and have demonstrated linear scalability 15:23 - The Scalar DB performance benchmark demonstrated 100,000 transactions/sec for a 100 node cluster, on I3 4XL machines, for roughly 1000 transactions per second per node 16:58 - The latency on a highly concurrent workload is typically around 100ms per transaction 17:30 - The transactions work across multiple data centers, but latency will be higher 18:46 - The performance tests used a payment processing scenario which involved lots of reads and writes and in-place updates. They also tested a second scenario with an append-only ledger model 20:00 - This work resulted in a recommended improvement to Cassandra improvements - improvements to commit log handling 21:36 - Next steps include improving the tooling around Scalar DB including a backup tool for Cassandra that takes a transactionally consistent backup and improvements to Scalar BL 23:05 - Wrapping up - having Jepsen may help decrease the time to maturity for new distributed database technology.

Avoiding Common Cassandra Issues with Carlos Rolo | Ep. 119 Distributed Data Show | File Type: audio/mpeg | Duration: 00:25:07

Patrick McFadin connects with Carlos Rolo from Pythian at ApacheCon NA to recap the talk Carlos gave on some of the most common issues he sees in production Cassandra clusters and how to avoid them. You can listen to the full talk at https://feathercast.apache.org/2019/09/12/day-to-day-with-cassandra-the-weirdest-and-complex-situations-we-found-carlos-rolo/ 0:00 - Patrick welcomes Carlos back to the show to recap his talk at ApacheCon about some of the worst cases he's seen with Cassandra clusters. New users frequently struggle with the transition to using a distributed database and don't understand implications of decisions they're making. 2:46 - The biggest problem is choosing a bad data model. People forget about eventual consistency and are slow to learn where to use batches and lightweight transactions. 4:55 - Learning to select a partition key is important. If your query includes many partition keys in an IN clause, you are scanning the cluster. IN clauses are ok within the same partition 8:47 - The solution is to do query-based data modeling. Often this is a challenge when the application is already in production. One quick workaround is to use the asynchronous operations in the drivers. 11:29 - As a DBA in the relational world, there were tricks that you could use to optimize queries. This is not the case with Cassandra - the performance is largely determined by the original table design, and it changes the dynamic between DBAs and developers. 13:32 - Batches can be effective if used correctly. It can be used for consistency across partitions, at a performance cost. Don't try to put too much data (by size) in a batch, and don't increase the warning thresholds - or ignore the warnings! 16:05 - Carlos has seen a number of problems caused by bad hardware choices - it's important to size your hardware correctly. Most choices in public cloud environments are ok, and you can jump up to larger instance sizes when needed. 19:14 - A final set of problems occur when people use experimental features in Cassandra without really understanding them. Use the mailing list, community forums and documentations to find the answers. 22:25 - Patrick encourages all our listeners to share their knowledge and experiences with running Cassandra with the rest of the community. We all benefit from the talks you give and blogs you write.

DynamoDB Cassandra Proxy with Sebastian Estevez and Jake Luciani | Ep. 118 Distributed Data Show | File Type: audio/mpeg | Duration: 00:27:38

Bonus episode this week! David talks with Seb and Jake about the DynamoDB Cassandra Proxy, learns it's history what it's all about, and gets all the important details for developers to understand how it works. Highlights 00:28 - Introductions with Seb and Jake 00:47 - What is the Cassandra DynamoDB Proxy? 1:53 - What’s the backstory here? 3:44 - Give Dynamo superpowers it doesn’t already have 4:10 - DataStax culture fosters autonomy for coding up ideas 5:52 - What is the use case for the Cassandra DynamoDB proxy? 9:10 - What pushed this app from just an idea to something more complete? 11:11 - It’s easier to switch when you don’t have to run a bunch of code 11:57 - Where does this play in the DataStax ecosystem? 14:33 - Real-time replication is a bit more complicated 16:35 - Why not build this directly in Cassandra core? 21:55 - Is there a cost for data serialization/deserialization? 22:37 - Talking about pluggable persistence 24:10 - Common themes around developers and community 25:44 - Get the proxy at https://github.com/datastax/dynamo-cassandra-proxy/

ApacheCon 2019 Highlights with Patrick McFadin and Jeff Carpenter | Ep. 117 Distributed Data Show | File Type: audio/mpeg | Duration: 00:37:11

Patrick McFadin and Jeff Carpenter recap their favorite talks and hallway conversations from ApacheCon North America 2019 including DataStax announcements from the keynote by DataStax CTO Jonathan Ellis. Highlights: 0:00 - Enough talk - lets fight! 1:53 - Next Generation Cassandra Conference (NGCC) - the conference within a conference. Thanks to the Apache Software Foundation for making space. 3:42 - NGCC was focused around Cassandra 4.0 including the release of the first alpha and the testing that will be required to get to a full release. 5:33 - One presentation was around transient replicas, a new experimental feature in 4.0. 6:55 - Joey Lynch and Vinay Chella from Netflix spoke about the sidecar project 7:58 - Ben Bromhead from Instaclustr talked about Cassandra community health including the good (recent upturn in contributions) and the bad (average JIRA age). There are some low hanging fruit that would be a great way for new contributors to start. 10:38 - Scott Andreas from Apple proposed a Cassandra improvement process and some other ideas to improve project management for Cassandra. 13:30 - Cassandra MVP Matija Gobec talked about a new partition-based compaction strategy and started talking with Netflix about a similar effort. 14:40 - Yuki from Scalar proposed an approach for building ACID transactions on top of Cassandra using the creatively named "Cherry Garcia" algorithm and Cassandra's Lightweight Transactions. 18:25 - Patrick is working with the project leadership to start bi-monthly user and developer meetings. 20:14 - Recapping Jonathan Ellis' keynote - DataStax commitment to open source in terms of code, content and community. 21:17 - Code - there will no longer be separate DataStax drivers for open source and DataStax Enterprise - there will soon be a single driver for each supported language, including many performance improvements that were previously only available when using DSE. 24:03 - (more Code) Also coming: prebuilt Python wheels and Spring Boot integration. 25:04 - (more Code!) DataStax Insights, the new cloud-based monitoring platform, will be offering a free tier for monitoring OSS Cassandra clusters. 27:08 - Content and community - reminding the community about our free content and community events including DataStax Academy, Cassandra Days, and... this show! 28:34 - in his talk, Patrick announced a new Cassandra / Dynamo proxy available on the DataStax Labs website 31:30 - Other highlights included James Gosling's talk on the fight to keep Java open, and the renewed camaraderie in the Cassandra community. 33:35 - Interesting hallway conversations with open source innovators shopping for a home for their project, and a team using Python for their MVP microservice implementations. 36:13 - Look for some deep dive interviews from ApacheCon over the next few episodes.

Data Density with Marc Selwan | Ep. 116 Distributed Data Show | File Type: audio/mpeg | Duration: 00:34:47

Marc Selwan, Product Manager for core database at DataStax, joins the show to discuss why we recently updated our recommended maximum data density for DataStax Enterprise nodes from 1 TB to 2 TB, the engineering behind this new guidance, and how this can save users money. Highlights: 0:00 - Intro 0:27 - We learn about Marc's background which includes everything from gaming to network operations, presales engineering to product management 3:03 - In 2013, Marc got involved with the Cassandra community and ended up joining DataStax. There's many ways to contribute to open source community besides being a committer. 5:30 - Historically, the DataStax recommended guidance for Cassandra and DataStax Enterprise deployments has been to limit stored data size aka node density to about 1TB of data per node. 8:25 - The DataStax Vanguard team has noted tradeoff between latency and data density. 10:40 - Bootstrapping, decommissioning and compaction are operational tasks for Cassandra and DSE that use increasing amounts of resources as data density increases as more SSTables have to be read 12:05 - The more data, the more operational workload. To improve performance, we need to decrease the amount of work. 13:15 - The current recommendation of 1TB per node can be a factor in the cost of running clusters 15:08 - In August 2019, DataStax released updated guidance for data density for DSE 6, which has been out for over a year. 2TB is now a reasonable guidance for core workloads 19:03 - The timing of the announcement was conservative - we took our time updating the guidance until we had prove the new storage engine thoroughly. 20:38 - The 2TB guidance was tested with SizeTieredCompactionStrategy and TimeWindowCompactionStrategy. Testing with LeveledCompactionStrategy is still in progress but may not be much higher than 1TB. 22:50 - The response from customers has been very positive 24:10 - Testing in your own environment should include throughput tests up to maximum capacity and then stress testing at 90% capacity, and latency testing for specific p99 SLA targets. The DSE metrics collector is useful, you can find some default dashboards at: https://github.com/datastax/dse-metric-reporter-dashboards 27:18 - Make sure to test bootstrapping and decommissioning - these streaming operations are CPU and I/O intensive and might impact your SLAs 29:54 - Zero copy streaming coming in Cassandra 4.0 will greatly improve the performance of streaming and may allow the density guidance to be increased again. 31:31 - Improvements that increase density improve the user experience of Cassandra and DSE as a whole and give you more flexibility to optimize for speed or cost. 33:05 - Wrapping up

Cassandra Day Tokyo Reflections | Ep. 115 Distributed Data Show | File Type: audio/mpeg | Duration: 00:22:37

Patrick and Jeff cover some of the universal questions that come up on "day 2" for Cassandra users around batches, lightweight transactions, secondary indexes, and materialized views. They also challenge some of the biases around using Cassandra for use cases such as banking. Highlights: 0:23 - Recapping a great day hosted by Yahoo! Japan 1:20 - Introducing some of the universal questions we hear no matter where we are in the world 4:11 - Many of the questions we hear most frequently have to do with skills vs. knowledge - knowing when to apply a technique depending on your use case 5:55 - The big 4 we get asked about - when is it ok to use batches, lightweight transactions, secondary indexes, materialized views 6:35 - Batches in Cassandra are not the same as in Oracle, where they refer to bulk loading. Batches are a useful way to group writes to multiple denormalized tables. the key thing to think about is the total amount of data volume being written 9:34 - Materialized views are an acceptable way to manage index tables. If you have a lot of writes, write amplification can be a problem. Repairing materialized views can also be a challenge in Cassandra 3.x, there are some reliability improvements in 4.0. 11:43 - Cassandra has historically provided a lot of flexibility and features which in some cases could be misused. Make sure to test out your approach at scale as much as possible. 13:22 - Secondary indexes are for convenience, not for speed, as they would be in a relational database. Inequality searches on a single partition are an instance where these indexes will scale well. 16:08 - Lightweight transactions in Cassandra are not ACID. They exist to address a particular set of race conditions in distributed systems - uniqueness check on creation and check-and-set. 19:04 - Banking is often cited as a use case for which Cassandra is not well-suited, but people often forget that these systems are historically built using a ledger data model and reconciliation. 21:15 - Hey listeners - send us your ideas for future episode topics, including use cases where you wonder whether Cassandra will be a good fit.

Distributed Data Show

Podcasts:

Comments

Directory

Distributed Data Show

Podcasts:

Comments

Directory

Click for all Categories