Distributed Data Show show

Distributed Data Show

Summary: The Distributed Data Podcast is your weekly source for the latest news and technical expertise to help you succeed in building large-scale distributed systems. Brought to you by the Developer Advocate team, we go in-depth with DataStax engineers and special guests from the broader data community. New episodes each Tuesday.

Join Now to Subscribe to this Podcast

Podcasts:

 Distributed Data Show Episode 39: Cassandra on Kubernetes with Aaron Ploetz | File Type: audio/mpeg | Duration: 00:09:15

We talk with Aaron Ploetz of Target about how they implemented Cassandra on Kubernetes and celebrate Aaron passing Jonathan Ellis to the top of the Cassandra tag in Stackoverflow. Highlights! 0:15 - David welcomes Cassandra MVP Aaron Ploetz to the show and celebrates Aaron passing Jonathan Ellis to the top of the Cassandra tag in Stack Overflow 1:12 - Aaron explains how Target is moving toward on-demand deployments of Cassandra using Kubernetes, starting on Target’s private cloud running OpenStack 2:42 - Aaron shares about testing this solution, including use of cassandra-stress. In one interesting case a cluster destroyed inadvertently was trivially rebuilt by Kubernetes pointing to the same backing store, with no perceived outage to the client. 4:17 - These methods apply to deployment of other distributed data stores such as Redis 5:30 - The performance test results observed by Aaron’s team for Cassandra on Kubernetes were competitive with non-Kubernetes Cassandra deployments in their external cloud provider 6:25 - Most of the work in running Cassandra on Kubernetes consists of building the initial container images 7:27 - Aaron talks about his work on the book Seven NoSQL Databases in a Week for Packt Publishing

 Distributed Data Show Episode 38: Spark 3.0 and Beyond with Holden Karau | File Type: audio/mpeg | Duration: 00:12:13

David Gilardi talks with Holden Karau of Google to mine many wonderful nuggets on the future of Spark and find out what might happen if she had a magic wand of awesomeness. Highlights! 0:15 - Welcoming Holden back to the show 0:30 - So what exactly is going to be in Spark 3? Significant updates to the SQL and Machine Learning (ML) APIs. There are missing pieces in ML API, adding them will cause breaking changes to existing models. One example is support for online model serving. 2:25 - The DataSet API does not yet fully cover all needed cases, causing developers to jump back to RDD APIs, so some API changes will be needed there . There will be continued performance improvements in query planning in minor releases. 3:13 - Python changes could include changes to handle Vectorized UDFs in the RDD APIs 4:35 Why it’s so hard to pin down when Spark 3 will appear: breaking API changes have to be worth it. We need to wait until the payoff in capability is worth the breaking. An example would be making ML APIs typesafe. 6:57 - What Holden would change in Spark, given a magic wand - shared memory buffer between languages using Apache Arrow 9:46 - Wrapping up - the most exciting change likely to be in Spark 3 in online model serving

 Distributed Data Show Episode 37: Cassandra at Instagram with Dikang Gu | File Type: audio/mpeg | Duration: 00:23:07

We talk with Dikang Gu about Instagram’s experience migrating from Apache Cassandra 2.2 to 3.0 and using Rocks DB as a pluggable storage engine for Cassandra.

 Distributed Data Show Episode 36: Graph Invariants Galore with Denise Gosnell | File Type: audio/mpeg | Duration: 00:16:07

We talk with Dr. Denise Gosnell, graph consultant at DataStax about creative applications of graph technology, TOTAL DOMINATION, and some surprising use cases. Highlights! 0:15 - David welcomes Denise back to the show 0:40 - Denise recaps how to know if you have a graph problem 1:06 - Introducing some interesting graph invariant techniques that we can apply to a variety of problems 1:55 - Chromatic numbers refer to the minimum number of colors required to color a graph so that adjacent vertices have different colors. This can be applied to scheduling problems where vertices are tasks to be scheduled, and edges represent conflicts, for example, medical operations that require same doctor, or surgery room. same room. We can use these conflict graphs to determine tasks that can occur simultaneously. 5:59 - Augmenting paths - this is single source shortest path problem. Beginning with a bi-partite graph (a graph with 2 sets of vertices that have no edges between the sets) and some preference data, we can use Dijkstra’s algorithm to identify shortest paths between the sets. An example of this is creating an academic class schedule that assigns students and classes, optimized for their preferences. 8:58 - Denise and David have a fun aside on finding your passion 9:54 - Total domination - a favorite area of research for Denise and her advisor, Dr. Teresa Haynes. Pick the smallest set of vertices such that every vertex in the graph is in the set or touches something in the set - for example, the dominating set of a star graph is the center. This is useful for problems such as determining where to place alarms in a building. 13:53 - You didn’t realize this, but you’ve been training to solve graph problems your whole life 14:39 - Sneak previewing future graph episodes with Denise about dealing with supernodes: modeling around them, and traversing through them

 Distributed Data Show Episode 35: Apache Cassandra vs. the Cloud Databases with Jonathan Ellis | File Type: audio/mpeg | Duration: 00:41:21

DataStax CTO Jonathan Ellis compares the tradeoffs, strengths, and weaknesses of Apache Cassandra vs. Amazon’s DynamoDB, Microsoft’s Azure Cosmos DB, and Google’s Cloud Spanner.

 Distributed Data Show Episode 34: Spark 2.3 with Holden Karau | File Type: audio/mpeg | Duration: 00:21:02

Patrick McFadin catches up with Holden Karau of Google to learn about new features of Spark 2.3, including Vectorized UDFs, Microbatch improvements, and Kubernetes support. Along the way, they explore whether API stability is an indicator that it’s time to make a career move.

 Distributed Data Show Episode 33: Big Data and Blockchain | File Type: audio/mpeg | Duration: 00:18:14

DuyHai Doan leads a discussion with Patrick McFadin and Jeff Carpenter on whether blockchain is a database, how blockchains and distributed databases complement each other, and what we can learn about distributed systems by looking at blockchain technology. Highlights: 1:00 - DuyHai gives a quick introduction to blockchain 3:01 - Patrick explains why blockchain != Bitcoin 3:49 - We discuss whether a blockchain can be considered a type of database 5:38 - Blockchain limitations: limited transaction throughput and the difficulty of removing errors once they are in the chain 6:54 - Scalability and efficiency challenges of blockchains based on proof of work 8:44 - Techniques for overcoming scalability challenges of the blockchain 11:34 - How blockchains and distributed databases can be used together 12:39 - If some data moves out of the blockchain, where does it go? How about a log-structured merge tree-style database like Cassandra? 14:23 - On using blockchains for integrity/authenticity 14:47 - What we can learn from blockchain technology in the distributed database world: peer-to-peer protocols and hashing techniques 16:51 - Blockchain and distributed databases will intersect in large scale transactional systems in healthcare, retail and other industries

 Distributed Data Show Episode 32: Search with Nick Panahi | File Type: audio/mpeg | Duration: 00:23:25

We talk with Nick Panahi about DSE Search and the direction of the product moving forward.

 Distributed Data Show Episode 31: Microservices and Data - Best Practices | File Type: audio/mpeg | Duration: 00:29:10

The evangelist team shares some its internal discussions and debates about the intersection of microservice architecture and the data tier.

 Distributed Data Show Episode 30: Orchestration with Kathryn Erickson | File Type: audio/mpeg | Duration: 00:17:56

We talk with Kat Erickson about the popularity of orchestration frameworks like Kubernetes and the pros and cons of using orchestration for deployment of distributed databases.

 Distributed Data Show Episode 28: 2018 Predictions | File Type: audio/mpeg | Duration: 00:23:13

The DataStax Evangelist team offers their predictions on the big technology trends to look for in 2018, including microservices and service meshes, containers and orchestration, and the emergence of higher level managed services for machine learning and AI applications.

 Distributed Data Show Episode 27: 2017 In Review | File Type: audio/mpeg | Duration: 00:16:34

The DataStax Evangelist team talks about the big tech trends of 2017 for large scale distributed systems including containers, orchestration, and graph databases.

 Distributed Data Show Episode 26: Partitioning Techniques with DuyHai Doan and Patrick McFadin | File Type: audio/mpeg | Duration: 00:16:30

DuyHai Doan and Patrick McFadin explain the primary two ways of distributing data used in computer science: hash-based partitioning and range-based partitioning, and the implications of each of these on operations (hint: rebalancing!).

 Distributed Data Show Episode 25: Adding a Graph-Based Recommender to KillrVideo with David Gilardi | File Type: audio/mpeg | Duration: 00:23:09

Patrick McFadin talks with David Gilardi about the new recommendation engine recently added to the KillrVideo reference application using DSE Graph and Java DSL’s (Domain Specific Language).

 Distributed Data Show Episode 24: Pre-Aggregation in an Eventually Consistent World with DuyHai Doan | File Type: audio/mpeg | Duration: 00:20:58

Luke Tillman and DuyHai Doan talk about why pre-aggregation is difficult on an eventually consistent database like Apache Cassandra, and debate whether the storage engine in Cassandra should be made pluggable.

Comments

Login or signup comment.