TinkerPop 4.0 Preview with Marko Rodriguez | Ep. 124 Distributed Data Show




Distributed Data Show show

Summary: Graph database veteran Marko Rodriguez joins the show to discuss how he got involved with graph databases and the Apache TinkerPop project, what's coming in TinkerPop 4.0, his work in distributed computing research and stream ring theory, and the origin of that lovable Gremlin character. Highlights: 0:00 - Jeff welcomes Marko to the show to discuss a decade of work in the Graph database community starting at the Los Alamos National Laboratory when he needed a better way to store data. Graph databases were actually Marko's first introduction to databases and he had to learn RDBMS later! 2:29 - Marko has been working with Apache TinkerPop and Gremlin for about 10 years. The Gremlin character came out of some comic books that Ketrina Yim created to teach kids about graph database concepts 4:05 - The world of graph technology has changed a lot over the past 10 years. The first 5 years the community was very small, and there weren't a lot of customers. The past 5 years graph has become mainstream and there is a lot more discussion. 5:41 - There were multiple factors in getting this technology to critical mass including having multiple vendors supporting the project and the project being a part of the Apache Software Foundation, (TODO) 6:40 - What's coming in the TinkerPop 4.0 release: first, this release is a complete rewrite of the Gremlin virtual machine. This is a distributed virtual machine that works with RxJava, Apache Spark, Apache Flink. Users can write in multiple languages including Gremlin, which compile down to a standard bytecode 9:27 - There are three architectural components: the storage engine (the primary structure of vertices and edges and secondary structure including indexes, sort order, etc), the processing engine (how data is read and manipulated, whether on a single machine or distributed), and the query language (which compiles to a bytecode that is executed by the processing language). 11:40 - The TinkerPop 4.0 release introduces the concept of the MMADT - the multi-model abstract data type can be used to represent any data model including tabular, graph, document, and key-value. This enables an extensible architecture for distributed computing 13:43 - The decision for the 4.0 release will be how much of the vision of MMADT to incorporate at this time vs. moving that work to a new project. A draft version of the MMADT specification is available. 15:25 - During his sabbatical, Marko began working on stream ring theory to create a unified processing model for distributed computing. There is a common algebra that various distributed computing platforms like Spark, Flink, Akka, etc. can fit into.