O'Reilly Data Show Podcast show

O'Reilly Data Show Podcast

Summary: The O'Reilly Data Show Podcast explores the opportunities and techniques driving big data, data science, and AI.

Join Now to Subscribe to this Podcast

Podcasts:

 Tools for machine learning development | File Type: audio/mpeg | Duration: 00:39:24

In this week's episode of the Data Show, we're featuring an interview Data Show host Ben Lorica participated in for the Software Engineering Daily Podcast, where he was interviewed by Jeff Meyerson. Their conversation mainly centered around data engineering, data architecture and infrastructure, and machine learning (ML).

 Enabling end-to-end machine learning pipelines in real-world applications | File Type: audio/mpeg | Duration: 00:42:53

In this episode of the Data Show, I spoke with Nick Pentreath, principal engineer at IBM. Pentreath was an early and avid user of Apache Spark, and he subsequently became a Spark committer and PMC member. Most recently his focus has been on machine learning, particularly deep learning, and he is part of a group within IBM focused on building open source tools that enable end-to-end machine learning pipelines. We had a great conversation spanning many topics, including: AI Fairness 360 (AIF360), a set of fairness metrics for data sets and machine learning models Adversarial Robustness Toolbox (ART), a Python library for adversarial attacks and defenses. Model Asset eXchange (MAX), a curated and standardized collection of free and open source deep learning models. Tools for model development, governance, and operations, including MLflow, Seldon Core, and Fabric for deep learning Reinforcement learning in the enterprise, and the emergence of relevant open source tools like Ray. Related resources: “Modern Deep Learning: Tools and Techniques”—a new tutorial at the Artificial Intelligence conference in San Jose Harish Doddi on “Simplifying machine learning lifecycle management” Sharad Goel and Sam Corbett-Davies on “Why it’s hard to design fair machine learning models” “Managing risk in machine learning”: considerations for a world where ML models are becoming mission critical “The evolution and expanding utility of Ray” “Local Interpretable Model-Agnostic Explanations (LIME): An Introduction” Forough Poursabzi Sangdeh on why “It’s time for data scientists to collaborate with researchers in other disciplines”

 Enabling end-to-end machine learning pipelines in real-world applications | File Type: audio/mpeg | Duration: 00:42:53

In this episode of the Data Show, I spoke with Nick Pentreath, principal engineer at IBM. Pentreath was an early and avid user of Apache Spark, and he subsequently became a Spark committer and PMC member. Most recently his focus has been on machine learning, particularly deep learning, and he is part of a group within IBM focused on building open source tools that enable end-to-end machine learning pipelines.

 Bringing scalable real-time analytics to the enterprise | File Type: audio/mpeg | Duration: 00:37:12

In this episode of the Data Show, I spoke with Dhruba Borthakur (co-founder and CTO) and Shruti Bhat (SVP of Product) of Rockset, a startup focused on building solutions for interactive data science and live applications. Borthakur was the founding engineer of HDFS and creator of RocksDB, while Bhat is an experienced product and marketing executive focused on enterprise software and data products. Their new startup is focused on a few trends I’ve recently been thinking about, including the re-emergence of real-time analytics, and the hunger for simpler data architectures and tools.  Borthakur exemplifies the need for companies to continually evaluate new technologies: while he was the founding engineer for HDFS, these days he mostly works with object stores like S3. We had a great conversation spanning many topics, including: RocksDB, an open source, embeddable key-value store originated by Facebook, and which is used in several other open source projects. Time-series databases. The importance of having solutions for real-time analytics, particularly now with the renewed interest in IoT applications and rollout of 5G technologies. Use cases for Rockset’s technologies—and more generally, applications of real-time analytics. The Aggregator Leaf Tailer architecture as an alternative to the Lambda architecture. Building data infrastructure in the cloud. The Aggregator Leaf Tailer (“CQRS for the data world”): A data architecture favored by web-scale companies. Source: Dhruba Borthakur, used with permission. Related resources: Serverless Streaming Architectures & Algorithms for the Enterprise – a new tutorial on September 24th at Strata Data NYC. “Becoming a machine learning company means investing in foundational technologies” Haoyuan Li: “In the age of AI, fundamental value resides in data” Harish Doddi: “Simplifying machine learning lifecycle management” Eric Jonas: “A Berkeley view on serverless computing” “Specialized tools for machine learning development and model governance are becoming essential” Avner Braaverman: “What data scientists and data engineers can do with current generation serverless technologies”

 Bringing scalable real-time analytics to the enterprise | File Type: audio/mpeg | Duration: 00:37:12

In this episode of the Data Show, I spoke with Dhruba Borthakur (co-founder and CTO) and Shruti Bhat (SVP of Product) of Rockset, a startup focused on building solutions for interactive data science and live applications. Borthakur was the founding engineer of HDFS and creator of RocksDB, while Bhat is an experienced product and marketing executive focused on enterprise software and data products. Their new startup is focused on a few trends I’ve recently been thinking about, including the re-emergence of real-time analytics, and the hunger for simpler data architectures and tools.  Borthakur exemplifies the need for companies to continually evaluate new technologies: while he was the founding engineer for HDFS, these days he mostly works with object stores like S3.

 Applications of data science and machine learning in financial services | File Type: audio/mpeg | Duration: 00:42:32

In this episode of the Data Show, I spoke with Jike Chong, chief data scientist at Acorns, a startup focused on building tools for micro-investing. Chong has extensive experience using analytics and machine learning in financial services, and he has experience building data science teams in the U.S. and in China. We had a great conversation spanning many topics, including: Potential applications of data science in financial services. The current state of data science in financial services in both the U.S. and China. His experience recruiting, training, and managing data science teams in both the U.S. and China. Here are some highlights from our conversation: Opportunities in financial services There’s a customer acquisition piece and then there’s a customer retention piece. For customer acquisition, we can see that new technologies can really add value by looking at all sorts of data sources that can help a financial service company identify who they want to target to provide those services. So, it’s a great place where data science can help find the product market fit, not just at one instance like identifying who you want to target, but also in a continuous form where you can evolve a product and then continuously find the audience that would best fit the product and continue to analyze the audience so you can design the next generation product. … Once you have a specific cohort of users who you want to target, there’s a need to be able to precisely convert them, which means understanding the stage of the customer’s thought process and understanding how to form the narrative to convince the user or the customer that a particular piece of technology or particular piece of service is the current service they need. … On the customer serving or retention side, for financial services we commonly talk about building hundred-year businesses, right? They have to be profitable businesses, and for financial service to be profitable, there are operational considerations—quantifying risk requires a lot of data science; preventing fraud is really important, and there is garnering the long-term trust with the customer so they stay with you, which means having the work ethic to be able to take care of customer’s data and able to serve the customer better with automated services whenever and wherever the customer is. It’s all those opportunities where I see we can help serve the customer by having the right services presented to them and being able to serve them in the long term. Opportunities in China A few important areas in the financial space in China include mobile payments, wealth management, lending, and insurance—basically, the major areas for the financial industry. For these areas, China may be a forerunner in using internet technologies, especially mobile internet technologies for FinTech, and I think the wave started way back in the 2012/2013 time frame. If you look at mobile payments, like Alipay and WeChat, those have hundreds of millions of active users. The latest data from Alipay is about 608 million users, and these are monthly active users we’re talking about. This is about two times the U.S. population actively using Alipay on a monthly basis, which is a crazy number if you consider all the data that can generate and all the things you can see people buying to be able to understand how to serve the users better. If you look at WeChat, they’re boasting one billion users, monthly active users, early this year. Those are the huge players, and with that amount of traffic, they are able to generate a lot of interest for the lower-frequency services like wealth management and lending, as well as insurance. Related resources: Kai-Fu Lee outlines the factors that enabled China’s rapid ascension in AI Gary Kazantsev on how “Data science makes an impact on Wall Street” Juan Huerta on “Upcoming challenges and opportunities for data technologies in consumer

 Applications of data science and machine learning in financial services | File Type: audio/mpeg | Duration: 00:42:32

In this episode of the Data Show, I spoke with Jike Chong, chief data scientist at Acorns, a startup focused on building tools for micro-investing. Chong has extensive experience using analytics and machine learning in financial services, and he has experience building data science teams in the U.S. and in China. We had a great conversation spanning many topics, including: Potential applications of data science in financial services. The current state of data science in financial services in both the U.S. and China. His experience recruiting, training, and managing data science teams in both the U.S. and China.

 Real-time entity resolution made accessible | File Type: audio/mpeg | Duration: 00:27:09

In this episode of the Data Show, I spoke with Jeff Jonas, CEO, founder and chief scientist of Senzing, a startup focused on making real-time entity resolution technologies broadly accessible. He was previously a fellow and chief scientist of context computing at IBM. Entity resolution (ER) refers to techniques and tools for identifying and linking manifestations of the same entity/object/individual. Ironically, ER itself has many different names (e.g., record linkage, duplicate detection, object consolidation/reconciliation, etc.). ER is an essential first step in many domains, including marketing (cleaning up databases), law enforcement (background checks and counterterrorism), and financial services and investing. Knowing exactly who your customers are is an important task for security, fraud detection, marketing, and personalization. The proliferation of data sources and services has made ER very challenging in the internet age. In addition, many applications now increasingly require near real-time entity resolution. We had a great conversation spanning many topics including: Why ER is interesting and challenging How ER technologies have evolved over the years How Senzing is working to democratize ER by making real-time AI technologies accessible to developers Some early use cases for Senzing’s technologies Some items on their research agenda Here are a few highlights from our conversation: Entity Resolution through years In the early ’90s, I worked on a much more advanced version of entity resolution for the casinos in Las Vegas and created software called NORA, non-obvious relationship awareness. Its purpose was to help casinos better understand who they were doing business with. We would ingest data from the loyalty club, everybody making hotel reservations, people showing up without reservations, everybody applying for jobs, people terminated, vendors, and 18 different lists of different kinds of bad people, some of them card counters (which aren’t that bad), some cheaters. And they wanted to figure out across all these identities when somebody was the same, and then when people were related. Some people were using 32 different names and a bunch of different social security numbers. … Ultimately, IBM bought my company and this technology became what is known now at IBM as “identity insight.” Identity insight is a real-time entity resolution engine that gets used to solve many kinds of problems. MoneyGram implemented it and their fraud complaints dropped 72%. They saved a few hundred million just in their first few years. … But while at IBM, I had a grand vision about a new type of entity resolution engine that would have been unlike anything that’s ever existed. It’s almost like a Swiss Army knife for ER. Recent developments The Senzing entity resolution engine works really well on two records from a domain that you’ve never even seen before. Say you’ve never done entity resolution on restaurants from Singapore. The first two records you feed it, it’s really, really already smart. And then as you feed it more data, it gets smarter and smarter. … So, there are two things that we’ve intertwined. One is common sense. One type of common sense is the names—Dick, Dickie, Richie, Rick, Ricardo are all part of the same name family. Why should it have to study millions and millions of records to learn that again? … Next to common sense, there’s real-time learning. In real-time learning, we do a few things. You might have somebody named Bob, but who now goes by a nickname or an alias of Andy. Eventually, you might come to learn that. So, now you know you have to learn over time that Bob also has this nickname, and Bob lived at three addresses, and this is his credit card number, and now he’s got four phone numbers. So you want to learn those over time. … These systems we’re creating, our entity resolution systems—which really resol

 Real-time entity resolution made accessible | File Type: audio/mpeg | Duration: 00:27:09

In this episode of the Data Show, I spoke with Jeff Jonas, CEO, founder and chief scientist of Senzing, a startup focused on making real-time entity resolution technologies broadly accessible. He was previously a fellow and chief scientist of context computing at IBM. Entity resolution (ER) refers to techniques and tools for identifying and linking manifestations of the same entity/object/individual. Ironically, ER itself has many different names (e.g., record linkage, duplicate detection, object consolidation/reconciliation, etc.). ER is an essential first step in many domains, including marketing (cleaning up databases), law enforcement (background checks and counterterrorism), and financial services and investing. Knowing exactly who your customers are is an important task for security, fraud detection, marketing, and personalization. The proliferation of data sources and services has made ER very challenging in the internet age. In addition, many applications now increasingly require near real-time entity resolution.

 Why companies are in need of data lineage solutions | File Type: audio/mpeg | Duration: 00:34:29

In this episode of the Data Show, I spoke with Neelesh Salian, software engineer at Stitch Fix, a company that combines machine learning and human expertise to personalize shopping. As companies integrate machine learning into their products and systems, there are important foundational technologies that come into play. This shouldn’t come as a shock, as current machine learning and AI technologies require large amounts of data—specifically, labeled data for training models. There are also many other considerations—including security, privacy, reliability/safety—that are encouraging companies to invest in a suite of data technologies. In conversations with data engineers, data scientists, and AI researchers, the need for solutions that can help track data lineage and provenance keeps popping up. There are several San Francisco Bay Area companies that have embarked on building data lineage systems—including Salian and his colleagues at Stitch Fix. I wanted to find out how they arrived at the decision to build such a system and what capabilities they are building into it. Here are some highlights from our conversation: Data lineage Data lineage is not something new. It’s something that is borne out of the necessity of understanding how data is being written and interacted with in the data warehouse. I like to tell this story when I’m describing data lineage: think of it as a journey for data. The data takes a journey entering into your warehouse. This can be transactional data, dashboards, or recommendations. What is lost in that collection of data is the information about how it came about. If you knew what journey and exactly what constituted that data to come into being into your data warehouse or any other storage appliance you use, that would be really useful. … Think about data lineage as helping issues about quality of data, understanding if something is corrupted. On the security side, think of GDPR … which was one of the hot topics I heard about at the Strata Data Conference in London in 2018. Why companies are suddenly building data lineage solutions A data lineage system becomes necessary as time progresses. It becomes easier for maintainability. You need it for audit trails, for security and compliance. But you also need to think of the benefit of managing the data sets you’re working with. If you’re working with 10 databases, you need to know what’s going on in them. If I have to give you a vision of a data lineage system, think of it as a final graph or view of some data set, and it shows you a graph of what it’s linked to. Then it gives you some metadata information so you can drill down. Let’s say you have corrupted data, let’s say you want to debug something. All these cases tie into the actual use cases for which we want to build it. Related resources: “Deep automation in machine learning” Vitaly Gordon on “Building tools for enterprise data science” “Managing risk in machine learning” Haoyuan Li explains why “In the age of AI, fundamental value resides in data” “What machine learning means for software development” Joe Hellerstein on how “Metadata services can lead to performance and organizational improvements”

 Why companies are in need of data lineage solutions | File Type: audio/mpeg | Duration: 00:34:29

In this episode of the Data Show, I spoke with Neelesh Salian, software engineer at Stitch Fix, a company that combines machine learning and human expertise to personalize shopping. As companies integrate machine learning into their products and systems, there are important foundational technologies that come into play. This shouldn’t come as a shock, as current machine learning and AI technologies require large amounts of data—specifically, labeled data for training models. There are also many other considerations—including security, privacy, reliability/safety—that are encouraging companies to invest in a suite of data technologies. In conversations with data engineers, data scientists, and AI researchers, the need for solutions that can help track data lineage and provenance keeps popping up. There are several San Francisco Bay Area companies that have embarked on building data lineage systems—including Salian and his colleagues at Stitch Fix. I wanted to find out how they arrived at the decision to build such a system and what capabilities they are building into it.

 What data scientists and data engineers can do with current generation serverless technologies | File Type: audio/mpeg | Duration: 00:36:32

In this episode of the Data Show, I spoke with Avner Braverman, co-founder and CEO of Binaris, a startup that aims to bring serverless to web-scale and enterprise applications. This conversation took place shortly after the release of a seminal paper from UC Berkeley (“Cloud Programming Simplified: A Berkeley View on Serverless Computing”), and this paper seeded a lot of our conversation during this episode. Serverless is clearly on the radar of data engineers and architects. In a recent survey, we found 85% of respondents already had parts of their data infrastructure in one of the public clouds, and 38% were already using at least one of the serverless offerings we listed. As more serverless offerings get rolled out—e.g., things like PyWren that target scientists—I expect these numbers to rise. We had a great conversation spanning many topics, including: A short history of cloud computing. The fundamental differences between serverless and conventional cloud computing. The reasons serverless—specifically AWS Lambda—took off so quickly. What can data scientists and data engineers do with the current generation serverless offerings. What is missing from serverless today and what should users expect in the near future. Related resources: “The evolution and expanding utility of Ray” Results of a new survey: “Evolving Data Infrastructure: Tools and Best Practices for Advanced Analytics and AI” Eric Jonas on “Building accessible tools for large-scale computation and machine learning” “7 data trends on our radar” “Handling real-time data operations in the enterprise” “Progress for big data in Kubernetes”

 What data scientists and data engineers can do with current generation serverless technologies | File Type: audio/mpeg | Duration: 00:36:32

In this episode of the Data Show, I spoke with Avner Braverman, co-founder and CEO of Binaris, a startup that aims to bring serverless to web-scale and enterprise applications. This conversation took place shortly after the release of a seminal paper from UC Berkeley (“Cloud Programming Simplified: A Berkeley View on Serverless Computing”), and this paper seeded a lot of our conversation during this episode.

 It’s time for data scientists to collaborate with researchers in other disciplines | File Type: audio/mpeg | Duration: 00:36:08

In this episode of the Data Show, I spoke with Forough Poursabzi-Sangdeh, a postdoctoral researcher at Microsoft Research New York City. Poursabzi works in the interdisciplinary area of interpretable and interactive machine learning. As models and algorithms become more widespread, many important considerations are becoming active research areas: fairness and bias, safety and reliability, security and privacy, and Poursabzi’s area of focus—explainability and interpretability. We had a great conversation spanning many topics, including: Current best practices and state-of-the-art methods used to explain or interpret deep learning—or, more generally, machine learning models. The limitations of current model interpretability methods. The lack of clear/standard metrics for comparing different approaches used for model interpretability Many current AI and machine learning applications augment humans, and, thus, Poursabzi believes it’s important for data scientists to work closely with researchers in other disciplines. The importance of using human subjects in model interpretability studies. Related resources: “Local Interpretable Model-Agnostic Explanations (LIME): An Introduction” “Interpreting predictive models with Skater: Unboxing model opacity” Jacob Ward on “How social science research can inform the design of AI systems” Sharad Goel and Sam Corbett-Davies on “Why it’s hard to design fair machine learning models” “Managing risk in machine learning”: considerations for a world where ML models are becoming mission critical Francesca Lazzeri and Jaya Mathew on “Lessons learned while helping enterprises adopt machine learning” Jerry Overton on “Teaching and implementing data science and AI in the enterprise”

 It’s time for data scientists to collaborate with researchers in other disciplines | File Type: audio/mpeg | Duration: 00:36:08

In this episode of the Data Show, I spoke with Forough Poursabzi-Sangdeh, a postdoctoral researcher at Microsoft Research New York City. Poursabzi works in the interdisciplinary area of interpretable and interactive machine learning. As models and algorithms become more widespread, many important considerations are becoming active research areas: fairness and bias, safety and reliability, security and privacy, and Poursabzi’s area of focus—explainability and interpretability.

Comments

Login or signup comment.