O'Reilly Data Show Podcast show

O'Reilly Data Show Podcast

Summary: The O'Reilly Data Show Podcast explores the opportunities and techniques driving big data, data science, and AI.

Join Now to Subscribe to this Podcast

Podcasts:

 Simplifying machine learning lifecycle management | File Type: audio/mpeg | Duration: 00:37:25

In this episode of the Data Show, I spoke with Harish Doddi, co-founder and CEO of Datatron, a startup focused on helping companies deploy and manage machine learning models. As companies move from machine learning prototypes to products and services, tools and best practices for productionizing and managing models are just starting to emerge. Today’s data science and data engineering teams work with a variety of machine learning libraries, data ingestion, and data storage technologies. Risk and compliance considerations mean that the ability to reproduce machine learning workflows is essential to meet audits in certain application domains. And as data science and data engineering teams continue to expand, tools need to enable and facilitate collaboration. As someone who specializes in helping teams turn machine learning prototypes into production-ready services, I wanted to hear what Doddi has learned while working with organizations that aspire to “become machine learning companies.” Here are some highlights from our conversation: A central platform for building, deploying, and managing machine learning models In one of the companies where I worked, we had built infrastructure related to Spark. We were a heavy Spark shop. So we built everything around Spark and other components. But later, when that organization grew, a lot of people came from a TensorFlow background. That suddenly created a little bit of frustration in the team because everybody wanted to move to TensorFlow. But we had invested a lot of time, effort and energy in building the infrastructure for Spark. … We suddenly had hidden technical debt that needed to be addressed. … Let’s say right now you have two models running in production and you know that in the next two or three years you are going to deploy 20 to 30 models. You need to start thinking about this ahead of time. … That’s why these days I observed that organizations are creating centralized teams. The centralized team is responsible for maintaining flexible machine learning infrastructure that can be used to deploy, operate, and monitor many models simultaneously. Feature store: Create, manage, and share canonical features When I talk to companies these days, everybody knows that their data scientists are duplicating work because they don’t have a centralized feature store. Everybody I talk to really wants to build or even buy a feature store, depending on what is easiest for them. … The number of data scientists within most companies is increasing. And one of the pain points I’ve observed is when a new data scientist joins an organization, there is an extreme amount of ramp-up period. A new data scientist needs to figure out what the data sets are, what the features are, so on and so forth. But if an organization had a feature store, the ramp-up period can be much faster. Related resources: “Lessons learned turning machine learning models into real products and services” “What are machine learning engineers?”: examining a new role focused on creating data products and making data science work in production “MLflow: A Platform for Managing the Machine Learning Lifecycle” “Managing risk in machine learning models”: Andrew Burt and Steven Touw on how companies can manage models they cannot fully explain “We need to build machine learning tools to augment machine learning engineers” When models go rogue: David Talby on hard-earned lessons about using machine learning in production

 Simplifying machine learning lifecycle management | File Type: audio/mpeg | Duration: 00:37:25

In this episode of the Data Show, I spoke with Harish Doddi, co-founder and CEO of Datatron, a startup focused on helping companies deploy and manage machine learning models. As companies move from machine learning prototypes to products and services, tools and best practices for productionizing and managing models are just starting to emerge. Today’s data science and data engineering teams work with a variety of machine learning libraries, data ingestion, and data storage technologies. Risk and compliance considerations mean that the ability to reproduce machine learning workflows is essential to meet audits in certain application domains. And as data science and data engineering teams continue to expand, tools need to enable and facilitate collaboration. As someone who specializes in helping teams turn machine learning prototypes into production-ready services, I wanted to hear what Doddi has learned while working with organizations that aspire to “become machine learning companies.”

 How privacy-preserving techniques can lead to more robust machine learning models | File Type: audio/mpeg | Duration: 00:36:43

In this episode of the Data Show, I spoke with Chang Liu, applied research scientist at Georgian Partners. In a previous post, I highlighted early tools for privacy-preserving analytics, both for improving decision-making (business intelligence and analytics) and for enabling automation (machine learning). One of the tools I mentioned is an open source project for SQL-based analysis that adheres to state-of-the-art differential privacy (a formal guarantee that provides robust privacy assurances).  Since business intelligence typically relies on SQL databases, this open source project is something many companies can already benefit from today. What about machine learning? While I didn’t have space to point this out in my previous post, differential privacy has been an area of interest to many machine learning researchers. Most practicing data scientists aren’t aware of the research results, and popular data science tools haven’t incorporated differential privacy in meaningful ways (if at all). But things will change over the next months. For example, Liu wants to make  ideas from differential privacy accessible to industrial data scientists, and she is part of a team building tools to make this happen. Here are some highlights from our conversation: Differential privacy and machine learning In the literature, there are actually multiple ways differential privacy is used in machine learning. We can either inject noise directly at the input data level, or while we’re training a model. We can also inject noise into the gradient. At every iteration we’re computing the gradients, we can inject some sort of noise. Or we can also inject noise during aggregation. If we’re using ensembles, we can inject noise there. And we can also inject noise at the output level. So after we’ve trained the model, and we have our vectors of weights, then we can also inject noise directly to the weights. A mechanism for building robust models There could be a chance that differential privacy methods can actually make your model more general. Because, essentially, when models memorize their training data, it could be due to overfitting. So, injecting all of this noise may help the resulting model move you further away from overfitting, and you get a more general model. Related resources: “How to build analytic products in an age when data privacy has become critical” “Managing risk in machine learning models”: Andrew Burt and Steven Touw on how companies can manage models they cannot fully explain. “Data regulations and privacy discussions are still in the early stages”: Aurélie Pols on GDPR, ethics, and ePrivacy. “Data collection and data markets in the age of privacy and machine learning”

 How privacy-preserving techniques can lead to more robust machine learning models | File Type: audio/mpeg | Duration: 00:36:43

In this episode of the Data Show, I spoke with Chang Liu, applied research scientist at Georgian Partners. In a previous post, I highlighted early tools for privacy-preserving analytics, both for improving decision-making (business intelligence and analytics) and for enabling automation (machine learning). One of the tools I mentioned is an open source project for SQL-based analysis that adheres to state-of-the-art differential privacy (a formal guarantee that provides robust privacy assurances).  Since business intelligence typically relies on SQL databases, this open source project is something many companies can already benefit from today. What about machine learning? While I didn’t have space to point this out in my previous post, differential privacy has been an area of interest to many machine learning researchers. Most practicing data scientists aren’t aware of the research results, and popular data science tools haven’t incorporated differential privacy in meaningful ways (if at all). But things will change over the next months. For example, Liu wants to make  ideas from differential privacy accessible to industrial data scientists, and she is part of a team building tools to make this happen.

 Specialized hardware for deep learning will unleash innovation | File Type: audio/mpeg | Duration: 00:41:18

In this episode of the Data Show, I spoke with Andrew Feldman, founder and CEO of Cerebras Systems, a startup in the blossoming area of specialized hardware for machine learning. Since the release of AlexNet in 2012, we have seen an explosion in activity in machine learning, particularly in deep learning. A lot of the work to date happened primarily on general purpose hardware (CPU, GPU). But now that we’re six years into the resurgence in interest in machine learning and AI, these new workloads have attracted technologists and entrepreneurs who are building specialized hardware for both model training and inference, in the data center or on edge devices. In fact, companies with enough volume have already begun building specialized processors for machine learning. But you have to either use specific cloud computing platforms or work at specific companies to have access to such hardware. A new wave of startups (including Cerebras) will make specialized hardware affordable and broadly available. Over the next 12-24 months architects and engineers will need to revisit their infrastructure and decide between general purpose or specialized hardware, and cloud or on-premise gear. In light of the training duration and cost they face using current (general purpose) hardware, some experiments might be hard to justify. Upcoming specialized hardware will enable data scientists to try out ideas that they previously would have hesitated to pursue. This will surely lead to more research papers and interesting products as data scientists are able to run many more experiments (on even bigger models) and iterate faster. As founder of one of the most anticipated hardware startups in the deep learning space, I wanted get Feldman’s views on the challenges and opportunities faced by engineers and entrepreneurs building hardware for machine learning workloads. Here are some highlights from our conversation: A renaissance for computer architecture OpenAI put out some very interesting analysis recently that showed that since 2012, the compute use for the largest AI training runs has increased by 300,000x. … What’s available to us to attack the vast discrepancy between compute demand and what we have today? Two things are available to us. The first is, exploring interesting compute architectures. I think this ushers in a golden age for compute architectures. And number two, it’s building dedicated hardware and saying: ‘We’re prepared to make trade offs to accelerate AI compute by not trying to be good at other things. By not trying to be good at graphics or by not trying to be a good web server. But we will attack this vast demand for compute by building dedicated hardware for artificial intelligence work.’ Historically, the following has been a very productive and valuable trade off: new and interesting architectures dedicated for a particular type of work. That’s the opportunity that many of these hardware companies or chip companies have seen. Communication intensive workloads When you stay on a chip, one can communicate fairly quickly. The problem is our work in artificial intelligence often spans more than one traditional chip. And the performance penalty for leaving the chip is very, very high. On-chip, you stay in silicon; off-chip, you have to wrap your communication in some sort of protocol, you need to send it, connect it over lanes on a print circuit board or maybe through a PCI switch, or maybe through an Ethernet switch or an InfiniBand switch. This adds two, three, four orders of magnitude of latency. Some of the problems the hardware vendors who are interested in solving data center training, data center inference are working on are how you can accelerate the communication between cores and across tens of thousands of cores, or even hundreds of thousands of cores across many chips. Some are inventing new techniques for special switches and modifying PCIe to do that. Others have sort of more fundamental approache

 Specialized hardware for deep learning will unleash innovation | File Type: audio/mpeg | Duration: 00:41:18

In this episode of the Data Show, I spoke with Andrew Feldman, founder and CEO of Cerebras Systems, a startup in the blossoming area of specialized hardware for machine learning. Since the release of AlexNet in 2012, we have seen an explosion in activity in machine learning, particularly in deep learning. A lot of the work to date happened primarily on general purpose hardware (CPU, GPU). But now that we’re six years into the resurgence in interest in machine learning and AI, these new workloads have attracted technologists and entrepreneurs who are building specialized hardware for both model training and inference, in the data center or on edge devices.

 Data regulations and privacy discussions are still in the early stages | File Type: audio/mpeg | Duration: 00:33:19

In this episode of the Data Show, I spoke with Aurélie Pols of Mind Your Privacy, one of my go-to resources when it comes to data privacy and data ethics. This interview took place at Strata Data London, a couple of days before the EU General Data Protection Regulation (GDPR) took effect. I wanted her perspective on this landmark regulation, as well as her take on trends in data privacy and growing interest in ethics among data professionals. Here are some highlights from our conversation: GDPR is just the starting point GDPR is not an end point. It’s a starting point for a journey where a balance between companies and society and users of data needs to be redefined. Because when I look at my children, I look at how they use technology, I look at how smart my house might become or my car or my fridge, I know that in the long run this idea of giving consent to my fridge to share data is not totally viable. What are we going to be build for the next generations? … I’ve been teaching privacy and ethics in Madrid at the IE Business School, one of the top business schools in the world. I’ve been teaching in the big data and analytics graduate program. I see the evolution as well. Five years ago, they looked at me like, ‘What is she talking about?’ Three years ago, some of the people in the room started to understand. … Last year it was like ‘We get it.’ Privacy by design It’s defined as data protection by design and by default as well. The easy part is more the default settings: when you create systems, it’s the question I ask 20 times a week: ‘Great. I love your system. What data do you collect by default and what do you pass on by default?’ Then you start turning things off and then we’ll see who takes on the responsibility to turn things on again. That’s a default. Privacy by design was pushed by Ann Cavoukian from Ottawa in Canada more than 10 years ago. These principles are finding themselves within the legislation. Not only in GDPR—for example, Hong Kong is starting to talk about this and Japan as well. One of these principles is about positive-sum, not zero-sum. It’s not ‘I win and you lose.’ It’s ‘we work together and we both win.’ That’s a very good principle. There are interesting challenges within privacy by design to translate these seven principles into technical requirements. I think there are opportunities as well. It talks about traceability, visibility, transparency. Which then comes back again to, we’re sitting on so much data; how much data do we want to surface and are data subjects or citizens ready to understand what we have, and are they able to make decisions based on that? … Hopefully this generation of more ethically minded engineers or data scientists will start thinking in that way as well. Related resources: “The data subject first?”: Aurélie Pols draws a broad philosophical picture of the data ecosystem and then hones in on the right to data portability. “How to build analytic products in an age when data privacy has become critical” “Managing risk in machine learning models”: Andrew Burt and Steven Touw on how companies can manage models they cannot fully explain. “Building tools for the AI applications of tomorrow” “Toward the Jet Age of machine learning” “The real value of data requires a holistic view of the end-to-end data pipeline”: Ashok Srivastava on the emergence of machine learning and AI for enterprise applications “Bringing AI into the enterprise”: Kris Hammond on business applications of AI technologies and educating future AI specialists.

 Data regulations and privacy discussions are still in the early stages | File Type: audio/mpeg | Duration: 00:33:19

In this episode of the Data Show, I spoke with Aurélie Pols of Mind Your Privacy, one of my go-to resources when it comes to data privacy and data ethics. This interview took place at Strata Data London, a couple of days before the EU General Data Protection Regulation (GDPR) took effect. I wanted her perspective on this landmark regulation, as well as her take on trends in data privacy and growing interest in ethics among data professionals.

 Managing risk in machine learning models | File Type: audio/mpeg | Duration: 00:32:34

In this episode of the Data Show, I spoke with Andrew Burt, chief privacy officer at Immuta, and Steven Touw, co-founder and CTO of Immuta. Burt recently co-authored a white paper on managing risk in machine learning models, and I wanted to sit down with them to discuss some of the proposals they put forward to organizations that are deploying machine learning. Some high-profile examples of models gone awry have raised awareness among companies for the need for better risk management tools and processes. There is now a growing interest in ethics among data scientists, specifically in tools for monitoring bias in machine learning models. In a previous post, I listed some of the key considerations organization should keep in mind as they move models to production, but the report co-authored by Burt goes far beyond and recommends lines of defense, including a description of key roles that are needed. Here are some highlights from our conversation: Privacy and compliance meet data science Andrew Burt: I would say the big takeaway from our paper is that lawyers and compliance and privacy folks live in one world and data scientists live in another with competing objectives. And that can no longer be the case. They need to talk to each other. They need to have a shared process and some shared terminology so that everybody can communicate. One of the recommendations that we make, taken from some of the model risk management frameworks, is to create what we call lines of defense: these are basically different lines of reviewers who conduct periodic reviews from the creation testing phase to validation to an auditing phase. And the members of those lines of review need to be made up of teams with multiple expertise. So, there needs to be data owners, the people responsible for the data being piped into the models; there needs to be compliance personnel who are thinking about legal and ethical obligations; there needs to be data scientists; and there needs to be subject domain experts. … We also dive into how you should be thinking about de-risking and monitoring your input data. How you should be thinking about monitoring and de-risking your output data and using output data for models. … And then, I think really importantly, is this idea of thinking about what it means for a model to fail, and having a concrete plan for what that means, how to correct it if it fails, and how to pull it from production if you need to. Explainability and GDPR Steven Touw: I gave a talk, “How Does the GDPR Impact Machine Learning?”, at Strata Data London. A lot of people are concerned about language in GDPR, which states that you must be able to explain how the model came to that conclusion. I think people are kind of overreacting to this a little bit, and we need to inject some common sense in there along the lines of: at the end of the day, you can explain what data went in; you can explain the logic of what you’re trying to solve and why; and you don’t have to explain every neuron in the neural net and how it was correlated to every other piece. I think the GDPR is actually doing a good thing. It’s enabling consumers to understand how the decisions are being made about them, but they don’t have to understand everything in the weeds about it. Because the whole point of machine learning is that it can do things that we can’t as humans. So, that’s why we use it, and there are cases where it makes sense to trust the model rather than humans to get things done and, potentially and hopefully, done more accurately. Related resources: “How to build analytic products in an age when data privacy has become critical” “We need to build machine learning tools to augment machine learning engineers” “The ethics of artificial intelligence” “Interpreting predictive models with Skater: Unboxing model opacity” “Toward the Jet Age of machine learning” “Building tools for the AI applications of tomorrow” “The current

 Managing risk in machine learning models | File Type: audio/mpeg | Duration: 00:32:34

In this episode of the Data Show, I spoke with Andrew Burt, chief privacy officer at Immuta, and Steven Touw, co-founder and CTO of Immuta. Burt recently co-authored a white paper on managing risk in machine learning models, and I wanted to sit down with them to discuss some of the proposals they put forward to organizations that are deploying machine learning. Some high-profile examples of models gone awry have raised awareness among companies for the need for better risk management tools and processes. There is now a growing interest in ethics among data scientists, specifically in tools for monitoring bias in machine learning models. In a previous post, I listed some of the key considerations organization should keep in mind as they move models to production, but the report co-authored by Burt goes far beyond and recommends lines of defense, including a description of key roles that are needed.

 The real value of data requires a holistic view of the end-to-end data pipeline | File Type: audio/mpeg | Duration: 00:31:05

In this episode of the Data Show, I spoke with Ashok Srivastava, senior vice president and chief data officer at Intuit. He has a strong science and engineering background, combined with years of applying machine learning and data science in industry. Prior to joining Intuit, he led the teams responsible for data and artificial intelligence products at Verizon. I wanted his perspective on a range of issues, including the role of the chief data officer, ethics in machine learning, and the emergence of AI technologies for enterprise products and applications. Here are some highlights from our conversation: Chief data officer A chief data officer, in my opinion, is a person who thinks about the end-to-end process of obtaining data, data governance, and transforming that data for a useful purpose. His or her purview is relatively large. I view my purview at Intuit to be exactly that, thinking about the entire data pipeline, proper stewardship, proper governance principles, and proper application of data. I think that as the public learns more about the opportunities that can come from data, there’s a lot of excitement about the potential value that can be unlocked from it from the consumer standpoint, and also many businesses and scientific organizations are excited about the same thing. I think the CDO plays a role as a catalyst in making those things happen with the right principles applied. I would say if you look back into history a little bit, you’ll find the need for the chief data officer started to come into play when people saw a huge amount of data coming in at high speeds with high variety and variability—but then also the opportunity to marry that data with real algorithms that can have a transformational property to them. While it’s true that CIOs, CTOs, and people who are in lines of business can and should think about this, it’s a complex enough process that I think it merits having a person and an organization think about that end-to-end pipeline. Ethics We’re actually right now in the process of launching a unified training program in data science that includes ethics as well as many other technical topics. I should say that I joined Intuit only about six months ago. They already had training programs happening worldwide in the area of data science and acquainting people with the principles necessary to use data properly as well as the technical aspects of doing it. I really feel ethics is a critical area for those of us who work in the field to think about and to be advocates of proper use of data, proper use of privacy information and security, in order to make sure the data that we’re stewards of is used in the best possible way for the end consumer. Describing AI You can think about two overlapping circles. One circle is really an AI circle. The other is a machine learning circle. Many people think that that intersection is the totality of it, but in fact, it isn’t. … I’m finding that AI needs to be bounded a little bit. I often say that it’s a reasonable technology with unreasonable expectations associated with it. I really feel this way, that people for whatever reason have decided that deep learning is going to solve many problems. And there’s a lot of evidence to support that, but frankly, there’s a lot of evidence also to support the fact that much more work has to be done before these things become “general purpose AI solutions.” That’s where a lot of exciting innovation is going to happen in the coming years. Related resources: Understanding the Chief Data Officer: Lessons and advice from the role’s pioneers “Building tools for the AI applications of tomorrow” “Toward the Jet Age of machine learning” “How to build analytic products in an age when data privacy has become critical” “How big data and AI will reshape the automotive industry”: Evangelos Simoudis on next-generation mobility services “Bringing AI

 The real value of data requires a holistic view of the end-to-end data pipeline | File Type: audio/mpeg | Duration: 00:31:05

In this episode of the Data Show, I spoke with Ashok Srivastava, senior vice president and chief data officer at Intuit. He has a strong science and engineering background, combined with years of applying machine learning and data science in industry. Prior to joining Intuit, he led the teams responsible for data and artificial intelligence products at Verizon. I wanted his perspective on a range of issues, including the role of the chief data officer, ethics in machine learning, and the emergence of AI technologies for enterprise products and applications.

 The evolution of data science, data engineering, and AI | File Type: audio/mpeg | Duration: 00:30:14

This episode of the Data Show marks our 100th episode. This podcast stemmed out of video interviews conducted at O’Reilly’s 2014 Foo Camp. We had a collection of friends who were key members of the data science and big data communities on hand and we decided to record short conversations with them. We originally conceived of using those initial conversations to be the basis of a regular series of video interviews. The logistics of studio interviews proved too complicated, but those Foo Camp conversations got us thinking about starting a podcast, and the Data Show was born. To mark this milestone, my colleague Paco Nathan, co-chair of Jupytercon, turned the tables on me and asked me questions about previous Data Show episodes. In particular, we examined the evolution of key topics covered in this podcast: data science and machine learning, data engineering and architecture, AI, and the impact of each of these areas on businesses and companies. I’m proud of how this show has reached so many people across the world, and I’m looking forward to sharing more conversations in the future. Here are some highlights from our conversation: AI is more than machine learning I think for many people machine learning is AI. I’m trying to, in the AI Conference series, convince people that a true AI system will involve many components, machine learning being one. Many of the guests I have seem to agree with that. Evolving infrastructure for big data In the early days of the podcast, many of the people I interacted with had Hadoop as one of the essential things in their infrastructure. I think while that might still be the case, there are more alternatives these days. I think a lot of people are going to object stores in the cloud. Another examples is that before, people maintained specialized systems. There’s still that, but people are trying to see if they can combine some of these systems, or come up with systems that can do more than one workload. For example, this whole notion in Spark of having a unified system that is able to do batch in streaming caught on during the span of this podcast. Related resources: An easy to scan episode list of the Data Show “What is data science?” “What are machine learning engineers?” “What is Artificial Intelligence?” “Building tools for the AI applications of tomorrow” “Data engineering: A quick and simple definition”

 The evolution of data science, data engineering, and AI | File Type: audio/mpeg | Duration: 00:30:14

This episode of the Data Show marks our 100th episode. This podcast stemmed out of video interviews conducted at O’Reilly’s 2014 Foo Camp. We had a collection of friends who were key members of the data science and big data communities on hand and we decided to record short conversations with them. We originally conceived of using those initial conversations to be the basis of a regular series of video interviews. The logistics of studio interviews proved too complicated, but those Foo Camp conversations got us thinking about starting a podcast, and the Data Show was born. To mark this milestone, my colleague Paco Nathan, co-chair of Jupytercon, turned the tables on me and asked me questions about previous Data Show episodes. In particular, we examined the evolution of key topics covered in this podcast: data science and machine learning, data engineering and architecture, AI, and the impact of each of these areas on businesses and companies. I’m proud of how this show has reached so many people across the world, and I’m looking forward to sharing more conversations in the future.

 Companies in China are moving quickly to embrace AI technologies | File Type: audio/mpeg | Duration: 00:28:52

In this episode of the Data Show, I spoke with Jason Dai, CTO of Big Data Technologies at Intel, and one of my co-chairs for the AI Conference in Beijing. I wanted to check in on the status of BigDL, specifically how companies have been using this deep learning library on top of Apache Spark, and discuss some newly added features. It turns out there are quite a number of companies already using BigDL in production, and we talked about some of the popular uses cases he’s encountered. We recorded this podcast while we were at the AI Conference in Beijing, so I wanted to get Dai’s thoughts on the adoption of AI technologies among Chinese companies and local/state government agencies. Here are some highlights from our conversation: BigDL: One year later Big DL was actually first open-sourced on December 30, 2016—so it has been about 1 year and 4 months. We have gotten a lot of positive feedback from the open source community. We also added a lot of new optimizations and functionalities to Big DL. I think it roughly can be categorized into four classes. We did large optimizations, especially for the big data environment, which is essentially very large-scale Intel server clusters. We use a lot of hardware accelerations and Math Kernel libraries to improve BigDL’s performance on a single-node. At the same time, we leverage the Spark architecture so that we can efficiently scale out and perform very large-scale distributed training or inference. The second part of the year we provided very rich support for existing deep learning tools. We can actually directly load and save models to and from TensorFlow, Caffe, Keras, Torch, and other libraries. People can take the existing models and actually load into Big DL and run it on Spark. End-to-end machine learning pipelines End-to-end object detection and image feature extraction pipeline on top of Spark and BigDL. Image source: BigDL white paper, used with permission. An industrial deep learning or machine learning application is actually a very complex end-to-end, big data analytics pipeline. You start with the data ingestion, data processing, ETL, and then after that, you will transform your data. For instance, you can do image augmentations, text tokenization, word embedding, and so on. After that, you will extract the features or perform various feature transformations and feature extractions. After that, when you have the feature extracted and transformed, then you will probably begin with your model training. But even the model training itself could be an iterative process, a pipeline, if you want to introduce various hyperparameter tunings. …There’s actually a reason why we took an integrated approach when we built Big DL on top of Apache Spark: you know that you’re going back and forth between data ingestion, data processing, model training, and inference. Having integrated software and hardware infrastructure will benefit the user a lot. AI in China People and companies in China have a very high level of awareness and a high level of hope that AI can be used to solve real problems. In China, you have the advantage in that people move fast to apply these new AI technologies. Companies here have access to large amounts of data, and there are many use cases across industries and the public sector. There are a lot of ways you can actually apply new technology and AI to real-world applications and see their impact. In general, people and companies in China move very fast, and they are very good at experimenting and trying new approaches. People here think they can always iterate and refine their approaches, and we see a lot of new technology getting put into practice very quickly. Related resources: “What happens when AI experts from Silicon Valley and China meet” BigDL: Jason Dai on the launch of a new deep learning library for Apache Spark “Why AI and machine learning researchers are beginning to embrace PyTorch”: Soumith Chintala on building a worthy

Comments

Login or signup comment.