O'Reilly Data Show Podcast show

O'Reilly Data Show Podcast

Summary: The O'Reilly Data Show Podcast explores the opportunities and techniques driving big data, data science, and AI.

Join Now to Subscribe to this Podcast

Podcasts:

 Machine learning needs machine teaching | File Type: audio/mpeg | Duration: 00:45:12

In this episode of the Data Show, I spoke with Mark Hammond, founder and CEO of Bonsai, a startup at the forefront of developing AI systems in industrial settings. While many articles have been written about developments in computer vision, speech recognition, and autonomous vehicles, I’m particularly excited about near-term applications of AI to manufacturing, robotics, and industrial automation. In a recent post, I outlined practical applications of reinforcement learning (RL)—a type of machine learning now being used in AI systems. In particular, I described how companies like Bonsai are applying RL to manufacturing and industrial automation. As researchers explore new approaches for solving RL problems, I expect many of the first applications to be in industrial automation. Here are some highlights from our conversation: Machine learning and machine teaching Everyone is so focused on making better and faster learning algorithms; what do we do when we have it? Let’s just suppose that you now have an algorithm that can learn as well as or better than humans. How do we use that, how do we apply that in a predictable, scalable, repeatable way toward the objectives that we want to apply it toward? … I thought about that for a while, and it’s one of those things where the answer is obvious in hindsight, but until you sit down and really chew on it, it doesn’t jump out at you. And it’s that, by design, if you’re building a learning system—if you want to program it—you have to teach it. Machine teaching and machine learning are necessary complements to one another; you need both. And for the large part, most of what comprises machine teaching these days consists of giant label data sets. … You need machine teaching and machine learning. It dawned on me that this was the core abstraction that was going to make it possible for us to start applying all of this stuff more broadly across all the myriad use cases that we see in the real world without having to turn all of the people who are looking to use it into experts in machine learning and data science. It’s what enabled me to realize what Bonsai’s mission is: to enable your subject matter experts (a chemical engineer or a mechanical engineer, someone who is very, very well versed in whatever their domain is but not necessarily in machine learning or data science) to take that expertise and use it as the foundation for describing what to teach and then automating the underlying pieces for how you can actually effectively learn that. Common use cases for reinforcement learning In a lot of cases, the system you are working with is undergoing some form of tuning. We see a lot of this. It might be that you have an HVAC system and you are looking to optimize across the lifetime of the equipment, the comfort of the user, and the energy consumption of the device, which is its own complex optimization problem. You do this by virtue of tuning a bunch of parameters or controlling a bunch of different capabilities on the device. … So, in many cases, a lot of what our users are applying the system toward is not necessarily strict real-time control. Certainly, we do that as well, and we put out a paper recently showing dexterous robotic control and manipulation using these kinds of techniques. You can absolutely do that, and we have customers who do that. A very common thing that we see across a lot of the real-world applications is more a tuning characteristic. Related resources: “Practical applications of reinforcement learning in industry”: an overview of commercial and industrial applications of reinforcement learning “Introducing RLlib – A composable and scalable reinforcement learning library”: this new software makes the task of training RL models much more accessible Deep reinforcement learning in the enterprise—Bridging the gap from games to industry (2017 Artificial Intelligence Conference presentation by Mark Hammond) Ray

 Machine learning needs machine teaching | File Type: audio/mpeg | Duration: 00:45:12

In this episode of the Data Show, I spoke with Mark Hammond, founder and CEO of Bonsai, a startup at the forefront of developing AI systems in industrial settings. While many articles have been written about developments in computer vision, speech recognition, and autonomous vehicles, I’m particularly excited about near-term applications of AI to manufacturing, robotics, and industrial automation. In a recent post, I outlined practical applications of reinforcement learning (RL)—a type of machine learning now being used in AI systems. In particular, I described how companies like Bonsai are applying RL to manufacturing and industrial automation. As researchers explore new approaches for solving RL problems, I expect many of the first applications to be in industrial automation.

 How machine learning can be used to write more secure computer programs | File Type: audio/mpeg | Duration: 00:28:12

In this episode of the Data Show, I spoke with Fabian Yamaguchi, chief scientist at ShiftLeft. His 2015 Ph.D. dissertation sketched out how the combination of static analysis, graph mining, and machine learning, can be used to develop tools to augment security analysts. In a recent post, I argued for machine learning tools to augment teams responsible for deploying and managing models in production (machine learning engineers). These are part of a general trend of using machine learning to develop and manage the software systems of tomorrow. Yamaguchi’s work is step one in this direction: using machine learning to reduce the number of security vulnerabilities in complex software products. Here are some highlights from our conversation: Machine learning to find code vulnerabilities I was not trying to build something that would just automatically take the code and give you all of the vulnerabilities. Instead, I was looking at the typical kind of tasks that I would encounter myself when doing these security audits, and I would ask myself, how can I automate these subtasks? As an example, when you find a vulnerability in code, the question that often arises is whether there are similar vulnerabilities still in that same program. That’s one of those subtasks you can automate well because what you’re actually doing is saying: ‘Hey, here’s an example of what a bug looks like. Can you scan the rest of the code? Can you use machine learning to actually determine other locations in the code that implement the same bug?’ … In machine learning, you never have enough data. In this case, this is actually an unsupervised learning approach. You’re taking all of the functions that you can get and you extract the dominant programming patterns in there. … It’s a bit like what you would do to find similar text documents, but it’s used for code. From source code to graph analytics By transforming software code into a graph, you can actually extract different properties from that code by analyzing the graph. … Let’s take a smaller function that might have one IF block. One of the graph structures that’s first generated is called an abstract syntax tree. That’s a tree that you’d get by just parsing the code. …  For each IF and for each variable, for each statement, there’s going to be a node. For each operator, like if there’s an assignment, there’s also going to be a node, and they are all connected by edges. You soon run into a lot of nodes and edges. If you take something like, let’s say, the Linux kernel, you’ll have several hundreds of thousands of nodes. … You can do a lot by essentially solving reachability problems in these graphs. Related resources: “How machine learning will accelerate data management systems” Artificial intelligence in the software engineering workflow: A 2017 AI Conference keynote by Peter Norvig “Responsible deployment of machine learning”: Why we need to build machine learning tools to augment our machine learning engineers “Architecting and building end-to-end streaming applications” Data is only as valuable as the decisions it enables

 How machine learning can be used to write more secure computer programs | File Type: audio/mpeg | Duration: 00:28:12

In this episode of the Data Show, I spoke with Fabian Yamaguchi, chief scientist at ShiftLeft. His 2015 Ph.D. dissertation sketched out how the combination of static analysis, graph mining, and machine learning, can be used to develop tools to augment security analysts. In a recent post, I argued for machine learning tools to augment teams responsible for deploying and managing models in production (machine learning engineers). These are part of a general trend of using machine learning to develop and manage the software systems of tomorrow. Yamaguchi’s work is step one in this direction: using machine learning to reduce the number of security vulnerabilities in complex software products.

 Bringing AI into the enterprise | File Type: audio/mpeg | Duration: 00:44:13

In this episode of the Data Show, I spoke with Kristian Hammond, chief scientist of Narrative Science and professor of EECS at Northwestern University. He has been at the forefront of helping companies understand the power, limitations, and disruptive potential of AI technologies and tools. In a previous post on machine learning, I listed types of uses cases (a taxonomy) for machine learning that could just as well apply to enterprise applications of AI. But how do you identify good use cases to begin with? A good place to start for most companies is by looking for AI technologies that can help automate routine tasks, particularly low-skill tasks that occupy the time of high-skilled workers. An initial list of candidate tasks can be gathered by applying the following series of simple questions: Is the task data-driven? Do you have the data to support the automation of the task? Do you really need the scale that automation can provide? We discussed other factors companies should consider when thinking through their AI strategies, education and training programs for AI specialists, and the importance of ethics and fairness in AI and data science. Here are some highlights from our conversation: It begins with finding use cases I’ve been interacting more and more with the companies that are thinking about AI solutions; they often won’t have gotten to the place where they can talk about what they want to do. It’s an odd thing because there’s so much data out there and there’s so much hunger to derive something from that data. The starting point is often bringing an organization back down to, “So what do you want and need to do? What kind of decision-making do you want to support? What kinds of predictions would you like to be able to make?” Identifying which tasks can be automated Sometimes, you see a decision being made and, from an organizational point of view, everyone agrees that this decision is really strongly data driven. But it’s not strongly data driven. It’s data driven based upon the historical information that two or three people are using. It looks like they’re looking at data and then making a decision, but, in fact, what they’re doing is, they’re looking at data and they’re remembering one of 2,000 past examples in their heads and coming out with a decision. … There are sets of tasks in almost any organization that nobody likes to have anything to do with. In the legal profession, there are tasks around things like discovery where you actually need to be able to look through a corpus of documents, but you need to have also some idea of the semantic relationships between words. This is totally learnable using existing technologies. … It’s not as though tasks that can be automated don’t exist. They do, and, in fact, they not only exist, but they’re easily doable with current technologies. It’s a matter of understanding where to draw the line. It’s sometimes easy for organizations to look at the problem and sort of hallucinate that there is not a different kind of reasoning going on in the heads of the people who are solving the problem. … You have to be willing to look at that and say, “Oh, I’m not going to replace the smartest person in the company, but, you know, I will free up the time of some of our smartest people by taking these tasks on and having the machine do them.” Related resources: Here and now – Bringing AI into the enterprise: Kris Hammond’s tutorial at the 2017 AI conference in San Francisco. Vertical AI – Solving full stack industry problems using subject-matter expertise, unique data, and AI to deliver a product’s core value proposition: Bradford Cross at the 2017 AI conference in San Francisco. Demystifying the AI hype: Kathryn Hume at the 2017 AI conference in NYC. “6 practical guidelines for implementing conversational AI“: Susan

 Bringing AI into the enterprise | File Type: audio/mpeg | Duration: 00:44:13

In this episode of the Data Show, I spoke with Kristian Hammond, chief scientist of Narrative Science and professor of EECS at Northwestern University. He has been at the forefront of helping companies understand the power, limitations, and disruptive potential of AI technologies and tools. In a previous post on machine learning, I listed types of uses cases (a taxonomy) for machine learning that could just as well apply to enterprise applications of AI. But how do you identify good use cases to begin with?

 How machine learning will accelerate data management systems | File Type: audio/mpeg | Duration: 00:34:46

In this episode of the Data Show, I spoke with Tim Kraska, associate professor of computer science at MIT. To take advantage of big data, we need scalable, fast, and efficient data management systems. Database administrators and users often find themselves tasked with building index structures (“indexes” in database parlance), which are needed to speed up data access. Some common examples include: B-Trees—used for range requests (e.g., assemble all sales orders within a certain time frame) Hash maps—used for key-based lookups Bloom filters—used to check whether an element or piece of data is present in a set Index structures take up space in a database, so you need to be selective about what to index, and they do not take advantage of the underlying data distributions. I’ve worked in settings where an administrator or expert user carefully implements a strategy for building indexes for a data warehouse based on important and common queries. Indexes are really models or mappings—for instance, a Bloom filter can be thought of as a classification problem. In a recent paper, Kraska and his collaborators approach indexing as a learning problem. As such, they are able to build indexes that take into account underlying data distributions, are smaller in size (thus allowing for a more liberal indexing strategy), and their indexes execute faster. Software and hardware for computation are getting cheaper and better, so using machine learning to create index structures is something that may indeed become routine. This ties with a larger trend of using machine learning to improve software systems and even software development. In the future, we’ll have database administrators who have machine learning tools at their disposal, which would allow them to manage larger and more complex systems, and these ML tools will free them to focus on complex tasks that are harder to automate. Here are some highlights from our conversation: Why use machine learning to learn index structures I think it used to be the case that if you know you have the key distribution, you could leverage that, but you need to build a very specialized system for that. Then, if the data distribution changes, you need to adjust the whole system. At the same time, any learning mechanism was in the past and way too expensive to do it. Things have changed a little bit because compute is becoming much cheaper. Suddenly, using machine learning to train this mapping actually pays off. On one hand, the B-tree structures are composed of a whole bunch of “IF statements,” and in the past, multiplications were very expensive. Now multiplications are getting cheaper and cheaper. Scaling “IF statements” is hard, but scaling math operations is at least relatively easier. In essence, we can trade these “IF statements” for multiplications, and that’s actually why suddenly learning the data distribution pays off. … For B-trees, for example, we saw speed-ups of up to roughly 2X. However, the indexes were up to two orders of magnitude smaller. Why B-Trees are models. Image from Tim Kraska, used with permission. The future of data management systems If this machine learning approach really works out, I think this might change the way database systems are built. … Maybe the database administrator (DBA) of the future becomes a machine learning expert. … My hope is that the system can figure out what model to use, but maybe if you want to have the best performance and you know your data very well, I can see that maybe the DBA / machine learning expert chooses a certain type of model to tune the index. … There was this Tweet, essentially saying that machine learning will change how we build core algorithms and data structures. I think this is currently still the better analogy. Related resources: Tupleware—redefining modern data analytics: a Strata Data 2014 presentation by Tim Kraska Artificial intelligence in the software engineering workflow: a 2017 AI Conference keyn

 How machine learning will accelerate data management systems | File Type: audio/mpeg | Duration: 00:34:46

In this episode of the Data Show, I spoke with Tim Kraska, associate professor of computer science at MIT. To take advantage of big data, we need scalable, fast, and efficient data management systems. Database administrators and users often find themselves tasked with building index structures (“indexes” in database parlance), which are needed to speed up data access.

 Machine learning at Spotify: You are what you stream | File Type: audio/mpeg | Duration: 00:21:23

In this episode of the Data Show, I spoke with Christine Hung, head of data solutions at Spotify. Prior to joining Spotify, she led data teams at the NY Times and at Apple (iTunes). Having led teams at three different companies, I wanted to hear her thoughts on digital transformation, and I wanted to know how she approaches the challenge of building, managing, and nurturing data teams. I also wanted to learn more about what goes into building a recommender system for a popular consumer service like Spotify. Engagement should clearly be the most important metric, but there are other considerations, such as introducing users to new or “long tail” content. Here are some highlights from our conversation: Recommenders at Spotify For us, engagement always comes first. At Spotify, we have a couple hundred people who are just focused on user engagement, and this is the group that creates personalized playlists, like Discover Weekly or your Daily Mix for you. We know our users love discovery and see Spotify as a very important platform for them to discover something new, but there are also times when people just want to have some music played in the background that fits the mood. But again, we don’t have a specific agenda in terms of what we should push for. We want to give you what you want so that you are happy, which is why we invested so much in understanding people through music. If we believe you might like some “long tail” content, we will recommend it to you because it makes you happy, but we can also do the same for the top 100 track if we believe you will enjoy them. Music is like a mirror Music is like a mirror, and it tells people a lot about who you are and what you care about, whether you like it or not. We love to say “you are what you stream,” and that is so true. As you can imagine, we invest a lot in our machine learning capabilities to predict people’s preference and context, and of course, all the data we use to train the model is anonymized. We take in large amounts of anonymized training data to develop these models, and we test them out with different uses cases, analyze results, and use the learning to improve those models. Just to give you my personal example to illustrate how it works, you can learn a lot about me just by me telling you what I stream. You will see that I use my running playlist only during the weekend in early mornings, and I have a lot of children’s songs streamed at my house between 5 p.m. and 7 p.m. I also have a lot of tango and salsa playlists that I created and followed. So what does that tell you? It tells you that I am probably a weekend runner, which means I have some kind of affiliation for fitness; it tells you that I am probably a mother and play songs for my child after I get home from work; it also tells you that I somehow like tango and salsa, so I am probably a dancer, too. As you can see, we are investing a lot into understanding people’s context and preference so we can start capturing different moments of their lives. And, of course, the more we understand your context, your preference, and what you are looking for, the better we can customize your playlists for you. Related resources: Music, the window into your soul: Christine Hung’s keynote at Strata Data NYC 2017 “Transforming organizations through analytics centers of excellence”: Carme Artigas on helping enterprises transform themselves with big data tools and technologies. “A framework for building and evaluating data products”: Grace Huang on lessons learned in the course of machine learning product launches. “How companies can navigate the age of machine learning”: to become a “machine learning company,” you need tools and processes to overcome challenges in data, engineering, and models.

 Machine learning at Spotify: You are what you stream | File Type: audio/mpeg | Duration: 00:21:23

In this episode of the Data Show, I spoke with Christine Hung, head of data solutions at Spotify. Prior to joining Spotify, she led data teams at the NY Times and at Apple (iTunes). Having led teams at three different companies, I wanted to hear her thoughts on digital transformation, and I wanted to know how she approaches the challenge of building, managing, and nurturing data teams. I also wanted to learn more about what goes into building a recommender system for a popular consumer service like Spotify. Engagement should clearly be the most important metric, but there are other considerations, such as introducing users to new or “long tail” content.

 The current state of Apache Kafka | File Type: audio/mpeg | Duration: 00:37:16

In this episode of the Data Show, I spoke with Neha Narkhede, co-founder and CTO of Confluent. As I noted in a recent post on “the age of machine learning,” data integration and data enrichment are non-trivial and ongoing challenges for most companies. Getting data ready for analytics—including machine learning—remains an area of focus for most companies. It turns out, “data lakes” have become staging grounds for data; more refinement usually needs to be done before data is ready for analytics. By making it easier to create and productionize data refinement pipelines on both batch and streaming data sources, analysts and data scientists can focus on analytics that can unlock value from data. On the open source side, Apache Kafka continues to be a popular framework for data ingestion and integration. Narkhede was part of the team that created Kafka, and I wanted to get her thoughts on where this popular framework is headed. Here are some highlights from our conversation: The first engineering project that made use of Apache Kafka If I remember correctly, we were putting Hadoop into a place at LinkedIn for the first time, and I was on the team that was responsible for that. The problem was that all our scripts were actually built for another data warehousing solution. The questions was, are we going to rewrite all of those scripts and now sort of make them Hadoop specific? And what happens when a third and a fourth and a fifth system is put into place? So, the initial motivating use case was: ‘we are putting this Hadoop thing into place. That’s the new-age data warehousing solution. It needs access to the same data that is coming from all our applications. So, that is the thing we need to put into practice.’ This became Kafka’s very first use case at LinkedIn. From there, because that was very easy and I actually helped move one of the very first workloads to Kafka, it was hardly difficult to convince the rest of the LinkedIn engineering team to start moving over to Kafka. So from there, Kafka adoption became pretty vital. Now, I think years down the line, all of LinkedIn runs on Kafka. It’s essentially the central nervous system for the whole company. Microservices and Kafka My own opinion of microservices is that it lets you add more money and turn it into software at a more constant rate by allowing engineers to focus on various parts of the application, by essentially decoupling a big monolith so that a lot of things can happen in parallel development of real applications. … The upside is that it lets you move fast. It adds a certain amount of agility to an engineering organization. But it comes with its own set of challenges. And these were not very obvious back then. How are all these microservices deployed? How are they monitored? And, most importantly, how do they communicate with each other? The communication bit is where Kafka comes in. When you break a monolith, you break state. And you distribute that state across different machines that run all those different applications. So now the problem is, ‘well, how do these microservices share that state? How do they talk to each other?’ Frequently, the expectation is that things happens in real time. The context of microservices where streams or Kafka comes in is in the communication model for those microservices. I should just say that there isn’t a one size fits all when it comes to communication patterns for microservices. Related resources: Kafka: The Definitive Guide “Architecting and building end-to-end streaming applications“: Karthik Ramasamy on Heron, DistributedLog, and designing real-time applications. “Semi-supervised, unsupervised, and adaptive algorithms for large-scale time series“: Ira Cohen on developing machine learning tools for a broad range of real-time applications. “Building Apache Kafka from scratch“: Jay Kreps on data integration, event data, and the Internet of Things. I Logs

 The current state of Apache Kafka | File Type: audio/mpeg | Duration: 00:37:16

In this episode of the Data Show, I spoke with Neha Narkhede, co-founder and CTO of Confluent. As I noted in a recent post on “the age of machine learning,” data integration and data enrichment are non-trivial and ongoing challenges for most companies. Getting data ready for analytics—including machine learning—remains an area of focus for most companies. It turns out, “data lakes” have become staging grounds for data; more refinement usually needs to be done before data is ready for analytics. By making it easier to create and productionize data refinement pipelines on both batch and streaming data sources, analysts and data scientists can focus on analytics that can unlock value from data. On the open source side, Apache Kafka continues to be a popular framework for data ingestion and integration. Narkhede was part of the team that created Kafka, and I wanted to get her thoughts on where this popular framework is headed.

 Building a natural language processing library for Apache Spark | File Type: audio/mpeg | Duration: 00:33:49

When I first discovered and started using Apache Spark, a majority of the use cases I used it for involved unstructured text. The absence of libraries meant rolling my own NLP utilities, and, in many cases, implementing a machine learning library (this was pre deep learning, and MLlib was much smaller). I’d always wondered why no one bothered to create an NLP library for Spark when many people were using Spark to process large amounts of text. The recent, early success of BigDL confirms that users like the option of having native libraries. In this episode of the Data Show, I spoke with David Talby of Pacific.AI, a consulting company that specializes in data science, analytics, and big data. A couple of years ago I mentioned the need for an NLP library within Spark to Talby; he not only agreed, he rounded up collaborators to build such a library. They eventually carved out time to build the newly released Spark NLP library. Judging by the reception received by BigDL and the number of Spark users faced with large-scale text processing tasks, I suspect Spark NLP will be a standard tool among Spark users. Talby and I also discussed his work helping companies build, deploy, and monitor machine learning models. Tools and best practices for model development and deployment are just beginning to emerge—I summarized some of them in a recent post, and, in this episode, I discussed these topics with a leading practitioner. Here are some highlights from our conversation: The state of NLP in Spark Here are your two choices today. Either you want to leverage all of the performance and optimization that Spark gives you, which means you want to stay basically within the JVM, and you want to use a Java-based library. In which case, you have options that include OpenNLP, which is open source, or Stanford NLP, which requires licensing in order to use in a commercial product. These are older and more academically oriented libraries. So, they have limitations in performance and what they do. Another option is to look at something like spaCy—a Python-based library that really has raised the bar in terms of usability, and the trade-offs between analytical accuracy and performance. But then your challenge is that you have your text in Spark, but to call the spaCy pipeline, you basically have to move the data from the JVM to a Python process, do some processing there, and send it back, which in practice means you take a huge performance hit because most of the processing you do is really moving strings between operating system processors. … So, really what we were looking for is a solution to work on text directly, within a data frame. A tool that will take into account everything Spark gives in terms of caching, distributed computation, and the other optimizations. This enable users to basically run an NLP and machine learning pipeline directly on their text. Enter Spark NLP Spark NLP. Image by David Talby, used with permission. The core purpose of an NLP library is the ability to take text and then apply a set of annotations on the text. So, the basic annotations we ship in this initial version of Spark NLP include things like a tokenizer, a lemmatizer, sentence boundary detection, and paragraph boundary detection. Then on top of that, we include things like sentiment analysis, spell checker so we can auto-suggest corrections, and a dependency parser so we can not just know that we have a noun and a verb, but also that this verb talks about the specific noun, which is often semantically interesting. We also include named entity recognition algorithms. Deploying and monitoring machine learning models in production I think what’s happening is that people expect basic model development to be very similar to software development. When we started doing software development, we started it wrong. We assumed software engineering was a lot like civil engineering or mechanical engineering. It took a good 30 years until we said no, this is actual

 Building a natural language processing library for Apache Spark | File Type: audio/mpeg | Duration: 00:33:49

When I first discovered and started using Apache Spark, a majority of the use cases I used it for involved unstructured text. The absence of libraries meant rolling my own NLP utilities, and, in many cases, implementing a machine learning library (this was pre deep learning, and MLlib was much smaller). I’d always wondered why no one bothered to create an NLP library for Spark when many people were using Spark to process large amounts of text. The recent, early success of BigDL confirms that users like the option of having native libraries. In this episode of the Data Show, I spoke with David Talby of Pacific.AI, a consulting company that specializes in data science, analytics, and big data. A couple of years ago I mentioned the need for an NLP library within Spark to Talby; he not only agreed, he rounded up collaborators to build such a library. They eventually carved out time to build the newly released Spark NLP library. Judging by the reception received by BigDL and the number of Spark users faced with large-scale text processing tasks, I suspect Spark NLP will be a standard tool among Spark users. Talby and I also discussed his work helping companies build, deploy, and monitor machine learning models. Tools and best practices for model development and deployment are just beginning to emerge—I summarized some of them in a recent post, and, in this episode, I discussed these topics with a leading practitioner.

 Machine intelligence for content distribution, logistics, smarter cities, and more | File Type: audio/mpeg | Duration: 00:36:05

In this episode of the Data Show, I spoke with Rhea Liu, analyst at China Tech Insights, a new research firm that is part of Tencent’s Online Media Group. If there’s one place where AI and machine learning are discussed even more than the San Francisco Bay Area, that would be China. Each time I go to China, there are new applications that weren’t widely available just the year before. This year, it was impossible to miss bike sharing, mobile payments seemed to be accepted everywhere, and people kept pointing out nascent applications of computer vision (facial recognition) to identity management and retail (unmanned stores). I wanted to consult local market researchers to help make sense of some of the things I’ve been observing from afar. Liu and her colleagues have put out a series of interesting reports highlighting some of these important trends. They also have an annual report—Trends & Predictions for China’s Tech Industry in 2018—that Liu will discuss in her keynote and talk at Strata Data Singapore in December. Here are some highlights from our conversation: Machine learning and content distribution Media consumption takes a large proportion of people’s everyday life here in China. Before, people learned their news from news portals and from editorial teams who served as the gatekeepers. People now trust machine learning algorithms with editorial and agenda setting. Apps like Toutiao have become very popular. It’s been quite a surprise to most news portals and media professionals here in China. People are trying to find a balance between the traditional ways of content creation and the new ways of content distribution by aggregators fully powered by machines. Toutiao’s news recommendation engine is purely a black box to most people. … But users are spending more and more time on these types of platforms. And, machine-generated news feeds have become a big thing. … So, it’s now becoming a content war again. After these algorithms improve the efficiency of content distribution, the battle may come down to what content you have. Bike sharing Bike sharing is kind of a new model adapted to Chinese society. … In between every subway station, there’s still several miles to go, where people still need to walk or maybe take a taxi. Bike sharing is being used to replace these other kinds of approaches. There are two primary players. One is Mobike and the other one is Ofo, and they started with different models, actually. Ofo started a year or two earlier from a university campus. … It provided this kind of public bike rental system to users on campus. This was kind of the preliminary prototype of this model. Mobike started in a city. These bike sharing companies have their GPS systems on the bikes, and the bikes have digital electronic locks that can be unlocked with an app on your phone. These technologies, combined together, can help them collect data as well as have a better management system of all the bikes they distribute over a city. Smart cities It’s still a maybe, but it’s very likely we are going to include things about smart cities in our 2018 reports. … This includes AR applications to help build better cities for urban planning. … Urban planning is a very complicated thing, and what we are missing there is, we can be a little bit left behind because of the lack of data. But now people have different types of data. For example, I know the ride sharing company Didi is collaborating with several city governments to help them do urban planning: by using data to better understand traffic, how to manage traffic light systems in the city, and also the bus system. City governments at all levels are now collaborating with all these tech companies to explore applications of their data to improve the cities we have in China. … This is going to be a very important opportunity for the tech companies here in China, especially in terms of their data applica

Comments

Login or signup comment.