Professor Ruslan Salakhutdinov received his PhD in machine learning (computer science) from the University of Toronto in 2009. As a PhD in 2009, Russ Salakhutdinov was one of the few of a coterie surrounding the so-called Godfather of Artificial Intelligence, Geoff Hinton when deep learning emerged as a frontier technology in machine learning.
After spending two post-doctoral years at the Massachusetts Institute of Technology Artificial Intelligence Lab, he joined the University of Toronto as an Assistant Professor in the Department of Computer Science and Department of Statistics. In February of 2016, he joined the Machine Learning Department at Carnegie Mellon University.
Professor Salakhutdinov sets out for Brand Master Talks how Hinton’s determination to put mathematical rigour and scientific proofs at the heart of the research propelled an obscure branch of computer science towards staggering success that shows no sign of slowing down.
KEY TAKEAWAYS:
Jonathan Elliott: How did you start work with Geoff Hinton on neural networks and what became Deep Learning?
Ruslan Salakhutdinov: It was interesting how my journey with Geoff began. I completed my masters with him before going to work in fintech for a while. Then, in 2005, I bumped into him on the University of Toronto campus one morning. He excitedly told me about his work on these models, which he was working on called Deep Belief Networks, which were a way of training multi-layer systems efficiently. This was before all the craziness with Deep Learning had got going.
He said, "Hey, Russ, come to my office. I'll show you. This is, like, very cool – these ideas that just came up". And he more or less grabbed me from the street, and basically I went to his office, and he was showing me, like, how you can train these multi-layer systems.
He had a unique way of explaining things, not through equations but by showing me the concepts. And he convinced me that that's the future. Although I didn't fully grasp everything at the time, I was convinced. And that's when I basically said, "Okay, I have to go and do my PhD with Geoff". That was in 2005 before deep learning was anywhere.
When I started at Toronto, I remember the first year we were looking at these models, training these multi-layer systems, and then basically showing that they can learn interesting things. Our work culminated in a paper published in Science Magazine in 2006, showcasing that you can actually train these models efficiently. And they were performing much better than existing models.
And it's important to remember we didn't have GPUs. Everything was run on a small machine with a single CPU; the fact that we could train these systems on simple machines was actually pretty amazing at that time.
Geoff also had a few other students who later went on to work at big tech companies or universities. My time doing my PhD with Jeff remains one of the best moments in my life.
JE: What do you think influenced the wider adoption of Deep Learning? What did it take for it to take off the way it did?
RS: I believe that part of it was due to the results. There are two main factors. One is the machine learning community itself, which around 2005 was predominantly focused on statistical machine learning. We had to come up with algorithms, we had to prove their effectiveness, define objective functions, and things of that sort.
At that time, neural networks did exist, but many people thought they were too complex and lacked a scientific basis. What really sparked interest was when Hinton and his collaborators demonstrated in the early stages that these models could be trained effectively using a proper objective like a variational bound.
Rather than simply presenting something that worked, and then saying something like "hey we cooked something up and it works!", they provided a mathematical proof behind it.
We studied various models like Boltzmann machines and deep Boltzmann machines, we had actually designed objective functions. This scientific approach excited people who realised that there was more to it than just engineering effort, there is actually science behind it.
This understanding began to spread in the early stages, but even by 2011 when I was on the job market after completing my PhD, I got a postdoc at MIT, and deep learning was not yet widely accepted as the future of machine learning. I structured my job pitch not as "Hey, here's deep learning, and this is how it's going to work" but more like "Here are statistical machine learning systems, and these are proper objectives" and things of that sort. Even at that time, there wasn't a widespread view that deep learning could be part of the future. It was interesting but niche.
Then, in 2012, when people started training these models at a larger scale and demonstrating their value, this is when a wider acceptance occurred within the computer vision and speech recognition communities.
JE: How did the key applications of deep learning evolve and grow from image recognition and speech recognition? How did these emerge and grow into what Deep Learning has become today?
RS: I would say the first major breakthrough occurred in 2012 when Geoff, Ilya Sutskever, and Alex Krizhevsky showed that you can build these deep convolutional neural networks that resulted in a significant reduction in errors in accuracy.
Many labs and researchers were focused on the ImageNet dataset, which consists of one million images categorised into 1000 classes and is used as a benchmark for computer vision.
In Jeff's lab, they built what they called "AlexNet" – named after and led by Alex Krizhevsky, who was a PhD student at that time, and they achieved remarkable results. Their model reduced error rates from around 30% to 16%, essentially cutting error rates in half. This was a turning point; a lot of people didn't believe the results at first because they were so good, and they said, "Whoa, that has to be a mistake, they probably looked at the test set. That cannot be."
Well, we thought that's a good sign because a lot of people started replicating these results and, sure enough, they confirmed that these deep learning models were indeed highly effective in computer vision tasks.
Prior to the advent of deep learning in computer vision, researchers primarily relied on something called deformable parts models using support vector machines, which provided some interpretability where you can say, "Well, if you're detecting a human, like, here's the detection of hands, here's the detecting of the torso, here's the detection of the head and we can visualise what the model is doing." People felt that was the way it was going to go, but deep networks outperformed it.
It was kind of hilarious at the time, because Alex Krizhevsky gave a talk at one of the workshops – and his talk was like "yeah, I had a couple of GPUs in my bedroom and I built this model and here are the results, right?" and there were people there from massive labs at major Universities and major tech firms all trying to build systems to improve recognition and here's this 20-year-old guy…you know…
This breakthrough led to widespread adoption of deep learning in computer vision, and the same thing happened in speech recognition as well. Once big companies like Google, Meta, and Apple started looking at this tech they realised that with more research and more resources and data significantly improved systems could be developed.
JE: How did this breakthrough develop into the commercial applications we see today?
RS: Yeah, I believe that the way it evolved, just as with GPT models like chat GPT, large language models were a big breakthrough. As long as you train these models with a lot of data, they have proven to work remarkably well. Nobody expected them to work so well.
In fact, many people felt that with large language models, there were missing pieces needed to reach our current capabilities in terms of algorithms, architectures, and reasoning. Even I felt we were missing key pieces. But it does show that scaling these models up to internet scale data, plus with tuning and fine-tuning, there are a few caveats, of course; it has been shown that these models work remarkably well, remarkably well - better than anything we had before.
In terms of the next steps and commercial applications, I see the field moving towards more complex tasks. Right now, it's very much input and output. You ask ChatGPT a question; and you get an answer. One direction is towards AI agents, where systems will be able to solve tasks on behalf of the user. This could range from setting appointments to finding the cheapest tickets to London. We're not there yet, but you can think of it as a personal assistant.
At CMU (Carnegie Mellon University), we're looking at a system where we can build task-oriented models. I tell it to go and do something, and if it can't, it comes back and says: "I'm halfway done, but I need some help". That's one area that is going to be evolving in the next few years.
From that, you could have personal tutors, finance advisors, medical advisors, and more. These systems would not just provide answers to questions but take action on the user's behalf. Imagine having an AI assistant that can communicate with service providers, fix technical issues, or complete tasks like registering your child for activities.
Overall, this evolution in AI systems will provide users with more powerful tools and personalised assistance, making everyday tasks more efficient and convenient for everyone.
JE: What about the limitations of processing power and the ability to handle all the sheer quantity of data? I mean, do you see challenges of that kind coming up?
RS: Yes, absolutely. I think that obviously, for systems, or for these large language models, they are very computationally demanding.
If you look at GPT-4V, for example, for these models, they are quite massive in terms of compute. But I do see companies, in particular, trying to build models that are lightweight models. So models that can do fast inference, models that are not big, but small.
And then there are techniques called distillation, where you can take your large models and try to distil them into a smaller, more efficient model.
For example, companies like Apple are working in that space because they want to make sure that if it's your assistant, or Siri, or whatever you're running is going to be running locally on your phone. It cannot drain a lot of your battery, so it has to be low in demand. Of course, these models are probably not going to be as powerful as the large models, but I suspect there are going to be a lot of small models specialised in specific areas.
They are going to be more efficient, more specialised, and cost is going to go down. And of course, hardware is going to be improving, but I do see a lot of effort from big tech, in particular, trying to take these large models, distil them into smaller models, or build models that will be more specialised for certain tasks.
JE: Can you see Deep Learning architectures being blended and hybridised with other systems, like generative models or hybrids that enable the capture of different aspects of data?
RS: There is a lot of debate in the community about whether we should only build deep learning systems all the way through or if we can expect these systems to be hybrid systems where deep learning processes the data and then transitions into a different model that incorporates reasoning. The jury is still out on this topic.
The big advantage of deep learning is that you can train these systems end-to-end, which means you don't have to produce something and then feed it into another model. This helps prevent errors from propagating. However, there are issues with explainability when trying to understand why a model makes mistakes, especially in safety-critical applications. There may need to be guardrails or additional measures around deep learning models. There are problems to do with the issue of hallucinations that we're seeing right now. Explicitly defining guardrails for uncertain situations can be challenging as deep learning models are inherently stochastic systems. It's very hard to put the rules in and say, "If this happens, you shouldn't do x".”
Despite these challenges, the scalability of deep learning models sets them apart from traditional machine learning systems. They can handle large volumes of data and discover complex patterns that are difficult for traditional systems to learn. While expert systems based on predefined rules were once considered the standard, the flexibility and scalability of deep learning models have made them more popular for handling massive amounts of data.