2018-09-06

Key takeaways from the Deep Learning Specialization

In recent years, Deep Learning has forever changed the landscape of innumerous areas. From real-time object detection to medical analysis, from self-driving cars to speech recognition, and even to a machine beating the world’s best Go player, all these have been made possible by the unprecedented growth in deep learning. While the field itself isn’t new - the inception of an artificial neuron dates back to 1943 - the advancements today are made possible due to the immense technological progress, and the vast amounts of data being constantly generated by users worldwide.

Knowing I’m living through a time of a huge turn in software development, I was absolutely stoked to dive into the field and learn more. As Andrej Karpathy puts it, we are now moving towards Software Development 2.0.

I had been eyeing Andrew Ng’s Deep Learning specialization offered by Andrew Ng’s initiative deeplearning.ai on Andrew Ng’s platform Coursera for a long while (I sometimes wonder how he manages his time), before I finally decided to get into it. Here’s are the key takeaways I have learnt from the 5 courses.

Course 1 : Neural Networks and Deep Learning

This course provides a basic introduction to Neural Networks.

-Code implementation. This was the most important part of the course for me. I mean, do you even know backpropagation if you can’t code it from scratch? Coding the concepts discussed gives a clarity of thought, and we might even encounter edge cases we hadn’t considered before.

-The need for deep networks. The Universal Approximation Theorem states that a neural network with just one hidden layer can practically learn anything! So the question of why we need more layers at all is natural. The answer lies in that using deep neural networks requires far lesser (sometimes exponentially lesser) nodes to work with. Thus, it’s simply more practical and efficient. However, note that deep NNs are preferred if the situation merits it.

-Importance of vectorization. Death to loops! All hail vectors! Ng demonstrates how vectorization can lead to incredible speed-ups - critical when working with large models.

Course 2: Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization

This course feels more real as we start to explore the tools to work with NNs.

-DL is systematic trial-and-error. Andrew Ng spares no time in bluntly stating that applied DL is a highly iterative trial-and-error process. But despair not! For it’s not random, but highly systematic. Here is Ng’s “basic recipe for Machine Learning”.

-Optimization algorithms. This week was a real treat. The basic gradient descent algorithm is discussed, and is then followed by successively better developments on it. I was fascinated by the idea of momentum - a seemingly simple idea that is simultaneously so profound! Discussing the improvements also equipped me with a sound clarity of the working of gradient descent.

-The problem of local minima doesn’t exist. Andrew bunks the myth regarding any problem of landing in a local minima. Because deep learning specializes in large datasets, the dimensionality is usually enormous. At such high dimensionality, it’s statistically highly unlikely that every single dimension simultaneously hits its lowest point. The more probable scenario is getting stuck at a saddle point. Optimization algorithms like Adam are known to handle these well.

Course 3: Structuring Machine Learning Projects

There’s a lot of practical and useful advice in this course.

-Orthogonalize your approach! The idea of orthogonalization is to be able to attack an issue, without also causing effects in other areas. A lot of research has allowed DL to yield tools such as regularisation, among others, which can tackle one issue of a model without causing any side effects (reducing variance doesn’t simultaneously increase bias), thus embracing the philosophy of orthogonalization. Ng strongly urges to follow this methodology – that means if you have been using early stopping till now, I’m afraid it’s time to let go.

-Single number evaluation metric. Having a single number stating the performance of a model makes comparing between several models immensely easy. If the objective includes multiple metrics like accuracy and precision, it’s thus imperative to combine them into a single number. This might seem like a natural idea, but it can be easily overlooked in practice.

-Compare against Human Level Performance (HLP). When assessing the single number evaluation metric, we should not compare it to the ideal 0% error. It’s very likely that the dataset contains noise, making 0% error infeasible. Rather, the comparison must be with HLP. The avoidable bias is the bias that upon elimination can bring the model on par with HLP. Another interesting problem occurs when the model surpasses HLP, a scenario that can occur especially when dealing with structured data. Attempting to improve performance becomes tricky here, since we are inherently no longer well equipped to do so.

Course 4: Convolution Neural Networks

Enter CNNs. (Fair notice - Convolution is actually a misnomer here. The operation that occurs is really only correlation, but by the time ML researchers realised it, an unspoken agreement to never mention it came into existence.) This course is a rather heavy one, with plenty of new ideas and concepts.

-Case studies in image classification. Don’t reinvent the wheel - Use established CNNs relevant to the objective as the starting point. Ng walks us through several famous models - Alexnet, VGG, Inception and Resnet50. The former two introduce certain core design philosophies of CNNs and how to choose hyperparameters . The latter two teach to think differently and beyond, and to not be limited by the design philosophies. Apart from letting the model decide the hyperparameters, one of the cool things about Inception is how it makes use of softmax nodes in multiple locations along its depth to mitigate the problem of vanishing gradients. Resnet50 uses a different method for this – the ingenious skip connections.

-Object detection. I went in expecting Object detection to similarly provide completely new architectures, and was surprised when that wasn’t the case. With only modifications to the softmax layer, an image classifier can also perform object localisation! To note, an image classifier can identify if an image is of a car; an object localizer can classify and display a bounding box around the said car; whereas an object detector must detect all instances of cars in the image. Ng discusses the simple sliding window algorithm for the task of object detection. One of its problems is the huge redundancy in computation. Ng then demonstrates how convolutions can be used to avoid repeating computations and speed up processes. This is followed by a discussion of the landmark YOLO algorithm. Yes, YOLO. It stands for “You Only Look Once”. This non-trivial algorithm scans the image just once to output all the bounding boxes, thus making it suitable for real time use cases.

-The unique problems in Face detection. The problem of face detection is very unique, since there’s a highly limited sample size per class, and the number of classes is mutable thus making the entire dataset volatile. The key idea developed here is to learn the similarity (or distance) function, between two images, than to learn the mapping of image to classes as is traditional. This single idea does away with all the problems in one clean stroke.

Course 5: Sequence models

Meet Recurrent Neural Networks (RNNs). RNNs are unreasonably effective. Unlike CNNs, they have a significantly varied application domain – music generation, speech recognition, sentiment analysis, DNA sequence analysis, and so on – any place where sequence data is used. The course, after teaching the basics such as the fancily named “backpropagation through time” and the different types of RNN architectures, stresses mostly on Natural Language Processing.

-Developing good word embeddings. At the heart of NLP is a good language model. The language model should be able to accurately predict which word is the most likely to fill in a certain blank within a sentence. A simple approach is to incorporate statistics and calculate the probability of a sentence. But this approach is too simplistic and the model usually knows nothing about the relationships between words. This is where word embeddings come in. They can understand the relationships between words well – for example, that “royal” is closer to a “king” than a “rat”. Learning good word embeddings takes a counter-intuitive approach. Ng walks us through several problem statements – Word2Vec, GloVe - which are in general not easy supervised learning problems. The side effect of training on them is that the weights learned by the model provide a good word embedding, even if the model itself doesn’t perform well on the actual problem statement.

-Learning long range dependencies. Owing to vanishing gradients, it’s difficult for a model to learn long range dependencies within a sentence. The need for memory was natural. Ng explains in detail the two inventions in this regard - the Gated Recurrent Unit (GRU) and later the Long Short Term Memory (LSTM), both of which approached the problem with the use of memory cells and gates.

-The best is subjective. Towards the end of the course, machine translation is discussed. Unlike in NLP, finding the best fitting word is not the focus here, but rather the best fitting phrase. In fact, using the successively best fitting word in hopes of getting the right translated phrase can possibly result in a poorer performance. Heuristics like beam search and attention model discuss in detail how the best phrase can be sought.

Overall thoughts

-No sensationalization. Andrew Ng is straightforward, even blunt, about Deep Learning. He immediately debunks any similarity claimed with the human nervous system, saying such analogies are loose at best.

-Consistent notation. Ng develops the notation early in the specialization, which is maintained and built upon throughout the remaining courses. This helps maintain a clear interpretation when new concepts are being discussed.

-Delves deep into the field. This is one of the few set of online courses which doesn’t start and end at the novice level, but actually delves into the field and teaches many concepts in detail. Many online courses simply provide a very basic introduction which makes the student aware of somethings without really understanding much at the end of it. This specialization is not like that, and I love it for that.

-Assignments all relate to current real-world applications. The assignments are made extremely interesting owing to this fact! One of them is even on music generation - how cool is that?

Concluding remarks

The specialization is wonderfully put together, and in many ways serves as a real break into the field. While not intensely mathematical, Andrew Ng’s explanation helps overcome any hurdle. Overall, I wholly recommend it!