In this smart tutorial, we want to start tackling two complex topics in deep learning : training loop, autograd. These two topics open doors to a myriad of related topics: auto differentiation, gradient descent, gradient tape, training loop, back propagation etc. All these concepts are related, yet there are plenty of differentials. No pun intended.
Learning goals | key takeaways from this page : It’s important to understand : What is back propagation, update function, gradient descent, and different flavors of gradient descent, how are they related and what is a batch, mini-batch, versus stochastic gradient descent?
The gradient is a vector pointing at the direction of the steepest descent. There are some differences between gradients and derivatives. The gradient is a vector, and the derivative is a scalar (a real number).
Training is an iterative process. May need to be repeated many times. Check the ML workflow chart to see where training fits in the overall workflow.
Gradient Descent. Stochastic Gradient Descent.
Training dataset is used for training. Cross validation dataset is used for evaluating the model, fine tuning the model parameters. The third dataset, aka the test dataset is for final model selection. Testing the model's performance on unseen dataset (mimicking real world data), testing its ability to generalize.
Training in Pytorch requires us to write a more custom detailed training loop. In scikit-learn, in tensorflow, training can be calling a simple high level API .fit(). Though each model has different architecture under the hood, the high level API has been abstracted to be .fit(). Here's a fully annotated note of pytorch training loop [important, high quality]