Exponential Learning Rate Scheduler
Motivation
When training deep neural networks, learning rate is arguably a hyperparameter of paramount importance. However, in many scenarios, altering the learning rate during the model training may help not only help stabilize the training but also help find better local minima. There are various ways to approach so-called scheduling the learning rate during the training. Deep learning frameworks such as PyTorch or TensorFlow provide basic infrastructure that supports this functionality. One of the approaches is based on using exponential decay.
This short post aims to provide a simple guide regarding how to derive the necessary parameters for a learning rate scheduler using exponential decay given that we know our base learning rate, target learning rate we want to reach, the total number of epochs as well as the number of warm-up epochs (in which the learning rate remains untouched).
Note: To be honest, I had to use re-derive this formula multiple times. Thus, I decided to save the processes as part of a blog post for future reference at least for me, if not for anyone else.
In this post, we will use the TensorFlow deep learning framework. Nevertheless, the reasoning and methodology are very general and can be applied to any scenario involving finding the parameters of a function for exponential decay.
More specifically, we will strive to implement the scheduling function for the LearningRateScheduler class representing a callback function. Its instance can be constructed as
tf.keras.callbacks.LearningRateScheduler(
schedule, verbose=0
)
The arguments are
argument | description |
---|---|
schedule |
a function that takes an epoch index (integer (int ), indexed from $0$) and current learning rate (float ) as inputs and returns a new learning rate as output (float ). |
verbose |
int . $0$ - quiet, $1$ - update messages. |
At the beginning of every epoch, this callback gets the updated learning rate value from the schedule
function. Please, refer to the dedicated documentation section for further details.
Formula Derivation
Let $B$ be the base learning rate, $T$ be the target learning rate, $N$ be the total number of epochs and $W$ be the number of warm-up epochs. The aim is to find a parameter $\lambda$, i.e., the decay rate, so that our learning rate scheduler equals $B$ for all the warm-up epochs including the first “real” epochs and after the $N - \left (W + 1 \right)$ epochs it reaches the value of $T$.
Generally speaking, we want to find a function that takes two parameters, the current epoch index $i$ (indexed from $0$) and the current learning rate $r$ and returns a new learning rate $\tilde{r}$. So,
\[f \left( i, r \right) = \tilde{r}.\]Considering the aforementioned requirements, the decay rate $\lambda$ is equal to
\[\lambda = \frac{\log \left( \frac{T}{B} \right)}{E - \left(W + 1 \right)}.\]Thus, the sought function $f \left( \cdot \right)$ can be defined as
\[f \left( i, r \right) = \begin{cases} B \qquad & \text{if } i \leq W,\\ B \cdot e^{\lambda} \qquad & \text{otherwise}. \end{cases}\]Implementation
The only required import is:
import tensorflow as tf
The derived formula can be transformed into a Python implementation as follows
def make_lr_scheduler(base_lr, target_lr, n_epochs, n_warmup_epochs):
n_update_epochs = n_epochs - n_warmup_epochs - 1
decay_rate = tf.math.log(target_lr / base_lr) / n_update_epochs
def _scheduler(epoch, lr):
if epoch <= n_warmup_epochs:
return base_lr
else:
return lr * tf.math.exp(decay_rate)
return _scheduler
The above function creates another function which is then passed to the model.fit(...)
method as a callback. Concretely, let model
be a TensorFlow model instance. When calling its fit()
method, one of the parameters is callbacks
, a list of callback functions to be called during the training.
Assume we have the following variables in our configuration
BASE_LR = ...
TARGET_LR = ...
N_EPOCHS = ...
N_WARMUP_EPOCHS = ...
Then, the learning rate scheduler can be instantiated and utilized during the training as
lr_scheduler_callback = make_lr_scheduler(
base_lr=BASE_LR,
target_lr=TARGET_LR,
n_epochs=N_EPOCHS,
n_warmup_epochs=N_WARMUP_EPOCHS
)
model.fit(train_dataset, epochs=N_EPOCHS, callbacks=[lr_scheduler_callback])
Implementation Verification
Here is a table that shows how the learning rate progresses with respect to different parameters.
$B$ | $T$ | $N$ | $W$ | $i = 0$ | $i = 1$ | $i = 2$ | $i = 3$ | $i = 4$ |
---|---|---|---|---|---|---|---|---|
$10^{-1}$ | $10^{-6}$ | $5$ | $0$ | $0.100000$ | $0.005623$ | $0.000316$ | $0.000018$ | $0.000001$ |
$10^{-2}$ | $10^{-4}$ | $5$ | $1$ | $0.010000$ | $0.010000$ | $0.002154$ | $0.000464$ | $0.000100$ |
$10^{-2}$ | $10^{-5}$ | $5$ | $2$ | $0.010000$ | $0.010000$ | $0.010000$ | $0.000316$ | $0.000010$ |
Conclusion
In this post, we covered how to derive the decay rate parameter for exponential decay function. As for the real-world use case, we showed a direct application to learning rate scheduling within the TensorFlow deep learning framework.