Welcome
Hello! I thought it’d be interesting to benchmark an optimizer on:
- Constant LR vs. OneCycle LR Policy
- Between training sessions: Making a new optimizer vs. Preserve state by using the same optimizer
I’m curious to see if maintaining an optimizer state between trainning sessions can impact a model’s performance. In PyTorch, optimizers hold both a state
and param_groups
.
state
refers to a set of variables that are changed periodically by stepping with the optimizer, such as momentum’s accumlating gradients, or parameter-based learning rates modifiers.param_groups
refers to the set of hyperparameters that are set upon optimizer initialization or changed through iterative use of anlr_scheduler
, such as thelr
,beta
,eps
, orweight_decay
.
All of the below figures and code snipets can be found in the GitHub Repo. (Figures use Tensorboard, all code lies within Jupyter Notebooks)
Benchmark Layout: Model, Datasets, and Task
Model: ResNet50 (Not pretrained, architecture from torchvision model zoo
Task: Image Classification
Datasets:
- Imagenette
- Imagewoof
These are subsets of imagenet. See this repo for more information. In a nut shell, these datasets are great to run quick experiments with. Imagenette contains 10 easy classes from Imagenet, while Imagewoof contains 10 harder classes from Imagenet.
Imagenette classes: tench, English springer, cassette player, chain saw, church, French horn, garbage truck, gas pump, golf ball, parachute
Imagewoof classses: 10 different dog breeds, hence the name Imagewoof
Very Important Note:
‘Imagenette’ is pronounced just like ‘Imagenet’, except with a corny inauthentic French accent. If you’ve seen Peter Sellars in The Pink Panther, then think something like that. It’s important to ham up the accent as much as possible, otherwise people might not be sure whether you’re refering to “Imagenette” or “Imagenet”.
A Quick Optimizer Review
What’s in a optimizer?
model = torchvision.models.resnet50(pretrained=False)
opt = torch.optim.Adam(model.parameters(), lr=1e-3)
opt
Output:
Adam (
Parameter Group 0
amsgrad: False
betas: (0.9, 0.999)
eps: 1e-08
lr: 0.001
weight_decay: 0
)
PyTorch optimizer’s contain a state_dict
:
opt.state_dict().keys()
Output:
dict_keys(['state', 'param_groups'])
An optimizer’s state_dict
contains two keys state
and param_groups
.
Currently, our state
is empty since we haven’t trained/stepped with our optimizer:
opt.state_dict()['state']
Output:
{}
Currently, our param_groups
have the following hyperparmeters:
opt.state_dict()['param_groups']
Output:
{[{'lr': 0.001,
'betas': (0.9, 0.999),
'eps': 1e-08,
'weight_decay': 0,
'amsgrad': False,
'params': [140328111909616,
140328113037216,
140328116557456,
...]}
Note that since we’re using the state_dict, we can only view this information, but not edit/modify it. state_dicts
are imutable, unless you manually craft your own and call opt.load_state_dict(some_dict)
.
A more common way to change a hyperparameter in a param_group is to use opt.param_groups
rather than playing with state_dicts
:
opt.param_groups[0].keys()
Output:
dict_keys(['params', 'lr', 'betas', 'eps', 'weight_decay', 'amsgrad'])
opt.param_groups[0]['lr']
Output:
0.001
opt.param_groups[0]['lr'] = 5e-2
opt.param_groups[0]['lr']
Output:
0.05
opt.state_dict()['param_groups']
Output:
[{'lr': 0.05,
'betas': (0.9, 0.999),
'eps': 1e-08,
'weight_decay': 0,
'amsgrad': False,
'params': [140328111909616,
140328113037216,
140328116557456,
...]}
This means if you want to change one of the hyperparameters of your optimizer, you have one of two options:
- Change the hyperparameter using the
param_groups
, which will preservestate
opt.param_groups[0]['lr'] = 5e-2 opt.param_groups[0]['lr']
- Make a fresh new
opt
opt = torch.optim.Adam(model.parameters(), lr=5e-2)
A Quick LR Scheduler Review
Rather than manually changing the lr
during training, we can use PyTorch’s lr_schedulers
. An example of creating a OneCycleLR Schedule is below:
model = torchvision.models.resnet50(pretrained=False)
opt = torch.optim.Adam(model.parameters(), lr=1e-4)
lr_scheduler = torch.optim.lr_scheduler.OneCycleLR(opt, max_lr=1e-3, epochs=5, steps_per_epoch=100)
opt
Output:
Adam (
Parameter Group 0
amsgrad: False
base_momentum: 0.85
betas: (0.95, 0.999)
eps: 1e-08
initial_lr: 4e-05
lr: 3.9999999999999996e-05
max_lr: 0.001
max_momentum: 0.95
min_lr: 4e-09
weight_decay: 0
)
Note the difference between printing out the base opt
in the previous section versus the lr_scheudler wrapped opt
. This new opt contains more hyperparameter settings. These additional settings are used to adjust the base opt
original hyperparameters as we step with the lr_scheduler
. The OneCycle LR Scheduler needs to know the total number of steps beforehand, in order to adjust the lr
appropriately between min_lr
and max_lr
.
To use the OneCycle lr_scheduler
, we need to step with our lr_scheduler
everytime with step with our opt
in our training loop. An example of how the lr
changes with respect to number of steps can be seen below.
To start another training session with the same optimizer, we can:
# Manually change LR
opt.param_groups[0]['initial_lr'] = 5e-2
opt
Output:
Adam (
Parameter Group 0
amsgrad: False
base_momentum: 0.85
betas: (0.95, 0.999)
eps: 1e-08
initial_lr: 0.05
lr: 3.9999999999999996e-05
max_lr: 0.001
max_momentum: 0.95
min_lr: 4e-09
weight_decay: 0
)
# Re-wrap the opt
lr_scheduler = torch.optim.lr_scheduler.OneCycleLR(opt, max_lr=1e-1, epochs=5, steps_per_epoch=100)
opt
Output:
Adam (
Parameter Group 0
amsgrad: False
base_momentum: 0.85
betas: (0.95, 0.999)
eps: 1e-08
initial_lr: 0.05
lr: 0.05
max_lr: 0.1
max_momentum: 0.95
min_lr: 4e-07
weight_decay: 0
)
This means if you want to make a lr_scheduler
with specific hyperparameter settings, you have two options:
- Wrap an existing
opt
but first modifyparam_groups
, which will preservestate
opt.param_groups[0]['initial_lr'] = 5e-2 lr_scheduler = torch.optim.lr_scheduler.OneCycleLR(opt, max_lr=1e-1, epochs=5, steps_per_epoch=100)
- Wrap a fresh new
opt
model = torchvision.models.resnet50(pretrained=False) opt = torch.optim.Adam(model.parameters(), lr=1e-4) lr_scheduler = torch.optim.lr_scheduler.OneCycleLR(opt, max_lr=1e-3, epochs=5, steps_per_epoch=100)
The Experiment
Question: How impactful is the state
in an optimizer, is it ok to throw it away between training sessions?
Experiment: Train a model for multiple training sessions, and benchmark the loss and classification accuracy with respect to the method of creating/modifying optimizers in between each training session.
An example training pipeline:
- Make a model
- Make an opt/lr_scheduler for the first time.
- Train for 10 epochs
- opt/lr_scheduler method
- Train for 10 epochs
- opt/lr_scheduler method
- Train for 5 epochs
Where each opt/lr_scheduler method refers to the following techniques to create/modify optimizers:
- Make a new optimizer from initialization,
torch.optim.Adam()
- Modify an existing optimizer to preserve
state
by modifyingparam_groups
- Wrap a
lr_scheduler
on a new optimizer,torch.optim.Adam()
- Wrap a
lr_scheduler
an existing optmizier to preservestate
by modifyingparam_groups
For this experiment, I used the Adam
optimizer and OneCycleLR
LR Scheduler
Total Models Trained: 2 Datasets, 4 methods = 8 models.
Results - Imagenette
Below is a set of figures corresponding to Training Loss, Training Accuracy, Validation Loss, Validation Accuracy for the Imagenette Dataset.
Legend:
Figures:
Train | Validation |
---|---|
Results - Imagewoof
Below is a set of figures corresponding to Training Loss, Training Accuracy, Validation Loss, Validation Accuracy for the Imagewoof Dataset.
Legend:
Figures:
Train | Validation |
---|---|
Short Conclusion
- OneCycle LR > Constant LR
- Making a new optimizer vs. Preserving
state
and re-using the same optimizer both achieve very similar performance. i.e. Discarding an optimizer’sstate
didn’t really hurt the model’s performance, with or without an LR Scheduler. Maybe the state is learned quickly.
Note: Conclusions are based on the Adam optimizer and OneCycle LR Scheduler. I haven’t experimented with other optimizers to see if dropping their state
is more impactful
Edit Note: I’m not proposing to always throw optimizers away, I still believe the general guideline is to use the same optimizer and keep the history. Kindly share resources if anyone found results showing the importance of using the same optimizer :)
Thanks for reading :)