Training Guide
Run experiments, track results, and save outputs with Lighter.
This guide covers the full training workflow from config to results.
Basic Commands
Lighter provides four main commands:
# Train and validate
lighter fit config.yaml
# Validate only (requires checkpoint)
lighter validate config.yaml
# Test only (requires checkpoint)
lighter test config.yaml
# Run inference
lighter predict config.yaml
All commands use the same config structure.
The Fit Command
Train your model with automatic validation:
What Happens
- Loads config from YAML
- Instantiates trainer, model, and data
- Runs training loop with validation
- Saves checkpoints automatically
- Logs metrics to configured logger
Output Structure
outputs/
└── YYYY-MM-DD/
└── HH-MM-SS/
├── config.yaml # Copy of config used
├── checkpoints/
│ ├── last.ckpt # Latest checkpoint
│ └── epoch=09-step=1000.ckpt
└── logs/ # Tensorboard/CSV logs
Resuming Training
Resume from latest checkpoint:
Overriding from CLI
Change any config value without editing files:
Single Override
# Change learning rate
lighter fit config.yaml model::optimizer::lr=0.01
# Train longer
lighter fit config.yaml trainer::max_epochs=100
# Use more GPUs
lighter fit config.yaml trainer::devices=4
Multiple Overrides
lighter fit config.yaml \
model::optimizer::lr=0.01 \
trainer::max_epochs=100 \
data::train_dataloader::batch_size=64 \
trainer::devices=4
Nested Overrides
# Override nested values
lighter fit config.yaml \
model::network::num_classes=100 \
model::optimizer::weight_decay=0.0001
Complex Overrides
# Add callbacks from CLI
lighter fit config.yaml \
'trainer::callbacks=[{_target_: pytorch_lightning.callbacks.EarlyStopping, monitor: val/loss}]'
Merging Configs
Combine multiple YAML files:
Example: Base + Experiment
base.yaml:
trainer:
max_epochs: 100
accelerator: auto
devices: 1
model:
_target_: models.MyModel
network:
_target_: torchvision.models.resnet18
num_classes: 10
data:
_target_: lighter.LighterDataModule
train_dataloader:
batch_size: 32
experiment.yaml:
# Override specific values
trainer:
max_epochs: 200 # Override
devices: 4 # Add
model:
optimizer:
lr: 0.01 # Add optimizer config
Result: Merged config with max_epochs=200, devices=4, new optimizer.
Merge Operators
Control how configs merge:
Replace with =:
# experiment.yaml
trainer:
=callbacks: # Replace entire list
- _target_: pytorch_lightning.callbacks.EarlyStopping
monitor: val/loss
Delete with ~:
# experiment.yaml
trainer:
~callbacks: null # Remove callbacks entirely
data:
~test_dataloader: null # Remove test dataloader
Checkpointing
Automatic Checkpointing
Lightning saves last.ckpt automatically. For more control:
trainer:
callbacks:
- _target_: pytorch_lightning.callbacks.ModelCheckpoint
dirpath: checkpoints
filename: 'epoch{epoch:02d}-loss{val/loss:.4f}'
monitor: val/loss
mode: min
save_top_k: 3 # Keep best 3
save_last: true # Keep last checkpoint
every_n_epochs: 1 # Save every epoch
Save Based on Metric
# Save best validation accuracy
- _target_: pytorch_lightning.callbacks.ModelCheckpoint
monitor: val/acc
mode: max
save_top_k: 1
filename: 'best-acc{val/acc:.4f}'
Multiple Checkpointers
Save different metrics:
trainer:
callbacks:
# Best accuracy
- _target_: pytorch_lightning.callbacks.ModelCheckpoint
monitor: val/acc
mode: max
save_top_k: 1
filename: 'best-acc'
# Best loss
- _target_: pytorch_lightning.callbacks.ModelCheckpoint
monitor: val/loss
mode: min
save_top_k: 1
filename: 'best-loss'
# Regular saves
- _target_: pytorch_lightning.callbacks.ModelCheckpoint
every_n_epochs: 10
filename: 'epoch{epoch:02d}'
Loading Checkpoints
For validation/testing:
lighter validate config.yaml --ckpt_path checkpoints/best.ckpt
lighter test config.yaml --ckpt_path checkpoints/best.ckpt
For inference:
To resume training:
Logging
TensorBoard (Default)
trainer:
logger:
_target_: pytorch_lightning.loggers.TensorBoardLogger
save_dir: logs
name: my_experiment
View logs:
CSV Logger
Results saved to logs/my_experiment/version_0/metrics.csv.
Weights & Biases
trainer:
logger:
_target_: pytorch_lightning.loggers.WandbLogger
project: my_project
name: experiment_1
save_dir: logs
Multiple Loggers
Use all at once:
trainer:
logger:
- _target_: pytorch_lightning.loggers.TensorBoardLogger
save_dir: logs
- _target_: pytorch_lightning.loggers.CSVLogger
save_dir: logs
- _target_: pytorch_lightning.loggers.WandbLogger
project: my_project
No Logging
Disable logging:
Saving Predictions
Use Writers to save predictions to files.
CSV Writer
Save predictions to CSV:
Your predict_step should return a dict:
def predict_step(self, batch, batch_idx):
x, y = batch
pred = self(x)
return {
"prediction": pred.argmax(dim=1),
"probability": pred.max(dim=1).values,
"target": y,
}
Output: predictions.csv with columns for each key.
File Writer
Save predictions to individual files:
Return dict with data and filenames:
def predict_step(self, batch, batch_idx, dataloader_idx=0):
images, paths = batch
predictions = self(images)
# Save each prediction
results = []
for i, (pred, path) in enumerate(zip(predictions, paths)):
results.append({
"prediction": pred.cpu().numpy(),
"$id": f"pred_{batch_idx}_{i}", # Unique filename
})
return results
Saves: predictions/pred_0_0.npz, pred_0_1.npz, etc.
Custom Writer
Create your own:
from lighter.callbacks import BaseWriter
class CustomWriter(BaseWriter):
def write(self, data):
"""Save data however you want."""
# data is what you returned from predict_step
output_path = self.output_dir / f"{data['$id']}.pkl"
with open(output_path, 'wb') as f:
pickle.dump(data, f)
Use in config:
Debugging
Fast Dev Run
Run 1 batch of train/val/test to catch bugs:
Or specify number of batches:
Overfit on Small Batch
Test if model can overfit (sanity check):
Trains on same 10 batches repeatedly.
Limit Batches
Run partial epoch:
# Train on 10% of data
lighter fit config.yaml \
trainer::limit_train_batches=0.1 \
trainer::limit_val_batches=0.1
Or specific number:
Profiler
Profile your code:
Options:
simple- Basic profilingadvanced- Detailed profilingpytorch- PyTorch profiler
Results saved to logs directory.
Find Learning Rate
Automatically find optimal LR:
trainer:
_target_: pytorch_lightning.Trainer
callbacks:
- _target_: pytorch_lightning.callbacks.LearningRateFinder
min_lr: 1e-6
max_lr: 1.0
Or run tuner:
Multi-GPU Training
Single Machine, Multiple GPUs
Or use all available GPUs:
Strategy Options
DDP (Recommended):
DDP Spawn:
DeepSpeed:
FSDP (Fully Sharded):
Batch Size Adjustment
Scale batch size with GPUs:
vars:
num_gpus: 4
per_gpu_batch: 32
data:
train_dataloader:
batch_size: "$%vars::per_gpu_batch * %vars::num_gpus"
Or keep per-GPU batch size:
Mixed Precision Training
Use 16-bit precision for faster training:
Or BFloat16:
Automatic mixed precision (AMP) is handled by Lightning.
Gradient Accumulation
Simulate larger batch sizes:
Effective batch size = batch_size × accumulate_grad_batches.
Example:
# Effective batch size = 32 × 4 = 128
data:
train_dataloader:
batch_size: 32
trainer:
accumulate_grad_batches: 4
Early Stopping
Stop training when metric stops improving:
trainer:
callbacks:
- _target_: pytorch_lightning.callbacks.EarlyStopping
monitor: val/loss
patience: 10
mode: min
verbose: true
Parameters:
monitor: Metric to trackpatience: Epochs to wait before stoppingmode:minormaxmin_delta: Minimum change to qualify as improvement
Progress Bars
Default Progress Bar
Shows by default. Disable with:
Custom Progress Bar
Or:
Validation
Validate Only
Run validation on a checkpoint:
Validation Frequency
Validate every N epochs:
Or every N steps:
Or specific number of steps:
Skip Validation
Testing
Run final test after training:
# Fit then test automatically
lighter fit config.yaml
# Test separately
lighter test config.yaml --ckpt_path checkpoints/best.ckpt
Test During Fit
Not recommended, but possible by loading checkpoint at end of fit.
Prediction/Inference
Run inference on data:
Requires:
predict_stepin your modulepredict_dataloaderin your data config- Optional: Writer callback to save results
Example config:
data:
predict_dataloader:
_target_: torch.utils.data.DataLoader
batch_size: 32
dataset:
_target_: my_project.data.PredictionDataset
root: ./inference_data
trainer:
callbacks:
- _target_: lighter.callbacks.FileWriter
write_interval: batch
Example predict_step:
def predict_step(self, batch, batch_idx):
images = batch
predictions = self(images)
return {
"predictions": predictions.cpu(),
"batch_idx": batch_idx,
}
Experiment Organization
Recommended Structure
my_project/
├── __lighter__.py
├── models.py
├── data.py
├── configs/
│ ├── base.yaml # Baseline config
│ ├── resnet50.yaml # Architecture variants
│ ├── augmented.yaml # Augmentation experiments
│ └── ablation/
│ ├── no_dropout.yaml
│ └── no_batchnorm.yaml
└── outputs/ # Generated by Lighter
└── YYYY-MM-DD/
└── HH-MM-SS/
Config Naming
Use descriptive names:
configs/
├── baseline-resnet18.yaml
├── baseline-resnet50.yaml
├── lr0.01-batch128.yaml
├── augment-strong.yaml
└── finetune-imagenet.yaml
Version Control
Track configs in git:
Compare experiments:
Common Workflows
Workflow 1: Hyperparameter Search
Create configs for different hyperparameters:
# Try different learning rates
lighter fit base.yaml model::optimizer::lr=0.001
lighter fit base.yaml model::optimizer::lr=0.01
lighter fit base.yaml model::optimizer::lr=0.1
# Try different architectures
lighter fit base.yaml model::network::_target_=torchvision.models.resnet18
lighter fit base.yaml model::network::_target_=torchvision.models.resnet50
lighter fit base.yaml model::network::_target_=torchvision.models.efficientnet_b0
Workflow 2: Resume Failed Training
Training crashed? Resume:
Workflow 3: Incremental Training
Train, then finetune:
# Initial training
lighter fit pretrain.yaml
# Finetune with lower LR
lighter fit finetune.yaml \
--ckpt_path outputs/.../checkpoints/last.ckpt \
model::optimizer::lr=0.0001
Workflow 4: Cross-Validation
Run multiple folds:
Config:
# data.py
class CVDataset(Dataset):
def __init__(self, root, fold, num_folds=5):
# Split data by fold
...
Output Management
Change Output Directory
Or CLI:
Disable Checkpoints
Save Frequency
Save less often:
trainer:
callbacks:
- _target_: pytorch_lightning.callbacks.ModelCheckpoint
every_n_epochs: 10 # Save every 10 epochs
Or based on steps:
trainer:
callbacks:
- _target_: pytorch_lightning.callbacks.ModelCheckpoint
every_n_train_steps: 1000
Troubleshooting
Out of Memory
Solutions:
-
Reduce batch size:
-
Use gradient accumulation:
-
Use mixed precision:
-
Reduce model size:
Training Too Slow
Solutions:
-
Use more workers:
-
Pin memory:
-
Use multiple GPUs:
-
Mixed precision:
Model Not Learning
Debug steps:
-
Overfit on small batch:
-
Check learning rate:
-
Visualize data:
-
Profile:
Next Steps
- Best Practices - Production patterns
- Example Projects - Complete working examples
- CLI Reference - Full command documentation
Quick Reference
# Basic commands
lighter fit config.yaml
lighter validate config.yaml
lighter test config.yaml
lighter predict config.yaml
# Override from CLI
lighter fit config.yaml key::path=value
# Merge configs
lighter fit base.yaml,experiment.yaml
# Resume training
lighter fit config.yaml --ckpt_path path/to/last.ckpt
# Multi-GPU
lighter fit config.yaml trainer::devices=4 trainer::strategy=ddp
# Debug
lighter fit config.yaml trainer::fast_dev_run=true
lighter fit config.yaml trainer::overfit_batches=10