Skip to content

Troubleshooting

Common errors and solutions when using Lighter.

Configuration Errors

ModuleNotFoundError: No module named 'project'

Cause: Missing __init__.py files or incorrect project path

Solution:

# In your config.yaml
project: ./my_project  # Ensure path is correct

# Ensure all module directories have __init__.py:
my_project/
├── __init__.py       # Required!
├── models/
│   ├── __init__.py   # Required!
│   └── my_model.py

Config Reference Errors

Wrong: "$@system#model#parameters()" - Using # for attributes Correct: "$@system#model.parameters()" - Use . for Python attributes

Wrong: Circular references

model:
  lr: "@system#optimizer#lr"  # Circular!
optimizer:
  lr: "@system#model.lr"      # Circular!

Correct: Use vars section

vars:
  lr: 0.001
model:
  lr: "%vars#lr"
optimizer:
  lr: "%vars#lr"

YAML Syntax Errors

Common mistakes: - Missing colons after keys - Inconsistent indentation (use spaces, not tabs) - Missing quotes around values with special characters - Missing values (like the roi_size example in inferers)

Training Issues

CUDA Out of Memory

Solutions:

# Reduce batch size
lighter fit config.yaml --system#dataloaders#train#batch_size=8

# Enable gradient accumulation
lighter fit config.yaml --trainer#accumulate_grad_batches=4

# Use mixed precision
lighter fit config.yaml --trainer#precision="16-mixed"

For distributed strategies, see PyTorch Lightning docs.

Loss is NaN

Check: 1. Learning rate too high → Reduce by 10x 2. Missing data normalization → Add transforms 3. Wrong loss function for task → Verify criterion 4. Gradient explosion → Add gradient clipping in Trainer config

Slow Training

Optimize:

system:
  dataloaders:
    train:
      num_workers: 8      # Increase for faster data loading
      pin_memory: true    # For GPU training
      persistent_workers: true  # Reduce worker startup overhead

For profiling and optimization, see PyTorch Lightning performance docs.

Debugging Strategies

Quick Testing

# Test with 2 batches only
lighter fit config.yaml --trainer#fast_dev_run=2

Debug Config Values

# Print values during config resolution
optimizer:
  lr: "$print('LR:', 0.001) or 0.001"

Check Adapter Outputs

Temporarily add print transforms in adapters:

adapters:
  train:
    criterion:
      pred_transforms:
        - "$lambda x: print('Pred shape:', x.shape) or x"

Getting Help

  1. Search this documentation
  2. Check FAQ
  3. Review PyTorch Lightning docs for Trainer issues
  4. Join Discord
  5. Open GitHub issue