Saving and loading weights and optimizer state

During training, you'll need to save checkpoints for two main purposes: sampling (to test your model) and resuming training (to continue from where you left off). The TrainingClient provides three methods to handle these cases:

save_weights_for_sampler(): saves a copy of the model weights that can be used for sampling.
save_state(): saves the weights and the optimizer state. You can fully resume training from this checkpoint.
load_state(): load the weights and the optimizer state. You can fully resume training from this checkpoint.

Note that (1) is faster and requires less storage space than (2).

Both save_* functions require a name parameter---a string that you can set to identify the checkpoint within the current training run. For example, you can name your checkpoints "0000", "0001", "step_1000", etc.

The return value contains a path field, which is a fully-qualified path, which will look something like tinker://<model_id>/<name>. This path is persistent and can be loaded later by a new ServiceClient or TrainingClient.

Example: Saving for sampling

# Setup
import tinker
service_client = tinker.ServiceClient()
training_client = service_client.create_lora_training_client(
    base_model="meta-llama/Llama-3.2-1B", rank=32
)
 
# Save a checkpoint that you can use for sampling
sampling_path = training_client.save_weights_for_sampler(name="0000").result().path
 
# Create a sampling client with that checkpoint
sampling_client = service_client.create_sampling_client(model_path=sampling_path) #

Shortcut: Combine these steps with:

sampling_client = training_client.save_weights_and_get_sampling_client(name="0000")

Example: Saving to resume training

Use save_state() and load_state() when you need to pause and continue training with full optimizer state preserved:

# Save a checkpoint that you can resume from
resume_path = training_client.save_state(name="0010").result().path
 
# Load that checkpoint
training_client.load_state(resume_path)

When to use `save_state()` and `load_state()`:

Multi-step training pipelines (e.g. supervised learning followed by reinforcement learning)
Adjusting hyperparameters or data mid-run
Recovery from interruptions or failures
Any scenario where you need to preserve exact optimizer state (momentum, learning rate schedules, etc.)

Loss Functions Downloading Weights

Saving and loading weights and optimizer state

Example: Saving for sampling

Example: Saving to resume training

When to use save_state() and load_state():

When to use `save_state()` and `load_state()`: