Under the Hood
This page explains some implementation details of Tinker, which are important for understanding how to speed up your code.
Clock Cycles
In Tinker, after you call ServiceClient.create_lora_training_client, your training job gets assigned to a pool of machines that working together -- a worker pool -- which are doing forward-backward operations repeatedly in lock-step.
Each of these steps of the worker pool is called a clock cycle.
In each clock cycle, we do forward-backward and an optimizer step operation, each of which may involve multiple LoRA models that are being trained by this pool.
You can think of this pool as a single large training run that is time-shared between multiple different LoRA models, often from different users.
With multi-tenancy -- sharing the same worker pool between multiple models -- we can run the training system efficiently even if users are training with small batch sizes, or if they have other delays in their training loops that would otherwise leave the worker pool idle. Small batch sizes can often give better sample efficiency, so this setup lets us achieve both high compute efficiency and high sample efficiency.
The downside is that it can sometimes lead to worse latency: even if training with a small batch, you'll still see the same step time as a large batch. (Still, note that we'll only charge you for the compute you use.) Also, if your training loop is implemented naively, you might have to wait multiple clock cycles per batch, because you might miss a clock cycle between operations.
Overlapping forward_backward and optim_step Requests
As mentioned in the Async and Futures section, you should submit your forward_backward and optim_step requests together before waiting for either of them. This way, they'll end up on the same clock cycle. If you write the code naively, you'll end up using three clock cycles per training step. Here's a recap of the example from the Async and Futures section:
❌ Naive implementation (uses 3 clock cycles):
# Submit forward_backward, gets queued for clock cycle N
fwd_bwd_future = await client.forward_backward_async(batch, loss_fn)
# Wait for it to complete, and for client to receive the result
# Due to communication latency, this happens a little after cycle N+1 started
fwd_bwd_result = await fwd_bwd_future
# Submit optim_step, gets queued for clock cycle N+2
optim_future = await client.optim_step_async(adam_params)
# Wait for it to complete, and for client to receive the result
# This happens a little after cycle N+2 finishes
optim_result = await optim_future
# Total: forward_backward on cycle N, optim_step on cycle N+2
# This takes 3 clock cycles (plus the time we waited before cycle N started)✓ Better implementation (uses 1 clock cycle):
# Submit both requests immediately. They'll both be slotted into the same clock cycle N
fwd_bwd_future = await client.forward_backward_async(batch, loss_fn)
optim_future = await client.optim_step_async(adam_params)
# Now wait for results - both operations happen on cycle N
fwd_bwd_result = await fwd_bwd_future
optim_result = await optim_future
# Total: both operations on cycle N
# This takes 1 clock cyclePipelining to Maximize Clock Cycle Efficiency
To maximize efficiency and avoid missing clock cycles, you should pipeline your training loop: submit the next batch before waiting for the current batch to complete. This ensures there's always a request queued when a new clock cycle starts.
We've created a demonstration script that shows the difference between pipelined and non-pipelined training:
View the clock cycles demonstration script →
The script includes two versions:
-
Non-pipelined: Submits a batch, waits for it to complete, then submits the next. This approach typically wastes clock cycles because there's a gap between when one batch finishes and the next is submitted, often using 2 clock cycles per training step.
-
Pipelined: Submits the next batch before waiting for the previous batch to complete. This approach often uses exactly 1 clock cycle per step, achieving maximum efficiency. Though it might sometimes take more than 1 clock cycle per step if the server is heavily loaded, or due to subtleties of our current implementation. (For example, if there are no other users, we might start the clock cycle after receiving the first
forward_backwardbut before receiving theoptim_step. Then we'll dooptim_stepon the next cycle. This causes an extra clock cycle but doesn't cause a slowdown.)
Running the script will show you the performance comparison, including total time and clock cycles used. The pipelined version typically saves both time and clock cycles.