tinker_cookbook.supervised.StreamingSupervisedDatasetFromHFDataset
class tinker_cookbook.supervised.StreamingSupervisedDatasetFromHFDataset(SupervisedDataset)
A supervised dataset that streams from HuggingFace, reducing memory usage.
Parameters:
- hf_dataset – The streaming HuggingFace dataset.
- batch_size – Number of rows per batch.
- length – Total number of rows in the dataset (streaming datasets do not expose a length).
- map_fn – Function mapping a single row to a Datum. Mutually exclusive with
flatmap_fn. - flatmap_fn – Function mapping a single row to multiple Datums. Mutually exclusive with
map_fn. - buffer_size – Shuffle buffer size. Default: 10000.
get_batch(index)
Return a batch of Datum objects at the given index.
Parameters:
- index (int) – Zero-based batch index (must be strictly greater than the previous call's index).
Returns: list[tinker.Datum]: Training datums for this batch.
Raises:
- DataValidationError: If
indexwould require backward seeking.
set_epoch(seed)
Reset the stream for a new epoch.
Parameters:
- seed (int) – Epoch seed forwarded to the underlying iterable dataset. Default
0.