Skip to content

tinker_cookbook.supervised.StreamingSupervisedDatasetFromHFDataset

class tinker_cookbook.supervised.StreamingSupervisedDatasetFromHFDataset(SupervisedDataset)

A supervised dataset that streams from HuggingFace, reducing memory usage.

Parameters:

  • hf_dataset – The streaming HuggingFace dataset.
  • batch_size – Number of rows per batch.
  • length – Total number of rows in the dataset (streaming datasets do not expose a length).
  • map_fn – Function mapping a single row to a Datum. Mutually exclusive with flatmap_fn.
  • flatmap_fn – Function mapping a single row to multiple Datums. Mutually exclusive with map_fn.
  • buffer_size – Shuffle buffer size. Default: 10000.

get_batch(index)

Return a batch of Datum objects at the given index.

Parameters:

  • index (int) – Zero-based batch index (must be strictly greater than the previous call's index).

Returns: list[tinker.Datum]: Training datums for this batch.

Raises:

  • DataValidationError: If index would require backward seeking.

set_epoch(seed)

Reset the stream for a new epoch.

Parameters:

  • seed (int) – Epoch seed forwarded to the underlying iterable dataset. Default 0.