lightbench.metrics

class lightbench.metrics.llm_judge.LLMJudge(judge_model_name='gpt-4o-mini')

Bases: object

get_score(prompt, response, max_attempts=2)

Module for measuring time to first token (TTFT), GPU memory usage, and GPU power usage.

This module provides the following classes:
  • GenerationMetrics: Monitors TTFT, GPU VRAM, and power usage.

  • VRAM_TORCH: Measures GPU VRAM usage using PyTorch CUDA utilities.

  • PowerUsage: Measures and tracks GPU power usage via NVML.

class lightbench.metrics.metrics.GenerationMetrics(tokenizer, sample_every: int = 1, device: str = 'cuda', DEBUG: bool = False)

Bases: BaseStreamer

Collects several generation-time metrics in a single place.

It measures:
  • TTFT (Time-To-First-Token)

  • Average VRAM usage (either via NVML or PyTorch utilities)

  • Average GPU power consumption (via NVML)

Sampling happens during token streaming. Set sample_every to decide how often the measurements are taken, defaults to sample_every = 5.

Parameters:
  • tokenizer (transformers.PreTrainedTokenizerBase) – Tokenizer used by your model; required to build a TextIteratorStreamer.

  • sample_every (int, default = 5) – Frequency (in tokens) at which VRAM & power are sampled. Must be > 0.

  • device (str, default = "cuda") – Device string, NVIDIA ``torch``and Apple ``Metal (mps)``are supported.

  • use_nvml (bool, default = False) – If True we try to use NVML for VRAM. If NVML is unavailable or use_nvml=False we fall back to the PyTorch memory utilities.

  • DEBUG (bool, default = False) – Emit verbose messages when something goes wrong.

property avg_power: float
property avg_vram: float
end()

Function that is called by .generate() to signal the end of generation

put(value)

Function that is called by .generate() to push new tokens

reset()
set_start_time()
property ttft: float | None
class lightbench.metrics.metrics.PowerUsage(gpu_index=0, DEBUG: bool = False)

Bases: object

Class to measure and track GPU power usage using NVML.

DEBUG

Flag to enable debug output.

Type:

bool

handle

NVML handle for the specified GPU.

power_samples

List to store power usage measurements in watts.

Type:

list

get_average()

Calculate and return the average GPU power usage from the recorded samples.

Returns:

Average power usage in watts. Returns 0.0 if no samples exist.

Return type:

float

kill()

Shutdown NVML to clean up resources.

measure_power()

Measure the current GPU power usage and record the sample.

Returns:

Current GPU power usage in watts. Returns 0 if measurement is unsupported or fails.

Return type:

float

class lightbench.metrics.metrics.VRAM_NVML

Bases: object

Class to monitor GPU VRAM usage using NVIDIA’s NVML.

Deprecated since version 0.1.0: The VRAM_NVML class is deprecated. Use VRAM_TORCH instead.

device_handle

NVML handle for the first GPU device.

_max_memory

Tracks the maximum memory used (in bytes).

measure_vram()

Measure and return the peak VRAM usage in gigabytes (GB).

Returns:

Maximum VRAM usage (in GB) observed so far.

Return type:

float

reset()

Reset the maximum memory usage by reading the current used memory.

class lightbench.metrics.metrics.VRAM_TORCH(device: str, DEBUG: bool = False)

Bases: object

Class to measure GPU VRAM usage using PyTorch’s utilities.

DEBUG

Flag to enable debug output.

Type:

bool

device

The device to monitor (‘cuda’ or ‘mps’).

Type:

torch.device

device: str = 'cuda'
measure_vram() float

Measure the memory usage in gigabytes.

Returns:

Memory usage (in GB), either peak or current depending on backend.

Return type:

float

reset()

Reset memory usage statistics based on the device type.