Operating spot VMs

Overview

Spot VMs let you run on ECI's idle GPU capacity at a discount versus on-demand. The discount varies with supply and demand, and a VM may be reclaimed when capacity tightens, so checkpointing and reclamation detection are mandatory.

Creating a spot VM

Under Compute > Virtual Machines, click Create VM.
On the Basic Info step, set the Pricing type to Spot. (Selecting spot automatically disables the Always-On and Disaster Recovery options.)
In the instance-type list, pick a GPU type whose availability is Currently available.
Fill in the remaining settings and create the VM.

Checking availability

Spot uses idle GPU capacity, so availability fluctuates frequently.

Option 1: From the VM creation screen

Once you select the spot pricing type, availability is shown inline next to each instance type.

Option 2: Infrastructure > Resource Status > Spot menu

Availability	Description
Currently available (`{n}`)	Capacity is reserved; you can create and run VMs
No capacity currently available	Capacity is exhausted; creation and run will fail

Reclamation

When capacity runs short, ECI force-reclaims spot VMs.

Reclamation process

Once reclamation is decided, the metadata API exposes the scheduled termination time.
After a 1–2 minute grace period, the VM is force-reclaimed.

Detecting reclamation: the metadata API

From inside the VM, the command below tells you whether reclamation is scheduled.

curl -s --unix-socket /run/eci-guest-agent.sock \
  http://localhost/vm/metadata?key=spot_termination_time

No reclamation scheduled: 404 page not found
Reclamation scheduled: returns the termination time (e.g. "2026-04-08T05:00:00+00:00")

Reclamation watcher script (run in background)

#!/bin/bash
# spot_watcher.sh

while true; do
  HTTP_CODE=$(curl -s -o /tmp/spot_response -w "%{http_code}" \
    --unix-socket /run/eci-guest-agent.sock \
    http://localhost/vm/metadata?key=spot_termination_time)

  if [ "$HTTP_CODE" -eq 200 ]; then
    echo "[$(date)] Reclamation scheduled: $(cat /tmp/spot_response)"
    /path/to/save_checkpoint.sh  # call your checkpoint script
    break
  fi
  sleep 5
done

chmod +x spot_watcher.sh
nohup ./spot_watcher.sh &

Saving checkpoints

Data inside the VM is lost on reclamation

Anything in VM memory or local ephemeral storage disappears when reclamation happens. Always save checkpoints to block storage on a regular schedule.

PyTorch checkpoint example

import torch, os

CHECKPOINT_PATH = "/data/checkpoints/checkpoint.pt"

def save_checkpoint(model, optimizer, epoch, step, loss):
    os.makedirs(os.path.dirname(CHECKPOINT_PATH), exist_ok=True)
    torch.save({
        'epoch': epoch,
        'step': step,
        'model_state_dict': model.state_dict(),
        'optimizer_state_dict': optimizer.state_dict(),
        'loss': loss,
    }, CHECKPOINT_PATH)
    print(f"[epoch {epoch}] checkpoint saved")

def load_checkpoint(model, optimizer):
    if os.path.exists(CHECKPOINT_PATH):
        ckpt = torch.load(CHECKPOINT_PATH)
        model.load_state_dict(ckpt['model_state_dict'])
        optimizer.load_state_dict(ckpt['optimizer_state_dict'])
        return ckpt['epoch'], ckpt['step']
    return 0, 0  # start from scratch

# Training loop
start_epoch, start_step = load_checkpoint(model, optimizer)
for epoch in range(start_epoch, total_epochs):
    for step, batch in enumerate(dataloader, start=start_step):
        train_step(model, batch)
        if step % 100 == 0:
            save_checkpoint(model, optimizer, epoch, step, loss)  # save every 100 steps

Saving only on reclamation detection can be too late

The grace period (1–2 minutes) may not be enough to finish writing. Combine reclamation detection with regular (every N steps) checkpointing.

Limitations

Applies only to GPU instance types. CPU-only instances do not have a spot option, and attached block storage, public IPs, and networking are billed at standard rates
Always-On cannot be used
Disaster Recovery (DR) cannot be used
Cannot be joined to a virtual cluster

FAQ

My VM shut down out of nowhere.

It was reclaimed because of spot capacity pressure. Set up metadata-API polling and save checkpoints regularly so you can resume after restarting.

The Run button is disabled.

There's no spot capacity right now (you'll see "Spot capacity is currently insufficient to run"). Check availability under Infrastructure > Resource Status > Spot and try again later.

The spot price changed.

The spot price is applied at VM start time, so a restart may pick up a different price.

Next steps

Terraform spot VM guide: automating spot VMs as IaC
Pricing model: spot vs on-demand vs reserved

Overview​

Creating a spot VM​

Checking availability​

Reclamation​

Reclamation process​

Detecting reclamation: the metadata API​

Reclamation watcher script (run in background)​

Saving checkpoints​

PyTorch checkpoint example​

Limitations​

FAQ​

Next steps​