Docs - Autoscaler


Warning: The Autoscaler is currently in Beta, and is subject to changes, 'quirks', and downtime. provides an Autoscaling system to manage serverless workers for AI Inference and any other GPU computing tasks. Autoscaling automates the provisioning of GPU workers to match the time-varying computational needs of dynamic workloads. The Autoscaler manages groups of instances - called autoscaling groups (autogroups) - to serve a horizontally scalable application by scaling up or down according to customizable usage metrics.

Key Features/Uses:

  • Efficient Inference: Dynamic load-balanced routing for cost effective inference at scale, leveraging vast's global fleet of powerful cheap GPUs.

  • Autoscaling: Dynamic scaling based on customizable (bring your own) application performance metrics for maximum efficiency at scale

  • Containers/Templates: Use any container or template - Autoscaling provides management and load-balancing layers, but Autoscaling workers are just regular GPU instances

  • Fast Cold-Start: To minimize cold-start times, the Autoscaler maintains a reserve pool of storage workers which can spin up in seconds (plus app/image specific model load-times, which vary)

  • Metrics/Debugging: Autoscaling workers are regular GPU instances and thus support all the same features: metrics, logs, jupyter/ssh access, etc

  • Autogroups: You can define custom worker types through CLI search filters and create commands, with multiple worker types (autogroups) per endpoint

  • Automatic Performance Exploration: Automate machine benchmarking and testing specific to your application to find machines with the best perf/price metrics