Docs - Autoscaler

Debugging

Warning
The Autoscaler is currently in Beta and may experience changes and downtime.

Worker Errors #

The Vast PyWorker framework automatically detects some errors, while others may cause the instance to timeout. When an error is detected, the autoscaler server will destroy or reboot the instance. To manually debug an issue, check the instance logs available via the logs button on the instance page in the GUI. All errors encountered while running the backend code, as well as the PyWorker code, are logged here. If further investigation is needed, additional logs are available in the /home/workspace/vast-pyworker directory on the instance.

Log Files #

Log files are categorized based on functionality:

  • infer.log: Logs for the backend inference server.
  • auth.log: Logs for authorization, backend wrapper function calls, and autoscaler server performance updates.
  • watch.log: Logs for monitoring the backend inference server and autoscaler server loaded and error messages.

Managing Load #

Increasing Load #

To handle high load on your instances:

  • Set test_workers high: Create more instances initially for autogroups with anticipated high load.
  • Adjust cold_workers: Keep enough workers around to prevent them from being destroyed during low initial load.
  • Increase cold_mult: Quickly create instances by predicting higher future load based on current high load. Adjust back down once enough instances are created.
  • Check max_workers: Ensure this parameter is set high enough to create the necessary number of workers.

Decreasing Load #

To manage decreasing load:

  • Reduce cold_workers: Stop instances quickly when the load decreases to avoid unnecessary costs. The autoscaler will typically handle this automatically, but manual adjustment can help if needed.