The Vast PyWorker framework automatically detects some errors, while others may cause the instance to timeout. When an error is detected, the autoscaler server will destroy or reboot the instance. To manually debug an issue, check the instance logs available via the logs button on the instance page in the GUI. All PyWorker issues will be logged here. If further investigation is needed, you can ssh into the instance and find the model backend logs location by running:
1
echo "$MODEL_LOG"
And PyWorker logs:
1
echo "${WORKSPACE_DIR:-/workspace}/pyworker.log"
To handle high load on your instances:
test_workers
high: Create more instances initially for autogroups with anticipated high load.cold_workers
: Keep enough workers around to prevent them from being destroyed during low initial load.cold_mult
: Quickly create instances by predicting higher future load based on current high load. Adjust back down once enough instances are created.max_workers
: Ensure this parameter is set high enough to create the necessary number of workers.To manage decreasing load:
cold_workers
: Stop instances quickly when the load decreases to avoid unnecessary costs. The autoscaler will typically handle this automatically, but manual adjustment can help if needed.