Warning: The Autoscaler is currently in Beta, and is subject to changes, 'quirks', and downtime.
This is a guide to the architecture of the tgi, sdauto, and ooba backends in the vast-pyworker repository, with the aim of allowing users to modify one of these backends to support additional endpoints, or make a new backend with a similiar architecture. The following architecture isn't necessary for all vast-pyworker backends, and is suitable for images where the underlying inference code is accessible through an HTTP server in its own process, such as text-generation-inference. This architecture assumes the method of communication between the inference server and the other components of the backend server is through logs, so your own backend might be able to simplified if the inference code can communicate more directly with the code that runs performance tracking and communicates with the autoscaler server, which would happen if all these pieces of code ran in the same process. To see an example of a backend that doesn't use this architecture, see the helloautoscaler backend guide
Each backend uses a launch script that will start the vast-pyworker code, install required dependencies for the backend code, and start the inference server with the required arguments. In order to start the vast-pyworker code, the start_server.sh script is run. This will make sure all required dependencies for vast-pyworker are installed, all necessary environment variables are set, and take care of launching all component processes of the backend server, and verify that they are running correctly.
Different backends will expect different environment variables to be defined to use in their launch script. For example sdauto uses a number of different variables (such as HF_MODEL_REPO) to allow the user to download a model from a huggingface repo, and authenticate the download.
Below you can find a list of the required environment variables for each of the pre-defined backends:
Each backend has a dedicated directory that contains all of the code that is specific to it, like this one for tgi. Each backend directory has the following files:
start_server.sh will launch server.py, which is responsible for setting up the flask server that receives client requests and forwards them to the backend server. server.py is written to be very general, and you don't need to modify this file at all.
The functionality for your backend should be written in backend.py, which will define the endpoints that you want the backend server to be able to handle, and the appropriate authentication and metrics tracking that should go along with each endpoint. You must write a handler function for each flask endpoint, which will then call the appropriate function in your backend class. You must declare each of these handler in the “flask_dict” dictionary, and the key must be the endpoint route, like is done here for text-generation-inference. The custom backend class that you write will be very minimal, and should inherit from the GenericBackend class defined in top-level backend.py. The GenericBackend class provides the method “format_request” which will separate the authentication parts of the client request from the model request parameters, and the method “check_signature” which will take the elements of the auth_dict returned by “format_request” and ensure they are signed correctly. It also provides the “generate” method, which is a convenience function which allows you to easily call backend server with the client’s request, handle the output, and track metrics. tgi/backend.py’s generate_handler is a good example of all of these components in use.
In order to have the server handle metrics you will need to create a metrics class in metrics.py that will be instantiated by your backend class. Your metrics class can inherit from the GenericMetrics class defined in the top-level metrics.py. This class can be pretty minimal, and just needs to have functions for handling when requests start, finish, and have an error, as well as controlling what metrics the vast-pyworker server sends to the autoscaler server when it sends update messages. An example of this class for stable-diffusion-webui can be found here
Lastly, you will need to write a custom LogWatch class for your backend in logwatch.py, which can inherit from GenericLogWatch defined in the top-level logwatch.py. Depending on how much your backend server outputs in its logs, your class can be extremely minimal or fairly extensive. All a minimal example needs to do is search for a pattern in the logs that indicates that the backend server is fully loaded and ready to serve requests, so that the vast-pyworker server can tell the autoscaler server that this server is ready. A minimal example for stable-diffusion-webui can be found here.
More extensive logwatch classes, such as the one found in tgi/logwatch.py can have a performance test that is run once the backend server is ready to give the autoscaler server a more comprehensive idea of the maximum performance capabilities of this server, as well as figure out if the performance of this server is particularly bad and replace it with another if so before sending any client requests to it. We have already created a performance test for LLMs (designed specifically with text-generation-inference in mind) which is found in test_model.py. You can see an example of this test in use here. Note that this test is optional.