Vast.ai is a cloud computing, matchmaking and aggregation service focused on lowering the price of compute-intensive workloads. Our software allows anyone to easily become a host by renting out their hardware. Our web search interface allows users to quickly find the best deals for compute according to their specific requirements.
Hosts download and run our management software, list their machines, configure prices and set any default jobs. Clients then find suitable machines using our flexible search interface, rent their desired machines, and finally run commands or start SSH sessions with a few clicks.
Vast.ai provides a simple interface to rent powerful machines at the best possible prices, reducing GPU cloud computing costs by ~3x to 5x.
Consumer computers and consumer GPUs in particular are considerably more cost effective than equivalent enteprise hardware. We are helping the millions of underutilized consumer GPUs around the world enter the cloud computing market for the first time.
DLPerf (Deep Learning Performance) - is our own scoring function. It is an approximate estimate of performance for typical deep learning tasks. Currently DLPerf predicts performance well in terms of iters/second for a few common tasks such as training resnet50 CNNs. For example on these tasks, a V100 instance with a DLPerf score of 21 is roughly ~2x faster than a 1080Ti with a DLPerf of 10.
It turns out that many tasks have similar performance characteristics, but naturally if your task is very unusual in its compute requirements the DLPerf score may not be very predictive. A single score can never be accurate for predicting performance across a wide variety of tasks; the best we can do is approximate performance on many tasks with a weighted combination. Although far from perfect, DLPerf is more useful for predicting performance than TFLops for most tasks.
In the near future we intend to improve DLPerf by incorporating search criteria into the score dynamically, and later - by using deep learning (of course!). For example, if you select the Pytorch image, the DLPerf scores will automatically adjust to predict Pytorch benchmark performance, a fp16/fp32 checkbox can provide information for even more informative scores, and so on.
We currently offer two rental types: On Demand (High Priority) and Interruptible (Low Priority). On Demand instances have a fixed price set by the host and run for as long as the client wants. Interruptible instances use a bidding system: clients set a bid price for their instance; the current highest bid is the instance that runs, the others are paused.
They are similar but have a few key differences. AWS spot instances and GCE interruptible instances both can be interrupted by on demand instances, but they do not use a direct bidding system. In addition GCE interruptible instances can only run for 24 hours. Vast.ai interruptible instances use a direct bidding system but are otherwise not limited.
If another user places a higher bid or creates an on demand rental for the same resources then your instance will be stopped. Stopping an instance kills the running processes, so when you are using interruptible instances it's important to save your work to disk. Also we highly recommend having your script periodically save your outputs to cloud storage as well, because once your instance is interrupted it could be a long wait until it resumes.
If you use the "custom command" option then your command will run automatically when the instance starts up. However if you are using an ssh instance, there is no default startup command. You can put startup commands in "/root/onstart.sh". This startup script will be found and run automatically on container startup.
Every instance offer on the Create page has a Max Duration. When you accept an offer and create an instance, this Max Duration becomes the instance lifetime and begins ticking down. When the lifetime expires, the instance is automatically stopped. The host can extend the contract which will add more lifetime to your instance, or they may not - it's up to them. Assume your instance will be lost once the lifetime expires; copy out any important data before then.
The environment variable VAST_CONTAINERLABEL is defined in the container. Ex:
root@C.38250:~$ echo $VAST_CONTAINERLABEL C.38250
It's complicated; it depends on many factors (hardware performance, price, reliablity, etc).
You can estimate your hardware's earning potential by comparing to similar hardware already rented on Vast.ai. On the create console page select "Include Unavailable Offers" and the nvidia/opencl image to see most instance types including those fully rented.
Hosts can run low priority jobs on their own machines, so there is always a fallback when high priority jobs are not available.
There are two prices to consider: the max price and the min price. The max price is what on demand rentals pay, and as a host you can set that price on the Host/Machines page with the Set Prices button. As a host you can set a min bid price for your machine by creating an idle job at that price on the Host/Create Job page. If you don't want to setup a true mining idle job, you can just use "ubuntu" as the image and "bash" as the command. See the Host Setup page for more info on idle jobs.
They are using a bid. The price that hosts set on the Host/Machines page is not the rental price. It is the maximum rental price. On demand instances pay the max price, but interruptible instances use a bidding system. You can control the min bid price by setting up an idle job. Alternatively, you can use the CLI to set a per machine min bid (reserve) price.
Removing gpus is currently not supported. If you really need to remove a gpu, you will need to unlist the machine and wait for 0 rentals. Then, when it is safe, you can recreate the machine. You can do this by deleting the file: /var/lib/vastai_kaalia/machine_id
The demand for DL compute has grown stably and significantly in the last few years; this growth is expected to continue for the forseeable future by most market analysts, and Nvidia's stock has skyrocketed accordingly. Demand for general GPU compute is less volatile than demand for cryptocurrency hashing. The stability of any particular host's earnings naturally depends on their hardware relative to the rest of the evolving market.
The slowdown in Moore's Law implies that hardware will last longer in the future. Amazon is still running Tesla K80's profitably now almost 4 years after their release, and the Kepler architecture they use is now about 6 years old.
Initially we are supporting Ubuntu Linux, more specifically Ubuntu 16.04 LTS. We expect that deep learning is the most important initial use case and currently the deep learning software ecosystem runs on Ubuntu. If you are a Windows or Mac user, don't worry, Ubuntu is easy and quick to install. If you are a current Windows user, it is also simple to setup Ubuntu in a dual-boot mode. Our software automatically helps you install the required dependencies on top of Ubuntu 16.04.
Technically if our software detects recent/decent Nvidia GPUs (GTX 10XX series) we will probably allow you to join, but naturally that doesn't guarantee any revenue. What truly matters is your hardware's actual performance on real customer workloads, which can be estimated from benchmarks.
We expect many initial customers to be interested in Deep Learning, which is GPU-intensive but also requires some IO and CPU performance per GPU to feed them with data. Multi-GPU systems are preferable for faster training through parallelization but also require more total system performance in proportion, and parallel training can require more pcie bandwidth per gpu in particular. Rendering and most other workloads have similiar requirements.
It depends heavily on the model and libraries used; it's constantly evolving; it's complicated. We suggest looking into the various deep learning workstations offered today for some examples, and see this in-depth discussion on hackernews . GPU workstations built for deep learning are similar to those built for rendering or other compute intensive tasks.
A reasonable rule of thumb is to expect the GPUs to be only about 30% to 50% of your machine's cost. Most current mining rigs are built for a much lower system cost, where the non-GPU parts are less than 25% of the total. We do not expect these builds to be highly profitable for anything other than mining. Spending a bit more on CPU, RAM, disk, etc will pay for itself several times over.
Interconnect in particular is one of the main limiters on scaling up DL, but current mainstream training algorithms do not yet utilize this precious resource efficiently. New upcoming techniques such as gradient compression can allow training large models on pcie 1x many-gpu rigs, but they are far from being a drop-in easy to use option for most researchers.
Guests are contained to an isolated operating system image using Linux containers. Containers provide the right combination of performance, security, and reliability for our use case. The guest only has access to devices and resources explicitly granted to them by their contract.
We do not by default prevent a guest from finding your router or NAT's external facing IP address by visiting some third party website, as this would require a full proxy network and all the associated bandwidth charges. It is essential that guests be able to download large datasets affordably. For many users a properly configured NAT/firewall should already provide protection enough against any consequences of a revealed IP address. For those who want additional peace of mind, we suggest using your own VPN service as they specialize in exactly this need and can proxy large volumes of traffic cheaply.
Cheaters lose. Modifying or tampering with our software, or the underlying OS or machine in order to defraud customers is still fraud. We can detect cheating by testing and comparing actual compute results, which are essentially impossible to fake. Hosts and machines with anomalous performance characteristics are subject to more extensive auditing.
Hosts with a history of good service and ratings are incentivized to maintain their good reputation just like any other cloud provider, but most peer hosts can not provide high levels of physical security. Protecting data privacy against curious hosts is quite difficult, but in the future we intend to implement hardened encrypted hosting environments as an option for additional data security. In the meantime, simple obfuscation methods may provide enough protection. Hosts can have many different clients and a difficult time identifying and finding interesting data, let alone any particular client's data.
Balances are updated about once every few seconds. Auto-billing periodically creates charges to pay off the current outstanding balance. Auto-billing is optional; alternatively you can use one time payments to purchase credits manually. Host billing runs on a regular weekly schedule every Friday. Host invoices are then paid out up to four days later, depending on bank transfer times.
For users in the United States, we support payout to a bank account (ACH) via Stripe. International users can receive payout through paypal. In the future we intend to add additional payout options. Due to various transaction fees, there is a minimum payout of $10 (or equivalent in other currencies).
No, not at this time.
Hosts receive 75% of the revenue earned from successful jobs, with 25% kept by Vast.ai.
Hosts are expected to provide reliable machines. We track data on disconnects, outages, and other errors; this data is then be used to estimate a host machine's future reliability. These reliability score estimates are displayed on the listing cards and also used as a factor in the default 'auto' ranking criteria.
There is no ssh password, we use ssh key authentication. If ssh asks for a password, typically this means there is something wrong with the ssh key that you entered or your ssh client is misconfigured. On Ubuntu or Mac, first you need to generate an rsa ssh key using the command:
ssh-keygen -t rsa
Then get the contents of the key with:
Then copy the entire output to your clipboard, then paste that into the "Change SSH Key" text box under console/account. The key text includes the opening "ssh-rsa" part and the ending "user@something" part. If you don't copy the entire thing, it won't work.
example SSH key text:
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDdxWwxwN5Lz7ubkMrxM5FCHhVzOnZuLt5FHi7J9pFXCJHfr96w+ccBOBo2rtCCTTRDLnJjIsMLgBcC3+jGyUhpUNMFRVIJ7MeqdEHgHFvAZV/uBkb7RjbyyFcb4MMSYNggUZkOUNoNgEa3aqtBSzt33bnuGqqszs9bfDCaPFtr9Wo0b8p4IYil/gfOYBkuSVwkqrBCWrg53/+T2rAk/02mWNHXyBktJAu1q7qTWcyO68JTDd0sa+4apSu+CsJMBJs3FcDDRAl3bcpiKwRbCkQ+N6sol4xDV3zQRebUc98CJPh04Gnc01W02lmdqGLlXG5U/rV9/JM7CawKiIz7aaqv bob@velocity
We recommend first uploading your files to a cloud data store that provides raw http(s) access and then downloading your data from their to each instance using something like wget. This method can provide much higher bandwidth than uploading from a personal machine, especially when running a number of instances.
If you launched a Jupyter notebook instance, you can use it's upload feature, but this has a file size limit.
If you launched an ssh instance, you can copy files using scp. We recommend only using scp for outbound transfers from a host machine or for small inbound transfers to a host machine (less than 1 GB). For larger inbound transfers, downloading from a cloud data store using wget or curl will have much higher performance. The relevant scp command syntax is:
scp -P PORT LOCAL_FILE root@IPADDR:/REMOTEDIR
The PORT and IPADDR fields must mach those from the ssh command. The "Connect" button on the instance will give you these fields in the form:
ssh -p PORT root@IPADDR -L 8080:localhost:8080
For example, if Connect gives you this:
ssh -p 7417 firstname.lastname@example.org -L 8080:localhost:8080
You could use scp to upload a local file called "myfile.tar.gz" to a remote folder called "mydir" like so:
scp -P 7417 myfile.tar.gz email@example.com:/mydir
This seems to be due to bugs in the urllib3 and or requests libraries used by many python packages. We recommend using wget to download large files - it is quite robust and recovers from errors gracefully.
First you need to get the document ID. In My Drive file browser, right click on your file and then "Get Shareable Link". This will copy a link with the document ID to your clipboard. Paste the link to a text file, you should get something like this:
The document ID in this case is 0CwHCjDrwcJcGSHVYTkJqOVlfZ25RMk5CZENXNHFwOFdSSUJZ. Now you can use the following wget command:
wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=FILEID' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=FILEID" -O FILENAME && rm -rf /tmp/cookies.txt
Replace the 2 occurences of FILEID with your document ID, and replace FILENAME with the desired output filename.
First you need to get the raw https link. Using the chrome browser, on the Kaggle website go to the relevant dataset or competition page and start downloading the file you want. Then cancel the download and press ctrl+j to bring up the chrome Downloads page. At the top is the most recent download with a name and a link under it. Right click on the URL link and use "copy link address". Then you can use wget with that URL as follows:
wget 'URL' --no-check-certificate -O FILENAME
Notice the URL needs to be wrapped in ' ' single quotes.
When you stop an instance, the gpu(s) it was using may get reassigned. When you later then try to restart the instance, it tries to get those gpu(s) back - that is the "scheduling" phase. If another high priority job is currently using any of the same gpu(s), your instance will be stuck in "scheduling" phase until the conflicting jobs are done. We know this is not ideal, and we are working on ways to migrate containers across gpus and machines, but until then we recommend not stopping an instance unless you are ok with the risk of waiting a while to restart it.