Vast.ai is a cloud computing, matchmaking and aggregation service focused on lowering the price of compute-intensive workloads. Our software allows anyone to easily become a host by renting out their hardware. Our web search interface allows users to quickly find the best deals for compute according to their specific requirements.
Hosts download and run our management software, list their machines, configure prices and set any default jobs. Clients then find suitable machines using our flexible search interface, rent their desired machines, and finally run commands or start SSH sessions with a few clicks.
Vast.ai provides a simple interface to rent powerful machines at the best possible prices, reducing GPU cloud computing costs by ~3x to 5x.
Consumer computers and consumer GPUs in particular are considerably more cost effective than equivalent enteprise hardware. We are helping the millions of underutilized consumer GPUs around the world enter the cloud computing market for the first time.
DLPerf (Deep Learning Performance) - is our own scoring function. It is an approximate estimate of performance for typical deep learning tasks. Currently DLPerf predicts performance well in terms of iters/second for a few common tasks such as training resnet50 CNNs. For example on these tasks, a V100 instance with a DLPerf score of 21 is roughly ~2x faster than a 1080Ti with a DLPerf of 10.
It turns out that many tasks have similar performance characteristics, but naturally if your task is very unusual in its compute requirements the DLPerf score may not be very predictive. A single score can never be accurate for predicting performance across a wide variety of tasks; the best we can do is approximate performance on many tasks with a weighted combination. Although far from perfect, DLPerf is more useful for predicting performance than TFLops for most tasks.
In the near future we intend to improve DLPerf by incorporating search criteria into the score dynamically, and later - by using deep learning (of course!). For example, if you select the Pytorch image, the DLPerf scores will automatically adjust to predict Pytorch benchmark performance, a fp16/fp32 checkbox can provide information for even more informative scores, and so on.
It's complicated; it depends on the market competiveness of one's hardware.
Our software can recommend defaults, but hosts ultimately set their own prices. We can roughly estimate revenue by comparing to the best prices offered by current cloud providers. For example an Nvidia 1080Ti has similar performance to an Nvidia P6000 but costs about ~6x less. A competitive price for P6000 rental as of April 2018 is $0.90/hour ($21/day) from Paperspace. This roughly suggests the upper-end of 1080Ti earnings for high priority (on-demand) rental. For low priority (interruptible/spot) rental, the lower bound for one 1080Ti is just current crypto-currency earnings: around $0.06/hour ($1.5/day) according to whatotomine.com .
We expect actual earnings will be somewhere between these bounds, depending on many factors (user rating, hardware benchmark performance, reliablity, etc). For a more accurate assesment of your specific hardware's potential, list your machine to see how it compares in our search interface. Hosts can run low priority jobs on their own machines, so there is always a fallback when high priority jobs are not available.
A high priority or on-demand job is one where the client can end the contract at any time but the host can not - the host is expected to provide full performance for the entire unknown duration (up to weeks in some cases, conceivably). A low priority (aka spot or interruptible) job is one where the host can end the contract with some small notification grace period (usually a few minutes to give the client enough time to save their working state to storage).
Hosts set their ask price for high priority jobs and clients then pick which offers to accept. Low priority jobs reverse this arrangment: clients set their bid price and hosts then (automatically) pick the most profitable jobs to run.
Low priority jobs naturally pay less for hosts but they are always better than the alternative. Hosts can mine in the background by setting up a low priority job for their own hardware. Mining is a simple low priority/interruptible job - it can be stopped and restarted fairly quickly. Client low priority jobs will naturally need to bid above the current mining payout to win computing time.
The demand for DL compute has grown stably and significantly in the last few years; this growth is expected to continue for the forseeable future by most market analysts, and Nvidia's stock has skyrocketed accordingly. Demand for general GPU compute is less volatile than demand for cryptocurrency hashing. The stability of any particular host's earnings naturally depends on their hardware relative to the rest of the evolving market.
The slowdown in Moore's Law implies that hardware will last longer in the future. Amazon is still running Tesla K80's profitably now almost 4 years after their release, and the Kepler architecture they use is now about 6 years old.
Initially we are supporting Ubuntu Linux, more specifically Ubuntu 16.04 LTS. We expect that deep learning is the most important initial use case and currently the deep learning software ecosystem runs on Ubuntu. If you are a Windows or Mac user, don't worry, Ubuntu is easy and quick to install. If you are a current Windows user, it is also simple to setup Ubuntu in a dual-boot mode. Our software automatically helps you install the required dependencies on top of Ubuntu 16.04.
Technically if our software detects recent/decent Nvidia GPUs (GTX 10XX series) we will probably allow you to join, but naturally that doesn't guarantee any revenue. What truly matters is your hardware's actual performance on real customer workloads, which can be estimated from benchmarks.
We expect many initial customers to be interested in Deep Learning, which is GPU-intensive but also requires some IO and CPU performance per GPU to feed them with data. Multi-GPU systems are preferable for faster training through parallelization but also require more total system performance in proportion, and parallel training can require more pcie bandwidth per gpu in particular. Rendering and most other workloads have similiar requirements.
It depends heavily on the model and libraries used; it's constantly evolving; it's complicated. We suggest looking into the various deep learning workstations offered today for some examples, and see this in-depth discussion on hackernews . GPU workstations built for deep learning are similar to those built for rendering or other compute intensive tasks.
A reasonable rule of thumb is to expect the GPUs to be only about 30% to 50% of your machine's cost. Most current mining rigs are built for a much lower system cost, where the non-GPU parts are less than 25% of the total. We do not expect these builds to be highly profitable for anything other than mining. Spending a bit more on CPU, RAM, disk, etc will pay for itself several times over.
Interconnect in particular is one of the main limiters on scaling up DL, but current mainstream training algorithms do not yet utilize this precious resource efficiently. New upcoming techniques such as gradient compression can allow training large models on pcie 1x many-gpu rigs, but they are far from being a drop-in easy to use option for most researchers.
Guests are contained to an isolated operating system image using Linux containers. Containers provide the right combination of performance, security, and reliability for our use case. The guest only has access to devices and resources explicitly granted to them by their contract. Guests are limited to an isolated virtual subnetwork consisting only of their own containers.
We do not by default prevent a guest from finding your router or NAT's external facing IP address by visiting some third party website, as this would require a full proxy network and all the associated bandwidth charges. It is essential that guests be able to download large datasets affordably. For many users a properly configured NAT/firewall should already provide protection enough against any consequences of a revealed IP address. For those who want additional peace of mind, we suggest using your own VPN service as they specialize in exactly this need and can proxy large volumes of traffic cheaply.
Cheaters lose. Modifying or tampering with our software, or the underlying OS or machine in order to defraud customers is still fraud. We can detect cheating by testing and comparing actual compute results, which are essentially impossible to fake. Hosts and machines with anomalous performance characteristics are subject to more extensive auditing.
Hosts with a history of good service and ratings are incentivized to maintain their good reputation just like any other cloud provider, but most peer hosts can not provide high levels of physical security. There are not yet many practial methods that can gurantee data privacy in the cloud setting, but if and when such techniques exist they could be used with our service. In the meantime, simple obfuscation methods may provide enough protection. Hosts will have many different clients and a difficult time identifying and finding any particular client's data.
Balances are updated about once per second. Client credit cards are billed once per week on Friday, and host payouts are also sent out once per week on Friday.
For users in the United States, we support payout to a bank account (ACH) via Stripe. International users can receive payout through paypal. In the future we intend to add additional payout options. Due to various transaction fees, there is a minimum payout of $10 (or equivalent in other currencies).
No, not at this time.
In the Beta period, hosts keep 100% of the revenue. After Beta the current plan is that hosts will receive 80% of the revenue earned from succesful jobs, with 20% kept by Vast.ai to cover costs. The final revenue payout structure will be determined and announced later.
Hosts are expected to provide reliable machines. We will track data on disconnects, outages, and other errors; this data will then be used to estimate a host machine's future reliability. In the future we plan on using this reliability estimate as part of the ranking criteria.
There is no ssh password, we use ssh key authentication. If ssh asks for a password, typically this means there is something wrong with the ssh key that you entered or your ssh client is misconfigured. On Ubuntu or Mac, first you need to generate an rsa ssh key using the command:
ssh-keygen -t rsa
Then get the contents of the key with:
Then copy the entire output to your clipboard, then paste that into the "Change SSH Key" text box under console/account. The key text includes the opening "ssh-rsa" part and the ending "user@something" part. If you don't copy the entire thing, it won't work.
example SSH key text:
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDdxWwxwN5Lz7ubkMrxM5FCHhVzOnZuLt5FHi7J9pFXCJHfr96w+ccBOBo2rtCCTTRDLnJjIsMLgBcC3+jGyUhpUNMFRVIJ7MeqdEHgHFvAZV/uBkb7RjbyyFcb4MMSYNggUZkOUNoNgEa3aqtBSzt33bnuGqqszs9bfDCaPFtr9Wo0b8p4IYil/gfOYBkuSVwkqrBCWrg53/+T2rAk/02mWNHXyBktJAu1q7qTWcyO68JTDd0sa+4apSu+CsJMBJs3FcDDRAl3bcpiKwRbCkQ+N6sol4xDV3zQRebUc98CJPh04Gnc01W02lmdqGLlXG5U/rV9/JM7CawKiIz7aaqv bob@velocity
If you launched a Jupyter notebook instance, you can use it's upload feature, but this has a file size limit.
If you launched an ssh instance, you can copy files using scp. The relevant scp command syntax is:
scp -P PORT LOCAL_FILE root@IPADDR:/REMOTEDIR
The PORT and IPADDR fields must mach those from the ssh command. The "Connect" button on the instance will give you these fields in the form:
ssh -p PORT root@IPADDR -L 8080:localhost:8080
For example, if Connect gives you this:
ssh -p 7417 email@example.com -L 8080:localhost:8080
You could use scp to upload a local file called "myfile.tar.gz" to a remote folder called "mydir" like so:
scp -P 7417 myfile.tar.gz firstname.lastname@example.org:/mydir
When you stop an instance, the gpu(s) it was using may get reassigned. When you later then try to restart the instance, it tries to get those gpu(s) back - that is the "scheduling" phase. If another high priority job is currently using any of the same gpu(s), your instance will be stuck in "scheduling" phase until the conflicting jobs are done. We know this is not ideal, and we are working on ways to migrate containers across gpus and machines, but until then we recommend not stopping an instance unless you are ok with the risk of waiting a while to restart it.