News
Backcloudscale GPU Servers – for LLM, AI, etc.
Everyone is talking about "AI" technology with the hopes it raises of being used in the most varied areas of life. You no doubt already also have ideas of how you can improve everything with intelligent tools. There are many freely available building blocks on the internet, and the new cloudscale GPU servers now provide you with the required computing power to go full throttle with the appropriate model.
The new cloudscale GPU flavors
Use virtual servers with GPUs at cloudscale, too, with immediate effect by choosing one of our GPU flavors when launching a new server. Just as with the existing Flex and Plus flavors, you can choose between various CPU and RAM configurations. In addition, your server will be allocated one to four physical GPUs depending on the flavor. A local scratch disk is also included in the GPU flavors and you will find more information on this below.
The new GPU flavors are aimed at maximum performance, which is why they are based on the tried-and-tested Plus flavors, where the selected number of CPU cores are exclusively available to your virtual server and you can use them to full capacity 24/7. The same applies to the GPUs, where one or more NVIDIA L40S GPUs supply massive processing power for your workloads and the GPUs are passed to your virtual server "as a whole" as PCI devices.
A new element: the scratch disk
From the outset, your servers' virtual hard drives have been saved in our Ceph-based storage clusters at cloudscale. This means that they are always immediately available, irrespective of the physical machine your virtual server is running on at the time and that these volumes (with the exception of the root volume) can be moved between virtual servers. This comes at the cost of a certain degree of latency. Read and write operations run via network connections, which means that – despite 100 Gbps links – they are on the move for considerably longer than in the case of locally installed NVMe disks.
In everyday situations, most requests tend to affect a small section of the data, which can be kept in a cache if required. As LLMs and similar workloads may be different here, our GPU servers have a local scratch disk. This storage is located on NVMe disks directly in the physical machine the virtual server is running on, thus providing minimal latency. Data are also stored in duplicate in a RAID 1 array as protection against failure.
Operating this setup involves a few particular issues. When moving GPU servers to another physical machine (which is not possible as "live migration" due to the GPUs, but can only occur when the server is switched off), the content of the scratch disk must also be transferred, which takes a certain amount of time. Moving your GPU server may be triggered during e.g. scaling or become necessary when maintenance work is due on our part.
In the event of (hardware) problems, GPU servers are restarted on a different physical machine depending on availability. Please assume, however, that you will be given a new, empty scratch disk in the process. For this reason, you should only use the scratch disk for data where complete loss can be tolerated at any time and ensure that you regularly copy any interim results to a separate storage location.
Development insights
Our GPU servers have been available to selected customers since late February and feedback has been extremely positive. In parallel to gathering initial practical experience, we implemented various improvements, in part also in OpenStack, the open source project our setup is based on. We will, of course, also give our enhancements back "upstream" to the projects in question, in as far as this is possible and feasible.
One of these improvements is the possibility of enlarging the scratch disk at a later point in time – up to 1'600 GB are available to you locally, in addition to the usual volumes in our storage clusters. We have also deactivated data compression when moving the scratch disk between physical machines; our internal 100 Gbps network means we can do without this overhead. And with regard to the SSH connection that is opened for the migration, we ensured that the ciphers used can benefit from the AES support of the CPUs.
Your turn
When creating a new virtual server in our cloud control panel, you will find the GPU flavors in the "Dedicated GPUs" tab. Use the "please contact support" link once and provide us with the key data of your planned use; in addition we will need you to attach a signed copy of the "Addendum for GPU servers". After a manual check we will enable the GPU flavors for the project you specify.
If you do not yet have a specific use case, but would like to speak to your own chatbot, Lukas has made it easy for you to get started. In our engineering blog, he shows you step by step how to install Ollama and DeepSeek-R1 70B at cloudscale and make them accessible via the web. A useful tip: our NVIDIA L40S have 48 GB memory per GPU. To ensure that performance does not collapse, take as many GPUs as needed for your selected model to fit completely into the memory of the GPUs.
Our new GPU servers with up-to-date NVIDIA L40S GPUs and a local scratch disk provide maximum performance for your LLM and AI workloads. After one-off activation, you can start, scale and delete GPU servers via the control panel or API at any time using the self-service model. It goes without saying that, as usual at cloudscale, you benefit from to-the-second billing without fixed costs and from a data location in Switzerland. However, the offer is currently limited to a first come, first served basis.
Still here for you personally,
Your cloudscale team