V100MAXXING

This serves as a companion to the original V100MAXX rentry, as I feel there's some additional pieces of wisdom that rentry fails to take note of. But do read the original first, as my rentry doesn't emphasize the whole "64GiB of VRAM for $2.1K" schtick enough, and my focus is on training pet projects rather than text-gen gooning. Instead I aim to cover a lot of details that seem rather obvious in hindsight but are still worth noting as suggestions.

Hardware
- Systems
- GPUs
- Power Supply
- Storage
Assembly
- Powering
Booting / Software Setup
- Software
- Troubleshooting
Misc. Considerations
- Cooling
- Scaling
- Software
Benchmarks
Afterthoughts
Changelogs

Hardware

Systems

The hurdles to jump through to make use of these cheap V100s is to utilize hardware that boast SXM2 sockets. If you're interested in utilizing your existing hardware but also want to make use of these cheap GPUs, consider the PCIe section.

I'll primarily focus on the two Gigabyte models that sports an NVLink/SXM2 daughterboard. Both model listings on eBay should have everything specific to the systems (heatsinks, fans, shrouds, screws). You will need to provide your own CPUs, GPUs, RAM DIMMs, storage, and PSU.

T180-G20

For about ~$985, the T180-G20 is an older, Broadwell era system that, at a glance, seems equivalent to the T181-G20. However, documentation for the T180-G20 is either nonexistent or extinct, as the only trace of documentation is some press release from Gigabyte with a link that you need to use the Wayback Machine for. Because of this, there's no guarantee for various firmwares or manuals out there. This detail is a huge reason why I'm making this rentry.

At least you get the QVL you can consult. I went with:

Intel Xeon E5-2630 V4: for about $7 a pop on eBay, or $10 on Amazon you can get next-day or so. You can consult the QVL for other SKUs you can go with for specific core count / frequencies / RAM frequencies.
DDR4-2133 RDIMMs: for about $15 a pop. Eight DIMMs is preferred to fully utilize the quad channel memory per CPU, but you can go for four if you're fiending for savings, or the full 16 to populate all slots. I went with Hynix DIMMs, but I had at least one "die" on me only for it to start working again after months of leaving it out of the system, so go with who you trust more.

You could get away with any other DIMMs instead of what's on the QVL. I have yet to try my presumed-to-be-dead gay Trident Z DIMMS, as I naively threw them all in with my normal DIMMs and the BMC complained about failing memory training (and I stopped caring to keep them in since the lid won't close with them in).

I would highly suggest getting 128GB or more RAM up front if you're looking to do any other form of GPU workloads, as I found it's quite easy to trigger OOM killers and lock the system up.

T181-G20

For about ~$1200, the T181-G20 is a slightly newer, Skylake era system that still maintains decent amount of documentation and available firmware.

Consult the QVL for a ballpark on what CPUs/RAM DIMMs to pair it with, but to reference the original V100MAXX rentry:

Intel Xeon Gold 6132: for about $35 a pop. Two is preferred, but shouldn't be required.
DDR4-2133 RDIMMs: for about $15 a pop. Twelve DIMMs is preferred to fully utilize the six-channel memory per CPU, but you can go for six if you're fiending for savings, or the full 24 to populate all slots.

Other

Another option is a Dell POWEREDGE C4130 which should have less headaches, but is a rarity to find the SXM2 variant (eBay will usually have the PCIe variant, or the NVLink/SXM2 add-in board and cables). Be very sure you're getting the SMX2 variant.

SuperMicro might have some solutions as well, but cursory G**gle searches points to a ServeTheHome thread mucking around with a mystical AOM-SXMV add-in board. I've yet to see any useful listings anywhere about acquiring one, add-in board or whole system.

To expand upon the option of SXM2 add-in boards, you might be able to make use of SXM2 add-in boards and use them within existing systems. The T180-G20's SXM2 board seems to serve as a PLX switch to the motherboard and is connected to it through two PCIe x16 connections. Power comes from the power distribution daughterboard on the back. I imagine it shouldn't be too herculean of a task, but if you do go this route, be sure to verify the voltages powering the board.

PCIe

If you're interested in simply:

taking advantage of cheap V100s
want to make use of existing PCIe based hardware (and shy away from datacenter-grade hardware)
do not care for NVLink all that much, or do not need four V100s (as the Gigabyte systems only offer four)

then you also have the option of going an alternative route: utilizing SXM2-to-PCIe adapter boards. They're a bit tricky to source, as:

eBay will typically have them for at least ~~$1k (defeating the purpose of buying SXM2 V100s for a bargain) and if you're lucky $500~~ >$300. Be very sure to check if it's a bare conversion board or it also includes a heatsink + fan.
- You might also be able to buy a V100 or A100 with a conversion board directly off eBay.
This Fiverr (R*ddit) listing offers SXM2-to-PCIe adapter boards for $150, at the "risk" of having to go through Fiverr. The linked R*ddit has his posts for credibility.
Other corners of the internet may offer these boards directly from China, such as Taobao (which seems like the Fiverr listing but more direct), but you'll need to already be proficient in ordering directly from China.

These boards should simply require a heatsink + fan, and two normal PCIe power connectors off your existing ATX motherboard.

However, as mentioned before, you won't have the benefit of NVLink. The adapter boards only have one SXM2 connector per card, and even if there were two, NVLink of PCIe isn't really a thing as PCIe cards make use of bridges as an interconnect. This shouldn't be a huge boon overall, but one of the purposes of V100MAXX is for NVLink, after all.

GPUs

While the namesake is to V100MAXX, I'll be thorough and list your available options:

V100 SXM2 16GB: for ~~$180~~ ~~~$195~~ $130 a pop. The original rentry mentioned these going for $150, but historical sales never shown them being sold for this much. They were $180 when I built my system, but they're climbing in price.
P100 SXM2 16GB: for ~$45 a pop. If you only care about VRAM and for some reason do not want a P40, sure, go ahead.
V100 SXM2 32GB: for ~~~$990~~ $700 a pop? These are expected to be dumped into the used market by next year, and should have the price for them to dump significantly. The primary benefit is being able to go from a max of 64GB of VRAM per system to 128GB, and this should serve as a good upgrade path for next year. Strictly for raw compute, these shouldn't make much of a difference in comparison to the 16GB models.
A100 SXM2 32GB: for $4K (some will pop up for ~$1.2k). The benefit of these are that these are allegedly SXM2 cards, and should serve as an upgrade to Ampere. However, a cursory search show that these might not be recognizable at best aand can brick boards at worse if not used in a SXM2-to-PCIe adapter, so be warned. This probably also means that you could salvage them from a scrapyard of autonomous vehicles with them, but this is extreme bargain hunting.
- Homebrewed Windows drivers seems to be required, but not on Linux?
- These allegedly have cut down SM counts compared to their non-Drive counterpart.

Power Supply

The original V100MAXX rentry doesn't thoroughly cover this, as with the Gigabyte systems, you'll have to source your own manner of powering them, as the intended hardware is virtually not available on the market (especially used). For what I went with:

1200W HP server PSUs: for ~$20 or so. I went with two for redundancy and spreading the load between them, as the theoretical total draw is a hair around 1500W.
- You might be able to, instead, opt for two 750W PSUs instead. Your mileage WILL vary as I have not actually checked my total power draw.
PSU breakout board: for $15 each. As much as I do not want to fund cryptoshit, you kind of have to since these breakout boards are geared for cryptomining. Pair it with 14 AWG wire (Amazon has a spool pair) and the smallest flathead screwdriver you can find. The voltmeter is kind of useless, but the power button is nice.

Additionally, you are free to solder your own connections onto the PSU with bigger gauge wire. There's a dozen of diagrams and forum posts documenting the pinout, so I leave this as an exercise for the reader (as I did not go this route to verify). You can also go for a Dell server PSU and its compatible breakout board, if that's more your flavor.

The Gigabyte systems have three bus-bar connectors, each connector with 6 AWG wires to carry a maximum 40A to the power distribution daughterboard. A rough equivalence with 14 AWG wire means you can sort-of comfortably go with three 12V+ and three ground wires off the breakout board, but I went with six 12V+ and ground wires for comfort (but ended up using four each, due to things being quite cramped). Amazon also has a wire stripper + crimper and a box of ring terminals as well that should have enough connectors for 24 wires at 14 AWG.

It is imperative you do NOT have your system exceed the power draw your PSUs can supply. Training off of just one 1200W power supply caused one V100 to stop responding. Additionally, for 120V Americans, you should definitely try to have your system be the only load through its 15A breaker, as it's quite easy to trip it in the middle of a hot day even with two other computers on idle at a very conservative 150W (suggestion) limit per GPU AND the CPUs at a lower power state.

Storage

It should mostly go without having a section for storage, as it's very straightforward, but for the sake of thoroughness:

The T180-G20's default storage options are mainly from the hotswappable 2.5" spinning rust bays. However, it does provide two additional PCIe slots at the front, allowing you to slot in an M.2 to PCIe adapter card (or two). I will caution, however, that there's little airflow at the very front for cooling M.2 SSDs that do run very hot (for example, PCIe 4/5 ones that boast a heatsink). You should be fine, as there should be some airflow being pulled through from the CPU fans, but you should still keep this in mind.

You don't have to go with M.2 SSDs. If you're a weirdo like me sitting on Sun Flash Accelators, or some other fancy decomissioned PCIe SSD, you can go with that. However, do note that if you do specifically go with a Sun F80 or adjacent, they will show up as separate drives that need to be aggregated (RAID0 or whatever fancy vdev striping under ZFS).

The T181-G20 might boast hotswappable "NVMe" 2.5" drive bays. I say might because the manual specifically mentions 4 x 2.5" hot-swappable NVMe HDD/SSD bay (optional). I don't know if this means that some SKUs have U.2 connectors, or that they do but they can be used for either SATA drives or NVMe ones. If you happen to have some U.2 drives on hand, sure go ahead, but don't go shelling out the shekels for them because they're still a bad value proposition in comparison to M.2 drives and any plain old PCIe adapter card.

Again for sake of thoroughness, the manual mentions Intel Optane is supported, but even in 2019, Optane was dead in the water. If you're a freak that uses it in lieu of something like ZFS ARC (and even then I have never needed to bother with tuning for it), sure, go ahead.

Drive nuances for other systems, such as the Dell one, are left as an exercise to the reader.

I focus on M.2 drives since they're a good value proposition factoring in great read speeds, and great read speeds means loading large models are done quick. Both of your Xeons should have plenty of PCIe lanes that one M.2 SSD won't kill it, especially when the PLX chip will handle the nitty gritty of juggling PCIe lanes.

Assembly

Cramming the hardware is relatively straightforward, but for sake of clarity:

for the T180-G20 at least, remove the guiding fan shrouds for one of the CPUs and the GPUs.
the CPUs require a bit of a maneuver to undo the latch bars, aligning the triangle, and CBT as you apply a concerning amount of pressure latching it back on. Conventional wisdom I've heard had you keep the socket protector on and have it pop off by itself as you latch down, but my gut said to pop it off beforehand. Heatsinks are equal agony, as you need to carefully align each screw thread, barely try and screw it enough to catch, then jam down on the screw head for the others to make sure they catch too. The supplied ones with the Gigabyte systems should have the cross-pattern screw order printed on them, but a standard cross-pattern works.
the GPUs are similar: remove the socket protectors, align them down, and it should be obvious when it slots in. The cross-pattern screw order should be printed on the lid of your system (the T180-G20 did at least, the T181-G20 has the privilege of having a manual that you can consult if it is not printed on the lid). The heatsinks are nowhere near as agonizing, and the cross-pattern screw order should also be printed on them.
populate the DIMMs with the blue slots first, then the black slots.
put the guiding fan shrouds back in place. The sole CPU shroud should easily slot in place, but the GPU shroud requires a bit of coercing on the right edge as there's some wires that are getting in the way.
there should be a cover in the middle (T180-G20) or left-hand side (T181-G20) for a PCIe "riser" cable. If you're going NVMe, you can slot in your PCIe-to-M.2 card here, or any other PCIe cards. For the T180-G20, it's a bit of a pain to get this back in, as you need it flush enough to put the lid back on for the entire blade.
you can put your spinning rust in the hotswappable caddies if you wish.

Powering

Powering it, however, requires a bit of a dedicated section, even if it seems straightforward in hindsight. On the back side, there should be two protusions / enclosures, and the three bus-bar connectors. Unscrew:

the two screws for each enclosure on the top to remove the cover.
each of the bus-bar connectors (each one should have four screws).
all six of the wire ends for each bus-bar connector (do NOT lose these screws).

After cutting, stripping, crimping, and connecting enough wires to the PSU's breakout board, take your positive wires and overlap the ring connector, and meticulously screw them onto the contact pad labeled positive. Repeat for the negative. Repeat for additional PSUs (one PSU is fine, but you can go up to three, if room allows). I'll reiterate that space is cramped for the wire to loop back around from the backside, but you might have a better experience. You're free to put the cover for the protrusions / enclosures back on if you've fed the wires through the holes on the back side, and not through the top.

Soldering, again, is left as an exercise to the reader, as I did not go this route, but connecting the wires from the PSU to the board should follow the same endeavor.

One thing to note when using multiple / redundant PSUs with breakout boards: due to the nature of the power distribution board, having only one PSU powered will power all other breakout boards, regardless if the power switch is engaged for the respective breakout boards. This is expected, as "ground"/"negative" is actually 12V return. The PSUs not plugged to an outlet will still remain powered off, as evident with its fan not being active.

If you want to be safe, you may instead connect your wires from the PSU onto a power distribution block, and then a single, large enough wire to the board for each 12V+ and ground. These are used for a myriad of applications, from building wiring to automotive use (since the electrical systems in cars are 12V after all) under the hood or in audio systems. It's preferable to find one that takes in multiple 14 AWG wire (what connects to the breakout board) and outputs 6 AWG wire (the wire that comes out of the stock bus-bar connectors).

Be very sure to check the wires are secure properly, and do not pull themselves out from slight tension over time.

Agony.

Booting / Software Setup

Once you close the lid and connect a keyboard/mouse, your Linux install USB, and VGA connected to a monitor, power on the PSU (with the button on the breakoutboard, or jumping it if you soldered yourself). You'll be greeted with a message initializing the BMC (an SoC for system management), followed by the chipset. The code on the bottom right should cycle between things, but as long as it keeps cycling, you're good. The fans should quiet down after POSTing, and from there it should be straightforward to setup.

The original V100MAXX rentry offers some suggestions on what hypervisors to use. I personally went with Manjaro (partly because I've already spent a good portion of my early 2010s installing Arch manually, partly because my previous training rig used it), but your personal preference distro works fine too.

You do not need to use a hypervisor, as there's inference backends dedicated for serving as inference servers, but if you're ever interested in utilizing additional cores to host other services, and if you're already familiar with managing a hypervisor, then you're more than free to. I'm personally interested in eventually setting up another SmartOS machine and setting up bhyve VMs with the PCIe passthrough, but that's lacking in documentation and I really do not need a hypervisor for a dedicated GPU machine.

As this rentry is about thorougly documenting what I've done, below are the two paths I've went with.

Boot with Open Source Drivers:
- run installation as normal, and reboot off your drive
- after you boot into KDE, set up SSH with systemctl enable sshd, grab the IP via ip addr, then you're free to SSH into the server (or don't, if that's your style).
- uninstall the free drivers (if you booted the Manjaro installation with free drivers) with sudo mhwd -r pci video-linux. This might be optional, but I ended up doing this and rebooting for safety.
- install the nonfree drivers (booting the Manjaro installation with propreitary drivers) wtih sudo mhwd -i pci video-nvidia.
- reboot
- install PyTorch with sudo pacman -S python-pytorch-cuda, as it should have all the required dependencies to ensure nvidia-smi works.
- reboot
- ensure everything works with nvidia-smi (if it doesn't then GLHF. I had to squabble with constant reboots and install/removes).
- setup a Python venv with python3 -m venv ~/venv (or whever you want it), then install pytorch with pip3 install torch torchvision torchaudio (yes you can setup your venv to use system packages but 9/10 it's icky).
- activate the venv with source ~/venv/bin/activate.
- ???
- Profit
Boot with Proprietary Drivers:
- run installation as normal
- That's it! nvidia-smi should already work after installing.

Either options may end up with having a blank screen (most likely when you only have proprietary NVIDIA drivers installed, as this happened only when I installed with proprietary drivers loaded on a new NVMe SSD). If that happens:

Boot back into your install USB

Run:

⎗

1
2
3

sudo mount /dev/nvme0n1p2 /mnt # or wherever your root partition is
sudo mount /dev/nvme0n1p1 /mnt/boot/efi # or wherever your boot partition is, might not even be needed
manjaro-chroot /mnt

Enable sshd with systemctl enable sshd
Note your IP address with ip addr
Do your pacman -Syu here for safety
maybe:
- If you care about having a working DE, install the open source drivers with mhwd -i pci video-linux as well.
- If you don't care, disable it entirely with systemctl disable sddm. I personally don't, as I'll just suck it up and painstakingly do all my file transfers through ssh. (chrome-remote-desktop is unironically the most convenient way to remote into it, but it's out of date and required CBT to get it to work again on my old GPU machine. Java KVM would be a great alternative, but I'd have to set up a Docker container with an old JRE just to use it, because fuck me for getting a T180-G20).
Reboot back into the Linux install, and just do everything else through SSH.

You can also check the topography with nvidia-smi topo -m to ensure that NVLink is working (it should without any additional tweaks). PyTorch should allegedly utilize NVLink with NCCL / DDP, which most PyTorch loaders/trainers/whatever should already initialize with, however I have not had a single instance where it reported to show transfers over NVLink.

Setting up on Debian-based distros should be an easy exercise left to the reader.

Software

For inferencing, the usual suspects for LLMs should be good enough for use, as they should already be aware of multiple GPU setups. llama.cpp compiled with CUDA should be able to utilize your GPUs, while PyTorch based backends like vLLM and ooba's webUI should allow specifying GPU count.

For training, DeepSpeed works out of the box for any trainers as long as you invoke it with the deepspeed command; all the distributed bits should be handled. Vanilla PyTorch trainers will need to wrap the model under DDP, and launch using torchrun. I'll assume HuggingFace's accelerate and PyTorch-Lightning also handles the details with their respective launchers.

However, I will caveat some things due to Volta being an older arch (a lot of optimizations seemed to have came about Ampere-onward):

the original V100MAXX rentry notes that vLLM utilizies a flavor of flash-attention-2 that works under Volta through Triton, but from a cursory glance, it seems that it actually cannot do so without further modifications (which in theory the typical flash-attention-2 implementations should not "work" anyways due to Volta's memory layouts and other arch details). llama.cpp seems to have it recently implemented, some posts mention it just uses FP16 tensor cores, but your mileage may vary.
- xformers is still a useful alternative, but requires manual wrapping if you're using HuggingFace's LlamaModel under transformers.
you're pretty much relegated to float32 and float16 for native, non-quantized performance. bfloat16 isn't implemented until Ampere, and float8 isn't until Hopper / Ada. float16 is still much prefered over float32, but training under float16 requires some bandaids like loss scaling and the like, which deepspeed / any trainer wrapper should handle automagically (at least from my own training use cases).
- I imagine anyone that is already interested in training off of these should already be well acquainted with the do's and don't of training. Training seems to stabilize after warmup, but it always gives me the ick needing bandaids for float16 when bfloat16 Just Works™.
not directly related to Volta, but NVLink doesn't seem to be thoroughly utilized during training under deepspeed. Running nvidia-smi nvlink -gt r shows no for transmitting/receiving, but the links are shown to be active with nvidia-smi nvlink -s. I'm sure I'm neglecting something, but it's something to keep an eye out for if you're hoping to actually utilize NVLink.

Once you have a working venv set up, I recommend installing nvidia_pstate. The nvidia-utils package (or whatever it's called for your flavor of package manager) is required to be installed. nvidia-pstate -ps # will set all cards to the specified pstate. However:

Any value that is not 16 seems to try to throttle all cards to <100W. I get a 2x reduction in training throughput this way, regardless of value.
16 seems to restore the normal pstate.
nvidia-smi will always report P0 for all cards, regardless of what state was set.
However, it does not seem to affect idle...

Some literature out there mentions use of ECC memory will incur a 20% penalty, but other literature mentions that HBM2 does not actually incur a penalty (and doesn't seem so while training). In any case, it can be enabled/disabled with:

⎗

1	sudo nvidia-smi -e 0 # or 1 to enable

Troubleshooting

Here's some issues I ran into that might serve useful, or things to consider. I figured to be thorough with any quirks I've ran into for documentation's sake.

>on chipset initialization, code B7 is stuck on the bottom right

You have bad RAM, replace it. I had this right after accessing BMC system management and caused the main system to freeze. I thankfully bought more RAM sticks than I needed thinking there was no difference between the T180-G20 and T181-G20, so it was a quick fix.

>the case fans are stuck at max speed even after POSTing

You most likely didn't socket a GPU in all the way. Originally I assumed it had to do with some case-open alarm, but this was proven false when I did some shifty things while it was running that required opening the lid during operation, and nothing happened.

> not all 4 GPUs are showing up in nvidia-smi

This is a very stupid thing to do, but you can unplug all the fans for the GPU side of the chassis, let it idle in Linux for a few minutes, then see which heatsink isn't hot (the disconnected one will be warm, but all the connected ones will be hot).

The smarter approach is to ensure the heatsinks are properly mounted, as the GPU I didn't have socketed properly didn't have the heatsink properly mounted. I was able to lift up a side a noticeable amount, but unsocketing the GPU itself, then remounting the heatsink, then resocketing the GPU, fixed it. You might end up with a bad GPU, but it's more likely not socketed right.

> I'm not getting internet!

Ensure you have the ethernet cable plugged into the right jack and not the left. The left is dedicated for the BMC management SoC itself (the IP that shows before POSTing). This does mean that if you want to use the BMC management features, you need two ethernet cables connected to your switch (or router).

> nvidia-persistenced doesn't work!

It seems simply running sudo nvidia-smi -pm ENABLED works. Subsequent sudo nvidia-smi -pl 150 calls will stop bitching about persistence mode not being enabled, and the power limit persists between python3 sessions.

However, this doesn't fully replace nvidia-persistenced being able to store power limits between reboots. If you're that forgetful to call it on boot (or have a script to do so), you can look into setting up a ~~PoetteringD~~ systemd init service.

> pwmconfig doesn't work! These fans are too loud!

To be investigated.

I don't think there's an actual remedy, as reading around for the BMC firmware seems that if you flash the wrong one you're stuck with max fan speeds, so I assume the fans are also goverened by the BMC moreso than the main board itself (this also lends credance to how the fans will slow down after the BMC finishes initializing).

During training, it seems any GPU that exceeds 60C will ramp the fans up another level.

> My python3 sessions keep stalling, and when I run nvidia-smi, I keep getting Unable to determine the device handle for GPU[IDENTIFIER]: Unknown Error!

Your system is trying to draw more power than the PSU(s) can supply. I ran into this issue when I turned off my second PSU since its exhaust fan is stuck at high for who knows why, and kept running into one specific GPU being power starved and crashing training (due to it not responding for synchronization points).

Even power limiting with nvidia-smi -pl 150 doesn't guarantee ALL cards are constrained to it. It's just a suggestion for the card to run at and will throttle itself if it briefly exceeds the limit.

Misc. Considerations

Something you can also do with nvidia-smi is power limit to 200W (or less) if you're genuinely concerned about drawing too much power (or having the chassis fans too loud) with sudo nvidia-smi -pl 200. Make sure persistence mode is enabled with sudo nvidia-smi -pm ENABLED beforehand so this power limit actually sticks between sessions. However, it doesn't fully replace nvidia-persistenced with maintaining power limits between reboots.

I found that actually capping each card to 150W keeps the fans a hair above the speed they run at when the system is idle for a negligble throughput hit during training.

The BMC is a useful feature you can make use of if you have it in another room and you can't be assed to get up and power it on/off. With an ethernet cable plugged into its jack (on the T180-G20 it's the left one next to the USB), you'll get the BMC IP Address printed before POSTing. You can access it from any web browser with the credentials admin:password, and you can muck around in it. The T180-G20 ships with a Java KVM (that I can't be assed to get working), though, while I believe the T181-G20 ships with it using HTML5 (as advertised in its manual). If you really care to use the BMC, I would recommend just getting the T181-G20 instead as you're not stuck with firmware from 2017.

You should also have a dedicated room for this, as the fans are loud. Server racks housing these are a rarity, as the blade is made for Meta's OpenComputeProject (OCP v1) Open Rack standard, where I think the gimmick is that anyone has the specs and blueprints to make their own racks, but being 21" wide, the tradeoff of "more density" is lost when I absolutely cannot find a rack to fit this. Matters are a bit worse when there's different versions of the standard, where v2 and v3 each have different power distribution methods.

The method of using a HP server PSU and a breakout board, and wiring itself is a bit sketchy desu. Cursory skims on G**gle leads to some R*dd*t posts showing burnt connectors on the breakbout board, but it could be written off as not enough wires to distribute the load. The PSU also can be noticeably loud over the server even when there's not even a load on the PSU itself.

Due to a lack of an easily-accessible rack, you may consider building your own noise suppressing enclosure. I was suggested HVAC duct boards outfitted with an intake and exhaust fan with enough space inside to line it with sound deadening material, but you might be able to get away with just any enclosure lined with sound deadening material. Remember, these are meant to go inside racks in a datacenter, not your room.

Some cursory G**gle searches in passing also mention the option of handing it off to a colocation datacenter. I think this is outside the scope of home-labbing your own machine for LLMs, but it's always a consideration if you find housing it yourself too detrimental. Others mention running an ethernet line to their garage or cuckshed.

Cooling

If you're very much concerned about cooling it and want it quiet, you can look into the following, as you're not restricted to containing everything within a 1OU blade:

Water Cooling

An option is to water cool the entire system. Waterblocks for SXM2 cards (P100/V100) seems to come around $180 a pop (there is also a primochill listing but it also seems to have the same storefront with the 1 IN STOCK CLICK NOW HURRY BUY FUD), and you can find waterblocks for the CPUs easily (T180-G20s are LGA2011-3 square ILM, T181-G20s are LGA3647 narrow ILMs). Pair them with adequate plumbing, a reservoir, a pump, and a radiator + fans, and you're good to go and free to remove the case fans. Like all watercooled system though, some airflow in the case is preferable to cool other SMDs on the boards.

Air Cooling

If water cooling is too much, you can look into sourcing large enough heatsinks for SXM2 cards that you can fasten a large enough fan onto, and normal coolers for the CPUs (T180-G20s are LGA2011-3 square ILM, T181-G20s are LGA3647 narrow ILMs). You'll lose the "benefit" it it being in a 1OU blade and have to keep the lid off (or cut), but you'll get the peace of mind of adequate cooling and peace of quiet of removing a shit ton of the tiny fans in the chassis.

Scaling

The original rentry also mentions scalability as a potential "upgrade path" with going this route. The gist is you can buy another machine and connect it with a capable NIC over a network cable. I find this idea quite a bit too outside the realm of the average user, as you really need to know what you're doing when it comes to even housing one of these, much less two or more blades. Additionally, I feel multi-node systems are for those that really know what they are doing, as this goes even beyond any of my use cases at the moment.

If it does interest you, eBay has some Mellanox NICs between cheap and expensive.

Some things to note for the uninitiated:

Single port cards can only connect between two machines (or requires two cards per machine to expand beyond two).
Dual port InfiniBand-based cards apparently cannot daisy chain, but this is only from a cursory search.
InfiniBand-based cards require a unique cable, akin to "fiber optic" NICs. Prices for these cables seem to have gone down a lot since the last time I recall looking into it.
These require additional drivers to make use of them. This GitHub gist seems to be a good starting point. Other cursory searches suggest that the CUDA toolkit has everything else included after installing your NIC's driver.

I also imagine that, on the PyTorch side, additional configuration is required for trainers, but frameworks/loaders like vLLM should have the necessary flags for inferencing for intranode communication.

Software

If you're worried that you're drawing unneeded power with idling GPUs if you're not utilizing them all, you can easily disable unneeded GPUs (where ## is the GPU ID found under lspci | grep NVIDIA or nvidia-smi) with:

⎗

1 2	sudo nvidia-smi -i 0000:##:00.0 -pm 0 sudo nvidia-smi drain -p 0000:##:00.0 -m 1 # -m 0 to enable

NUMA

Because this server uses two CPUs, it is imperative to keep NUMA topology in mind. It's quite easy to lose the die roll and suffer performance penalties simply from lack of NUMA awareness. I am not an expert on which is the best, catch-all numactl invocation, but for anything that would require multiple processes (for example, parallelizing GPU workloads by naively spawning a worker process for each GPU), it is imperative to keep each process within its own NUMA node by prefixing numactl -N# -m#, where # is the node number found under numactl -H. For example:

numactl -N0 -m0 python3 [...]
numactl -N1 -m1 python3 [...]

For training, I would assume that PyTorch's torchrun (or whatever other framework you're using) would be NUMA aware, but thumbing over DeepSpeed's documentation seems it requires passing --bind_cores_to_rank.

For inferencing, I do not think it matters much as this problem only manifests when you have more than one process dispatching GPU work. It shouldn't be beneficial to invoke llama.cpp with --numa flags

Benchmarks

To be expanded upon with better methodologies.

Command R+ (exl2 4.0bpw): 12t/s
- ExLlamav2_HF: 8t/s
- Context length: 24576 (like a pube hair away from OOMing)
LLaMA3 70B Instruct (exl2 4.0bpw): 17t/s
- ExLlamav2_HF: 14t/s
- Context length: 32768 (13GiB free on last GPU)
Mixtral 8x7B (exl2 5.0bpw): 39t/s
- ExLlamav2_HF: 33t/s
- Context length: 32768 (32GiB free, last two GPUs unused)
Mixtral 8x7B (GGUF Q5_K_M): ~34t/s
- CPU: 2t/s
- Context length: 32768 (16GiB free)
Mistral 7B (exl2 5.0bpw): ~52t/s
- ExLlamav2_HF: 43t/s
- Context length: 32768 (54GiB free)
Mistral 7B (GGUF Q5_K_M): 46t/s
- CPU: 4t/s
- Context length: 32768 (51GiB free)
Mistral Large (exl2 3.5bpw): ?t/s
- ExLlamav2_HF: ~8/ts
- Context length: ?
- Numbers need to be actually found, since despite it loading with 32K context, it'll OOM for ~8K contexts.
- GGUFs are unusable for me desu.
QwQ-32B Preview (GGUF Q6_K_L): ~20t/s:
- Context length: 32768 (~44GiB free?)
- -fa even though I still don't believe this should work.
- Some speedup when using draft/speculative decoding with Qwen2.5-Coder-0.5B-Instruct (Q8_0 because smaller quants seem silly), but it only seems to be beneficial when it's outputting code and not doing CoT.
- I would be using exl2 but for once I feel a GGUF is faster.

Notes:

Text generation is not my specialty, but I'm sure I know what I'm doing.
My rudimentary tests are simply a few words as text completion under the Notebook tab and letting it run its course.
Iteration rates are implied to have a margin of error of ±1t/s.
Models that can load entirely on one card does not seem to offer any throughput improvements/hits when manually splitting to all GPUs.
Tests are ran without any power limits. I imagine even 150W per card will have a very negligible throughput rate, as each GPU isn't even pinned to the max 300W.
I'm focusing more on covering general parameter size of "typical" base models. Extrapolate for whatever monstrosity of a merge or flavor of finetune accordingly.
I'm focusing on reasonable quants for a given model (except Command R+. A smaller quant for more context would be favorable). Targetting 5bpw gives the best tradeoff between perplexity hit and speed.
nvidia-smi nvlink -gt r still does not report any transfers over NVLink, so I'm still skeptical if it's actually utilized or I'm just doing something wrong.
ExLlama2 specifics:
- Flags: autosplit, no_flash_attn, (Command R+ required cache_4bit).
- Uses ExLlamav2 model loader for a more apples to apples comparison. ExLlamav2_HF is the commonly used model loader, but HF samplers incur a throughput penalty.
- Because V100s are Volta cards, flash attention is not available. I don't trust xformers to automatically load, or transformer's Llama model to utilize mem_efficient through LlamaSdpaAttention.
llama.cpp specifics:
- Flags: tensorcores, flash_attn
- Uses llama.cpp model loader. llamacpp_HF takes some extra steps and I'm too lazy to go through with it just to show a throughput hit.
- I believe flash_attn is supported, but I don't trust it to be true as it relies on CUDA code, and if hardware agnostic flash attention exists, why isn't it already backported to PyTorch-backends?
- row_split throws fails an assertion and crashes Ooba.
- tensorcores may or may not be silently ignored, as I did not manually compile llama-cpp-python.
- For shits and grins, CPU only throughputs are included. Note I followed nothing extra from the CPUMAXX rentry just to ensure the best performance, since I accidentally stumbled upon them when setting GPU layers to 0.

Afterthoughts

> Is it worth the headaches?

Definitely.

My primary niggle after a month seems to mostly be from the audible two-tones of fan noise while I'm in the adjacent bathroom, but I can live.

However, I would very much push for you to spend an extra $200 or so for a T181-G20 unless you know exactly what you're doing. Having actual documentation and available firmware updates is a godsend. I've yet to actually run into anything catastrophic, but I know I will eventually with my luck of killing DDIMs on even my previous GPU slave machine.

> Is it worth the money?

Maybe.

If you do not already have a dedicated GPU slave machine, definitely. The entire system is cheaper than solely a group of GPUs (be it 3090s, 4080s, or 7900XTXs).

If you already have an existing GPU slave machine (like I did), then I would probably just consider stacking more GPUs instead. The only upgrade path with V100MAXXING is with the 32GiB SXM2 cards for next year when they're dumped into the used market (and it's just a VRAM increase), or crossing your fingers on A100 SXM2s being dumped in any reasonable time frame (for better creature comforts Ampere offers).

I'm also a bit hard pressed to suggest this unless you really know how to get your money's worth out of this, be it actually making use of text-gen LLMs or training your own models under PyTorch. I'm only really pleased with it for having a huge speedup over either my 4070Ti or my 7900XTX when it comes to my experiments (being able to get results after a day or two is orgasmic in comparison to maybe waiting 4 or 6 days to see results on my previous setup).

> Is it worth the power bill?

Probably not.

Each night I'm letting my machine crank through batches to train, I dread thinking about how much each night is actually costing power-wise. Savvy power users might already have a "free nights" plan they can abuse, but I don't.

The bigger strain comes from the V100s idling at ~45W, as they do not automatically set their pstates to anything besides P0. nvidia-pstates seems to have some affect, but a PoetteringD service is required to set them on boot.

I don't even want to think about how much power each CPU and all the fans are actually drawing. If it were a more modern system like a Zen 4 with Ampere/Ada cards, I would feel more comfortable, but the T180-G20 is Broadwell and Volta (you might get more comfort with a T181-G20 on Skylake).

Changelogs

This is mainly to aide anyone that happens to see that this rentry updates, but are unsure of what got added.

2024.05.02:

V100 SXM2 32GB link corrected
Added a note for PCIe adapter boards
Added a section for scaling
Added suggestion section for better air/water cooling
Rephrased RAM explanation

2024.05.03:

Rephrased the first paragraph a bit
Added a note for the T180-G20 that the lack of documentation is a primary reason for this rentry
Added a note about multiple / redundant PSUs and breakout board behavior
Added a suggestion for using power distribution blocks for connecting the PSUs to the server
Added a remark in the considerations about the BMC being the one governing chassis fan speeds
Added a citation for the A100 SXM2, and noted it's 32GB.

2024.05.04:

Added some notes under Booting / Software Setup.
Added a section for notes under Software to detail some observations and suggest notes when training.
Added a brief initial benchmark note comparing my previous training setups.

2024.05.15:

Heavily revised remarks under Booting / Software Setup, as I installed onto a new NVMe SSD and ran into a different set of issues.

2024.05.20:

Replaced nvidia-persistenced mentions with nvidia-smi -pm ENABLED, as that makes the V100s somewhat happy (despite it working fine on goysumer NVIDIA cards).
Removed irrelevant remarks under Benchmarks, because it's a bigly disparate apples to oranges comparison (my expectations are all over the place).
Added a section for Storage. Again, it should go without needing to mention it, but I feel it might help to explicitly say "go with an M.2 SSD + a M.2 to PCIe adapter".
Added a note in the beginning that I'm focused more on using this for training pet projects rather than gooning to text-gen.
Revisited my afterthoughts about it being worth the headache.

2024.05.24:

Added a note about PSU requirements (as one 1200W will NOT suffice).
Added an update remark about nvidia-smi -pm ENABLED not working as a full nvidia-persistenced replacement.

2024.05.31:

Added initial throughput rates for common models under Benchmarks
Added a rudimentary table of contents.

2024.06.01:

Rewrote the Afterthoughts section to be more coherent, and to better reflect my first month of owning this.
Added note about nvidia-pstates seemingly actually working as a way to further throttle cards, despite it not offering enough control.

2024.06.03:

Adjusted V100 16GB's price as they've raised in price.

2024.07.29:

Adjusted note about NVIDIA DRIVE A100s.
Rudimentary numbers for Mistral Large (exl2).

2024.12.08:

Added horror.
Updated some pricing for SXM2 cards.
Added a note about SXM2 => PCIe cards becoming more available on eBay.
Added QwQ-32B-Preview under Benchmarks

2025.02.18:

Added remarks about NUMA nodes under NUMA.
Added a note about disabling GPUs under Misc. Considerations/Software.
Added remarks about memory under Systems.

2025.03.06:

Added remark about enabling/disabling ECC mode.

V100MAXXING

Table of Contents

Hardware

Systems

T180-G20

T181-G20

Other

PCIe

GPUs

Power Supply

Storage

Assembly

Powering

Booting / Software Setup

Software

Troubleshooting

Misc. Considerations

Cooling

Water Cooling

Air Cooling

Scaling

Software

NUMA

Benchmarks

Afterthoughts

Changelogs

Warning