Thoughts on distributed large language model computing networks and Distributed Hash Tables for powering chatbot applications.
Note that while I consider myself a proficient software engineer, I am almost completely new to machine learning, and should be considered a layman for the time being. Please take what I say, especially when referring to technical details, with a grain of salt. I request input from more experienced anons.

Problems

The consolidation of processing power between a handful of technology corporations cause multiple side-effects that hamper the activities of chatbot enthusiasts. Much of the groundwork has been laid out for increased growth and widened participation of recent AI trends (OPT, LLaMA, and BLOOM, to name a few) but the power for most users to take advantage of it is tragically out of reach. So, inevitably, the most effective strategy became reliance on cloud services and APIs, where all the problems regarding inflated costs and heavyhanded "safety" restrictions should be not be news to the seasoned observer.
However, the mentioned group tends to collectively own many lower-cost, lower-power devices with computing resources that sit idle for most of the day. Chatbot enthusiasts are highly dedicated users that already tend to take part in many collaborative projects to help others enjoy the fun, and volunteer projects to attempt to create competition for centralized alternatives could see success.
The closest existing attempt to cluster volunteer resources for LLM inference is KoboldAI Horde. While a good step forward, it is lacking due to the fact that the software naively parallelizes its queue against single workers. Each one still must have the memory and capacity to singlehandedly handle the full model. Potential volunteers who could form efficient arrays together simply cannot join the network, and models above a certain size are still completely infeasible to serve. This design flaw is what I believe is behind the network's low capacity, slow response time, and limited scale.

Solutions

In the world of enterprise and research-focused developments, there exist solutions that might be closer to what we are looking for. The concept of model parallelization dictates that ML arrays should look to split up the model itself into n parts, and perform operations against sequences of workers who each only need to process data for their partition. To minimize latency introduced by network communication and coordination, multiple metrics such as network speed and overall throughput can be used to dynamically choose workers most likely to finish a given task the soonest. Petals (for inference and finetuning) and Hivemind (for training) are the most promising examples for this specific usecase, in my opinion. Using any number of their techniques within a greater network that can signal requests for training capacity, add new models on the fly, transfer chatlogs for finetuning, and provide leaderboards and "game" rewards for contribution, an extremely robust system is within reach.

Playing with Petals

Petals
Colab Screencap
Chatbot Screencap

The live Petals inference network operates a >175B model with a response time of <5 seconds on a fresh prompt after swarm connection. There were only eight peers backing it at the time of this paper. The minimum recommended GPU memory required to join the swarm is only 8GB.
Yes, there are many other factors in play to consider, and none of my tests were intended to be scientific evaluations of the network, but still, decently impressive for a model that kinda needed multiple A100s ($10,000 each) to even touch before...

It is important to state that "naive parallelization" could still be used on top of model parallelization to build multiple efficiently composed stacks. You can make as many or as little splits in the model as needed.

Considerations

Of course, systems like this can't support expectations of privacy very well. Some element of randomization when picking worker routes may be needed to prevent a "bad" node from snooping on one session. Even then, a determined attacker could simply register more nodes... you could then try to prioritize workers that have been on the network for longer, but you get where I'm going with this. It's kinda a difficult problem to solve.
To prevent dishonest leechers from abusing the network by piping requests to their product down to the cluster, workers should probably be able to keep track of how much any given user is actually active and contributing, and use that to heavily deprioritize peers that aren't actively contributing or communicating the network. Think of Bittorrent's "tit-for-tat."

Other

There was a part about using DHT to store character data and prompts here, but it didn't flow with the rest of my points well. Just leave that as a standing point; it'd probably be a nice bonus to permanently archive our cards someplace, and DHT can do that stupidly well; although it wouldn't be necessary for anything else I talked about, and to be honest, it isn't probably a day-one concern in the slightest.

Contact

...just in case.
biscuitsyeah AT protonmail D0T com

Edit
Pub: 28 Mar 2023 03:25 UTC
Views: 1086