Magnum 32b v2 - Public Proxy

Status: Offline. Public link (OAI endpoint. Experimental): Private link (Kobold Classic endpoint): Discord only.

The proxy along with the generated tokens will be reset if abuse or further abuse is detected.

Responses may be delayed up to one minute or more. This is being hosted by a single person :wah:

What is this

A public proxy hosting Magnum 32B v2 made by the good folks over at https://anthra.site/ & https://sillytilly.org/

While you all may be mourning the loss of 3.5 Sonnet proxies dying and having to resort to 4o again, do not fret - You can use Magnum 32B v2 ( Which can be hosted on your own computer) or through this API

Info

Max context: 30720
Output: 512-2048

How to setup (OAI)

  1. Go to the "Api Connections" tab.
  2. Open the list of available APIs and select Chat Completion, then select OpenAI as your completion source.
  3. Paste PUBLIC proxy endpoint on "Proxy Server URL" with your token.
  4. Enable 'Show "External" models (provided by API)'
  5. Select the model called koboldcpp/magnum-32b-v2.
  6. Use a clean preset (no JB needed) and follow instructions of Kobold setup (only 4 and 5).

How to setup (Kobold)

  1. Go to the "Api Connections" tab.
  2. Open the list of available APIs and select KoboldAI Classic.
  3. Paste URL in the box of the page (URL at the top of the page) and connect.
  4. Go to "Advanced Formatting" (Tab with the symbol "A" in SillyTavern).
  5. Use the preset called ChatML in "Context Template" and "Instruct Mode", enabling the latter.
  6. For the preset, use either Universal Light or Universal Creative.
  7. That's all! Go and chat with your characters, shit's better then GPT 4o.

Made by Anthracite & SillyTilly

Alt Tag Alt Tag

Hosted by SmileyTatsu SmileyTatsu

Proxy Status

Last updated: 8/9/2024, 4:11:51 AM (Terminated)

{
    "last_process": 1.1524838209152222,
    "last_eval": 71.02339172363281,
    "last_token_count": 171,
    "last_seed": 176790,
    "total_gens": 101,
    "stop_reason": 0,
    "total_img_gens": 0,
    "queue": 1,
    "idle": 0,
    "hordeexitcounter": -1,
    "uptime": 7528.458552122116,
    "idletime": 10.208022356033325
}

Local/Cloud Hosting

Section dedicated to explaining how to host your own models! By request of an anon.

I'm not a professional on local models, but I think I can help other anons host a model without having so much trouble.

Selecting the model

There are many models to choose from. The ones I can recommend are the Magnum models, which have been among the few I have tried and I can say they have decently good quality. But it all depends on whether you want to host on your PC or in the cloud and how much you would like to pay for the second option. Some model recommendations to get you started:

Magnum 12B v2 - 12 billion parameter model, small enough to be used on 12GB VRAM GPUs with the following configuration:

1
2
3
4
Quant model: Q4_K_M
Context Size: 16384
VRAM used: 11.07
https://huggingface.co/bartowski/magnum-12b-v2-GGUF

Magnum 32B v2 - 32 billion parameter model, it has the size to be used on 24GB VRAM GPUs with the following configuration:

1
2
3
4
Quant model: Q4_K_M
Context Size: 16384
VRAM used: 23.65
https://huggingface.co/bartowski/magnum-32b-v2-GGUF

Magnum 72B v1 - 72 billion parameter model, this has the same quality as its 32B v2 version. You would just use more resources but if you want to try.... It has the size to be used on 48GB VRAM GPUs with the following configuration:

1
2
3
4
Quant model: Q4_K_M
Context Size: 14336
VRAM used: 47.23
https://huggingface.co/bartowski/magnum-72b-v1-GGUF

These are only 3 configurations for those who (normally) would host on their own PC, 12, 24 and 48 GB VRAM if you are a millionaire. I recommend substituting the 72B model in favor of the 32B if you have 48GB VRAM to use. Why? Because you can use the 32k context that the model offers, a higher quant and still live. This is the configuration that the proxy uses to host Magnum 32b v2:

1
2
3
4
Quant model: Q6_K
Context Size: 32768
VRAM used: 35.49
https://huggingface.co/bartowski/magnum-32b-v2-GGUF

I will briefly explain what quants are soon. You can calculate how much memory you would need for X model with Y context and a certain quant using this calculator: https://huggingface.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator

Local or Cloud

This decision depends on your PC and your minimal tastes. Do you have a PC powerful enough to run local models? You are on the way to host it on your PC. Do you feel that the model you can host does not satisfy you enough? Then you should go to the cloud.

Local

Advantages

  • If you already have the equipment, you don't need to depend on paying third parties.
  • You don't have to wait for the models to be installed each time (or pay for storage).
  • No dependence on third parties.
  • Completely anonymous.

Disadvantages

  • Requires heavy hardware.
  • Can be slower if your GPU is at the limit.
  • Sometimes will require you not to do other things to avoid overloading your computer.
  • Can be tedious to configure for some.

Cloud

Advantages

  • You don't need to pay hundreds of dollars for a GPU.
  • Usually the most affordable and simple options include templates to configure the model, making it a matter of a few clicks.
  • Cheaper in the short/medium term compared to buying a GPU.
  • Accessibility to switch GPUs whenever you want.
  • Pay for what you use. 0.3%
  • Only 0.3 USD per hour to use an A40 GPU to host Magnum 70B v1 at 20k context. Or host (the top) Magnum 32B at 32k context with extra VRAM for whatever you want.

Disadvantages

  • If you don't pay for storage, you should install the model you want to use each time (usually it doesn't take long because of the speed, but it depends on your provider).
  • It depends on whether the provider has a certain GPU available.

Taking the above into account you must decide what you will do. I personally have the equipment to host 12B models with a 16k context, however, I am currently paying for cloud hosting to host 32B for this proxy. To be honest if you are not sure is try and see, try local and if you don't like the quality throw some dollars at runpod.

Hosting on local

I am just going to list the steps you must do to host the model.

  1. Look for the model you like and can host on your computer (we will handle the GGUF versions because they are easier to explain and have support for CPU splitting, which I will not explain here but at the end of this Rentry I will list others where it is explained).
  2. Download the quant you need (I recommend at least a quant Q4. Anything below that decreases the quality too much. I don't recommend a Q8, it's better to use a Q6 thanks to saving space and minimal quality loss).
  3. Save it in a folder where you have all your models (to make it easier, save it in a folder called KoboldCPP and create a subfolder called models. KoboldCPP/models/your_model.gguf).
  4. Download KoboldCPP. Follow the instructions on https://github.com/LostRuins/koboldcpp (if you are on Windows save the .exe in the KoboldCPP folder created previously so we can execute commands with it. If you are a Linux user you should know what to do).

Now it depends on how you want to execute it. If you want to execute the .exe (mostly Windows users) or use CMD.
.exe method:

  1. Execute koboldcpp.exe and wait for it to start. It should open a CMD and a UI, ignore the CMD for now.
  2. Normally, you don't need to change much unless you want to experiment. If you don't feel like it, don't touch anything (and hope it works), if you really want to know what is happening, read the official wiki https://github.com/LostRuins/koboldcpp/wiki.
  3. If you are not touching anything OR you are using CuBLAS, enable Flash Attention on Quick Launch, Increase/Decrease the Context Size to the size you want and choose the model. If you want to share the model to other devices, enable Remote Tunnel. Then just click launch.
  4. After clicking launch, the UI will close. Now you need to open the previously open CMD and you will see a lot of things appearing, this is just the model loading and showing the full configuration, usually you don't care unless you are a tech. In the end, you will have something like this, these are the links for each connection (last one is for the web UI):
    1
    2
    3
    4
    5
    6
    7
    8
    Cloudflared file exists, reusing it...
    Attempting to start tunnel thread...
    Starting Cloudflare Tunnel for Windows, please wait...
    Your remote Kobold API can be found at https://ce-sbjct-wesley-publishers.trycloudflare.com/api
    Your remote OpenAI Compatible API can be found at https://ce-sbjct-wesley-publishers.trycloudflare.com/v1
    ======
    
    Your remote tunnel is ready, please connect to https://ce-sbjct-wesley-publishers.trycloudflare.com
    

If you are running this for the first time, it may install Cloudflare for you.

  1. As a last part, just follow the #How to Setup (Kobold) section.

This section is for magnum models (at least for 72B v1 and 32B v2), other models may require a different Context Template and Instruct Mode settings for them to work (should be listed on main model page). Also playing with the preset settings is recommended.

Hosting on cloud

I am just going to list the steps you must do to host the model.

  1. Look for the model you like and can host on your computer (we will handle the GGUF versions because they are easier to explain and have support for CPU splitting, which I will not explain here but at the end of this Rentry I will list others where it is explained).
  2. Copy the link for the download of the quant you want to use. Right click on the download icon and copy the link. After that, save the link somewhere for later.

Now, I will explain the runpod method since this one is the easiest for me.

  1. Go to https://www.runpod.io/ (or use my cute referral link bleh https://runpod.io?ref=urj78syx) and create an account. Then, put some credits in there (I don't know the minimum honestly, 10 should be more than enough to try for like 30 hours (remember that you can turn it off, so this can last for a full month if you only use 1 hour per day lol) on the settings I will show, more if you use a cheaper GPU).
  2. Go to the pods section and click on "Deploy a Pod". You will be shown a list of GPUs, if you are doing this alone you can just choose A40, cheap 48GB VRAM GPU (cheaper than some 24GB VRAM options and more worth imo).
  3. Now you need to go and search for the "KoboldCpp - Official Template - Text and Image" template. Should be the only one appearing when searching for "KoboldCPP".
  4. Click on "Edit Template".
    • Edit Container Disk (Temporary) for the size of your model + 10GB (I just like to give some free space in case of anything that could happen). Usually 100GB is overkill and will only take more of your cute money.
    • Click on "Environment Variables" and remove KCPP_IMGMODEL and KCPP_WHISPERMODEL, these are just so you can do image gen or use the whisper model, unless you know what you are doing you should not have them since this will slow down the startup.
    • Edit the variable called KCPP_MODEL and replace the default value for your GGUF model link (saved previously).
    • You may need to edit the variable called KCPP_ARGS, the default value is --usecublas mmq --gpulayers 999 --contextsize 4096 --multiuser 20 --flashattention --ignoremissing. I recommend only changing the contextsize to the correct one, multiuser is the max queue that Kobold will handle before throwing errors. For more flags or knowing what a certain one does, read the wiki https://github.com/LostRuins/koboldcpp/wiki.
  5. The last thing you will need to change is Instance Pricing. Unless you are planning to run a 24/7 service or something like that, you can't care less about the "On Demand" option, it's just more expensive. Change it to "Spot" which basically means that if the GPU is needed elsewhere, they will take it away from you. But considering you only want it for a while and also the A40's are not in demand, you'll be fine and pay less.
  6. Just click deploy after that, you can see how much you are going to spend per hour on the bottom part.
  7. After clicking deploy, it will wait some seconds before redirecting you to "My Spots", open the spot that is starting, click logs, switch to "Container Logs" and click on the arrow on top (Tail Logs), this will show you the process of startup, after that, you will have something like this, these are the links for each connection (last one is for the web UI):
    1
    2
    3
    4
    5
    6
    7
    8
    Cloudflared file exists, reusing it...
    Attempting to start tunnel thread...
    Starting Cloudflare Tunnel for Windows, please wait...
    Your remote Kobold API can be found at https://ce-sbjct-wesley-publishers.trycloudflare.com/api
    Your remote OpenAI Compatible API can be found at https://ce-sbjct-wesley-publishers.trycloudflare.com/v1
    ======
    
    Your remote tunnel is ready, please connect to https://ce-sbjct-wesley-publishers.trycloudflare.com
    

If you are running this for the first time, it may install Cloudflare for you.

  1. As a last part, just follow the #How to Setup (Kobold) section.

This section is for magnum models (at least for 72B v1 and 32B v2), other models may require a different Context Template and Instruct Mode settings for them to work (should be listed on main model page). Also playing with the preset settings is recommended.

More questions?

Contact me via Discord or ask on the /lmg/ thread (to be honest, this one is the better option lol, they should know a lot more than me).
Discord username: SmileyTatsu.

  1. Search for the /lmg/ or Local Models thread, they have a lot of cool links about locally hosting LLMs
  2. VRAM calculator: https://huggingface.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
  3. Search for models: https://huggingface.co/models
  4. Quantify models: https://huggingface.co/spaces/ggml-org/gguf-my-repo
  5. Explanation of models: https://rentry.org/lmg-spoonfeed-guide#4-models

If you would like to support me in continuing to invest time in hosting and administrating these proxies, donations are always welcome >~<

Patreon

Edit
Pub: 04 Aug 2024 02:29 UTC
Edit: 09 Aug 2024 04:14 UTC
Views: 5245