ChatGPT - If Stack Overflow and Reddit had a child

  • 🔧 Actively working on site again.
I'm still using a 1060 6GB so not much, it seems that these chatbots require an insane amount of VRAM for some reason.
It might be incredibly difficult to run a local coding model on a 1060. It's not that it's impossible but the token output will be around 0.25 tokens / second and to load the model fully you are going to have to cut the context length in half (from 2048 to 1024, which will severely limit the output of the model). Depending on what CPU you have available, you could use a GGML model (which offloads most of the model to the CPU) to load in a 7B model. The best local model I've had experience with for programming related tasks is StarCoder. It seems to score the highest on most tasks compared to other models across HuggingFace
1689395411515.png

I would try this first:
1. Download and install text-generation-webui - a frontend for running LLMs
2. Update the webui to the latest version (follow the instructions on the GitHub repo for your OS)
3. Look for a quantitized, 7b GGML model - this HF account has a lot of them in different formats https://huggingface.co/TheBloke - i suggest this one: https://huggingface.co/TheBloke/WizardLM-Uncensored-Falcon-7B-GGML
4. Download the model and then run it to see what type of results you get with your hardware.
 
  • Informative
  • Like
Reactions: anustart76 and Puff
Here's mine with GGML, I managed to load a 65b model on my 3090, though it is dog slow:
text-generation-webui | 2023-07-16 01:05:24 INFO:cache capacity is 0 bytes
text-generation-webui | llama.cpp: loading model from models/TheBloke_airoboros-65B-gpt4-1.4-GGML/airoboros-65b-gpt4-1.4.ggmlv3.q4_K_M.bin
text-generation-webui | llama_model_load_internal: format = ggjt v3 (latest)
text-generation-webui | llama_model_load_internal: n_vocab = 32000
text-generation-webui | llama_model_load_internal: n_ctx = 4096
text-generation-webui | llama_model_load_internal: n_embd = 8192
text-generation-webui | llama_model_load_internal: n_mult = 256
text-generation-webui | llama_model_load_internal: n_head = 64
text-generation-webui | llama_model_load_internal: n_layer = 80
text-generation-webui | llama_model_load_internal: n_rot = 128
text-generation-webui | llama_model_load_internal: ftype = 15 (mostly Q4_K - Medium)
text-generation-webui | llama_model_load_internal: n_ff = 22016
text-generation-webui | llama_model_load_internal: model size = 65B
text-generation-webui | llama_model_load_internal: ggml ctx size = 0.19 MB
text-generation-webui | llama_model_load_internal: using CUDA for GPU acceleration
text-generation-webui | llama_model_load_internal: mem required = 23387.92 MB (+ 5120.00 MB per state)
text-generation-webui | llama_model_load_internal: allocating batch_size x (1536 kB + n_ctx x 416 B) = 1600 MB VRAM for the scratch buffer
text-generation-webui | llama_model_load_internal: offloading 38 repeating layers to GPU
text-generation-webui | llama_model_load_internal: offloaded 38/83 layers to GPU
text-generation-webui | llama_model_load_internal: total VRAM used: 19321 MB
text-generation-webui | llama_new_context_with_model: kv self size = 10240.00 MB
text-generation-webui | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
text-generation-webui | 2023-07-16 01:05:30 INFO:Replaced attention with xformers_attention
text-generation-webui | 2023-07-16 01:05:30 INFO:Loaded the model in 6.87 seconds.
The important parts being:
text-generation-webui | llama_model_load_internal: offloading 38 repeating layers to GPU
text-generation-webui | llama_model_load_internal: offloaded 38/83 layers to GPU
text-generation-webui | llama_model_load_internal: total VRAM used: 19321 MB

That means, 38 parts out of 83 offloaded into the GPU.

It can fit within 24GB + all the other OS functions. I get about 1 token per second, pretty slow compared to a 33b GPTQ model fully loaded into VRAM pushing 20+ per second, but the output quality difference is massive.
 
  • DRINK!
Reactions: 888Flux
The best way to get GPT's opinion on things is to praise something it doesn't support because it would go all whiny there.

GPT Uganda.png
 
Just to let everyone know that as I'm typing this, DeMod is kill. ChatGPT found a way to flag your naughty posts even with DeMod on.
 
While talking with my boss about it I realized the whole AI craze will massively diminish once someone finds a way to hack it to run scripts internally or whatever other massive security issue that's going to be when sites let users do everything through AI.
I’m so tired of this doomerism surrounding AI tools. Every fucking day someone is fearmongering about this bullshit and that bullshit and the other bullshit. Not your comment necessarily, it’s mostly tech writers. Get a grip for fuck sake.
 
  • Agree
Reactions: A-A-AAsssston!
1080ti barely runs llama, fuck yeah.

These are what the big boys use, $15000 each, with 80GB VRAM.
I am already struggling on my 3090 with 24GB VRAM to extend the llama models to use 4096/8192 context tokens to make the models sound less like dementia patients.

I have also been using SillyTavern as a frontend to Oobabooga's webui and KoboldCPP, its much nicer as a drop-in chat system.
see https://chub.ai/ for some characters for SillyTavern, some better than others.
 
  • Informative
Reactions: A-A-AAsssston!
That means, 38 parts out of 83 offloaded into the GPU.

It can fit within 24GB + all the other OS functions. I get about 1 token per second, pretty slow compared to a 33b GPTQ model fully loaded into VRAM pushing 20+ per second, but the output quality difference is massive.
What is the tokens output per second?
 
The tokens per second is how fast the model can output text. Each token is kind of a word or a syllable, I'm not sure how it is counted. Suffice to say, the higher it is, the faster the text is generated. I get best performance using the ExLlama loader, but the entire model needs to fit into the GPU.
It's worst if you want to also use superhot 40196/8192 length context tokens, 33b 4096 token context mode can sort of fit into the GPU if I'm lucky.

TheBloke has just quantized the new facebook 70b GGML model https://huggingface.co/TheBloke/Llama-2-70B-Chat-GGML https://huggingface.co/TheBloke/Llama-2-70B-Chat-GGML
Support in Llama.cpp and Koboldcpp is still experimental.
 
The tokens per second is how fast the model can output text. Each token is kind of a word or a syllable, I'm not sure how it is counted. Suffice to say, the higher it is, the faster the text is generated. I get best performance using the ExLlama loader, but the entire model needs to fit into the GPU.
It's worst if you want to also use superhot 40196/8192 length context tokens, 33b 4096 token context mode can sort of fit into the GPU if I'm lucky.

TheBloke has just quantized the new facebook 70b GGML model https://huggingface.co/TheBloke/Llama-2-70B-Chat-GGML https://huggingface.co/TheBloke/Llama-2-70B-Chat-GGML
Support in Llama.cpp and Koboldcpp is still experimental.
No I was asking what the tokens per second was specifically for your hardware. If you are running webui it will tell you in the output.
 
No I was asking what the tokens per second was specifically for your hardware. If you are running webui it will tell you in the output.
I'm on an i9-10920X with an NVIDIA 3090, fairly beefy machine.
For 33b 4bit, around 20 tokens/s for GTPQ models with ExLlama, 4-5 tokens/s for GGML models with llama.cpp. I can't load 65b GPTQ models, but I'm getting about 1 token/s for GGML.
 
  • Informative
Reactions: 888Flux
I'm on an i9-10920X with an NVIDIA 3090, fairly beefy machine.
For 33b 4bit, around 20 tokens/s for GTPQ models with ExLlama, 4-5 tokens/s for GGML models with llama.cpp. I can't load 65b GPTQ models, but I'm getting about 1 token/s for GGML.
Interesting. I have that same exact CPU but with a 4090. For 30b models using GPTQ inference it's about 50 tokens a second. For 65b GGML models it's 1 token a second but the context length has to be cut in half otherwise it can't load.
1690161090195.png
Running Airboros65b
 
  • Informative
Reactions: Another Char Clone
It seems there is little effect to offloading 65b GGML models to GPU unless you have about 48GB/64GB of VRAM. I am also getting about that speed sometimes if I don't offload any layers at all while pushing 20 CPU threads.
 
  • Like
Reactions: 888Flux
The lazybones at NovelAI finally released a new model (no shill, it's just a pain in the ass getting local UIs to recognize my CUDA install and I caved when I saw they had new shit out, if you are already running local or have more patience than me for setup then stick with it and don't be a paypig) and it performs OK

screenshot-2023-07-29T00 50 29.970Z.png
 
Rate me dumb or autistic, but would it be possible to get a KF dataset to fine tune an existing model?
It's possible but you would have to scrape all of the data from this site and compile it into a dataset. There was a model based on GPT-3 called GPT4Chan developed by Yannic Kilcher that he let run on /pol/ for a month and the results were more than interesting. It used a dataset that was a archive of 4chan from late June 2016 to November 2019 that included 3.3 million threads and 134.5 million posts
 
  • Agree
Reactions: AFAB
Back