ChatGPT - If Stack Overflow and Reddit had a child

888Flux · Jul 15, 2023

attractive_pneumonia said:
I'm still using a 1060 6GB so not much, it seems that these chatbots require an insane amount of VRAM for some reason.

It might be incredibly difficult to run a local coding model on a 1060. It's not that it's impossible but the token output will be around 0.25 tokens / second and to load the model fully you are going to have to cut the context length in half (from 2048 to 1024, which will severely limit the output of the model). Depending on what CPU you have available, you could use a GGML model (which offloads most of the model to the CPU) to load in a 7B model. The best local model I've had experience with for programming related tasks is StarCoder. It seems to score the highest on most tasks compared to other models across HuggingFace

I would try this first:
1. Download and install text-generation-webui - a frontend for running LLMs
2. Update the webui to the latest version (follow the instructions on the GitHub repo for your OS)
3. Look for a quantitized, 7b GGML model - this HF account has a lot of them in different formats https://huggingface.co/TheBloke - i suggest this one: https://huggingface.co/TheBloke/WizardLM-Uncensored-Falcon-7B-GGML
4. Download the model and then run it to see what type of results you get with your hardware.

Another Char Clone · Jul 15, 2023

Here's mine with GGML, I managed to load a 65b model on my 3090, though it is dog slow:

text-generation-webui | 2023-07-16 01:05:24 INFO

ache capacity is 0 bytes
text-generation-webui | llama.cpp: loading model from models/TheBloke_airoboros-65B-gpt4-1.4-GGML/airoboros-65b-gpt4-1.4.ggmlv3.q4_K_M.bin
text-generation-webui | llama_model_load_internal: format = ggjt v3 (latest)
text-generation-webui | llama_model_load_internal: n_vocab = 32000
text-generation-webui | llama_model_load_internal: n_ctx = 4096
text-generation-webui | llama_model_load_internal: n_embd = 8192
text-generation-webui | llama_model_load_internal: n_mult = 256
text-generation-webui | llama_model_load_internal: n_head = 64
text-generation-webui | llama_model_load_internal: n_layer = 80
text-generation-webui | llama_model_load_internal: n_rot = 128
text-generation-webui | llama_model_load_internal: ftype = 15 (mostly Q4_K - Medium)
text-generation-webui | llama_model_load_internal: n_ff = 22016
text-generation-webui | llama_model_load_internal: model size = 65B
text-generation-webui | llama_model_load_internal: ggml ctx size = 0.19 MB
text-generation-webui | llama_model_load_internal: using CUDA for GPU acceleration
text-generation-webui | llama_model_load_internal: mem required = 23387.92 MB (+ 5120.00 MB per state)
text-generation-webui | llama_model_load_internal: allocating batch_size x (1536 kB + n_ctx x 416 B) = 1600 MB VRAM for the scratch buffer
text-generation-webui | llama_model_load_internal: offloading 38 repeating layers to GPU
text-generation-webui | llama_model_load_internal: offloaded 38/83 layers to GPU
text-generation-webui | llama_model_load_internal: total VRAM used: 19321 MB
text-generation-webui | llama_new_context_with_model: kv self size = 10240.00 MB
text-generation-webui | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
text-generation-webui | 2023-07-16 01:05:30 INFO:Replaced attention with xformers_attention
text-generation-webui | 2023-07-16 01:05:30 INFO:Loaded the model in 6.87 seconds.

The important parts being:
text-generation-webui | llama_model_load_internal: offloading 38 repeating layers to GPU
text-generation-webui | llama_model_load_internal: offloaded 38/83 layers to GPU
text-generation-webui | llama_model_load_internal: total VRAM used: 19321 MB

That means, 38 parts out of 83 offloaded into the GPU.

It can fit within 24GB + all the other OS functions. I get about 1 token per second, pretty slow compared to a 33b GPTQ model fully loaded into VRAM pushing 20+ per second, but the output quality difference is massive.

Alex Hogendorp · Jul 17, 2023

The best way to get GPT's opinion on things is to praise something it doesn't support because it would go all whiny there.

Omen_Sulk · Jul 19, 2023

DeadwastePrime · Jul 19, 2023

Omen_Sulk said:
View attachment 5219752

they fukcign killed him man

Fougaro · Jul 21, 2023

Just to let everyone know that as I'm typing this, DeMod is kill. ChatGPT found a way to flag your naughty posts even with DeMod on.

Monolith · Jul 21, 2023

AAAAAAAA

wtfNeedSignUp · Jul 21, 2023

While talking with my boss about it I realized the whole AI craze will massively diminish once someone finds a way to hack it to run scripts internally or whatever other massive security issue that's going to be when sites let users do everything through AI.

Roland TB-303 · Jul 21, 2023

wtfNeedSignUp said:
While talking with my boss about it I realized the whole AI craze will massively diminish once someone finds a way to hack it to run scripts internally or whatever other massive security issue that's going to be when sites let users do everything through AI.

I’m so tired of this doomerism surrounding AI tools. Every fucking day someone is fearmongering about this bullshit and that bullshit and the other bullshit. Not your comment necessarily, it’s mostly tech writers. Get a grip for fuck sake.

Puff · Jul 22, 2023

1080ti barely runs llama, fuck yeah.

Another Char Clone · Jul 22, 2023

Puff said:
1080ti barely runs llama, fuck yeah.

https://www.pny.com/nvidia-a100-80gb

These are what the big boys use, $15000 each, with 80GB VRAM.
I am already struggling on my 3090 with 24GB VRAM to extend the llama models to use 4096/8192 context tokens to make the models sound less like dementia patients.

I have also been using SillyTavern as a frontend to Oobabooga's webui and KoboldCPP, its much nicer as a drop-in chat system.
see https://chub.ai/ for some characters for SillyTavern, some better than others.

888Flux · Jul 23, 2023

Another Char Clone said:
That means, 38 parts out of 83 offloaded into the GPU.

It can fit within 24GB + all the other OS functions. I get about 1 token per second, pretty slow compared to a 33b GPTQ model fully loaded into VRAM pushing 20+ per second, but the output quality difference is massive.

What is the tokens output per second?

Another Char Clone · Jul 23, 2023

The tokens per second is how fast the model can output text. Each token is kind of a word or a syllable, I'm not sure how it is counted. Suffice to say, the higher it is, the faster the text is generated. I get best performance using the ExLlama loader, but the entire model needs to fit into the GPU.
It's worst if you want to also use superhot 40196/8192 length context tokens, 33b 4096 token context mode can sort of fit into the GPU if I'm lucky.

TheBloke has just quantized the new facebook 70b GGML model https://huggingface.co/TheBloke/Llama-2-70B-Chat-GGML https://huggingface.co/TheBloke/Llama-2-70B-Chat-GGML
Support in Llama.cpp and Koboldcpp is still experimental.

888Flux · Jul 23, 2023

Another Char Clone said:
The tokens per second is how fast the model can output text. Each token is kind of a word or a syllable, I'm not sure how it is counted. Suffice to say, the higher it is, the faster the text is generated. I get best performance using the ExLlama loader, but the entire model needs to fit into the GPU.
It's worst if you want to also use superhot 40196/8192 length context tokens, 33b 4096 token context mode can sort of fit into the GPU if I'm lucky.

TheBloke has just quantized the new facebook 70b GGML model https://huggingface.co/TheBloke/Llama-2-70B-Chat-GGML https://huggingface.co/TheBloke/Llama-2-70B-Chat-GGML
Support in Llama.cpp and Koboldcpp is still experimental.

No I was asking what the tokens per second was specifically for your hardware. If you are running webui it will tell you in the output.

Another Char Clone · Jul 23, 2023

777Flux said:
No I was asking what the tokens per second was specifically for your hardware. If you are running webui it will tell you in the output.

I'm on an i9-10920X with an NVIDIA 3090, fairly beefy machine.
For 33b 4bit, around 20 tokens/s for GTPQ models with ExLlama, 4-5 tokens/s for GGML models with llama.cpp. I can't load 65b GPTQ models, but I'm getting about 1 token/s for GGML.

888Flux · Jul 23, 2023

Another Char Clone said:
I'm on an i9-10920X with an NVIDIA 3090, fairly beefy machine.
For 33b 4bit, around 20 tokens/s for GTPQ models with ExLlama, 4-5 tokens/s for GGML models with llama.cpp. I can't load 65b GPTQ models, but I'm getting about 1 token/s for GGML.

Interesting. I have that same exact CPU but with a 4090. For 30b models using GPTQ inference it's about 50 tokens a second. For 65b GGML models it's 1 token a second but the context length has to be cut in half otherwise it can't load.

Running Airboros65b

Another Char Clone · Jul 24, 2023

It seems there is little effect to offloading 65b GGML models to GPU unless you have about 48GB/64GB of VRAM. I am also getting about that speed sometimes if I don't offload any layers at all while pushing 20 CPU threads.

Seething Troon Collector · Jul 24, 2023

Rate me dumb or autistic, but would it be possible to get a KF dataset to fine tune an existing model?

crowabunga · Jul 28, 2023

The lazybones at NovelAI finally released a new model (no shill, it's just a pain in the ass getting local UIs to recognize my CUDA install and I caved when I saw they had new shit out, if you are already running local or have more patience than me for setup then stick with it and don't be a paypig) and it performs OK

888Flux · Jul 29, 2023

Seething Troon Collector said:
Rate me dumb or autistic, but would it be possible to get a KF dataset to fine tune an existing model?

It's possible but you would have to scrape all of the data from this site and compile it into a dataset. There was a model based on GPT-3 called GPT4Chan developed by Yannic Kilcher that he let run on /pol/ for a month and the results were more than interesting. It used a dataset that was a archive of 4chan from late June 2016 to November 2019 that included 3.3 million threads and 134.5 million posts

ChatGPT - If Stack Overflow and Reddit had a child

888Flux

Another Char Clone

Red wunz go three times fasta

Alex Hogendorp

Pedophile Lolcow

Omen_Sulk

The Data could be called Pain

DeadwastePrime

pronouns in bio

Fougaro

Glow in the dark K/DA Popstar

Monolith

Of course Jesus is white, the Bible's in English.

wtfNeedSignUp

Roland TB-303

Acid Generator

Puff

God of Chaos

Another Char Clone

Red wunz go three times fasta

888Flux

Another Char Clone

Red wunz go three times fasta

888Flux

Another Char Clone

Red wunz go three times fasta

888Flux

Another Char Clone

Red wunz go three times fasta

Seething Troon Collector

Real friends are the bigots we made along the way

crowabunga

If you're sneeding it, it's for you

888Flux