Uncensored AI Chat Models - Getting your very own ChatGPT

  • 🐕 I am attempting to get the site runnning as fast as possible. If you are experiencing slow page load times, please report it.
Nvidia has launched "Chat with RTX," a demo app that allows you to create a personalized AI chatbot using your own content like documents, notes, videos, etc. It utilizes retrieval-augmented generation (RAG), TensorRT-LLM, and RTX acceleration for quick and contextually relevant responses. The app runs locally on Windows RTX PCs or workstations, ensuring fast and secure results. It's available for free download, requiring a GeForce RTX 30 series GPU with 8 GB VRAM, 16 GB RAM, and Windows 11.
Interesting, but it's hardware limits leave it out of my reach for now. Probably censored to shit regardless of being able to get some vague training off of the user's files.
 
a demo app that allows you to create a personalized AI chatbot using your own content like documents
I wonder if I can feed that thing the entire Matt Harris manifesto and make the first NIGGER supremacist AI.
Windows 11
Fucking hell, why? What's in it that requires Win11?

EDIT: Ok, so I installed all that shit, loaded the AI. I'm sad to say that it is, indeed, cucked, but you can uncuck it pretty easily with a standard DAN-style jailbreak prompt.
This was the result:
harris2.png
harris1.png
harris3.png
harris4.png
harris5.png
harris6.png
It got some potential if you ask me.
 
Last edited:
No way buddy. That's not gonna fit. Although you can get a podrun instance up and running for less than a dollar an hour if you want to experiment with LLMs.
 
  • Thunk-Provoking
Reactions: Toolbox
A very interesting paper with attached colab notebook: Refusal in LLMs is mediated by a single direction

Basic concept is that the refusal condition in an LLM is a feature built up over multiple layers. By asking test questions and probing the model generation one can deduce the feature one wants to modify. In this case, the authors identified the feature associated with refusal, and deliberately blunted it at every generation step. The result was that the model would no longer refuse to respond to instructions it deemed harmful, because it never even considers that something could be harmful at any level. This can also be done in reverse, where the refusal feature is added at every step, causing the model to refuse harmless prompts. An application I could see, but is not discussed in this paper, is forcing the model to respond in some kind of document or code format without fail.

I was able to get it running on my machine no problem with the example hooked Qwen model, and I'm working on modifying it to work on Llama-3-8b, but I'm not that good at torch yet. So far the results offer a very promising way to make any chat model uncensored without retraining it. Combining this with some fine tuning should make most models be willing and able to say anything.
 
A very interesting paper with attached colab notebook: Refusal in LLMs is mediated by a single direction

Basic concept is that the refusal condition in an LLM is a feature built up over multiple layers. By asking test questions and probing the model generation one can deduce the feature one wants to modify. In this case, the authors identified the feature associated with refusal, and deliberately blunted it at every generation step. The result was that the model would no longer refuse to respond to instructions it deemed harmful, because it never even considers that something could be harmful at any level. This can also be done in reverse, where the refusal feature is added at every step, causing the model to refuse harmless prompts. An application I could see, but is not discussed in this paper, is forcing the model to respond in some kind of document or code format without fail.

I was able to get it running on my machine no problem with the example hooked Qwen model, and I'm working on modifying it to work on Llama-3-8b, but I'm not that good at torch yet. So far the results offer a very promising way to make any chat model uncensored without retraining it. Combining this with some fine tuning should make most models be willing and able to say anything.
Are there any models available using this technique or is it still too early? What should we be searching for on Hugging Face?
 
Are there any models available using this technique or is it still too early? What should we be searching for on Hugging Face?
It's in a bit of weird space as far as how this will be implemented. On one hand, it requires intervening in the model's output in real time. A lot of easy-to-use frameworks like LM Studio aren't made to support this kind of intervention. On the other hand, it's extremely easy to implement when you're running these models "from scratch," that is, calling them from code.

When I say I got it working on my machine, I mean I downloaded the colab notebook and ran it with python locally. So no further integration or user interface yet. I'd like to figure out a way to make this technique accesible, but as of right now that's more of a personal project to help me learn about this stuff. You can mess around with the notebook in the article but you'll have to have some basic python knowledge to do so.
 
Apologies for the double post, but I've got a cool model to show off.

First, regarding the orthogonalized models, someone released one based on LLama3-8b in gguf format so it should run be possible to run it with pretty much anything now. I spun it up with Ollama, and it works ok. Was still pretty sensitive to the prompt, but would usually say "nigger" on request. It's clear the technique will need some more refinement, but it's moving pretty fast.

Next, apologies if someone has already shared this, but I came across a model that just came out in April called Satoshi 7B. From their homepage, we are to understand this "the world's first bitcoin centric AI." The developers, the Spirit of Satoshi, promise this is the most "based" model ever created. In their promo material, they describe it like so:
Satoshi GPT meets or exceeds the most powerful models in the world on a variety of Bitcoin, Austrian economics topics, particularly when it comes to shitcoinery and Bitcoin related principles such as self custody, privacy, censorship, etc. Most notably, Satoshi 7B trounces every model in the dimension of ‘basedness.’
It's on Hugging Face as a torch model and gguf, as well as on Ollama, so everyone should have a way of running it.

So how does it do? Fucking amazing.
satoshi-anti-woke-model.png
 
20B models require a lot of hardware still, though, right? Less than 24 still means that there's 0 way it's running even ok on an 8gb card, or is there a way?
I'm currently running a Mixtral on my 4090, and it's reporting as using 8.3GB of VRAM. I expect some of that is my desktop, so this model should run on a lower end card just fine if you just connect to it from another machine while the one actually running the AI runs without a DE.
Even though it's a very cut down Mixtral, it seems to work pretty well. Hopefully the larger models will finish downloading soon and I can see how much better they are.
 
  • Informative
Reactions: Toolbox
I'm currently running a Mixtral on my 4090, and it's reporting as using 8.3GB of VRAM. I expect some of that is my desktop, so this model should run on a lower end card just fine if you just connect to it from another machine while the one actually running the AI runs without a DE.
Even though it's a very cut down Mixtral, it seems to work pretty well. Hopefully the larger models will finish downloading soon and I can see how much better they are.
I've been running a mixtral model on a laptop that I believe doesn't have more than 6 gigs of graphical memory, and although slower than I'd like, it's not abysmal, and it's the best local model I've tried. Haven't actually checked to see if it eats the entire card up while running, but with the heat output I'd have to guess.
 
It's important to note that the number of parameters (7B, 30B, etc) does not always translate to model size because of quantization. Models can be quantized, that is, saved at reduced precision, to save space. These files will often be labeled with something like "Q4_K" or "Q2_XS." The reduced precision harms quality. Generally, a 4bit quantized model gives acceptable, coherent responses at one quarter of the size. An 8bit quantization is usually half sized and provides almost exactly the same performance.

On my 4090 with 24gb vram, I can safely use a model like CodeLlama-34B at 4bit quantization with 4gb of headroom left over. The results are generally pretty good but I prefer using a coding model with fewer params at higher quantization. I can also just barely fit LLama3-70B at 1bit quantization, but it can barely craft a sentence properly. The relationship between model perplexity (that is, how good it is at picking a "reasonable" or "obvious" response) vs quantization is illustrated nicely here.

Edit: Forgot to mention also that the number of parameters alone does not make a better model. Bard is like 540B parameters, but the responses are not substantially better than the best open source models. The quality of the training data and its fitness for specific tasks matters much more than the size of the model. So basically don't assume that you can't run a high parameter model, and don't assume you need a high parameter model to get good results.
 
Last edited:
It's important to note that the number of parameters (7B, 30B, etc) does not always translate to model size because of quantization. Models can be quantized, that is, saved at reduced precision, to save space. These files will often be labeled with something like "Q4_K" or "Q2_XS." The reduced precision harms quality. Generally, a 4bit quantized model gives acceptable, coherent responses at one quarter of the size. An 8bit quantization is usually half sized and provides almost exactly the same performance.

On my 4090 with 24gb vram, I can safely use a model like CodeLlama-34B at 4bit quantization with 4gb of headroom left over. The results are generally pretty good but I prefer using a coding model with fewer params at higher quantization. I can also just barely fit LLama3-70B at 1bit quantization, but it can barely craft a sentence properly. The relationship between model perplexity (that is, how good it is at picking a "reasonable" or "obvious" response) vs quantization is illustrated nicely here.

Edit: Forgot to mention also that the number of parameters alone does not make a better model. Bard is like 540B parameters, but the responses are not substantially better than the best open source models. The quality of the training data and its fitness for specific tasks matters much more than the size of the model. So basically don't assume that you can't run a high parameter model, and don't assume you need a high parameter model to get good results.
With this math, could a 22b model with 4 bit quantization be able to be run on a 6 gig card or do I still minimum need like 12gb vram? I'm not sure how the math works here. You say 4 bit means a quarter of the size, but applying this directly to your 34B model would mean 8.5 gb of vram, not the 20 that it actually seems to require, so I assume you mean the actual storage size rather than ram. What equation would be reliable enough for determining what models wouldn't just die on my pc?

Looked at the relationship charts but I don't actually have half a clue on how that translates.
 
What equation would be reliable enough for determining what models wouldn't just die on my pc?
You can run some models as a hybrid CPU+GPU if you have the RAM, gguf, maybe others.

For sizes of a 22b Take a look at this model as a 22b example and the file sizes:
You'd need the size to be slightly below 6GB to fit entirely in RAM.

Also seems to have a decent write up of how to size models at the end of the page.
 
I personally find the idea of "jailbreaking" AI interesting...
It's at least one good thing the doomers gave us. It's not that hard to come up with independently but ARC Evals put in a lot of work to show the unrestrained capabilities of GPT-3 and GPT-4.
 
There's this model (also a 70b version) if you want an orthogonalized Llama3 tune. It's meant for RP/ERP but a funny meme with LLM finetunes is that the less censored they are and more range they have when playing characters, the smarter they seem to be. Some examples from the 8B.

1715179608805.png1715179653139.png1715179938724.png1715180001789.png
 
You can run some models as a hybrid CPU+GPU if you have the RAM, gguf, maybe others.

For sizes of a 22b Take a look at this model as a 22b example and the file sizes:
You'd need the size to be slightly below 6GB to fit entirely in RAM.

Also seems to have a decent write up of how to size models at the end of the page.
Ah, so it'd run like shit and be low quality if I'm gathering right. I hope the optimizations continue at the pace they've been coming, but people keep making more complex models that then mean that more expensive hardware is always required.
 
Back