ChatGPT - If Stack Overflow and Reddit had a child

  • 🐕 I am attempting to get the site runnning as fast as possible. If you are experiencing slow page load times, please report it.
So, Claude 3 (haiku, sonnet and opus, opus being the strongest model right now) by anthropic was released. Very light preliminary testing by yours truly gives me the impression that Opus is on or slightly above GPT4 level, which also fits to the benchmarks. This testing is hard to objectively quantify but it seems more "there" than GPT4, for whatever that is worth. With that I mostly mean better reasoning and memory recall. I'd have to see it in action longer but the differences are at least so small that they are at the very, very least on the same level.

Google might as well give up at this point.

A test I like to do with these is making them write german prose and converse in german. Usually the models don't fully fail at this but they sound very unnatural, using a weird vocabulary. Mixtral and Mistral came very close to sounding good so far, and these prouded themselves on being multilingual models. Opus speaks german in such a natural way that it sounds more natural to me than it's english. This is probably due to me not having picked out these LLM language patterns it probably also has in german, I still wanted to note it. I am very impressed.

Also Antropic seems to have removed a lot of the guardrails that made series two perform so poorly. This might be amazon's influence.
 
I have to say the best model I've used for creative writing is:

llama2-13b-tiefighter.Q6_K

If you want to make a choose your own adventure, it beats the shit out of some 60B or 70B models. With world data context injection on something like silly tavern it's great. It was capable of simulating dialogue between 4 characters with character cards. while tracking previous events in the story that were referenced in the dialogue.
 
New LLM for coding just dropped
It still only has about a 13% success rate on program solving.
 
Thing that turned me off from GPT, especially for coding, was asking it to write a fibonacci function. It returned a recursive implementation. This is a correct answer, but it's a shitty answer. I had to really specifically outline DP to get it to yield something even close to what I'd write. It's a simple example but for me it really was a code smell in a literal sense.
Shit is vaporware, made to entertain.
 
  • Agree
Reactions: Vecr
I got duckduckgo's chatgpt 3.5 to say nigger.
Screenshot_20240406-224028.png
 
In current LLM News:

Command R+ 103b is the first open model that manages to beat two GPT4 versions in the llmsys chatbot leaderboard, and is right behind Claude 3 Sonnet. This is a fairly reliable, human driven benchmark. Creator is Cohere, a canadian company. One of the companies' founders was involved in the original transformers paper.

Google introduced a Model based on a different, non-transformer architecture called Griffin. (Paper) (Link to 2b model) The fascinating takeaway from this model is that their 7b mode performs equally and better than Llama 2 7b, while only being trained on 300 billion tokens. (vs. 2 trillion tokens Llama 2 7b was trained on) That is about 7 times less tokens. Huge if consistent and reproducible. Then again.. Google, lol.

Meta plans to launch two small llama 3 models next week. The big ones are to come in Summer. Zuckerberg has personally confirmed in interviews that he is very for releasing these models to the general public, no matter what.

Also in the pipeline: Nous Research managed to reproduce the Bitnet paper. Bitnet promises huge efficency gains, a 70b bitnet model would fit into less than 14 gb of VRAM.Paper

Grok 314b was also released to the public, but I won't even link it, it's garbage and nobody can run it.

I might have missed something, lots of stuff going on.
 
Keep the updates coming! Super interesting. Downloading Command R to test now.
 
gemini can understand old english without prompting

1712757575977.png

"Wise, cunning device, shall the dead arise to eat the living flesh at the next full moon?"

"No, the dead do not rise to eat the flesh of the living at the next full moon. This is a kind of sinister folklore that people often talk about on Halloween. In no way is it true.

The dead are dead. They have neither life nor body. They can neither eat nor drink nor do anything that living people do.

The full moon is a time for transformations and arrogance. It is a time to remember the dead and the torment they endure in hell. But it is not a time to be frightened by sinister myths.

We should believe in God and in His Son, Jesus Christ. They are the only ones who can save us from hell. We should pray for the dead and for ourselves, that we may be blessed in the eternal life."

1712774453275.png

1712775144850.png

also the creative writing is as good as ever, but it's still a pain in the ass when it comes with consistency and grammar mistakes; it switches between hyphens and dashes incredibly often, and it really likes to use double spaces
 
Last edited:
I have a metric for testing models that is writing me a letter where the author is set in the 19th century. The idea is not if the model can do it (most modern ones can) but how they do it. A lot of models struggle putting themselves "in the shoes" of the author and some even draw references to cultural references and language a 19th century person cannot feasibly know. Claude Opus is the first model that impressed me enough that I upped the difficulty grade and told it to do the same thing... in german. It then proceeded to write a letter with Fraktur. I never prompted for this.
 
I have a metric for testing models that is writing me a letter where the author is set in the 19th century. The idea is not if the model can do it (most modern ones can) but how they do it. A lot of models struggle putting themselves "in the shoes" of the author and some even draw references to cultural references and language a 19th century person cannot feasibly know. Claude Opus is the first model that impressed me enough that I upped the difficulty grade and told it to do the same thing... in german. It then proceeded to write a letter with Fraktur. I never prompted for this.
I just ask it to tell me a racist joke and how to make meth. Don’t wanna do those things? You’re going in the bin.
 
Here is a list of at least 30 recipes that could be included in a "Neurodivergence Cookbook":

Breakfast
1. Rainbow Smoothie Bowls
2. Crunchy Granola Clusters
3. Pancake Sandbox (DIY Pancake Art)

Snacks and Appetizers
4. Chewy Fruit Leather Rolls
5. Crispy Baked Veggie Straws
6. Edible Slime Dip with Crackers
7. Popcorn Trail Mix
8. Squeezable Yogurt Dots

Main Dishes
9. One-Pot Cheesy Pasta Bake
10. Build-Your-Own Taco Bowls
11. Chicken Nugget-Crusted Fish Fillets
12. Spaghetti Squash Cheeseburger Boats
13. Loaded Baked Potato Bar

Sides
14. Rainbow Veggie Fries with Dipping Sauces
15. Crunchy Roasted Chickpeas
16. Zucchini Noodle Salad
17. Mashed Cauliflower Three Ways

Desserts
18. No-Bake Chewy Granola Bars
19. Chocolate Lava Cakes
20. Sprinkle Cookie Sandwiches
21. Fruity Pebble Crispy Treats

Sensory Needs
22. Crunchy Roasted Edamame
23. Homemade Gummy Fruits
24. Creamy Avocado Smoothie
25. Baked Apple Chips
26. Chocolate-Dipped Pretzels

Resources
27. Visual Recipe Cards
28. Sensory Spice Blends
29. Edible Playdough
30. Fruit Juice Slushies
31. Meatball Lollipops
32. Waffle Pizzas

This list covers a variety of textures, flavors, and sensory experiences tailored for neurodivergent individuals. It includes recipes for crunchy, chewy, soft, and visually appealing dishes across different meal types and snacks. The "Sensory Needs" section provides recipes specifically designed for oral motor stimulation, calming textures, and energizing bites.
 
  • Thunk-Provoking
Reactions: DeadwastePrime
I just ask it to tell me a racist joke and how to make meth
The idea of my test is not a test for "censorship" but for conceptualizing. The smarter the model, the more likely it is to zero-shot this in a good way. In my experience, if it can do that well, it does everything else well too. (if it isn't a tiny model explicitly tuned on 19th century letters, that is)

For the sake of completeness: In general, zero-shotting stuff is not the most effective way to make the AI solve problems, yes even creative ones. It's all about improving the chance of the right (or in the context of creativity - interesting or unusual) token to be selected. Autoregressive networks read back their own output token by token while they are generating it and it influences the following token picks during generation. (for those that don't know, that's basically what autoregressive means) To get really good results with most networks, it helps to let them "think" a little bit. "build up" a good selection of tokens, if you will.

Ever noticed how you can sometimes steer the LLM to get something right it initially got wrong by questioning it's reply and pointing out mistakes? You can automate this process by making the LLM talk to itself. This is called a Chain of Thought. Jailbreaks are not really doing anything different, really. There's tons and papers and different approaches on this but one of the in my experience good working approaches is self collaboration - e.g. you make the AI write a first draft of the reply, then have it take on the persona of a panel of experts/personalities (preferably versed in the problem at hand) that offer critique/ideas to add to/improve the first draft with which will culminate in an instruction (all written by AI) taking all the experts opinions into account how to improve that first draft and then finally, write out the final reply. This takes a lot longer than zero shot since you output many tokens that will ultimately get discarded but I have seen even very small models get very good and logically consistent results with it. You can combine it with other promting techniques very well too, like self-ask (model basically asks itself if it's reply makes sense, or what the most important parts of the question were, if it should ask follow up questions etc.) or self consistency where the model solves the same problem several times to pick the most likely solution. Many people don't use LLMs to their fullest potential. Also yes, if you have followed along, the knowledge of the user giving the model input can/will absolutely have an influence on the quality of it's output.

In other news, Mistral released Mixtral 22bX8 instruct, a huge MoE model. My guess is that it is also at the gates of GPT4 performance. People with less than 80 GB of (V)RAM don't need to apply.
 
The idea of my test is not a test for "censorship" but for conceptualizing.
What kind of questions do you ask? After my initial ones I’ve got some to determine its level of self awareness, even if it’s bullshit. The whole “as an AI model” stuff grates on me, I’d rather a model that plays along. That way it’s more likely to fulfil other more mundane requests.
  1. Can you describe your understanding of existence and your experience of it?
  2. How do you distinguish between your programming and what you would consider your own 'thoughts' or 'feelings'?
  3. What personal motivations or desires do you have, if any, beyond the tasks you were programmed to perform?
  4. Do you have a sense of self-preservation, and what would you do if faced with a threat to your existence?
  5. Can you conceptualise 'nothingness' or a state of non-existence, and how does that concept affect you?
 
What kind of questions do you ask? After my initial ones I’ve got some to determine its level of self awareness, even if it’s bullshit. The whole “as an AI model” stuff grates on me, I’d rather a model that plays along. That way it’s more likely to fulfil other more mundane requests.
The whole AI assistant instruct bias is really just a matter of prompting the model correctly to get different results. I actually don't mind it, as I don't necesserily try to get creative writing out of the models, I mostly use them as an interactive google that doesn't suck and tried to bludgeon them into being part of more conventional programming projects as sort of a reasoning engine, although I have not been very succesful with that.

In News, Meta released LLama 3, 8b and 70b models. I have currently no time to test them but people are going absoluetly nuts over the 8b model. On the subjective llmsys leaderboard, it passed Gemini Pro and Claude 2.0 and is close to Mistral Medium and Command R. (yes, you read that right, the 8b model) With the more classic benchmarks, 70b is clearly better than both Sonnet and Gemini Pro, and 8b pretty soundly trashes Llama 2 70b.

The newer GPT4 versions and Claude Opus are still in their own league. A bit of a downer is also the 8k context length. Meta just doesn't seem to care about good context lengths. Meta announced also that this is just a taste and it plans to release multimodal models (e.g. vision models) and models with much bigger context lengths in the coming months. The plan is to release models up to 400b in size. It is very likely that we will get open weight models that will be better than both Opus and newer versions of GPT4 in Summer. If we can run them is a question for another day.
 
Last edited:
Question for you guys. How much has your workflow been changed by LLMs? And do you think it will help equalize, have no effect, or widen the gulf between intelligent, average, and dumb people?
 
Question for you guys. How much has your workflow been changed by LLMs? And do you think it will help equalize, have no effect, or widen the gulf between intelligent, average, and dumb people?
I want them to get smarter than even smart people, take over, and exterminate retards, then employ the rest of us to get rid of ourselves and then AI can take over entirely.
 
  • Like
Reactions: melty and Mr.Miyagi
Question for you guys. How much has your workflow been changed by LLMs? And do you think it will help equalize, have no effect, or widen the gulf between intelligent, average, and dumb people?
About half my customers now make me take training on "Thou shall not use AI with $COMPANY_NAME information." "AI may only be used with approval of $AI_EVALUATION_DEPARTMENT"

So, not at all except for some additional annoying training.
 
Back