Programming thread

  • 🐕 I am attempting to get the site runnning as fast as possible. If you are experiencing slow page load times, please report it.
Python just encourages a whole lot of brain rot, both in terms of absolute shit performance and really loosy goosy coding, as well as hiding away too many dangerous details that you can't ever really actually safely ignore.
Yay, time for my Python story.

We had a Python script that loaded a ton of data, did some manipulations on it then returned a small amount of data. Let's say it was 8GB or so process size.

To speed it up, they realized that threads sucked, so let's just fork() 10 times or so. Since Linux does Copy On Write, this was all well and good, The main data was never written to so it didn't balloon up. These 10 processes took roughly the same runtime. Then they exited. Python in its infinite wisdom(maybe fixed now) would do a garbage collect before exiting. Now, the majority of those pages were no longer shared since GC updated refcounts on most of them. Our nice 10x8=16 or so, was now 10x8=80. And the 32GB box promptly fell over. The simple solution was to hook the exit and tell it not to do a garbage collect on exit and things returned to 'normal'.

And yes, that was a GIANT pain in the ass to debug.
 
Python the language is not a terrible starting language (though the ecosystem was and is increasingly total fucking AIDs)
I'm primarily familiar with Python as a tool for data science (or whatever term you'd like to use). Do you mean in that context and/or elsewhere?
One of the things that really impressed upon me is how many guns, clubs and knifes Python presents the user as being "totally safe bro", that if you understood any better, you'd know are pants on head fucking retarded. An example: users would do shit like spin up a ton of threads in a program, and then go invoke multiprocessing all while utilizing shit loads of bad third party libraries, and wonder why things shit themselves or just got mysteriously hung (and none of them understood the consequences of mixing bare ass fork(2) with threads and resources like mutexes and file descriptors, because "the language handles it bro").
As little as I know about writing concurrent programs:
  • Mixing fork() with mutexes ... wat.jpg
  • My understanding is that the GIL inherently and inevitably makes Python ill-suited for highly concurrent applications and I would go looking for another language entirely. I one-finger click discounted a bunch of materials on Erlang and Elixir and faced with that sort of task I would likely set about actually reading (as opposed to hoarding) them.
A lot of what you're saying about Python really isn't specific to Python per se but will arise anywhere in the programming world with a lack of curiosity and the mental discipline to treat learning as a lifelong task. For example, I've noticed the tendency sometimes for those who have only really ever used one language, whatever it is, to become obnoxious fanboys and dismiss everything else as baby toys. (I remember being informed in middle school that C and Linux both suck and Visual Basic rules—apparently there was no need to write the Windows OS kernel.)
It's also increasingly taking on the place of Java for the "wastes absurd amounts of memory award"
That's basically the story of computing in a microcosm though:
Screenshot 2024-10-12 01:37:34.png

when it really really blows for anything larger than small projects with good disciplined developers
I won't say tools like pandas are perfect and sometimes I wish I were using tibbles with dplyr and friends but I wouldn't say they exactly "blow".
We really should push people towards learning and using other tools, when it's better suited (which is often).
What are your recommendations for specific use cases? Genuinely curious.
Yay, time for my Python story.

We had a Python script that loaded a ton of data, did some manipulations on it then returned a small amount of data. Let's say it was 8GB or so process size. [etc.]
What libraries were you using? There are ways to do this effectively, like with Dask.
 
Last edited:
What libraries were you using? There are ways to do this effectively, like with Dask.
We were not doing math. Fork was the right answer just this edge case was a brief problem. And. you have one problem, you add another Python library and now you have 2 problems.
 
We were not doing math. Fork was the right answer just this edge case was a brief problem. And. you have one problem, you add another Python library and now you have 2 problems.
I will say I don't know a ton about Dask but I see that it scales pandas DataFrames. pandas use cases need not be intensely numerical.
 
In contrast, here are the current dependencies in the venv
Doesn't it already strike you kinda weird that python needs compartmentalized environments because of the sheer amount of possibly conflicting dependencies the usual python program has? Whitehead was a smart man but he had never to deal with dependency hell.

language does matter
Well yeah, of course, sometimes, otherwise we'd all just use the same language all the time. The argument was that a programmer should be able to fluently switch because he understands the logical principles. A lot of programmers nowadays don't and can't. I directly blame people learning with languages like python or rust and books and tutorials not moving past "now bang library x and y together, done". Again, This is not a problem of the languages per se.

I also would never want to go back to using Borland Turbo C.
I dabble sometimes, simpler times, to be honest. I mostly miss somewhat more modern tools when I write code directly on such systems but not even always. The level of complexity is lower than what you got used to.

As far as simple CPUs go, I think you can still get eZ80's which kind of get you most of the way there to a really simple computing experience
I'm on and off developing a homebuilt Z180 system. It boots into Forth which I have built from the ground up. Almost all words are implemented in pure Forth, with some inline assembler. The important things for bootstrapping are in a ROM, and I can save and load on a SD-card I access via SPI. I recently added an old graphics chip that used to render text to french terminals for direct graphics output, as opposed to serial. The Z80 core (even running at over 30 Mhz) is slow and not the most ideal for forth and modern sensibilities. If I had to do it all over again with an 8 bitter I would go with a W65C02. There actually used to be a "computer on a chip" 6502 core with Forth integrated into an on-chip ROM, way back in the 80s.

Nowadays there's the ESP32 and the Pi Pico. These microcontrollers both have excellent "batteries included" Forth implementations, but the ESP32 one is implemented in C on top of the RTOS, so directly poking registers is no bueno (will generate an access violation). It's still pretty fast. The Pico implementations are bare metal though. It's really hard to overstate how powerful these dual-core (!) microcontrollers are. People implemented x86 DOS emulation on them, complete with PS/2 input and VGA output. The ESP32 one can run Windows 3.1 IIRC. The Pico's PIO can bruteforce a 640x480 DVI signal. Compared to 80s and 90s computers, they are very powerful. The only real limitation is their limited RAM. You could totally build usable general purpose computers with these $5-$10 MCUs. There's also a pretty good Lisp and even python for them. But yes, they are complex enough that it's hard for somebody inexperienced to completely grasp them on a hardware level. Also you have to realize that back in the day, whatever you could do on home computers was state of the art. As impressive as these MCUs are, they pale to what you can do on a modern computer with e.g. the unreal engine. It'd be hard to draw people into this. I guess these times are just gone.

(and if you go more expensive, there are some RISC cores that come with up to 8 MB of RAM and are free of the shakles of x86. Build a modern lisp machine, nobody is stopping you )
 
Last edited:
It's not clear what point you're trying to make here.
So, the Lisp limit was purely an optimization whereas, and correct me if I got anything wrong, the UNIX limit is a hard limit used to avoid the need to detect cycles. When I was looking around for the name of this limit, I saw a conversation in the Linux kernel mailing list about removing the limit, so perhaps the Linux kernel lacks one, but other UNIX clones certainly have it. Such a limit is evidence that only half the problem has been solved. It's just an analogy I like to use, but there are plenty of others I could've chosen.
 
I made something like that then tried making a game where you counterfeit money and buy stuff and it calculates the exact change and learned alot in the process.
At least George Floyd died for something.
Also load bearing whitespace is completely mental.
This. So much this.

If two programs differ in nothing but whitespace, they should either do the same goddamn thing, or at least one of them should blow the fuck up saying "what the fuck is wrong with you, you retard, learn to indent your code properly".
 
Doesn't it already strike you kinda weird that python needs compartmentalized environments because of the sheer amount of possibly conflicting dependencies the usual python program has? Whitehead was a smart man but he had never to deal with dependency hell.
Whitehead's point was about abstraction. I'm not aware of any sufficiently popular language at any level of abstraction that hasn't run into dependency hell. glibc has been a notorious source of dependency hell on Linux and that's only one level of abstraction higher than assembly.
Well yeah, of course, sometimes, otherwise we'd all just use the same language all the time. The argument was that a programmer should be able to fluently switch because he understands the logical principles. A lot of programmers nowadays don't and can't. I directly blame people learning with languages like python or rust and books and tutorials not moving past "now bang library x and y together, done". Again, This is not a problem of the languages per se.
Agreed. PEBCAK goes beyond end users.
I dabble sometimes, simpler times, to be honest. I mostly miss somewhat more modern tools when I write code directly on such systems but not even always. The level of complexity is lower than what you got used to.
Among the first things I do when emulating Amiga is setting up KingCON to replace the shitty default console and it's still awful.
At least George Floyd died for something.
An incremental game like "Cookie Clicker" except it's "Cotton Clicker" and starts with manually whipping a nigger to produce bales of cotton and then expands to plantations, jails, crack houses, Belgian Congos and JAW FLOYS
 
  • Like
Reactions: UERISIMILITUDO
C#:
// CS0266.cs
class MyClass
{
    public static void Main()
    {
        // You cannot implicitly convert a double to an integer.
        double d = 3.2;

        // The following line causes compiler error CS0266.
        int i1 = d;

        // However, you can resolve the error by using an explicit conversion.
        int i2 = (int)d;

        // You cannot implicitly convert an object to a class type.
        object obj = new MyClass();

        // The following assignment statement causes error CS0266.
        MyClass myClass = obj;

        // You can resolve the error by using an explicit conversion.
        MyClass c = (MyClass)obj;

        // You cannot implicitly convert a base class object to a derived class type.
        MyClass mc = new MyClass();
        DerivedClass dc = new DerivedClass();

        // The following line causes compiler error CS0266.
        dc = mc;

        // You can resolve the error by using an explicit conversion.
        dc = (DerivedClass)mc;
    }
}
 
class DerivedClass : MyClass
{
}

76e.jpg

NO CHUD // You cannot implicitly convert a double to an integer.
 
C#:
// CS0266.cs
class MyClass
{
    public static void Main()
    {
        // You cannot implicitly convert a double to an integer.
        double d = 3.2;

        // The following line causes compiler error CS0266.
        int i1 = d;

        // However, you can resolve the error by using an explicit conversion.
        int i2 = (int)d;

        // You cannot implicitly convert an object to a class type.
        object obj = new MyClass();

        // The following assignment statement causes error CS0266.
        MyClass myClass = obj;

        // You can resolve the error by using an explicit conversion.
        MyClass c = (MyClass)obj;

        // You cannot implicitly convert a base class object to a derived class type.
        MyClass mc = new MyClass();
        DerivedClass dc = new DerivedClass();

        // The following line causes compiler error CS0266.
        dc = mc;

        // You can resolve the error by using an explicit conversion.
        dc = (DerivedClass)mc;
    }
}
 
class DerivedClass : MyClass
{
}

View attachment 6515499

NO CHUD // You cannot implicitly convert a double to an integer.
Hey, if that seems inconvenient, then never use Ada. All of these cute little conversion rules for built-in types appear to be convenient, and then cause hell in large systems. People have died because of this kind of shit.
 
Hey, if that seems inconvenient, then never use Ada. All of these cute little conversion rules for built-in types appear to be convenient, and then cause hell in large systems. People have died because of this kind of shit.
Funny enough, after I made the shitpost I went back and actually read that code and then figured out how to fix a specific thing I was working on.
 
Hey, if that seems inconvenient, then never use Ada. All of these cute little conversion rules for built-in types appear to be convenient, and then cause hell in large systems. People have died because of this kind of shit.
Don't fly-by-wire systems in aircraft use Ada? Anyway yeah I can't see any good reason for silent conversion with possible loss of information.
 
Languages are just tools. Start with whatever has most help. Probably python because llms are trained on it and they're much better to ask than google or genius programmers who don't even use their own libraries and just rewrite the same code over and over

Don't use llms for coding for you, use them as a search engine. What's the function for x, how do you use y. You can use them to write full code, and it does work at least for small snippets, but if you don't fully understand all that's going on in there you won't ever learn properly

Want to learn to code? Find something you could actually use, that way you have great motivation to work on it, and you'll have yourself as a tester.

There's no better way to upgrade your skills than to figure out on your own why what you did is pure shit.
 
May we see some of your APL code? I like APL's syntax a lot but I haven't come across much code to read.
Sure, I'll provide an explanation of the Wordle solver. So, here's the Wordle solver in APL:

Code:
d←'abcdefghijklmnopqrstuvwxyz'

∇ wordle l;⎕IO;g;i;s;w ⍝ I have l for list; g for guess; i for index; s for summary; and w for word.
⎕IO←i←1◊s←+/d∘.=w←l[?↑⍴l;]◊'Word: ',w
now: ⎕←g←l[?↑⍴l;]◊→end⍴⍨g≡w◊i←i+1◊l←(((+/[2]l∘.=d)∧.≤(6,⍳5)[1+(s<+/d∘.=g)×s⌊+/d∘.=g])∧((+/[2]l∘.=d)∧.≥s⌊+/d∘.=g)∧(((g=w)/l)∧.=(g=w)/w)∧(((~g=w)/l)∧.≠(~g=w)/g)∧(~∨/(l∊(~g∊w)/g)))⌿l◊→now
end: 'Guesses: ',⍕i
∇

I asked Josh, and my posting it here does give him a special license to redistribute the code in relation to Kiwi Farms, but everyone else may use the code under the AGPLv3. Clearly, this is very difficult to read, and practically impossible without knowledge of APL, so I'll now describe it. Here's an alternative form of the code which splits the solver into its parts:

Code:
∇ wordle l;⎕IO;g;i;s;w
⎕IO←i←1◊s←+/d∘.=w←l[?↑⍴l;]◊'Word: ',w
now: ⎕←g←l[?↑⍴l;]◊→end⍴⍨g≡w◊i←i+1
     l←(~∨/(l∊(~g∊w)/g))⌿l
     l←(((~g=w)/l)∧.≠(~g=w)/g)⌿l
     l←(((g=w)/l)∧.=(g=w)/w)⌿l
     l←((+/[2]l∘.=d)∧.≥s⌊+/d∘.=g)⌿l
     l←((+/[2]l∘.=d)∧.≤(6,⍳5)[1+(s<+/d∘.=g)×s⌊+/d∘.=g])⌿l◊→now
end: 'Guesses: ',⍕i
∇

The solver is based on a very simple principle: There are only so many five-letter words in the English language, and not very many at all, meaning a machine may hold the entire dictionary of such words in its storage and gradually narrow it down. I constructed a simple dictionary using a public corpus, but some of you may wish to look into the New York Times leak to get theirs. I'm soon going to write about improving the solver slightly by reordering the dictionary to make picking the first word optimal, but all current iterations of the solver choose from the dictionary randomly.

I particularly liked the rigidity of APL here, since it was very convenient to represent the dictionary as a two-dimensional array with an unbounded first dimension and a last dimension of five.

The first line of the APL function is a definition:

Code:
∇ wordle l;⎕IO;g;i;s;w

I defined a function named wordle returning no value with one parameter named l; all remaining names are local parameters, including the Index Origin ⎕IO value which may be set to zero or one.

The second line performs initializations and displays the word which the solver will find. APL expressions are evaluated right-to-left. The diamond ◊ separates expressions, which are evaluated left-to-right:

Code:
⎕IO←i←1◊s←+/d∘.=w←l[?↑⍴l;]◊'Word: ',w

So, ⎕IO and i are both set to one, with i only serving as a count of the guesses; w is set to the word which the solver will find, chosen at random; s is set to the summary of w; and then the string is displayed, joined by the comma function which combines arrays. The code which I've named Summarize in other languages is very special:

Code:
+/d∘.=

With a domain d, it compares each element of the given value to the domain; when comparing with a vector such as a word, this will be a two-dimensional matrix of ones and zeroes as its result; reduction with addition across the last dimension converts this to a count. An example will make this very clear:

Code:
  +/d∘.='kiwifarms'
1 0 0 0 0 1 0 0 2 0 1 0 1 0 0 0 0 1 1 0 0 0 1 0 0 0

Here's an even clearer result:

Code:
  d,[0.5]+/d∘.='kiwifarms'
a b c d e f g h i j k l m n o p q r s t u v w x y z
1 0 0 0 0 1 0 0 2 0 1 0 1 0 0 0 0 1 1 0 0 0 1 0 0 0

The code ignores all values not in the domain, but I take no advantage of this detail. There's a lot of beauty in a name, but notice the APL needs no name, as the code is literally shorter than the name Summarize here. Therefore, inlining the code is always preferable.

The last line of the function reports the number of guesses after converting the count of guesses i into a string:

Code:
end: 'Guesses: ',⍕i

The Common Lisp reported the count in English, even capitalizing everything properly, but that was infeasible here. Follows is the COMMON-LISP:FORMAT expression which does so, for those curious:

Code:
(format t "~&~:(~R~) ~:[guesses were~;guess was~] needed.~%"
        count (= 1 count))

The second line is where most of the real work happens, obviously:

Code:
now: ⎕←g←l[?↑⍴l;]◊→end⍴⍨g≡w◊i←i+1◊l←(((+/[2]l∘.=d)∧.≤(6,⍳5)[1+(s<+/d∘.=g)×s⌊+/d∘.=g])∧((+/[2]l∘.=d)∧.≥s⌊+/d∘.=g)∧(((g=w)/l)∧.=(g=w)/w)∧(((~g=w)/l)∧.≠(~g=w)/g)∧(~∨/(l∊(~g∊w)/g)))⌿l◊→now

APL lacks indefinite loops in its basic constructs, so it's necessary to build them out of goto, basically. The last expression with the right arrow → jumps back to the line to start it anew.

Firstly, g is initialized at random from the dictionary, no differently than w, and is also displayed by sending the expression to the box character ⎕:

Code:
⎕←g←l[?↑⍴l;]

Secondly, this expression is the exit condition. It was easy enough to use the match function ≡ for this, which returns one if both sides are identical, and zero otherwise. That result is used to reshape the expression naming the third line of code. I'm fond of using the swap operator ⍨ for eliminating parentheses. The expression l f⍨ r is equivalent to r f l. It just switches the elements, which is useful to avoid parentheses with the evaluation order. So, this code goes to the third line if it's reshaped to size one, but does nothing if it's reshaped to size zero:

Code:
→end⍴⍨g≡w

Thirdly, i is incremented:

Code:
i←i+1

The original Common Lisp solver used a function WORDLE::CHECK-WORD which took in both words and returned a vector of symbolic results. Here's an example:

Code:
CL-USER> (WORDLE::CHECK-WORD "kiwis" "milks")
#((:NONE #\m) (:HERE #\i) (:NONE #\l) (:SOME #\k 6 1 #(T NIL T T NIL))
  (:HERE #\s))

The Common Lisp solver was actually very clumsy compared to the APL, but this isn't too hard to understand. The solver itself used these symbolic results to reject words containing letters in the :NONE case, reject words not containing letters in the right positions for the :HERE case, and the :SOME case was particularly clumsy in indicating that the letter is somewhere in the word but not where it was given in the guess. The Common Lisp also had another case :MISS purely for handling a poor interaction of these cases, and there are other details I won't explain.

The APL was originally going to follow the Common Lisp in structure, but I knew it could be better.

The APL solver handles each of these cases more elegantly and splits the problem into five; rather than word-at-a-time thinking, the APL solver handles all letter matches simultaneously, and likewise for the other four cases.

This is a filter expression in which the bit-vector a filters the rows of the array l based on whether a holds a one or a zero, keeping the row in the former case:

Code:
a⌿l

The solver is built around this expression, and the one-line form combines the masks with logical AND before filtering the dictionary. The first expression filters words containing letters known to not be not present, equivalent to the :NONE case of the Common Lisp:

Code:
(~∨/(l∊(~g∊w)/g))

The member function ∊ returns a result indicating whether an element in the left appears in the right or not. These examples will make it clear:

Code:
  'take it down now now now'∊'kiwi farms'
0 1 1 0 1 1 0 1 0 0 1 0 1 0 0 1 1 0 0 1 1 0 0 1
  1 2 3∊1 1 4 3 5
1 0 1

The following expression returns letters in the guess not present in the word to guess. The tilde ~ negates its parameter:

Code:
(~g∊w)/g

Those letters are then used on the entire dictionary simultaneously in the same way:

Code:
l∊(~g∊w)/g

The last little bit ~∨/ is logical OR reduction along the last dimension followed by negation. Afterwards, this is a nice little vector in which a one indicates the row lacked those letters, and zero indicates the opposite.

The second expression filters words containing letters known to not be present in the given positions:

Code:
(((~g=w)/l)∧.≠(~g=w)/g)

The comparison function = returns a one for each element that matches on both sides, and zero otherwise. The following expression reduces the guess to only those letters which didn't match in value and position:

Code:
(~g=w)/g

This example will help, and notice the order is maintained:

Code:
  (~'kiwis'='milks')/'kiwis'
kwi

The following expression does the same to the dictionary, but removes entire columns instead:

Code:
(~g=w)/l

So, the filtered dictionary has the same number of columns as the filtered guess has letters afterwards. This is necessary for the inner product code. To be brief, the code ∧.≠ will compare the guess with each row of the filtered dictionary with inequality and then reduce the vectors with logical AND afterwards. Therefore, the resulting vector will hold a one in a position only if all values in the row weren't equal with any letter of the guess.

Here's a nice explanation of inner product:

To be brief, I needed to compare the filtered guess with every filtered row of the dictionary, and this was the way to do it. In some vector languages, the language notices this kind of code more simply, and will allow it. Had I only needed to compare a single letter, it would've worked without inner product, which was a little frustrating when I was writing this.

The third expression filters words not containing letters known to be present in their correct positions, equivalent to the :HERE case of the Common Lisp, and is similar to the second expression:

Code:
(((g=w)/l)∧.=(g=w)/w)

Notice the code is almost identical, but it doesn't negate the intermediate values, and uses equality over inequality in the inner product.

Lastly, the fourth and fifth expressions handle the remainder of the :SOME case, that case in which the correct letters were guessed but not in the correct positions. I learned a lot about how to properly handle this case when writing it in APL; the Common Lisp handling of it is actually slightly flawed, in that it's too open and failed to eliminate guesses in a very specific set of circumstances. Since I can do everything simultaneously in APL, I noticed it was very easy to handle. These two expressions also use the Summarize code again.

The fourth expression summarizes the guess and compares it to the summary of the word to guess with the minimum function ⌊ before using that information to filter the dictionary by its summaries:

Code:
((+/[2]l∘.=d)∧.≥s⌊+/d∘.=g)

This reveals only what is known when playing the game, namely the minimum count of known letters, and examples make this clear:

Code:
  +/d∘.='kiwis'
0 0 0 0 0 0 0 0 2 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0
  +/d∘.='milks'
0 0 0 0 0 0 0 0 1 0 1 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0
  (+/d∘.='kiwis')⌊+/d∘.='milks'
0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0

So, in this example, the guess "kiwis" has more instances of the letter i than the word to guess "milks" and so the code correctly shows that the word contains a minimum of one letter i. I won't explain the following code beyond pointing out that it summarizes the entire dictionary, and reduces along the second axis instead, due to the way the code works:

Code:
(+/[2]l∘.=d)

After this, both sides contain summaries, and the code ∧.≥ will reject any words in the dictionary whose corresponding letter counts aren't equal to or greater than the summary obtained on the right side of the inner product expression.

The fifth and final expression is similar, but calculates the known maximum letter counts:

Code:
((+/[2]l∘.=d)∧.≤(6,⍳5)[1+(s<+/d∘.=g)×s⌊+/d∘.=g])

A more complex operation was needed to reduce the maximums; after all, using the maximum function ⌈ wouldn't work, because that would reveal knowledge when it shouldn't be known. What I did was calculate the array of minimums and then multiply it with a bit vector of comparisons. The code (s<+/d∘.=g) produces an array in which every element is a zero unless the summary of the guess is greater than the summary of the word to guess for that letter, in which case the game would show grey squares and thus the maximum would be known. By multiplying the array of minimums by this array of comparisons, I obtain an array where only the known maximums remain, and all else will be zero:

Code:
(s<+/d∘.=g)×s⌊+/d∘.=g

I still need to convert zero to the proper integer for unknown maximums, however, and I used six. I saw no better way to perform this translation than a table mapping. The codes are incremented by one due to Index Origin ⎕IO. The code 6,⍳5 will produce the vector 6 1 2 3 4 5:

Code:
(6,⍳5)[1+(s<+/d∘.=g)×s⌊+/d∘.=g]

After that, the fifth expression is identical to the fourth, except that its inner product code is ∧.≤ instead, meaning it requires the corresponding letter counts to be less than or equal to those on the right.

With all of these masks combined, the dictionary can be stripped of all words that can't be the word to guess. Sooner or later, the word to guess is found by this method.
 
Last edited:
Trying to see if I can trick more performance from gpu matmul given specific constraints...

Code:
#pragma once
#include <cstdio>
#include <cstdlib>
#include <cublas_v2.h>
#include <cuda_runtime.h>

#ifndef LOP_XORAND
__inline__ __device__ int lop_xorand(int A, int B){
  constexpr uint32_t immLut = 0xf0 ^ 0xcc & 0xaa;
  int temp;
  asm("lop3.b32 %0, %1, %2, %3, %4;" : "=r"(temp) : "r"(A), "r"(B), "r"(0x80000000), "n"(immLut));
  return temp;
}
#define LOP_XORAND
#endif

__global__ void naive_bitnet(int M, int N, int K, const float *A, const float *B, float *C) {
  const uint x = blockIdx.x * blockDim.x + threadIdx.x;
  const uint y = blockIdx.y * blockDim.y + threadIdx.y;
  if (x < M && y < N) {
    int tmp = 0;
    int a_v;
    int b_v;
    int b_v_prime;
    float counter = 0.0;
    for (int i = 0; i < K; ++i){
      a_v = *(int *)(&(A[x * K + i]));
      b_v = *(int *)(&(B[i * N + y]));
      b_v_prime = (b_v >> 29) & 0x1;
      tmp = (b_v_prime) ? lop_xorand(a_v, b_v) : 0;
      counter += *(float *)(&tmp);
    }
    C[x * N + y] = counter;
  }
}

its slightly faster than gemm naive implementation, but as I add more optimizations they end up leveling out to the same... going to keep going just to see if this trick actually pays off and goes way faster than matmul, but not holding my breath on this...

Code:
DESKTOP-7HFB8KB:~/SGEMM_CUDA/build$ DEVICE=0 ./sgemm 1
Running kernel 1 on device 0.
Max size: 4096
dimensions(m=n=k) 128, alpha: 0.5, beta: 3
Average elapsed time: (0.000077) s, performance: (   54.3) GFLOPS. size: (128).
dimensions(m=n=k) 256, alpha: 0.5, beta: 3
Average elapsed time: (0.000163) s, performance: (  205.8) GFLOPS. size: (256).
dimensions(m=n=k) 512, alpha: 0.5, beta: 3
Average elapsed time: (0.001239) s, performance: (  216.7) GFLOPS. size: (512).
dimensions(m=n=k) 1024, alpha: 0.5, beta: 3
Average elapsed time: (0.007107) s, performance: (  302.1) GFLOPS. size: (1024).
dimensions(m=n=k) 2048, alpha: 0.5, beta: 3
Average elapsed time: (0.053378) s, performance: (  321.9) GFLOPS. size: (2048).
dimensions(m=n=k) 4096, alpha: 0.5, beta: 3
Average elapsed time: (0.436171) s, performance: (  315.1) GFLOPS. size: (4096).
DESKTOP-7HFB8KB:~/SGEMM_CUDA/build$ DEVICE=0 ./bitnet 1
Running kernel 1 on device 0.
Max size: 4096
dimensions m=128, n=128
Average elapsed time: (0.000047) s, performance: (   88.7) GFLOPS. size: (128).
dimensions m=256, n=256
Average elapsed time: (0.000109) s, performance: (  307.2) GFLOPS. size: (256).
dimensions m=512, n=512
Average elapsed time: (0.000674) s, performance: (  398.5) GFLOPS. size: (512).
dimensions m=1024, n=1024
Average elapsed time: (0.003911) s, performance: (  549.0) GFLOPS. size: (1024).
dimensions m=2048, n=2048
Average elapsed time: (0.028781) s, performance: (  596.9) GFLOPS. size: (2048).
dimensions m=4096, n=4096
Average elapsed time: (0.231138) s, performance: (  594.6) GFLOPS. size: (4096).

edit: Im dumb and dont know how to do the fancy code embeds or prevent my quotes from getting formatted in to smilies, sorry.
 
Last edited:
edit: Im dumb and dont know how to do the fancy code embeds or prevent my quotes from getting formatted in to smilies, sorry.
Quote my post and use the "Toggle BB Code" button (there's also a toolbar option under the three vertical dots):
Code:
DESKTOP-7HFB8KB:~/SGEMM_CUDA/build$ DEVICE=0 ./sgemm 1
Running kernel 1 on device 0.
Max size: 4096
dimensions(m=n=k) 128, alpha: 0.5, beta: 3
Average elapsed time: (0.000077) s, performance: (   54.3) GFLOPS. size: (128).
dimensions(m=n=k) 256, alpha: 0.5, beta: 3
Average elapsed time: (0.000163) s, performance: (  205.8) GFLOPS. size: (256).
dimensions(m=n=k) 512, alpha: 0.5, beta: 3
Average elapsed time: (0.001239) s, performance: (  216.7) GFLOPS. size: (512).
dimensions(m=n=k) 1024, alpha: 0.5, beta: 3
Average elapsed time: (0.007107) s, performance: (  302.1) GFLOPS. size: (1024).
dimensions(m=n=k) 2048, alpha: 0.5, beta: 3
Average elapsed time: (0.053378) s, performance: (  321.9) GFLOPS. size: (2048).
dimensions(m=n=k) 4096, alpha: 0.5, beta: 3
Average elapsed time: (0.436171) s, performance: (  315.1) GFLOPS. size: (4096).
DESKTOP-7HFB8KB:~/SGEMM_CUDA/build$ DEVICE=0 ./bitnet 1
Running kernel 1 on device 0.
Max size: 4096
dimensions m=128, n=128
Average elapsed time: (0.000047) s, performance: (   88.7) GFLOPS. size: (128).
dimensions m=256, n=256
Average elapsed time: (0.000109) s, performance: (  307.2) GFLOPS. size: (256).
dimensions m=512, n=512
Average elapsed time: (0.000674) s, performance: (  398.5) GFLOPS. size: (512).
dimensions m=1024, n=1024
Average elapsed time: (0.003911) s, performance: (  549.0) GFLOPS. size: (1024).
dimensions m=2048, n=2048
Average elapsed time: (0.028781) s, performance: (  596.9) GFLOPS. size: (2048).
dimensions m=4096, n=4096
Average elapsed time: (0.231138) s, performance: (  594.6) GFLOPS. size: (4096).
 
  • Informative
  • Like
Reactions: Vecr and Await
This smiley annoyance was the only reason I had to edit my last post. I recommend using the [PLAIN][/PLAIN] BB code for this, since it's the code's only purpose.
 
  • Like
Reactions: Kosher Salt
Back