I was trying to come up with some algorithm for game physics in 6502 asm. I had this autistic divine intellect idea of using self modifying code in some parts to save some cycles and memory as well. Then it had occurred to me that the 65C816 used in the SNES has the option to expand the registers to 16 bit, so it might be enough to just do 16 bit fixed point math in the accumulator and shift the result 8 bits right.
Well, turns out that the 16 bit part has been implemented in a retarded way. You see, there aren't special opcodes for 16 bit operations. For example on the 68K you have moveq and move.b/w/l depending on whether or not you're dealing with bytes, words or long words. But on the 65C816 you can either do 8 or 16 bit operations at any given time, depending on whether or not you have the CPU set in native or emulation mode. There is no lda.w or sta.w, which means that you can't avoid burning one extra cycle and eating one extra byte with immediate instructions. Yeah, you could switch back and forth between emulation and native mode, but guess what! Say goodbye to the upper byte of a register! Now, it does sound dramatic, but it appears to still be faster per cycle than on the 68k.
BUT, what I then realized is that doing 16 bit math is unnecessary for this scenario anyways, as the shifting part burns a lot of cycles, no less than 16 cycles for shifting by 8 bits, and it also eats up an additional 8 bytes due to only being able to shift one bit at a time. Now PyFags and PyTards are probably gonna be shocked and confused, but you are better off using the 8 bit mode, and 8 bit math, which saves some bytes and cycles. Now, you still want 8 bits of fractional precision, which is fair enough. However, you can just abuse ADC immediate, BCS, and INC value on some zeropage address if Carry is set. Turns out it's actually faster. Oh, the value in the accumulator will just wrap around after it goes over $FF.
Now, the PyFags and PyTards are probably seething or sperging out about me being a nigger retard using immediate values because that would mean that I can have only one fixed speed. Nigger, this piece of code is supposed to be in RAM, so should the speed change, I can just overwrite the immediate operand with the new value. I don't even need to write a macro that would generate 256 ADC immediate combinations. Besides, ADC immediate is 33% faster than ADC value in zeropage. And you'd have to store the new speed value anyways, so you might as well use this. In the end it's almost twice as fast compared to taking advantage of the 16 bit functionality. Now, I have yet to take a look at how many cycles would such a thing burn on the 68K, but I have a feeling that it would burn more of them. The PC Engine's CPU would especially benefit from this, due to it being a 6502 on steroids running at the same speed as the 68K in the Megadrive.
Here's a screenshot of some sort of mockup I made in Notepad.
Edit: I was being a tard, turns out I forgot to add a sta instruction. And of course, if you're storing a byte you have to switch back to emulation mode. Bumping the total cycle count to over 30 cycles.