Make the most of compiled C loops on the 68000

https://dciabrin.net/posts/make-the-most-of-compiled-c-loops-on-the-68000/make-the-most-of-compiled-c-loops-on-the-68000/

69•floitsch•4mo ago

Comments

dmitrygr•4mo ago

Significant further gains are possible by simply unrolling the loop eight or 16 times to lower the overhead of the DBF per word written

p_l•4mo ago

The step with declaring hw registers in assembly reminds me how assignment of value to pointer is IIRC at best implementation defined, and at worst UB, and playing around with volatile saves you not from zealous optimizer.

Arguably every hardware register should be declared that way as a symbol

pjmlp•4mo ago

That was a common feature on Borland and Microsoft compilers for MS-DOS.

p_l•4mo ago

I mean, at its most basic, it's a feature from even the earliest compilers for C.

Not sure if Borland or MS shipped big fat symbol tables for all hardware registers of an IBM PC though?

pjmlp•4mo ago

With Assembly.

robinsonb5•4mo ago

Interestingly, gcc-amigaos-gcc 6.5 uses dbra without having to jump through any of those contortions, as long as the optimisation level is set to at least -O1:

  _clear_screen:
        move.w #28672,3932160
        move.w #1,3932164
        move.l #3932162,a0
        move.w #-13570,d1
        move.w #1279,d0
  .L2:
        move.w d1,(a0)
        dbra d0,.L2
        rts

dlundqvist•4mo ago

I tried this with the old SAS/C Amiga compiler. It put addresses in A0 and then moved value into (A0) on next instruction, so the setup part was a bit more inefficient. And refused to use "dbra" no matter what I tried.

odipar•4mo ago

I was once into 68k so I may be rusty, but shouldn't it be move.w d1,(a0)+ (increment the target address after each step)?

dlundqvist•4mo ago

The hardware increments an internal pointer after each access. The view to that address is through value in a0.

chris_j•4mo ago

One thing that I heard from folks who do development for retro Atari platforms is that the 68k support in GCC has been getting worse as time has gone on, and it's very difficult to get the maintainers to accept patches to improve it, since 68k is not exactly widely used at this point.

Specifically, I heard that the 68k backend keeps getting worse, whilst the front-end keeps getting better. So choosing a GCC version is a case of examining the tradeoffs between getting better AST-level optimisations from a newer version, or more optimised assembly language output from an earlier version.

I imagine GCC 6.5 probably has a backend that makes better use of the 68k chip than the GCC 11.4 that ngdevkit uses (such as knowing when to use dbra) but is probably worse in other ways due to an older and less capable frontend.

kstenerud•4mo ago

SNK were the gods of the 68000. I still remember back in the day getting a bug report on my 68000 emulator:

When playing King of Fighters, the time counter would go down to 0 and then wrap around to 99, effectively preventing the round from ending.

Eventually I tracked it down to the behavior of SBCD (Subtract Binary Coded Decimal): Internally, the chip actually does update the overflow flag reliably (it's marked as undefined in the docs). SNK was checking the V flag and ending the round when it got set.

https://github.com/kstenerud/Musashi/blob/master/m68k_in.c#L...

SBCD was an old throwback instruction that was hardly used anymore, and the register variant took 6 cycles to complete (vs 4 for binary subtraction).

HOWEVER... For displaying the timer counter on-screen, they saved a ton of cycles with this scheme because extracting the digits from a BCD value is a simple shift by 4 bits (6 cycles) rather than a VERY expensive divide (140 cycles).

kevin_thibedeau•4mo ago

You don't need division to convert to decimal, though it will still be slower than using BCD operations.

MobiusHorizons•4mo ago

Oh? How do you do it? Some kind of lookup table?

kevin_thibedeau•4mo ago

https://en.wikipedia.org/wiki/Double_dabble

A software implementation with masks and shifts will beat traditional CISC dividers.

kstenerud•4mo ago

Technically no, but they were also always fighting against the ROM size, trying to keep costs down. Every byte helped.

veltas•4mo ago

> Note how gcc is smart enough to detect that the expression ((0xc<<12) | 0xafe) is constant, so it can skip shifts and bitwise assembly operations and just emit the resulting immediate value at line 14. The same goes for the loop condition, gcc emits constant 1280 at line 10 in place of the multiplication 40x32. A classic compiler optimization called constant folding, but nice nonetheless.

This is actually required rather than an optimisation for any C compiler, from early on, as C semantically allows constant expressions rather than just constants to be used for statically allocated sizes, etc. While the 'optimisation' is not guaranteed you'll see even on -O0 the constant was evaluated at compile-time, as it's harder to not fold constant expressions sometimes than it is to just always fold them for the already required constant expression features.

jcmeyrignac•4mo ago

You can optimize further by unrolling the loop. For example:

  .L2:
        move.w d1,(a0)
        move.w d1,(a0)
        move.w d1,(a0)
        move.w d1,(a0)
        dbra d0,.L2
        rts

allenrb•4mo ago

But what about the effect on cache… oh, wait!

;-)

commandlinefan•4mo ago

This is cool, but at that point, why write it in C at all? Why not just hand-roll some assembler? It's targeted at a specific platform anyway.

Show HN: Mermaid Formatter – CLI and library to auto-format Mermaid diagrams

RFCs vs. READMEs: The Evolution of Protocols

Kanchipuram Saris and Thinking Machines

Chinese chemical supplier causes global baby formula recall

I've used AI to write 100% of my code for a year as an engineer

Looking for 4 Autistic Co-Founders for AI Startup (Equity-Based)

AI-native capabilities, a new API Catalog, and updated plans and pricing

What changed in tech from 2010 to 2020?

From Human Ergonomics to Agent Ergonomics

Advanced Inertial Reference Sphere

Toyota Developing a Console-Grade, Open-Source Game Engine with Flutter and Dart

Typing for Love or Money: The Hidden Labor Behind Modern Literary Masterpieces

Show HN: A longitudinal health record built from fragmented medical data

CoreWeave's $30B Bet on GPU Market Infrastructure

Creating and Hosting a Static Website on Cloudflare for Free

"The Stanford scam proves America is becoming a nation of grifters"

Elon Musk on Space GPUs, AI, Optimus, and His Manufacturing Method

X (Twitter) is back with a new X API Pay-Per-Use model

Zlob.h 100% POSIX and glibc compatible globbing lib that is faste and better

Show HN: Deterministic signal triangulation using a fixed .72% variance constant

Scientists Discover Levitating Time Crystals You Can Hold, Defy Newton’s 3rd Law

When Michelangelo Met Titian

Solving NYT Pips with DLX

Baldur's Gate to be turned into TV series – without the game's developers

Interview with 'Just use a VPS' bro (OpenClaw version) [video]

EchoJEPA: Latent Predictive Foundation Model for Echocardiography

Disablling Go Telemetry

Effective Nihilism

The UK government didn't want you to see this report on ecosystem collapse

No 10 blocks report on impact of rainforest collapse on food prices