Reimplementing an OS kernel as an ISA

SamuraiCrow · March 10, 2026, 2:03pm

Continuing the discussion from Ai age verification:

Kernel vs. OS

The kernel-level functionality of an operating system has been a mainstay for me since the early 90’s for me. My first computer had a firmware kernel in ROM and therefore didn’t have, or need, an operating system. My Commodore 64’s kernel ROM was all it needed to be able to access all of its peripherals. Each device was a “smart” device with a processor of its own.

Most modern-day C64 hobby-groups have way better peripherals than the C64 itself had, even when it was new… especially when it was new! Most of the original software works even now! I say “most” because every now and then, you come across some programs that do something clever that the current-gen flash-memory cartridges don’t support.

The chief among them is spawning tiny sub-threads on the 6502 CPU of the controller-board of the 1541 disk drive itself. The drive controller’s processor is almost identical to the 6510 of the C64 itself. The speed is, itself, identical and differs in ISA as missing 2 bank-swap differences in the first pair of zero-page registers. Those registers aren’t needed on the disk-drive controller because the total address-space of all the peripheral controllers, ROM and memory are not more than 64k total.

The Software-Support, International product: the RAMBOard for the 1541, actually added 32k of addtional RAM to the address-space of the controller board! Of course it was intended to be used with the Maverick 5.0 copy-protection defeating copier, but it could have been more! Spawning a data-processing thread to the CPU of the disk drive would have been a parallel accelerator and memory-expansion peripheral in one! That 32k RAM expansion was only $20 USD in 1998 but could have been a hobby-coder’s second-wind for the platform! Commodore’s DOS ran entirely on the drive board and even had memory-write, memory-read and memory-execute commands as a feature!

My example, listed before about spawning a “child thread” on a second CPU on a readily-available piece of hardware by repurposing a cheap RAM expansion for the floppy-drive controller on a C64, is a microcosm of something that could be innovative!

Bare-Metal Coding

In the past I’ve been trying to envision the implications of using a bytecode being a distribution medium. It doesn’t stop there. If that bytecode was a superset of the functionality of the peripherals that are all used by the machine, embracing not only WASM but also SPIR-V, the instruction bytecode of the GPU’s shader architecture, in a common entity, the ISA would no-longer be specific to the CPU itself!

In 1998, at my first Amiga Dev-Con, no longer Commodre’s asset but now Gateway 2000 Inc.'s asset, the Amiga was the first computer with a semi-modern GPU. It’s Blitter core could copy sections of memory around the multimedia chipset and the Copper core adjusted registers in accordance to the timing of the video raster position. A dual-processor graphics pipeline was not previously used in 3D graphics but the Amiga did in 2D.

I was disappointed as I went home after that conference, not because I didn’t walk home without the $1,000,000 USD for my innovations, but because I didn’t get a steady job there. Money for me was just gravy at age 24. The ability to make a difference in the industry was the main course. I actually turned down the $1,000,000 because of the strings attached and the unwillingness of the executives to budge. My counter-offer: Drop the non-compete clause or make it $10,000,000. If they wouldn’t budge I would take my ball and go home with it. If they wouldn’t innovate, as puppets of Bill Gates, I would. I took the bus home to Iowa from Missouri almost penniless but at least I signed neither the non-compete, nor the non-disclosure agreement.

What they got that day changed the nature of GPUs forever: the dual-shader pipeline. Putting the floating-point-heavy compute on an offloaded vertex-shader so that the pixel-shader could have more bandwidth to do “fancy stuff”. In that case, it was processing texture-fragments by copying them directly to the graphics port using a second chip with a second thread to save bandwidth, earning the pixel pipeline the title “fragment shader” in OpenGL. All it did was render fetch instructions and pointers to the data-stream of the framebuffer instead of the data itself. It was just like the Blitter and Copper but now in 3D and with floating-point!

Applying Custom Silicon

Back then there wasn’t much stomach for making custom silicon, no matter how much it would improve capabilities. Masks and dies were mostly done by hand because there wasn’t much software support for VHDL or Verilog, other than just doing the gate-layout in those languages.

Being a generalist and an innovator at the same time was hard on my side of it as well. I knew far more than I realized at the time. I could diagnose problems and come up with weird solutions as far out of the box as you could get, all without even seeing source code or gate layout. If I tried to actually build anything myself, I’d get the “blank page” disorder of not knowing where to start! The reason I could think outside the box so well was that I had only a 2-year tech-school degree in hardware and was entirely self-taught in software. To make matters worse, I had no experience with organizing projects and management. The innovations I passed along that day would fuel the whole tech-sector for the past 2+ decades! The entire tech sector!

Dataflow Architecture

One technology that I didn’t know about that the guys there did tell me about was graph instruction sets. Graph instruction-sets are so parallel internally, that they combine the techniques of hyper-threading and super-scalar all in one elegant core-design. My recent studies with that reveal that certain abstractions needed to implement parallelism in traditional, VonNaumann CPU architecture, are completely unnecessary in graph ISA to the point that they just artifacts in an instruction-cracker that expands to simpler operations with no backend support needed.

Floating Point

Scientific notation, known in the tech sector as floating point, is so necessary in discrete graphics cards today and used for everything from vector units in the CPU, to the vertex-pipeline in the dual-pipeline GPU mentioned earlier, to the NPUs required by the sinking Titanic of WIn11 and ultimately science itself…is just an unnecessary abstraction in a graph ISA!

In my prototype processor, the loads and stores of the FPU instructions map to micro-ops that are essentially vectorized scatter-gather ops to convert to and from structures containing just mantissas, exponents and sign bits respectively. Soft-floats have never been so parallel!

Condition Codes

In a traditional CPU, condition codes are frequently bundled in a single register. The condition-code register is an implicit, added parameter to pass in and out of every instruction, in many CPUs. In a graph-processor, the calling overhead would be terrible if that were done. Also it is usually unnecessary! The RISC-V CPU suffers when emulating ARM and x86 ISAs because you keep having to add compare ops that would not normally be needed in native RISC-V code! In RISC-V it is explicit in every case but implicit in the others mentioned.

In my core, the unsigned byte has a secret identity: the 8-lane packed-boolean vector! Even the experimental MIT graph-processors of the late 80’s that the guys at Amiga Inc., the Gateway 2000 subsidiary back in 1998, told me about, probably supported it but didn’t realize how important it was! Just being able to cast as many as 8 booleans, in vector form, as an integer, and index into a 256-entry jump-table in just one op!

The MIT-derived Monsoon chips failed, not because the chips were bad, but because the compiler technology sucked back then! They probably didn’t even optimize for the boolean vector case I mentioned! They might even have an op for a branch without even realizing it’s just a degenerate case of a one-lane vector of booleans! How redundant! By leaving the ordering of the condition codes in the boolean vector undefined, the optimizations couldn’t be easier!

Big Vectors in RAM

My CPU doesn’t have (or need) scalar ops because a single-lane vector is the representation of them. ARM9 supports a memory-to-memory vector size of 2048 bits, that’s 256 bytes. Where mine surpasses theirs is that the activation records on the stack and heap pointers are pegged in the caches so often that they don’t actually need to be addressable registers at all! All of them are basically offsets into the structure they are in! My ISA doesn’t even need to have more than a few accumulators per-thread! Plus, mine has more space per micro-op so the vectors are potentially bigger (though not required to be).

Instruction Cracker as Single Truth

On every modern, CISC/RISC hybrid CPU, the native instructions are small in size but the number of micro-ops are more. The instruction cracker is one of the best sources of code cache bandwidth! Fewer, smaller ops in the code cache compress the code and save a lot of bandwidth!

There was a guy at the Amiga Dev-Con, that worked as a design engineer at IBM, that wanted to sell PowerPC CPUs to the execs there. The current model was what Apple called the G3 and the G4 was soon-to-come. When I pointed out that those big instructions had terrible code-density (his phrase) and that the code cache couldn’t operate efficiently with the existing ISA of the PowerPC, he had to admit he hadn’t thought about it. I’m pleased that he did the right thing with the successor of the PowerPC 970, the G5. He replaced the PowerPC instruction decoder, which couldn’t crack to more than 4 micro-ops per RISC opcode in an out-of-order CPU die, with that of the modern-day x86_64 instruction set! The Athlon64 was a venture of cooperation between IBM and AMD that was just that: a PowerPC 970 that replaced the instruction-cracker with a backward-compatible, x86 CISC one!

The performance benefits of the AMD64 instruction set, despite being generationally convoluted from having no clean breaks, is a monumental achievement of processor advancement because it demonstrated the importance of 2-tier ISA with opcdes on the top tier and micro-ops as the basis of implementing them. With that knowledge in hand, it can be pushed to even greater heights with graph micro-ops than the high-clocked, superscalar, RISC ALUs.

The graph parallelism is so free-form that many, short pipelines can outrun fewer, longer ones. Tile-based approaches can also work, like the ones in the GPU of the second and third-gen Raspberry Pi. There is nothing more freeform, in my mind, than the ability to shove 8-bit micro-ops into the backend by the dozens or more! Modern AMD64 CPUs use flash memory for the micro-coded ops so some bugs can be updated. Only some of them, however.

Patent Office, Here I come!

With all the knowledge I need to do it again, I’m going to make an appointment with a patent attorney. This time I’m not passing along. I’m taking names and kicking *ss! The entire kernel architecture of a nono-kernel can be encoded into an instruction cracker of the fabric of an FPGA and making the entire instruction cracker user-defined is the means of doing exactly that! Using WebAssembly, SPIR-V, the WASI standard runtimes and the WASIX extension for POSIX support are all fair game. If I can get the soft-core big enough, the chip can be one-core-to-rule-them-all but this time made in “Gondor” instead of “Mordor”! Make Gondor Great Again!

See ya!

PeterW · March 10, 2026, 3:06pm

I don’t understand where in this case the two novelities lie, that a patent must have. But go ahead, I wish you success!

And I would have been very surprised if WASM wouldn’t be envolved

SamuraiCrow · March 10, 2026, 3:31pm

I was being deliberately vague but the gist of this one is that if the kernel functions are directly part of the instructions in the cracker, they aren’t software. Nor the BIOS equivalent, for that matter.

PeterW · March 10, 2026, 3:48pm

Ah, I see. That makes sense for a patent. I wish you success!

humdinger · March 10, 2026, 4:12pm

Is there anything in that post pertaining to Haiku development - the topic of this forum - besides Haiku does have a kernel?

vighnesh-sawant · March 10, 2026, 6:14pm

Quite Interesting stuff indeed!

SamuraiCrow · March 10, 2026, 6:53pm

If the functionality of a “module” (the WebAssembly equivalent of a shared-object) is incorporated, most of the kits and libraries of Haiku become “runtime library dependencies” rather than an actual operating system. In response to the thread about operating system requirements of age-verification, this is a potential legal loophole.

PulkoMandy · March 11, 2026, 7:26am

Al we have to do is provide a listbox with a selection of age brackets in FirstBootPrompt. Not revolutionize the entire computer industry and then have dubious arguments that the code is somewhat implemented in hardware or whatever and somehow doesn’t count as an operating system despite doing exactly the same thing.

Your loophole is more complicated and constraining than just doing what the law asks.

X512 · March 11, 2026, 11:22am

Please do not make it for users outside of California and Brazil.

nephele · March 11, 2026, 6:07pm

Already wrong

EDIT: also to the post above. using some “Virtual” machinecode, uhh, doesn’t actually make it any better? This was already done with hardware natively implementing the JVM bytecode.

SamuraiCrow · March 11, 2026, 9:29pm

The performance regressions of the JavaSparc architecture were only in the backend. (Then again, backend issues are such a PITA.)

PulkoMandy · March 12, 2026, 7:47am

This was already done with the Xerox Alto implementing Smalltalk bytecode in microcode loaded to microcode RAM back in the 1970s, and it probably wasn’t the first machine to do so.

Munchausen · March 12, 2026, 12:33pm

Whole bunch of ARM cores could do this for some java bytecode too, it was called Jazelle IIRC.

I’m reminded of Transmeta CPUs as well which were VLIW and would load a VM at boot that told them how to execute x86 instructions. But they also demoed running java bytecode directly I believe.

orignal · March 12, 2026, 2:35pm

Looks like another AI slop.

SamuraiCrow · March 12, 2026, 3:55pm

AI slop is based on a pattern-matching algorithm. This is no common pattern but it’s been tried before in various iterations. The most unusual but potentially successful one was the MIT Monsoon processor with a non-Von Naumann graph instruction set. If only modern compiler technology would have been around then… WOW!

PulkoMandy · March 12, 2026, 10:48pm

I believe in this case it is genuine human generated slop. I don’t know if that’s worse or better.

orignal · March 12, 2026, 11:43pm

Worse. Because no that excuse.

SamuraiCrow · March 14, 2026, 2:39am

Of course the devil’s in the details but I haven’t met my IP attorney yet. Once the patents are filed, I’ll be at greater liberty to discuss.