1 files changed, 131 insertions, 0 deletions
diff --git a/articles/2024-11-28-optimal-computer-part-1.md b/articles/2024-11-28-optimal-computer-part-1.md
new file mode 100644
index 0000000..ca67c98
--- /dev/null
+++ b/articles/2024-11-28-optimal-computer-part-1.md
@@ -0,0 +1,131 @@
+# My dream computer (DRAFT)
+
+## Meta stuff
+
+For multiple years now I have had a vision for how a perfect operating system
+should be designed. The specifics transformed slightly over the years but have
+generally stood the same. However I have never actually gotten to implement this
+design that envisioned, because of some or another problem that was hard to
+solve at the time. I have still learned many things with my failed attempts and
+considered more radical ideas too. In my continuing efforts in implementing my
+ideas some day I want to write my most important thoughts on this topic in the
+next few days because I realized that a problem of such great complexity is hard
+to solve just within my head. These notes should be helpful for me but may also
+interest others and so are published here.
+
+During my time of failing to implement the thing I found myself questioning
+lower-level things a lot and so reevaluated these too and came to the conclusion
+that these should be redesigned aswell. For my overview I will therefore start
+from the bottom.
+
+In the rest of this article I am also just writing what I think is true or close
+to the truth even for topics where I know less. I expect some things to be
+wrong. In that case, please tell me.
+
+## Content
+
+My goal is it to design a _Computer system_ (You should know what that means to
+you.). The heart of the physical part of this machine is a _Processor_, the
+remaining parts store data or interface with the outside world or user and are
+of less interest to me. Software is also important and I will write more about
+this in the next posts.
+
+For performing the actual computations many design principles have been found in
+the past: You could use a mechanical contraption, DNA/Protein's biochemical
+reactions or anything really. Electric circuits however are most useful and
+widely deployed today and since I do not have the expertise, I am not
+reconsidering at this level.
+
+With electronic semiconductor logic gates you can arrange circuits in form of a
+silicon die that perform most arithmetic and logic operations. The arrangement
+of those gates on such a die is immutable however, meaning it is not possible to
+be reprogrammed in that way. It is therefore required to build a circuit that is
+general-purpose and can be programmed with data ("software").
+
+This problem has been approached from multiple sides before. An FPGA is where
+relatively small clusters of gates can be arranged to circuits by switching
+connecting gates (that's a simplification). [Write how "traditional" CPUs work]
+In the end there are many ways to achieve the required semi-turing-completeness
+required for a computer. The performance of these does differ though. From
+observing optimization in current CPUs it seems that the **useful computation
+per space per time** matters most. The minimum time interval for computations
+(cpu clock) does primarily depend on the size of the die (which is constraint by
+the manufacturing technology which I consider fixed) and distance of components
+on the die. A major part in optimizing _computational density_ of a processor is
+keeping all parts as busy with computation as possible.
+
+I see a tradeoff between having small reconfigurable parts (like FPGAs) where
+the overhead of reconfiguration is large (as in: many "useless" gates for
+programming) and big parts where you lack the flexibility optimize the software.
+In an unoptimized basic CPU for example, only one part would active at a time.
+(The ADD instruction only performs an addition while other destinct components
+like the multiplier, logic unit or memory are idle). Modern CPU design avoid
+this to some extent with pipelining, speculative execution and simultaneous
+multithreading which however introduce new complexity and also overhead for this
+coordination. Also the program execution timing varies at runtime which
+introduces additional problems. The concept of _Very Long Instruction Word_
+architectures appeals to me. It describes architectures where an instruction
+specifies the configuration of most (usually all) components of the processor at
+once.
+
+Next I am describing my design: Firstly the processor is made up of a number of
+_modules_ that have _inputs_, _outputs_ and _parameters_. These modules' inputs
+are then connected with data multiplexers to allow abitrary cross-module data
+flow. Parameters of modules are bits that configure the modules operating within
+one clock cycle of the processor. These parameters also include an output index
+for each input of the module to obtain the data for this cycle from. The
+processor operates in clock cycles where each cycle a new instruction is loaded
+and executed. This module system is very flexible; Here are a few examples of
+useful modules for a processor:
+
+- Add/Sub unit: Two inputs and one output. A flag to switch subtract mode.
+- Literal loader: One output set to a value set by 32 or so parameter bits.
+  Essentially providing a way to introduce new constants from program code
+- Memory loader: One input (address) and an output that takes on the value of
+  the RAM at the provided address.
+- Flow controller: Two inputs, a condition and a program counter value, that set
+  the PC when a condition is met, otherwise increments the PC as normal.
+- Memory storer: (left as an exercise to the reader)
+- Logic unit
+- Mul/Div unit
+- Floating-point unit
+
+To balance the load between modules, the most busy units can be contained
+multiple times within the processor, allowing for example to perform two
+addition per cycle.
+
+Additionally, since typical RAM is rather slow, it is also possible to introduce
+a _latency_ to some modules like that memory loader that need main memory where
+there is internal uninterupted pipelining for memory fetches such that, for
+example memory fetches arrive with 4 cycles delay and jumps happen 4 cycles
+after beeing issued. This always assumes the worst case memory latency and can
+not utilise a cache. You could however reimplement a cache manually be adding
+secondary "fast" memory that is local but smaller where you load data into with
+software.
+
+Still current superscalar CPUs can be smarter about module load balancing and
+memory caching because of their runtime knowledge, that a compiler does not
+have. In general, writing a _good_ compiler for this architecture (assuming a
+traditional language) is relatively hard.
+
+## End
+
+Thats it for today. The writing of this article has been motivated by me
+"participating" in
+[Macoy Madson's Simple Useful System Challenge](https://macoy.me/blog/programming/SimpleUsefulSystem).
+As usual I am always happy to receive feedback about what I am writing here.
+[Send mail!](mailto:metamuffin@disroot.org)
+
+Next article will be more about my ideas about software. The overview is:
+
+- Compiler/Language design
+  - Memory safety
+  - Type systems
+  - Representing hardware execution/performance characteristics
+  - Memory management
+  - Data flow oriented programming
+- Software design
+  - Permission checks at compile-time
+  - Cooperative multithreading asserted by compiler
+  - No driver-program boundary
+  - Global Safety