articles/2024-11-28-optimal-computer-part-1.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137

# My dream computer (Part 1: Processor architecture)

## Preamble

For multiple years now I have had a vision for how a perfect operating system
should be designed. The specifics transformed slightly over the years but have
generally stood the same. However I have never actually gotten to implement this
design that envisioned, because of some or another problem that was hard to
solve at the time. I have still learned many things with my failed attempts and
considered more radical ideas too. In my continuing efforts in implementing my
ideas some day I want to write my most important thoughts on this topic in the
next few days because I realized that a problem of such great complexity is hard
to solve just within my head. These notes should be helpful for me but may also
interest others and so are published here.

During my time of failing to implement the thing I found myself questioning
lower-level things a lot and so reevaluated these too and came to the conclusion
that these should be redesigned aswell. For my overview I will therefore start
from the bottom.

In the rest of this article I am also just writing what I think is true or close
to the truth even for topics that I know less about. So view the text below as
pure speculation. I expect some things to be wrong. In that case, please tell
me.

## Content

My goal is it to design a _Computer system_ (You should know what that means to
you.). The heart of the physical part of this machine is a _Processor_, the
remaining parts store data or interface with the outside world or user and are
of less interest to me. Software is also important and I will write more about
this in the next posts.

For performing the actual computations many design principles have been found in
the past: You could use a mechanical contraption, DNA/Protein's biochemical
reactions or anything really. Electric circuits however are most useful and
widely deployed today and since I do not have the expertise, I am not
reconsidering at this level.

With electronic semiconductor logic gates you can arrange circuits in form of a
silicon die that perform most arithmetic and logic operations. The arrangement
of those gates on such a die is immutable however, meaning it is not possible to
be reprogrammed in that way. It is therefore required to build a circuit that is
general-purpose and can be programmed with data ("software").

This problem has been approached from multiple sides before. An FPGA is where
relatively small clusters of gates can be arranged to circuits by switching
connecting gates (that's a simplification). [Write how "traditional" CPUs work]
In the end there are many ways to achieve the required semi-turing-completeness
required for a computer. The performance of these does differ though. From
observing optimization in current CPUs it seems that the **useful computation
per space per time** matters most. The minimum time interval for computations
(cpu clock) does primarily depend on the size of the die (which is constraint by
the manufacturing technology which I consider fixed) and distance of components
on the die. A major part in optimizing _computational density_ of a processor is
keeping all parts as busy with computation as possible.

I see a tradeoff between having small reconfigurable parts (like FPGAs) where
the overhead of reconfiguration is large (as in: many "useless" gates for
programming) and big parts where you lack the flexibility optimize the software.
In an unoptimized basic CPU for example, only one part would active at a time.
(The ADD instruction only performs an addition while other destinct components
like the multiplier, logic unit or memory are idle). Modern CPU design avoid
this to some extent with pipelining, speculative execution and simultaneous
multithreading which however introduce new complexity and also overhead for this
coordination. Also the program execution timing varies at runtime which
introduces additional problems. The concept of _Very Long Instruction Word_
architectures appeals to me. It describes architectures where an instruction
specifies the configuration of most (usually all) components of the processor at
once.

[Skipping a few steps because its already 4 am...] Next I am describing my
design: Firstly the processor is made up of a number of _modules_ that have
_inputs_, _outputs_ and _parameters_. These modules' inputs are then connected
with data multiplexers to allow abitrary cross-module data flow. Parameters of
modules are bits that configure the modules operating within one clock cycle of
the processor. These parameters also include an output index for each input of
the module to obtain the data for this cycle from. The processor operates in
clock cycles where each cycle a new instruction is loaded and executed. This
module system is very flexible; Here are a few examples of useful modules for a
processor:

- Add/Sub unit: Two inputs and one output. A flag to switch subtract mode.
- Literal loader: One output set to a value set by 32 or so parameter bits.
  Essentially providing a way to introduce new constants from program code
- Memory loader: One input (address) and an output that takes on the value of
  the RAM at the provided address.
- Flow controller: Two inputs, a condition and a program counter value, that set
  the PC when a condition is met, otherwise increments the PC as normal.
- Memory storer: (left as an exercise to the reader)
- Logic unit
- Mul/Div unit
- Floating-point unit

To balance the load between modules, the most busy units can be contained
multiple times within the processor, allowing for example to perform two
addition per cycle.

Additionally, since typical RAM is rather slow, it is also possible to introduce
a _latency_ to some modules like that memory loader that need main memory where
there is internal uninterupted pipelining for memory fetches such that, for
example memory fetches arrive with 4 cycles delay and jumps happen 4 cycles
after beeing issued. This always assumes the worst case memory latency and can
not utilise a cache. You could however reimplement a cache manually be adding
secondary "fast" memory that is local but smaller where you load data into with
software.

Still current superscalar CPUs can be smarter about module load balancing and
memory caching because of their runtime knowledge, that a compiler does not
have. In general, writing a _good_ compiler for this architecture (assuming a
traditional language) is relatively hard.

I am already visualizing code for such a processor in my head in a table with
time and module axis. For some later post or development I might generate such
diagrams from code to think about about. (will probably look super cool)

## End

Thats it for today. The writing of this article has been motivated by me
"participating" in
[Macoy Madson's Simple Useful System Challenge](https://macoy.me/blog/programming/SimpleUsefulSystem).
As usual I am always happy to receive feedback about what I am writing here.
[Send your thoughts!](https://metamuffin.org/contact) (any method is fine)

Next article will be more about my ideas about software. The overview is:

- Compiler/Language design
  - Memory safety
  - Type systems
  - Representing hardware execution/performance characteristics
  - Memory management
  - Data flow oriented programming
- Software design
  - Permission checks at compile-time
  - Cooperative multithreading asserted by compiler
  - No driver-program boundary
  - Global Safety