Skip to content

For as long as I remember, I had played with LEGOs. Those simple blocks would transform into complex objects whose final shape only existed in the supple thoughts of creative imagination. The kind of LEGO blocks we had while growing up were simple: 2x4, 2x2, 1x8. Anything unusual was rare and precious.

Do you remember some of these old blocks (the fence, for example)?

For my friends and me, playing with LEGOs from an early age made us think in terms of hierarchical constructions, building blocks, and breaking up the work into smaller steps to build something more complex – all done at the mental level first. We did not know that, of course, but the required way of thinking made it so. It is no wonder that many of us growing up with it ended up doing engineering.

Engineering is in many ways just like building things from LEGO blocks. Like a metaphorical "standing on the shoulders of giants" is for the science world, we combine, mix and match proven existing pieces, carefully introducing new variables only one at a time and testing them with great care before proceeding. We start with some grand ideas. Sometimes we are not sure if they are even doable, yet we make repeated attempts to reach them, using different building blocks if needed. The joy is in doing.

In this installment of our calculator building project, I will describe the framework I had set up to have several different kinds of practical applications built from a single set of Verilog source code. It uses technologies that span both software and hardware development. When I started toying with it, I was not sure I could get it all to work together in that way, but in the end, it worked out well.

The initial Verilog code around which I built this framework does not do much. Still, it does implement some valuable functions: the "lcd" module implements a complete 16x2 LCD panel driver, which displays "Hello World" after initialization. The 3 LEDs are assigned to blink, show the pushbutton state and the internal reset state. The 50MHz input clock controls the timing. Pressing the button restarts the sequence. This process is swift, but you might still see a quick flicker on the screen. The states of individual signals are visible in simulation, and there are several ways set up to simulate the design. Each verification approach may uncover a unique set of issues.

Altera Cyclone II dev board

While the end goal is to make a physical, electronic device (with a keypad and a display), in the process, we need to test and simulate pieces of that design at several levels and in stages. We will use ModelSim and Verilator software to simulate it; we will debug it and run a prototype using Qt in desktop and web browser environments. Most importantly, the same design will synthesize and flash into a physical test board (with the attached peripherals like LCD and push buttons). The selected Altera EP2C5 prototype board will be a stepping stone to our own created and manufactured final board.


While we could have used 74-series 5V TTL chips to build this project as it would be considered more of a "retro" style, I wanted to use an off-the-shelf FPGA device instead since it lets me design more complexity at the gate level. (I also did not want to end up with a wired mess of breadboard interconnects.)

I opted for the Altera EP2C5 due to its relatively low cost (around $5 on eBay), ease to solder at home (being packaged in a somewhat manageable TQFP with 144 pins), and I also already had an inexpensive test board for prototyping.

My Cyclone II devices arrived in a nice perforated strip package

Altera (which was acquired by Intel in 2015 for $16.7 billion in cash) designs and manufactures programmable logic devices (PLD, FPGA, …) and licenses a synthesis tool that supports those devices, called Quartus.

If you will, the retro part is definitely Quartus software: when you look at its GUI, it does take you back to the 80s! However, once you get familiar with it and its quirks, the software is not bad, and it does a lot.

To support this FPGA chip generation, you will need an older Quartus version, 13.0.1 (Service Pack 1). The "web" or "lite" editions of Quartus are free. It supports a smaller subset of mostly older FPGA Cyclone and CPLD MAX devices.


Verilator is a free and open-source compiler for Verilog (and System Verilog variations) that generates modern C++ code which can be compiled and executed as an application. It creates a cycle-accurate behavioral software model.

You need to write a C++ wrapper with a main() function, which essentially drives the top module's clock, reset, and other input pins, and then you can read the state of the model's wires and nets while running the simulation.

The primary loop in Verilator that runs the simulation:

Such a "verilated" code also supports coverage analysis and signal tracing. These traces are written to an external file and can be viewed by the GtkWave utility. Be aware that the trace files can quickly get very large!

Verilated simulation traces shown in GtkWave

If you develop hardware in Verilog (or System Verilog), Verilator can also be used as a good linter (a static code analyzer). It will find errors in your code much faster than any other compilation or synthesis tool. Use it with these options (with your Verilog sources listed at the end):


Qt is a software framework for creating cross-platform applications using C++. The way it does it is by abstracting everything platform-related with its own set of well-designed classes. So, while you still use C++, standard libraries, etc., you use Qt's classes to access resources like network, keyboard, and so on.

Qt has more than 2000 classes. Kidding you not: All Classes of Qt 5.15

I have been using Qt professionally and personally for more than ten years, and I've used (and am familiar with) only a fraction of the available classes. However, the documentation is abundant and of good quality; there are plenty of excellent examples and related discussions, making it easy to pick up.

Qt is developed and maintained by "The Qt Company" (originally "Trolltech") based in Finland. Their business model includes commercial licensing at various levels and provides a free, open-source license for non-commercial use.

The beauty of Qt is that once you create an application, you can recompile it to run on Linux, Windows, Mac, and Android, unmodified.

QtCreator is an IDE that ships with Qt. Alternatively, if you want, you can compile a Qt app from the command line or import the project into Visual Studio (using a Qt plugin for Visual Studio) if you prefer that IDE. I frequently do that since the debugging capabilities of Visual Studio are unmatched.

We will be wrapping our "verilated" C++ sources to create a GUI application.

Another benefit of using Qt is that (starting with version 5.15) it supports compiling applications for WebAssembly.


WebAssembly (or "Wasm") is a relatively new technology. It is not JavaScript, although it runs in every modern JS virtual machine (in a browser). Wasm was developed with an intent to JIT (Just-In-Time) compile to any modern CPU and provide native performance. It is "client-side," meaning your browser loads the Wasm object code and compiles it locally before running it. You will notice a rapid spike in CPU usage after loading a web page containing Wasm code and before the code runs.

Currently, the primary tool to compile your C++ source to Wasm is called "Emscripten." Some other languages are also supported; for example, Microsoft's Blazor compiles C# into Wasm.

Wasm has two standard formats: a compact binary object file and the corresponding, human-readable listing. The binary object still needs a JavaScript wrapper to load it and to provide the sandboxed environment in which it can execute. Hence, the final code bundle includes both the JavaScript wrapper and an HTML page it is embedded in. The Wasm API requires a level of support provided by the sandbox to interface with the "outer world" safely. As a developer, you would need to write that glue logic.

Fortunately, Qt takes care of that wrapper for you. You compile your C++ code just as you would a typical Qt application, and it should work. Be aware that not all Qt classes were ported to Wasm yet, but the list is growing. Also, since the Wasm runs in a protected sandbox, some functions are modified: for example, standard file dialogs to load and save files are implemented as browser file upload and download operations. As Qt is constantly evolving, I am sure they will iron out those kinks as much as possible, given the inherent limitations of the environment.

Since the WebAssembly code is running within a protected browser VM and it cannot access the host file system, and because our async_rom module needs to read the initial ROM content from a file ("lcd_cmd.txt"), we need to resort to a trick. To provide that file inside a browser's sandboxed VM, we store it as a resource in the Qt app, and then we copy that resource back to the file on the host when the app starts:


The "quartus" folder contains project files for Altera (Intel) Quartus software. To support Cyclone II devices, you need to download Quartus version 13.0. SP1. The project was set up to synthesize on that FPGA device and to download/flash our test board.

Cyclone II EP2C5 Mini Dev Board - Land Boards Wiki (

I performed some minor modifications to that board: I have removed a couple of resistors and capacitors to free up a row of GPIOs for use. Those SMD components were not used for the core FPGA operation. They were added to support external reset and a different device SKU using the same board. On that board's website, the components are mentioned as "Only needed for EPC28".

Components that were removed


ModelSim is an HDL simulator made by Mentor Graphics. Founded in 1981 and based in Oregon, it was acquired by Siemens in 2017. The company develops and licenses a range of EDA (electronic design automation) tools, of which ModelSim and QuestaSim are just two. While QuestaSim is their newer and more expensive flagship product, ModelSim is an older and now mostly free for students and hobbyists.

You can download the free versions of ModelSim alongside the Quartus tool from the Intel website. Both are sufficient for smaller projects using older FPGA chips.

Our project's "modelsim" folder contains a ready-to-run configuration. After making sure that the ModelSim executable is in your path, run "modelsim.bat" from a command prompt, and the script will do the rest: create a project, include relevant Verilog files, compile, and run the simulation. At the end of the simulation, the code will encounter a "$finish" statement (located in the "" file) and stop. If you do not want the source file to open at that point, change ModelSim setting by Tool->Preference->By Name->OpenOnFinish = 0.

ModelSim simulation run showing the "Hello" time slice (click on the image to enlarge)

The results

This development was done on a Windows 10 machine. Verilator and GtkWave were run within a WSL (Windows Subsystem for Linux). The MobaXTerm application provided the X-11 server on Windows. If you try to replicate this setup and encounter problems opening an X-window from within WSL, be sure to set Linux' DISPLAY environment variable in this way:

Here is "Hello, World" displayed running on an Altera FPGA board:

"Hello World" on EP2C5 board

The same Verilog code, verilated and compiled as a Windows application:

"Hello World" as an application

Compiled to Qt's WebAssembly target, the same application runs in a browser:

"Hello World" as a WebAssembly application

I have copied the code here where you can run it from your browser:

Calculator Framework Webasm

This framework's beauty is that the basic Verilog source code runs un-modified in a variety of build and test environments, producing identical results. Another good thing about this project is that all the tools for it are free.

These technologies are like LEGO pieces; we picked them from a pile of available technologies, connected them with some glue logic, and, after much experimentation, we have built something quite versatile.

The source is on GitHub here:

I am leaving it in that basic form that can serve as an initial template for a similar project.

Additional Links:


In the last post, we have verified and quantified the precision of the basic four functions (addition, subtraction, multiplication, and division), and now we can use them as a steppingstone. We can assume they will be available to us so our experimentation at this stage could simply use built-in C++ functions.

In this article, which is the third in a series, we will continue with the proof-of-concept research. We will tackle more complex, higher-order functions and investigate methods to compute them. At this point, the focus will be more on the algorithms than on the low-level implementation (that is, we will not be bothered trying to finesse the code to use a particular hardware register format like BCD; we will simply use a generic “double”).

The source code implementing the algorithms of these operations is checked in to Github here:

I have implemented only major functions and not those that could be derived. As you will see, there were only a few major functions that we really needed. The code also checks for the symmetry of the pairs of functions. The expectation is that running the inverse function on a result should produce the original number, for example, arctan(tan(1.23)) = 1.23.

The theoretical field of numerical methods is wide and mature, and a decent study could take a lifetime (or a Ph.D. thesis), so the algorithms selected for this project do not necessarily represent the best ones available. Most of the decimal CORDIC algorithms were derived from Jacques Laporte's web pages. Jacques passed in 2015 but his excellent work on reverse engineering algorithms of HP calculators lives on.

Square Root

Wikipedia: Methods of computing square roots

HP calculators’ algorithm: W. Egbert, 1977 , Volume, Issue May-1977 (

  • It uses digit-by-digit calculation, never having to go back and correct a digit, hence deterministic time that depends on the number of digits and not on a result’s convergence rate.
  • It has additional improvements to speed up the calculation.
  • It looks suspiciously like this Japanese method: F. Jarvis, spec4.dvi (

Possibly improved algorithm, but with no accompanied source code: C. Bond, sqroot.pdf (

ZX Spectrum uses Chebyshev polynomials: Logan, O’Hara, The Complete Spectrum ROM Disassembly

  • It uses discrete coefficients to approximate sin(x), atan(x), ln(x) and exp(x)
  • Based on that, it derives sqrt(x) = (x)^(0.5), where x^y = exp(y * ln(x))

Therefore, a square root could also be implemented using an exponential function.

We will be using the Babylonian method: it converges quickly, most numbers converging in 10-15 iterations. It does have two divisions per iteration, though, but its simplicity makes up for it. Source code is here.

This method starts with an initial guess and squares it, and then it refines that guess depending on the result: if the square of a guess is larger than the original value (“overestimate”), it decrements it, and if it is less (“underestimate”), it increments it. The amount of that correction, and a way to calculate it, directly affects the quality of the algorithm and the speed of convergence. The process is repeated until we reach the desired accuracy.

The convergence formula that is used is: next_guess = (last_guess + initial_value / last_guess) / 2
We stop when last_guess and next_guess do not differ any longer.

As an example, here is a table of convergence of sqrt(5.71):


Trigonometric functions

HP35 algorithm: TRIGONOMETRY ( and Cordic (English) (

ZX Spectrum uses Chebyshev polynomials: The Complete Spectrum ROM Disassembly

  • It uses discrete coefficients to approximate sin(x), atan(x), ln(x) and exp(x)
  • sin(x) uses 6 constants; exp(x) uses 8; ln(x) and atan(x) use 12 constants
  • Based on those, it derives others:
    cos(x), tan(x), asin(x), acos(x), **(x) and sqr(x)
    cos(x) = sin(PI*W/2), where -1<=W<=1, W is “reduced argument”
    tan(x) = sin(x) / cos(x), overflow is cos(x)==0
    asin(x): first, tan(Y/2) = sin(Y)/(1 + cos(Y)); then, use atan(Y/2)
    acos(x) = PI/2 – asin(x)

Some practical problems with Taylor (or Chebyshev) polynomials are their relatively slow convergence, lots of terms (multiplications and divisions), and thus lots of math for not much precision:

Also, since polynomial solutions only approximate desired functions, their precision wobbles alongside the curve (hence there is a search for a more perfect polynomial that has a good “minimax” property, minimizing the maximal error – a less extreme error wobble across the range); they may be more precise at some points (say, at and around x=0) but are off further you get away from that point.

To reach 12-14 digits of accuracy, we would need to compute many terms (20 or more) which is very computationally intensive.

Here is where the CORDIC algorithm shines. It is not only simple to understand, but with correct optimizations, it is also very fast. In its most optimized form, it uses only basic shift and add operations while producing the results that are exact to the known precision. CORDIC is not an approximation to a function but a true solution to it.

CORDIC can also be used to compute many other functions, and not only trigonometric.

How CORDIC works?

Suppose we need to find a tangent of 32°. If we graph it, there is a point on the unit circle at P(x,y) for that θ=32°.

(Source: desmos )

If we somehow find the coordinates of that point P in the cartesian system, we know that the tangent of θ is simply Py/Px. Knowing the tangent, we could also derive sin θ and cos θ using the equivalence formulas.

CORDIC stands for COordinate Rotation DIgital Computer and is based on rotation formulas:
Px = x cos θ - y sin θ
Py = x sin θ + y cos θ

These two formulas tell us how to rotate a point along the unit circle. The first optimization is to simplify those two rotation formulas so that they use only tan θ instead of both sin θ and cos θ:
Px = x - y tan θ
Py = y + x tan θ

For the curious, the exact math of this rewrite can be found here: An Introduction to the CORDIC Algorithm - Technical Articles ( A take-home point is that the above optimization is almost correct: it will not rotate a point on the unit circle - the point itself will drift out - but the final coordinates will have the same ratio of its x and y parts. It turns out that we care only about that ratio and not the values. The article above mentions a pre-computed scaling factor that could be used to bring the P back onto the unit circle. In that case, computing sin and cos are a bit simpler and more direct.

Let us create a second point, Q, that lies on the x-axis, at (x=1, y=0). We will incrementally rotate it counterclockwise by a certain predefined set of small angles and stop when the total angle of rotation, call it β, reaches the angle we were given as a task to calculate its tangent (θ). The value of tan θ will be the y/x ratio of our rotated point Q. We want β to be as close as possible to θ.

Rotation operations are additive: you can apply many smaller rotations in sequence with the result being the sum of those individual rotations.

A clever trick of CORDIC now comes into play.

We start by making the first rotation of Q by a small angle for which we know the tangent. We could start with 1°, for which tan(1°)=0.01745. Following the rotation formulas, we now need to multiply both Qx and Qy components with 0.01745. Multiplication is an expensive operation that may also contribute to compounding errors when done many times.

Let us look at the problem in this way: we picked 1° since that angle looks nice and round to us. Instead, if we have picked an angle whose tangent was a value that we could multiply with very easily, like 0.1 or 0.01, then we would only need to shift right Qx and Qy. Multiplication with 0.1 in the decimal system (and hence in BCD) amounts to a simple shift by one digit.

That is what CORDIC does: it picks a set of angles for which tangents are 1, 0.1, 0.01, 0.001… Those values correspond to angles of 45, 5.71059, 0.57293... The angles are stored in a table so they can be added to the rotating point Q as its rotation angle approaches the final desired angle.

The precision of the result depends on the number of digits stored in the table, and by the algorithmic decision of how close we want to approach the desired angle (the number of iterations).

If we work out our example of 32°, the sequence of computed values would look like this:

QxQyNew xNew yTotal βtan θθ

Our result (y/x) compared to the actual tan(32°) differs by only 8x10^-5. Notice that we skipped one table value (for tan θ = 0.001) because the corresponding angle increment of 0.0572, when added to the current running angle β, would have exceeded the target angle of 32°.

Range reduction

Standard trigonometric functions are symmetrical and cyclic, which means you should get identical results for any positive angles that are away by a multiple of 360°. Hence, tan(20°) equals tan(380°), but also tan(1.234e10°) equals tan(-80°) which is not as obvious due to the exponential format.

Before starting to compute any trigonometric function, we need to reduce the input angle; we also need to handle large values (with non-zero or large exponents) paying attention to a possible loss of precision that may creep in with that reduction.

Our code implements the heuristic of subtracting 2*PI from the input value until the value is within the (0, 2*PI) range. For larger values (with non-zero exponents), we subtract 2*PI * 10^exp, reducing the exponent as the value gets smaller. Even for large exponents, the convergence to the base range turns out to be quite fast.

Function Dependency (trigonometric)

The rest of the trigonometric functions can be derived from this equivalence table. (Source: Wikipedia)

Trigonometric function identity table (Creative Commons)

In practice, you do not need to compute both sin θ and cos θ as each one is simply 90° shifted from the other. Having only tan θ suffices to get all other functions.

Inverse trigonometric functions

HP35 algorithm is described here: INVERSE TRIGO (
These functions can also be computed using Chebyshev polynomials.

If you recall how we calculated tan θ by creating a new point Q at (1,0) and rotating it counterclockwise until it reached the angle θ, inverse trigonometric CORDIC arctan(x) does the opposite: we are given the ratio y/x of a point P located at the unit circle and we need to find the angle θ that this point makes with the x-axis.

To calculate arctan(x), we create a point P at coordinates (1, x), where x is the input value, and rotate it clockwise until its y coordinate becomes zero (or as close to zero as necessary for the required precision). As we rotate it, we add those small angles. Once y is zero, the total angle of rotation represents arctan(x).

If you look at the code, the algorithm is not immediately obvious due to the shortcuts deployed to simplify BCD arithmetic. The coordinate P has been rotated by a set of angles whose tangents are values easy to multiply with (1, 0.1, 0.01…), and the total rotation is stored in array digits[K], with an index corresponding to that table. The last loop simply sums them up using their positional weights.

Logarithmic and Exponential functions

This is the second major group of functions found on all but the most basic calculators. Just like trigonometric, each function has its inverse counterpart. A convenient property of this group is its rich equivalence table – not unlike the trig functions where we only needed tan θ and its inverse, arctan(x), it turns out that we only need to implement a logarithm and its inverse, the exponential.

For both functions, we again use a variation of a decimal CORDIC.

Logarithm (natural)

As a notation, ln(x) is equivalent to loge(x)
HP35 uses the “Pseudo Division and Pseudo Multiplication” process: LOGARITHM ( which is suitable for BCD architectures.
ZX Spectrum uses Chebyshev polynomials: The Complete Spectrum ROM Disassembly
It uses discrete coefficients to approximate sin(x), atan(x), ln(x) and exp(x)

As with trigonometric functions, ln(x) and eX need to use the same constants and accuracy to assure symmetry.

We are using HP’s method: it is very fast and simple to implement on a BCD architecture. The source code is here.

Although at the first sight it does not appear so, the logarithm code is very similar to the equivalent CORDIC code for a tangent. Instead of tangent values, its hard-coded lookup tables contain log values.

The basic idea implemented in the log algorithm is that each individual digit of the input value “n” contributes to the result by a certain quantum. An intermediate array digits[M] holds this contribution count (quotients from pseudo-division). The final loop sums up these contributions weighted by this quantum (actual log values stored in a table).

For the more curious, Jacques Laporte explained the algorithm in much more detail: LOGARITHM (

Logarithm, base-10

log10(x) = ln(x) / ln(10), where ln(10) is a constant 2.30258509299404568402
Alternatively, 10x = e^(x * ln(10))

We will use these formulas to derive the logarithm and exponential of base-10.

Exponential (natural)

Wikipedia: Exponential function
A natural exponential function is defined as y = ex, where e=2.718281828459 (Euler's number)
This operation can be found written in several different formats: ex, exp(x), or EXP(x), or even e^x.
eX is the inverse of the ln(x), natural logarithmic function, written as loge x, ln(x), or LN(x).

Due to computing imprecision around x=0, many calculators (as well as C++ standard library) implement a function called expm1(x) which is y= ex - 1 instead.
ln(x) and eX need to use the same constants and accuracy to assure symmetry.

HP35 uses the ln(x) algorithm, run in reverse, so to speak: EXP(X) (

We will follow the HP heuristic and use the same.

Exponential, base-10

Written as yx or y^x, it is often calculated using this formula: yx = e^(x * ln(y))

HP41 is using this formula, and so will we.

Function Dependency (exponential)

Expressing a wide set of functions using a combination the basic ex and ln(x):  

sqrt(x)x^(0.5)e^(y * ln(x))*
10X10^(x)e^(x * <const>) 
log10(x)  ln(x) / <const>
yX e^(x * ln(y))*

All hyperbolic functions can also be derived from the same set of basic exponential functions. They operate on the hyperbola rather than the circle (as their trigonometric cousins do), so although they share similar names, they are a different kind of beasts, having more in common to exponential “e” rather than the circular “PI”.

Below is the list of hyperbolic and inverse hyperbolic functions with equations on how to derive them. (Source Wikipedia)

We can get all 18 functions “implemented” for the price of 2 … Isn’t that wonderful?

This should wrap up a brief look at the theory and algorithms of what we are building. It’s time to roll up our sleeves and start with the actual design.

Interesting Reads

This is the second article in a series on making our very own calculator.


When I was in 13th grade, I almost flunked math. Test after test, I was getting poor grades and it seemed that I just could not get those answers straight. I was messing up a lot. As the semester progressed, things were getting worse. Luckily, my late dad, having been a teacher himself in the past, understood the problem, and hired an old lady, a former professor, to tutor me.

She was from the old school: strict, authoritative, and always with a proverbial ruler in her hands to smack me when I would not follow her directions. (Yes, she really did it a few times, but gently, I should say. Those were different times, a different country).

After a couple of months, my grades dramatically improved. In fact, I remember the day after we got one test back, our teacher said in front of the whole class just how impressive my improvement was and that I was reaching the top of the class. It felt surreal.

She did not teach me math or how to remember the formulas.

What she did teach me was the method of doing the math.

She made me slow down and do my work, pedantically, on the paper. “Don’t rush! Slow down! Step by step! Verify it on the side!”, she would keep repeating. I was trying to calculate too many steps in my head since I was getting them right. Almost … most of the time, at least.

This brings me to the importance of step-by-step testing and verification, and this article is about prototyping and the verification of basic algorithms.

Algorithm Verification

We need to investigate and pick algorithms that we will use and verify them against a known good, “golden”, model. That platform also needs to be easy to use and debug. It should be quick to prototype various models so we can research different ideas before committing to them.

Normally, for such tasks, Python comes to mind, but for this project, I have selected C++ and MS Visual Studio’s basic console C++ project to construct such a framework. In my opinion, Visual Studio is the best freely available debugging framework. I also checked in a Makefile to compile it under Linux. Using both, I have also found certain discrepancies between the two compilers which I will describe.

The goal of this step is to verify the basic operations of addition, subtraction, multiplication, and division and the parsing of the input buffer. We want to investigate various ideas on how to do that effectively on BCD values, with mantissa and exponent processed separately. The code should be flexible and not tied to a particular mantissa width so that we could change it and see the impacts of such changes on the overall precision.

This code, a functional-level simulator, does not have to be too fancy, or fast. In fact, quite the opposite: the goal is to start mimicking some kind of low-level implementation that could be done in hardware so that the eventual translation to it may be simpler. Still, nothing is set in stone yet – as the verification progresses, it should become more obvious which parts would belong to what implementation level.

I have checked this project in on github here: Calculator Project: Proof of Concept (

I have rewritten that code many times and spent many evening hours on it as I was trying and debugging various approaches and algorithms.

Gradually, some certainties about the internal architecture started to emerge almost as all by itself. As you take certain directions, many related variables seem to naturally fall in place to form an architecturally consistent, and arguably perfect, model (ex. the way scratch registers extend calculation precision by adding extra digits). Those are the easy ones to decide, they are obvious and represent a sound engineering conclusion. The harder ones require giving up some obvious benefits to satisfy other variables (ex. selecting the mantissa or exponent widths). They are an ever-present nemesis of these engineering tradeoffs.

Somewhere in that spectrum is a decision to use BCD as the internal number specification.

Why BCD?

While the binary floating point (or IEEE 754 standard) is overwhelmingly used in general and scientific computation, fixed point BCD calculations are still very relevant in banking, finance, and commercial calculations due to their precision, accuracy, and the immunity from implicit conversion errors. In many ways, doing arithmetic with BCD digits resembles calculating values as taught in school, on a paper, digit by digit – there are no conversions to other number systems. Also, at any point in the process, digits are readily visible to a debugger which makes the algorithms easier to debug.

Hence, mantissas will be stored in BCD format.


Exponents, however, will be stored as 8-bit binary (integer) numbers. That format simplifies the kinds of operations that need to be done with them – exponents are never addressed as individual digits and are always used as integer numbers, so keeping them in a multi-digit BCD format would make their handling unnecessarily more complicated. Moreover, we will add a bias value of 128 to it. That will “move” their range to positive integers and simplify their comparisons.

Most calculators use a two-digit exponent, from -99 to +99, so we will limit ours to that range, too. As they get internally converted to 8 binary bits, they will “lose” values in the ranges [0-28] and [228-255] and that is ok. (Maybe we could allow intermediate results to extend into those ranges.) Since exponents are always whole numbers (integers), we are not losing any precision when we convert them from a format of two-digit input buffer to integers, and back when we need to display their values. The process of such a 2-digit conversion is quite simple either way.

Input parsing

This is the first practical operation that we need to code and verify.

The deployed heuristic should be able to process any reasonably well-formatted number which a user might have typed in and produce a mantissa and exponent pair. Since the mantissa is stored in BCD format, a separate mantissa sign bit should also be extracted and stored.

We will allow two basic formats: with and without the exponent value. The test code contains a list of values that we consider valid and runs a verification algorithm on each. Note that the input parsing process does not check for incorrectly formatted numbers – it will be the job of the input editor routine to deliver us properly formatted buffers.

Number format without the exponent
Number format with the exponent

The input parser will have to normalize all input numbers. In this implementation, a normalized BCD register contains left-aligned digits, with an implied decimal point after the first digit. That decimal point is shifted by the value of the exponent.

Two numbers and their internal, normalized, representation

All internal operations will be performed on such normalized values stored in registers.

Having specified the internal number representation, we can now tackle the four basic algorithms.


This implementation uses a “nibble-by-nibble” serial algorithm.

Usually, the subtraction is performed by negating the subtrahend and then using the addition. Most known BCD heuristics use 9’s or 10’s complement (subtract each BCD digit from 9 or 10, add one, and then add them). That is also similar to the way processors do subtraction of binary values.

After coding in several variations of popular algorithms, I have decided to create a slightly different one, one that is more suitable to the architecture that keeps the sign bits separately from their mantissas and does not store negative numbers as complements. Our implementation of subtraction does not negate the subtrahend. There are many ways to skin a cat, as they say.

Addition and subtraction operations are complementary and are used interchangeably depending on the signs of the addend and augend, or minuend and subtrahend (the former naming a base value, the latter naming a value to be added or subtracted).

First, we align the mantissas so that their digits represent equal decimal weights. The normalized form had them all left aligned while the exponent specified the position of their decimal points. We must shift the smaller of the two to the right by the difference of their exponents so that we can add ones to ones and tens to tens.  However, this could also result in a lesser number being shifted out of existence (ex. 1e20 – 1); we recognize such cases early and return the larger value unchanged.

Since the result may need a new significant digit (as in “9+9=18”), we need to have a nibble pre-allocated for it.

Our code splits up discrete addition and subtraction cases by this heuristic:

Another twist to subtraction is that we make sure to always subtract a lesser value from a greater one, avoiding the last (MSB digit) “borrow”. When the terms are swapped, the result is the same but negative (like in “10-3=7” but “3-10=-7”)

The implemented heuristic, although slightly different from the traditional ways to solve the problem, is well suited to this architectural implementation making the code not only readable and simple but more importantly, exact.


There are many methods of multiplying two BCD numbers. They range from traditional iterative partial-sum methods to intermediate conversion to binary value and back. There are also different approaches to implementing a multiplier in ASIC and FPGA. However, using an FPGA embedded multiplier would not be fun, would it?

The multiplication algorithm is using 2 extra scratch registers: one to store the intermediate product and the other one for a running total of the sums of products. In the outer loop, we multiply a multiplicand with each digit of a multiplier, while in the inner loop we multiply each pair of individual digits. Each result of these iterations is then being summed with the running total.

At this time, we assume that the hardware will be able to multiply a single (4-bit) BCD nibble with another one and return a 2-nibble wide result. This may be an oversimplification and may likely change as we start designing the hardware.

The sign of a result is an XOR of the signs of the input terms. The exponent is the sum of the exponents of those terms. Simple.


We use a classic “Shift and Subtract” method to divide two numbers. This method mimics long division. Since both, the dividend and divisor, are already normalized, we do not need to align their decimal digits.

To obtain each digit of a quotient, we subtract the divisor from the dividend as many times as we can (for as long as the dividend is greater or equal to a divisor) and count. To get the next digit of a quotient, we shift the dividend one place (one digit) to the left (thus multiplying it by 10, decimal). Note that many classic “Division by Shift and Subtract” algorithms shift the divisor one bit to the right instead, a process that eventually “eats up” the bottom digits of the divisor, losing the precision.

Once we have found all the quotient digits, the dividend value (which is now numerically less than the divisor) represents the remainder. If we wanted to, this convenience leaves us with an option to implement a modulus or a remainder function in the future.

Not unlike multiplication, the sign of a result is simply an XOR of the signs of the input terms while the exponent is a difference of the term’s exponents.

With all four operations, we also need to normalize the result. In the end, the total number of digits might exceed the width of our scratch register causing a loss of accuracy.

Proof of Concept

To verify our calculations, the checked-in code carries along parallel computation using a standard C++ type “double”. This data type is convenient since, when printed, it provides up to 15 decimal digits of precision. Those are the “control” or “golden” values we compare our results with.

I could have used one of many “Arbitrary Precision Math” packages instead-but taking the advantage of a C++ built-in type suffices since we don’t need our calculator to exceed its precision. We also don’t need to tie-in another, potentially large, package with our sources.

Creating a test bench, we add a set of tests to each function we implement. The tests run a function using statically defined values that include several common and corner case numbers (ex. 0, 1, 0.00000001, 0.99999999).

These values are used in combination with each other and with varying mantissa and exponent signs.

Additional test sections use purely random values. Here we use C++11 “std::minstd_rand” to generate the same pseudo-random sequence across different, compliant, C++ platforms. That way, our VS and gcc runs will use the same random sequence and we can simply compare their outputs as text files.

Although such test coverage is not exhaustive, it seems to be sufficient.

The validator code compares the output of our algorithms with the calculated control value and prints “OK” if the values match exactly, “NEAR” if the values differ by only a magnitude of the last digit (a rounding error), or “FAIL” otherwise.

Since we currently do not perform any rounding, there are many results marked as “NEAR”, as expected.

It turns out that VS and gcc libraries round numbers differently, so there is a slight mismatch in the control values between those two platforms.

For example, on VS:
123456789012345 / 0.1 = 1.2345678901234e+15 vs. +1.2345678901235e+15  NEAR (5e-14)

while, on Linux gcc:
123456789012345 / 0.1 = 1.2345678901234e+15 vs. +1.2345678901234e+15  OK

(The last number in each line is the control value calculated using a standard VS or gcc library).

I also cross-checked by hand many calculations using SpeedCrunch software, which in this case shows 1.23456789012345e15; and as with many/all other values, the difference is clearly in the rounding digit.

During the development of that code, algorithms, and tests, in the subconscious pursuit of a “perfect match”, I had to keep reminding myself that the goal of this validator was not to match a particular C++ library perfectly but to provide a well-defined and consistent accuracy of the result. At one point I have coded in the rounding function and had matched VS control values perfectly. However, the same values would mismatch when compiled with gcc (by the said rounding errors), so I decided to keep the code simpler and not do the rounding, for now, implementing only the truncation policy.

There are two major defines in the validator code: MAX_MANT specifies the number of mantissa digits for input values and the result, and MAX_SCRATCH specifies the number of mantissa digits used in the internal calculation. The latter I call, “Arithmetic Scratch Register” since it should likely map to an equivalent register in our final design.

You can change those two defines and see the impact they make on the precision of the results.

Here is the somewhat abbreviated result of the verification: Verification.txt

After getting basic algorithms to work and to match the control values, I could verify successive changes to the code by quickly re-running the tests and spot functional regressions. Such verification setup made the development process faster and less error-prone.

Finally, I iteratively rewrote the code, simplifying it into conceptually basic steps (no fancy C++ constructs!) so that we could write some kind of microcode that mimics it. The microcode will be for a simple processor we will design, and the design will greatly be influenced by the kind of operations and data structures (registers) we use in these basic algorithms.

Just like doing the basic math in 13th grade properly: we are doing it step by step, not rushing it and we are verifying it on the side. This time, though, without a proverbial ruler looming in the air…

Interesting Reads


The idea for this project came about during a week of freezing winter arctic event here in Austin, Texas, with a failed power grid as well as government, while keeping close to a gas fireplace, the only source of heat and light, for a couple of days. With a weak internet over a phone data line, I could only do some preliminary searches and mainly work out various details on a writing pad, growing a feeling that the project may be personally exciting and practically doable.


The idea for this project came about during a week of freezing winter arctic event here in Austin, Texas, with a failed power grid as well as government, while keeping close to a gas fireplace, the only source of heat and light, for a couple of days. With a weak internet over a phone data line, I could only do some preliminary searches and mainly work out various details on a writing pad, growing a feeling that the project may be personally exciting and practically doable.

I will try to post my progress over the coming weeks.

How do calculators work?

You press a few buttons, a function, and you get your result back. But what is the technology and the algorithms at work behind its computation? For sure we can read about that, but to fully internalize all the tradeoffs and corner cases, we need to embark on designing one. To create one from scratch, and by that, I mean it - not use emulators and pre-existing ROMs, or even a powerful microcontroller, will provide us with that innate knowledge. We should start with a blank drawing pad onto which we would sketch our design from the very first gate and build on top of it.

Let's dive in.

Architectural Decisions

We will have to make certain implementation decisions, many of which are exclusive, each one with its own tradeoff.

Internal number representation

The number format internally used for computation:

  1. BCD (binary-coded decimal), where each nibble stores one decimal digit of mantissa and exponent
    • More cumbersome operations on individual digits.
  2. Floating point representation
    • Needs conversion to and from decimal values.
    • Suffers from conversion errors.

The drawbacks of BCD are more complex routines and slightly lower compactness; the advantage is the perfect representation of decimal magnitudes, as opposed to pure binary, that cannot represent exactly some of them; also that there are very refined algorithms for both normal arithmetic and transcendental functions.

Digits of precision

How many digits of precision do we want to carry? Here we need to be very careful not to show more digits than we can compute exactly (with the exception of the last, rounding, digit). A calculator that displays incorrect values is much worse than one that displays only a few, but correct, digits.

  1. BCD may need internally wider registers to compute trailing, rounding digit, so it uses more digits for computation that it shows in the result (ex. HP41:10 digits, 13 internally; HP71: 12 digits, 15 internally)
  2. Binary (floating point) may be off when rounding certain values that can't be represented exactly in a decimal system. This notation may follow a standard like IEEE 754 or implement it’s own partitioning of mantissa and exponent values

A major decision to be made is whether to use one of many rounding modes or simply truncate the result. Each rounding mode has its share of advantages and drawbacks, or it’s suitable for a particular use (financial, scientific, etc.)

Calculation notation

  1. RPN (Reverse Polish Notation)
    • Does not have parenthesis, uses stack, computes intermediate values immediately.
    • Does not have "=" key but have "push to stack" key (often called "ENTER")
    • Quicker, less keystrokes.
  2. Algebraic (or infix notation)
    • Uses parenthesis and knows about operator precedence.
    • In practice, requires more cognitive effort.
    • Have "(", ")" and "=" keys

Programmable vs. fixed function

  1. Programmable calculators have larger non-volatile storage where they store user input sequence (a program) and provide additional functions for execution control (value testing, looping,...)
  2. More sophisticated machines are more akin to full-blown computers and have a high-level language interpreter (like BASIC)

Gates vs Microcode

How much is done in hardware (in gates) vs. in microcode (ROM). Here, by gates, I mean strictly specialized digital logic dedicated to computing arithmetic functions.

  1. Everything is done in gates (no ROM) (early, or very simple, calculators).
  2. Very minimal basic operations in gates; most functions is in code (ex. using HP's Saturn CPU).
  3. Most is done in gates, only the most complex functions in code.
  4. Everything is done in code, minimal necessary functionality in gates (ex. some Z80-based calculators).


What algorithms are used to compute higher level functions?

  1. Taylor series
  3. Chebyshev polynomials
  4. Other series and algorithms


Keypad, or keyboard, is the primary interface to the user.

Arguably the best keypads are those made by HP for their line of calculators. They use double molded plastic and a proprietary spring system.

On the other end of the keyboard technology are super-cheap membrane keypads, such are those used in budget home computers by Sinclair and many $1 calculators from China.

A typical predefined keypad you could buy from Alibaba

A small step up is the membrane sheets pushed by extruded rubber/molded keyboard overlays. They provide more tactile feedback. They are very common since they are inexpensive and provide a good balance of price vs how they feel. Your TV remote controls, thermostats, and most of the appliances around would have those.

A step up: Extruded silicone rubber keypad

Occasionally, a frequently used button, after being pressed harder and harder, would stop working. That happened to our thermostat: two buttons setting up the temperature up and down stopped responding to our increasingly frustrating attempts to set the temperature. The remedy was to open the box and scratch the back of a rubber contact, a small graphite pad called “the contact pill”, with a soft black drawing pencil. Since the pencil leaves graphite marks, those contacts are now fully conducive and will continue working.

Since we cannot make plastic molds easily (perhaps they could be 3d printed, but I don’t have a 3d printer), the membrane-style keypad looks to be the most promising alternative for our calculator. We need to have lettering on the actual keys (numbers “0” through “9” and major functions like “+” and “-“), and also text for secondary functions accessed by some kind of a shift key, and those are normally expected to be shown next to the primary functions (above or below). We could print those on a regular piece of paper (printing using a laser color printer for nicely colored labels if we want to get fancy), and then laminate that sheet. The lamination would add extra stiffness to the paper.

There are two kinds of actuators we could use underneath.

The first one could be traditional PCB mounted push buttons. Since those have a certain thickness, we would make another PCB sheet, acting as a spacer, with round holes where the buttons are, providing that clearance. If the clearance is not sufficient, we could raise that board up with some spacers.

As you press a spot on a laminated sheet, it would press a switch; the intermediate board would keep it all flat and level.

Another option would be to use small, tactile dome switches. Their use is rather uncommon among hobbyists. Those switches are made of tiny pieces of domed conductive material (sometimes gold plated) which bend as you press to make contact in the middle. They make a quite satisfying tactile “click”. The domes should not be soldered to the board since that would stiffen their elasticity and would make them prone to failure, so they are either physically held by some template or they are glued on top (frequently they are sold with an adhesive sheet).

Tactile dome switch
Tactile dome switches sold with adhesive backing: just cut and stick to your PCB

Surprisingly, although these cost only a fraction of a cent to manufacture, US electronics shops like Mouser sell them for 50 cents each! If you shop from Alibaba (China), you can get a bag of 500 for $10. If you are making a large order, you should not pay more than 0.4 cents per button! I am sure there are quality differences, but we are not building a nuclear reactor control panel, either. It’s ok if they are not perfect.

A variation of those dome switches has a hole in the middle, where you can PCB mount an LED behind it and, for example, have it lit as you press a switch, or show a latched state.

Although I’ve never used those dome switches before, I just ordered a bag from China and am planning to give it a try with this project. I hope they will work as visualized.

There is another kind of “switches”, or better, a lack of them. With the advent of microcontrollers and their special purpose IO blocks, many now offer touch detection which does not need anything else besides access to an open trace. The controller registers even the slightest capacitive touch of a finger. Since we are not using one of those MCUs, we won’t pursue that idea here.

Modes of Operation

When we are already making our own calculator, being engineers, we should plan to support several different number bases. Besides decimal, we should also have hex and binary, and we may as well throw in octal. We could, of course, architect it to support any arbitrary base, but I don’t see any practical use for something like that.

We want to follow conventions and call these modes DEC (decimal), HEX, OCT and BIN.

Any number currently on the screen can be converted amongst these bases. Non-whole decimal numbers will be truncated to integers.

Consequently, we need a few additional keys: besides the standard “0”…”9” we also need “A”, “B”, “C”, “D”, “E”, and “F”. We also need to assign 4 mode functions to keys.

We should also add a couple of bitwise operations to go with these modes. At the very least, basic AND, OR, XOR, NOR, and SHIFT (left or right) operations come to mind.


The heart of our calculator will be a small micro-sequencer implemented in FPGA. We will try to implement IO as discrete blocks to make our life easier.

For example, keypad scanning could be a separate and independent module that takes care of debouncing and simply queues key codes to an internal FIFO.

Similarly, the LCD driver could also be a separate module that initializes the display and implements some simple protocol to control writing the characters.

The exact details should be more obvious once we start writing a concrete architectural document.


In this first article, I tried to provide an overview of various options we can select from, trying not to decide anything in particular. Other major things are still being left out (like the power source) because we are still wrapping our heads around the complete project. As I was writing, some things did start to crystalize in terms of dependency (“if you decide to implement it this way, then you can’t do that”) and feasibility. Still, anything is on the table.

Interesting Reads

This set of articles explains how HP calculators calculate various functions:


In this blog I will show you how to interface an Atari-style joystick to the Altera DE1 FPGA board running a Spectrum implementation, how to change the ROM to enable you to input some game-cheat pokes and a few games I eventually completed using this setup.
...continue reading "ZX Spectrum ROM mods"


In the last article I presented a different way of architecturally modelling a Zilog Z80 processor. It is time to do something really useful with it and what could be better than reliving the past for a moment? Let's recreate an old computer and load in and play some games!
...continue reading "ZX Spectrum on FPGA using A-Z80 CPU"


In the first post I described the sequencer, a circuit that provided discrete timing signals to space operations apart. In the second post I mentioned the Timing matrix that was run by these signals and orchestrated a dance of control signals in time.

This article is about making it all alive and kicking within an FPGA solution.
...continue reading "A Z80 : The Soul"