A Calculator (3): Practical Numerical Methods

Post 2 answered the feasibility question: yes, tan, ln, exp, and sqrt can all be computed from add, subtract, and multiply alone, with some tricks. Post 3 is about doing it correctly, to up to 16 decimal digits, with a verified reference implementation that the hardware will eventually be tested against.

When I started this project in 2021, I needed C++ code that would implement algorithms using BCD primitives and verify the results. That became the Methods sub-project. It worked, but some subtle mantissa-width bugs crept in that made the test results unreliable at the last digit or two. Rather than patch it, I rewrote the whole thing from scratch in 2025 as Proto sub-project: a cleaner architecture, a proper reference, and a hardware test vector generator that the FPGA microcode now validates against.

The Internal Number Format

This scientific calculator’s registers hold:

  • 16-digit BCD mantissa, normalized so the first digit is always 1–9 (except for the number zero, where all 16 digits are zeroes)
  • sign bit for the mantissa
  • 2-digit BCD exponent, giving a range of 10⁻⁹⁹ to 10⁹⁹
  • sign bit for the exponent

That structure is analogous to what the HP-35 and HP-41 used internally (those machines used 10-digit mantissas). The format is a natural fit for a decimal calculator: what the user types is what gets stored, with no binary representation error. Type 1 ÷ 3 × 3 and you get exactly 1, not 0.999999999. Every intermediate result stays in the decimal domain the user expects.

Methods: The Initial Implementation

In Methods project, I implemented eight functions in C++: add/sub, multiply, divide, ln, exp, tan, and atan. Each implementation included both the BCD algorithm and a test harness that compared results against C++ double. The MAX_MANT define was intended to be 16 throughout, but at that time I did not do it right and I got a mixture of 14 and 16 digits – being the very first prototyping code, I was playing with too many buffer sizes at once. The bugs this caused were quiet: results looked right but were off in the last few decimal places. I could not tell apart real errors from the rounding artefacts.

The comparison mechanism was straightforward: run the BCD algorithm on a test input, convert the result to double, subtract the C++ native result for the same input, and check whether the difference is within some tolerance. All eight functions converged correctly across the tested range. The trouble was that with the mantissa inconsistency I couldn’t tell whether a failing test was a real bug or just noise from the width mismatch.

Fast forward a couple of years, and that drove the search for a better reference and, eventually, a complete rewrite.

Searching for the Right Reference

Floating point type double cannot fully verify 16-digit BCD arithmetic as it caps at 15 digits. Although 15 digits looks sufficient for most uses, it is not perfect for this one since we need 16. Also, once you start chaining operations, it further erodes the precision.

The solution turned out to be in the compiler. GCC on x86-64 implements long double as an 80-bit extended-precision type using the x87 FPU: the full 64-bit mantissa that Intel has included in every x86 processor since the 8087 in 1980. That gives approximately 18-19 significant decimal digits, two to three digits of headroom over a 16-digit BCD result. When the BCD and long double disagree at digit 16, the BCD is wrong.

The catch is that this only works on Linux with GCC. On Windows, MSVC maps long double to a plain 64-bit double, giving no extra precision at all. Although it will run on Windows as well, Proto code requires Linux or WSL for full precision. Part of my Windows dev setup is to use WSL, so that worked really well.

Before settling on long double, I evaluated several alternatives:

  • GNU MP / MPFR is the gold standard for arbitrary precision arithmetic in C, but the API is verbose and the build setup quite heavy for my project.
  • Boost.Multiprecision wraps MPFR behind a cleaner C++ interface but still requires the underlying library.
  • Henrik Vestermark’s arbitrary precision C++ package was the most promising – header-only, portable, no dependencies – but I hit some corner-case bugs on Linux GCC that produced wrong results in specific input ranges which made this otherwise great package into another questionable variable, so I decided not to use it. (ref)
  • Python’s decimal module was useful for spot-checking specific values.
  • Wolfram Alpha confirmed a handful of critical results manually. None of them added enough benefits over what GCC already provided for free.

Proto: The Golden Reference

The Proto project is the clean rewrite: it has consistent 16-digit BCD mantissas throughout, it is verified against GCC’s 80-bit long double on Linux, has a proper tolerance framework with three classes (Strict, Standard, Relaxed), and it includes a hardware test vector generator. With 18–19 digits of reference precision versus 16 digits of BCD, there is no ambiguity: a disagreement at any digit is a BCD bug.

Proto operates in three modes, selected by command-line flags:

  • Dev mode (default): runs all test vectors and prints only failures (NEAR and MISS lines). A clean run is silent. The -c flag enables ANSI color output that highlights exactly which digits disagree. This also looks cool 😉
  • HW vectors mode (-t): generates test vectors for hardware verification, one per line, in a fixed-column format that the Verilog test harness can consume directly. Every algorithm (all 25 operations) can have its vectors generated this way.
  • Interactive RPN calculator (-?): a command-line four-level RPN stack using full BCD arithmetic. Type 3 enter 4 * and you get 12. This is useful for spot-checking results and for comparing against hardware. This mode was totally an afterthought: after I have already coded most functions, why not to add an input line to have it execute RPN expressions?

Testing as First-Class Work

The Intel Pentium FDIV bug of 1994 is the poster child of incomplete testing. Five missing entries in a 1,066-entry lookup table caused the processor to return wrong results for certain division inputs. The bug shipped in millions of processors before Thomas Nicely, a mathematics professor, noticed discrepancies while checking prime number computations. The technical cause was tiny; the consequences cost Intel $475 million.

Proto exists to catch the BCD equivalent: a rounding error at the 16th digit, a CORDIC iteration off by one, an edge case in the guard digit logic.

Since the BCD arithmetic engine and the FPGA microcode are verified using the same test vectors, generated by the same implementation, any divergence between them should be caught immediately.

The approach is straightforward: Proto generates test vectors by running its BCD reference implementation on a large set of input vectors that it generates and prints out the results. The FPGA test harness feeds those vectors to the microcode and compares the outputs. A discrepancy means either the microcode has a bug or the BCD algorithm has a bug, and either way it shows up quickly. Post 4 covers the complete test vector pipeline in detail.

Guard Digit, Sticky Bit, and Rounding

Every arithmetic operation can generate more digits than the mantissa can hold. The solution is to track the guard digit and a sticky bit (recording whether any nonzero digits were discarded further out on the right) during each operation, then apply round-half-to-even to decide the final 16th digit. Post 9 covers the full mechanism, including how the sticky bit handles the subtraction cancellation problem.

The precision depends on both the rounding strategy and the algorithm used to compute each function. For the transcendentals, that algorithm is CORDIC. In this implementation, the four basic operations (+,-,*,/) achieve full 16-digit precision (0.5 ULP), while the transcendentals lose one to three digits depending on the algorithm’s internal iteration count.

A side note: There is a classic trick for probing a calculator’s internal precision: type 9, then apply sin, cos, tan, arctan, arccos, arcsin in sequence (in degrees mode). Mathematically, arcsin(arccos(arctan(tan(cos(sin(9°)))))) should return exactly 9: you are just applying six functions and their inverses. In practice, every calculator returns something different. The HP-41C returns 9.000417403. The HP-42S returns 8.99999864267. A TI with a Mostek chip returns 8.99999614252. The deviation from 9 is a fingerprint of the internal algorithm family and the number of guard digits.

This test was coined by Mike Sebastian, who built a database of results from hundreds of calculators to trace which chip families were shared between manufacturers (link).

CORDIC: One Algorithm to Rule Them All

Once the basic four operations are working, the transcendentals can be implemented. The approach used here is CORDIC, the same family of algorithms the HP-35 used in 1972. The reason it keeps appearing in BCD calculator design is that it requires only additions, subtractions, and digit shifts, with no multiplication in the inner loop.

The trigonometric case is the clearest to explain. To find tan(32°), start with a point Q at (1, 0) on the unit circle and rotate it counterclockwise until the total rotation angle equals 32°. When you get there, the tangent is simply Qy/Qx.

The rotation formula is:

Qx_new = Qx - Qy × tan(θ)
Qy_new = Qy + Qx × tan(θ)

The trick is in choosing which angles to rotate by. If you pick angles whose tangents are powers of ten: 1, 0.1, 0.01, 0.001… then multiplying by the tangent is a decimal digit shift, not a multiplication. Those angles happen to be 45°, 5.711°, 0.5729°, and so on, and their values are precomputed and stored in a table.

The algorithm then works iteratively: at each step, check whether adding the next table angle would overshoot the target. If not, apply the rotation. If yes, skip it and move to the next smaller angle. After K steps (K is the table size), the accumulated angle is as close to the target as K digits of precision allow.

The Methods project used K=8 CORDIC table entries for all functions. Raw CORDIC converges at about one digit per iteration, so 8 iterations gives roughly 8 digits of raw precision. Proto kept K=8 for trig but recovers the remaining precision through a residual division after the CORDIC loop and Taylor series refinement for small angles, reaching about 14 digits for tan and atan. For ln and exp, Proto extended the table to 15 iterations, reaching 15 digits.

To make this concrete, here is a full trace of computing tan(32°). The algorithm runs in two phases.

Phase 1 – Digit extraction (pseudo-division). The input angle (32° = 0.5585 rad after range reduction) is decomposed into a sum of precomputed atan values. At each step j, the algorithm subtracts atan(10⁻ʲ) from the remaining angle as many times as possible without going negative. The subtraction count becomes the extracted digit for that step:

Step jatan(10⁻ʲ)DigitRemaining angle
00.7854 (= 45°)05.585×10⁻¹
10.09967 (= 5.71°)56.016×10⁻²
20.0100061.641×10⁻⁴
30.00100001.641×10⁻⁴
40.000100016.410×10⁻⁵
50.0000100064.098×10⁻⁶
60.00000100049.806×10⁻⁸
70.000000100009.806×10⁻⁸

At step 0, atan(1) = 0.7854 rad (45°) is larger than our angle 0.5585, so the digit is 0 – we cannot subtract even once. At step 1, atan(0.1) = 0.09967 rad (5.71°) fits 5 times: 5 × 0.09967 = 0.4984, leaving 0.0602. At step 2, atan(0.01) fits 6 times, consuming almost all of the remainder and leaving just 1.641 × 10⁻⁴. Each step peels off one more decimal digit of the angle. After 8 steps, the remaining angle is negligible (9.8 × 10⁻⁸) and the extracted digits are [0, 5, 6, 0, 1, 6, 4, 0]. This is the intermediate result.

Phase 2 – CORDIC rotation (pseudo-multiplication). Now the algorithm reconstructs the tangent. Starting from a unit vector (x, y) = (1, 0), it applies rotations using the extracted digit counts, working from the finest scale (j=7) up to the coarsest (j=0). Each individual rotation at scale j is:

y_new = y + x × 10⁻ʲ
x_new = x − y × 10⁻ʲ

Multiplying by 10⁻ʲ is a decimal digit shift, so each rotation uses only addition and subtraction – no multiplication in the inner loop. The digit extracted in Phase 1 tells how many times to apply this rotation at each scale. For instance, at j=1, the digit is 5, so the rotation is applied 5 times in succession at scale 10⁻¹. After 8 rounds of rotations, the vector (x, y) has been rotated by a total angle equal to the original input.

Final step – division. The tangent is the ratio of the final y and x components:

y = 0.5434297703056166
x = 0.8696694255289085
tan(32°) = y / x

This division is the only multiplication-class operation in the entire computation; the CORDIC core itself used nothing but shifts and additions to get here.

BCD result: 0.6248693519093394
Expected:   0.6248693519093275
Error:      1.2 × 10⁻¹⁴

Eight CORDIC iterations plus Taylor series refinement for the small residual yield about 14 digits of agreement. For ln and exp, 15 iterations give 15 digits – exactly the design target. The measured precision of each core operation is listed in this table:

OperationPrecisionAlgorithm
+, −, ×, ÷16 digitsGuard digit + banker’s rounding
√x~15 digitsNewton-Raphson + ULP correction
ln~14.5 digitsCORDIC, 15 iterations
e^x~13–14 digitsCORDIC (inverse of ln)
atan~14.8 digitsCORDIC, K=8 + residual division
tan~14–14.5 digitsCORDIC + range reduction
sin, cos~13.5–14 digitsHalf-angle formula via tan

Walther’s 1971 paper showed that sin/cos/atan, exp/ln/sqrt, and multiply/divide are all one algorithm, with a single mode parameter selecting circular, hyperbolic, or linear coordinates. HP’s chips exploited this directly: one hardware loop, three function families.

The hyperbolic mode works the same way but with a different set of precomputed constants (derived from hyperbolic tangent values instead of circular tangent values) and slightly different convergence properties, certain iterations must be repeated to ensure convergence. The linear mode is simpler still: it degenerates to straightforward shift-and-add multiplication and division. The elegance is that the loop structure is mostly identical across all three modes; only the constant table and the sign logic change.

Most of the decimal CORDIC algorithms trace back to work of Jacques Laporte, a French engineer who spent years reverse-engineering the algorithms of HP calculators. He figured out how the HP-35 computed ln, exp, tan, and atan from published timing data and behavioral testing. He died in 2015, but his archived website remains the most detailed public reference on HP BCD calculator algorithms.

The payoff for this design: eight core functions, implemented in assembly, give rise to the calculator’s entire function table. The core set is:

FunctionAlgorithmSource File
+, −, ×, ÷Shift-and-add/subtract with guard digit and banker’s roundingaddsub.asm, mul.asm, div.asm
tan, atanCircular CORDIC (pseudo-division / pseudo-multiplication)tan.asm, atan.asm
ln, e^xMeggitt’s digit-by-digit method (hyperbolic CORDIC)log.asm, exp.asm
√xNewton-Raphson iteration with square-based ULP correctionsqrt.asm

Almost everything else is derived from these primitives, either through thin assembly wrappers or through the scripting engine that composes basic operations into compound functions:

FunctionDerived from
sin(x)2t / (1 + t²), where t = tan(x/2)
cos(x)sin(x + 90°)
asin(x)atan(1 / √(1/x² − 1))
acos(x)90° − asin(x)
log₁₀(x)ln(x) / ln(10)
10^xe^(x × ln(10))
y^xe^(x × ln(y))
ˣ√ye^((1/x) × ln(y))
1/x1 ÷ x
x × x
%(x / 100) × y
%CHG100 × (x − y) / y
D→Rd × (π / 180)
R→Dr × (180 / π)
→HMS, HMS→Decimal hours ↔ H.MMSS format conversion
n!Iterative multiplication
nPr, nCrIterative product and factorial
GCDEuclidean algorithm
LCM|a × b| / GCD(a, b)
sinh(x)(e^x − e^(−x)) / 2
cosh(x)(e^x + e^(−x)) / 2
tanh(x)(e^(2x) − 1) / (e^(2x) + 1)
asinh(x)ln(|x| + √(x² + 1))
acosh(x)ln(x + √(x² − 1))
atanh(x)ln((1 + |x|) / (1 − |x|)) / 2
x̄, Sx, σxRunning mean and standard deviation (Welford’s algorithm)
P→Rx = r·cos(θ), y = r·sin(θ)
R→Pr = √(x² + y²), θ = atan2(y, x)

That is 36 operations for the price of 8. Isn’t that wonderful? 🙂

What Comes Next

With a working, verified BCD arithmetic engine in C++: 16 digits, guard digit, sticky bit, banker’s rounding, tested against 80-bit long double, hardware test vectors generated, the next step is designing a CPU to run it on. The microcode cannot be written until there is an instruction set to write it in, FPGA hardware to run it on and verify it. Post 4 covers the development framework that ties all these tools together. Post 5 is the first hardware prototype. Post 6 is the CPU design.

Leave a Reply (your email address will not be published):