Learning · Step 4
Fixed-point arithmetic
A value sitting in a Q-format is the easy part. The moment you compute with it the format moves: every add and every multiply produces a result whose Q and word width differ from the inputs. Miss that and the result silently overflows, or you quietly throw away the low bits. Two rules cover almost all of it.
Add. Two fixed-point numbers can only be added once
their binary points line up — they need the same Q. Aligned, the sum
keeps exactly those fractional bits, but the integer side can carry:
Qm.n + Qm.n → Q(m+1).n. Add two 16-bit values and the
safe result is 17 bits. Sum K of them and you need
⌈log₂K⌉ extra integer guard bits up top.
Multiply. Multiplying multiplies the ranges and
adds the resolutions: Qa × Qb → Q(a+b) — the
fractional bits add. An N-bit by N-bit
product needs 2N bits, so 16×16 → 32. One
catch for signed values: the top two bits of the product are both
sign. Q15 × Q15 lands in Q30 of a 32-bit
word — one bit short of Q31 — which is exactly why DSP
hardware offers a fractional multiply that shifts the result
left by one.
That wide result rarely gets to stay wide. Sooner or later you store it back into a native word and the low bits have to go. How they go — plain truncation or rounding — decides whether the quantization error averages to zero or builds up as a steady DC offset. The explorer below shows the format arithmetic and that final squeeze together.
Block diagrams
An adder and a multiplier as datapath blocks — inputs on the left, result on the right. Watch the format labels: the output never carries the same format as the inputs.
Try it
Pick an operation and the operand formats — the result's bit field updates live. Then choose where to store it and how to round, and watch the error and its bias.
Store the result back into a word
What to notice
- In Multiply, raise
QAandQB— the result Q is exactly their sum and the word width doubles. The fraction segment grows; the integer segment shrinks to match. - Switch to Add with
QA ≠ QB— the explorer aligns to the finer grid first. The coarser operand is shifted up, and the result is one bit wider than the inputs to hold the carry. - Drag Target Q down — each step drops one more low bit (the red drop segment grows). That is the precision you are paying with.
- Flip Truncate → Round at the same target — the stored value barely moves, but the average error goes from −0.5 LSB to ≈ 0. Truncate is one shift; round is an add then a shift.
- Enter
1.5at a high target Q — the value can't fit and the result saturates. Clamping beats wrapping every time.
Output truncation — best practice
- Round, don't truncate, on anything that feeds more arithmetic. Truncation's −0.5 LSB bias is a DC offset; through a feedback path it accumulates and can even sustain IIR limit cycles.
- Truncate once, at the end. Keep the full double-width product through a chain of multiply-accumulates and collapse to the storage word only at the output. Rounding after every step throws away precision and adds a fresh dose of noise each time.
- Give the accumulator guard bits. Summing
Kproducts can carry⌈log₂K⌉bits past the product width. Size the accumulator for the worst-case sum, not the typical one. - Saturate the integer side. When the result's integer part won't fit, clamp to the maximum — never let it wrap. A wrap flips the sign and reads as a loud click; saturation just flattens the peak. Round first, then saturate.
- Use convergent rounding (round-half-to-even) when even round-to-nearest's tiny tie bias matters — long integrators, energy meters, anything that sums millions of samples. For ordinary signal paths, plain round-to-nearest is enough.
Step exam
Answer all 3 questions correctly to complete this step.
-
Multiplying a Qa value by a Qb value gives a result in which format?
-
Adding two N-bit fixed-point numbers needs how many bits to stay safe from overflow?
-
An N-bit by N-bit signed multiply produces a result of what width?