Main Content

Arithmetic Operations

These sections help you understand what data type and scaling choices result in overflows or a loss of precision.

Modulo Arithmetic

Binary math is based on modulo arithmetic. Modulo arithmetic uses only a finite set of numbers, wrapping the results of any calculations that fall outside the given set back into the set.

For example, the common everyday clock uses modulo 12 arithmetic. Numbers in this system can only be 1 through 12. Therefore, in the clock system, 9 plus 9 equals 6. This can be more easily visualized as a number circle:

Diagram of a clock face demonstrating modulo 12 arithmetic.

Similarly, binary math can only use the numbers 0 and 1, and any arithmetic results that fall outside this range are wrapped around the circle to either 0 or 1.

Two's Complement

Two's complement is a way to interpret a binary number. In two's complement, positive numbers always start with a 0 and negative numbers always start with a 1. If the leading bit of a two's complement number is 0, the value is obtained by calculating the standard binary value of the number. If the leading bit of a two's complement number is 1, the value is obtained by assuming that the leftmost bit is negative, and then calculating the binary value of the number. For example,

01=(0+20)=111=((21)+(20))=(2+1)=1

To compute the negative of a binary number using two's complement,

  1. Take the one's complement, or flip the bits.

  2. Add a 2^(-FL) using binary math, where FL is the fraction length.

  3. Discard any bits carried beyond the original word length.

For example, consider taking the negative of 11010 (-6). First, take the one's complement of the number:

1101000101

Next, add a 1, wrapping all numbers to 0 or 1:

00101+1¯00110(6)

Addition and Subtraction

The addition of fixed-point numbers requires that the binary points of the addends be aligned. The addition is then performed using binary arithmetic so that no number other than 0 or 1 is used.

For example, consider the addition of 010010.1 (18.5) with 0110.110 (6.75):

010010.1+0110.110¯011001.010(18.5)(6.75)(25.25)

Fixed-point subtraction is equivalent to adding while using the two's complement value for any negative values. In subtraction, the addends must be sign-extended to match each other's length. For example, consider subtracting 0110.110 (6.75) from 010010.1 (18.5):

010010.1000110.110¯(18.5)(6.75)

In the Fixed-Point Designer™ software, the CastBeforeSum property of the fimath object has a default value of 1 (true). This casts addends to the sum data type before addition. Therefore, no further shifting is necessary during the addition to line up the binary points.

If the CastBeforeSum property has a value of 0 (false), the addends are added with full precision maintained. After the addition, the sum is then quantized.

Multiplication

The multiplication of two's complement fixed-point numbers is directly analogous to regular decimal multiplication, with the exception that the intermediate results must be sign-extended so that their left sides align before you add them together.

For example, consider the multiplication of 10.11 (-1.25) with 011 (3):

Multiplication Data Types

The following diagrams show how the Fixed-Point Designer software determines data types used for fixed-point multiplication. The diagrams illustrate the differences between the data types used for real-real, complex-real, and complex-complex multiplication.

Real-Real Multiplication.  This diagram shows the data types used by the Fixed-Point Designer software in the multiplication of two real numbers. The output of this operation is returned in the product data type, which is governed by the ProductMode property of the fimath object.

Input a data type and input c data type are fed into a multiplier. The multiplier outputs product data type ac.

Real-Complex Multiplication.  This diagram shows the data types used by the Fixed-Point Designer software in the multiplication of a real and a complex fixed-point number. Real-complex and complex-real multiplication are equivalent. The software returns the output of this operation in the product data type, which is governed by the ProductMode property of the fimath object:

Real input a is fed into two multipliers. Complex input c+di is first split into real and imaginary components c and d. c is multiplied with a in the first multiplier; d is multiplied with a in the second multiplier. The products ac and ad are recombined into complex output ac+adi.

Complex-Complex Multiplication.  This diagram shows the multiplication of two complex fixed-point numbers. The software returns the output of this operation in the sum data type, which is governed by the SumMode property of the fimath object. The intermediate product data type is determined by the ProductMode property of the fimath object.

Complex inputs a+bi and c+di are split into real and imaginary components. These inputs are multiplied by four multipliers, which output the product data type for each multiplication operation. These are then cast to the sum or product data type (Sum data type if CastBeforeSum is true, Product data type if CastBeforeSum is false). The outputs of the cast are fed into a subtractor and an adder which output the sum data type. The real and imaginary parts are combined to produce the final output (ac-bd)+(ad+bc)i.

When the CastBeforeSum property of the fimath object is true, the casts to the sum data type are present after the multipliers in the preceding diagram. In C code, this is equivalent to

acc = ac;
acc- = bd;

for the subtractor, and

acc = ad;
acc += bc;

for the adder, where acc is the accumulator. When the CastBeforeSum property is false, the casts are not present and the data remains in the product data type before the subtraction and addition operations.

Multiplication with the fimath Object

These examples show how the ProductMode and SumMode properties of the fimath object impact multiplication for real and complex data.

In the following examples, let

F = fimath('ProductMode','FullPrecision',...
'SumMode','FullPrecision');

T1 = numerictype('WordLength',24,'FractionLength',20);
T2 = numerictype('WordLength',16,'FractionLength',10);

P = fipref;
P.FimathDisplay = 'none';

Real*Real.  Multiply two real numbers x and y. Notice that the word length and fraction length of the result z are equal to the sum of the word lengths and fraction lengths, respectively, of the multiplicands. This is because the ProductMode and SumMode properties of the fimath object are set to FullPrecision.

x = fi(5,T1,F)
y = fi(10,T2,F)
z = x*y
x = 

     5

          DataTypeMode: Fixed-point: binary point scaling
            Signedness: Signed
            WordLength: 24
        FractionLength: 20

y = 

    10

          DataTypeMode: Fixed-point: binary point scaling
            Signedness: Signed
            WordLength: 16
        FractionLength: 10

z = 

    50

          DataTypeMode: Fixed-point: binary point scaling
            Signedness: Signed
            WordLength: 40
        FractionLength: 30

Real*Complex.  Multiply real number x with complex number y. Notice that the word length and fraction length of the result z are equal to the sum of the word lengths and fraction lengths, respectively, of the multiplicands. This is because the ProductMode and SumMode properties of the fimath object are set to FullPrecision.

x = fi(5,T1,F)
y = fi(10+2i,T2,F)
z = x*y
x = 

     5

          DataTypeMode: Fixed-point: binary point scaling
            Signedness: Signed
            WordLength: 24
        FractionLength: 20

y = 

  10.0000 + 2.0000i

          DataTypeMode: Fixed-point: binary point scaling
            Signedness: Signed
            WordLength: 16
        FractionLength: 10

z = 

  50.0000 +10.0000i

          DataTypeMode: Fixed-point: binary point scaling
            Signedness: Signed
            WordLength: 40
        FractionLength: 30

Complex*Complex.  Multiply complex number x with complex number y. Complex-complex multiplication involves an addition as well as multiplication. As a result, the word length of the full-precision result has one more bit than the sum of the word lengths of the multiplicands.

x = fi(5+6i,T1,F)
y = fi(10+2i,T2,F)
z = x*y
x = 

   5.0000 + 6.0000i

          DataTypeMode: Fixed-point: binary point scaling
            Signedness: Signed
            WordLength: 24
        FractionLength: 20

y = 

  10.0000 + 2.0000i

          DataTypeMode: Fixed-point: binary point scaling
            Signedness: Signed
            WordLength: 16
        FractionLength: 10

z = 

  38.0000 +70.0000i

          DataTypeMode: Fixed-point: binary point scaling
            Signedness: Signed
            WordLength: 41
        FractionLength: 30

Casts

The fimath object allows you to specify the data type and scaling of intermediate sums and products with the SumMode and ProductMode properties. It is important to keep in mind the ramifications of each cast when you set the SumMode and ProductMode properties. Depending upon the data types you select, overflow and/or rounding might occur. These examples demonstrate cases where overflow and rounding can occur. For more examples of casting, see Cast fi Objects.

Casting from a Shorter Data Type to a Longer Data Type

Consider the cast of a nonzero number, represented by a 4-bit data type with two fractional bits to an 8-bit data type with seven fractional bits.

Diagram representing the cast of a 4-bit number with two fractional bits to an 8-bit type with seven fractional bits. The source bits must be shifted up to match the binary point position of the destination data type. The left-most bit from the source data type "falls off" the high end with the shift up. Overflow might occur. The result will saturate or wrap. The five right-most bits of the destination data type are padded with 0's or 1's.

As the diagram shows, the source bits are shifted up so that the binary point matches the destination binary point position. The highest source bit does not fit, so overflow might occur and the result can saturate or wrap. The empty bits at the low end of the destination data type are padded with either 0's or 1's.

  • If overflow does not occur, the empty bits are padded with 0's.

  • If wrapping occurs, the empty bits are padded with 0's.

  • If saturation occurs,

    • The empty bits of a positive number are padded with 1's.

    • The empty bits of a negative number are padded with 0's.

Even with a cast from a shorter data type to a longer data type, overflow might still occur. This can happen when the integer length of the source data type (in this case two) is longer than the integer length of the destination data type (in this case one). Similarly, rounding might be necessary even when casting from a shorter data type to a longer data type if the destination data type and scaling has fewer fractional bits than the source.

Casting from a Longer Data Type to a Shorter Data Type

Consider the cast of a nonzero number, represented by an 8-bit data type with seven fractional bits, to a 4-bit data type with two fractional bits.

Diagram representing the cast of an 8-bit data type with seven fractional bits to a 4-bit data type with two fractional bits. The source bits must be shifted down to match the binary point position of the destination data type. There is no value for the left-most bit from the source, so the result must be sign-extended to fill the destination data type. The five right-most bits from the source do not fit into the destination data type. The result is rounded.

As the diagram shows, the source bits are shifted down so that the binary point matches the destination binary point position. There is no value for the highest bit from the source, so sign extension is used to fill the integer portion of the destination data type. Sign extension is the addition of bits that have the value of the most significant bits to the high end of a two's complement number. Sign extension does not change the value of the binary number. In this example, the bottom five bits of the source do not fit into the fraction length of the destination. Therefore, precision can be lost as the result is rounded.

In this case, even though the cast is from a longer data type to a shorter data type, all the integer bits are maintained. Conversely, full precision can be maintained even if you cast to a shorter data type, as long as the fraction length of the destination data type is the same length or longer than the fraction length of the source data type. In that case, however, bits are lost from the high end of the result and overflow can occur.

The worst case occurs when both the integer length and the fraction length of the destination data type are shorter than those of the source data type and scaling. In that case, both overflow and a loss of precision can occur.

See Also

Go to top of page