Home | History | Annotate | Download | only in Softfloat
      1 $NetBSD: softfloat.txt,v 1.2 2006/11/24 19:46:58 christos Exp $
      2 
      3 SoftFloat Release 2a General Documentation
      4 
      5 John R. Hauser
      6 1998 December 13
      7 
      8 
      9 -------------------------------------------------------------------------------
     10 Introduction
     11 
     12 SoftFloat is a software implementation of floating-point that conforms to
     13 the IEC/IEEE Standard for Binary Floating-Point Arithmetic.  As many as four
     14 formats are supported:  single precision, double precision, extended double
     15 precision, and quadruple precision.  All operations required by the standard
     16 are implemented, except for conversions to and from decimal.
     17 
     18 This document gives information about the types defined and the routines
     19 implemented by SoftFloat.  It does not attempt to define or explain the
     20 IEC/IEEE Floating-Point Standard.  Details about the standard are available
     21 elsewhere.
     22 
     23 
     24 -------------------------------------------------------------------------------
     25 Limitations
     26 
     27 SoftFloat is written in C and is designed to work with other C code.  The
     28 SoftFloat header files assume an ISO/ANSI-style C compiler.  No attempt
     29 has been made to accommodate compilers that are not ISO-conformant.  In
     30 particular, the distributed header files will not be acceptable to any
     31 compiler that does not recognize function prototypes.
     32 
     33 Support for the extended double-precision and quadruple-precision formats
     34 depends on a C compiler that implements 64-bit integer arithmetic.  If the
     35 largest integer format supported by the C compiler is 32 bits, SoftFloat is
     36 limited to only single and double precisions.  When that is the case, all
     37 references in this document to the extended double precision, quadruple
     38 precision, and 64-bit integers should be ignored.
     39 
     40 
     41 -------------------------------------------------------------------------------
     42 Contents
     43 
     44     Introduction
     45     Limitations
     46     Contents
     47     Legal Notice
     48     Types and Functions
     49     Rounding Modes
     50     Extended Double-Precision Rounding Precision
     51     Exceptions and Exception Flags
     52     Function Details
     53         Conversion Functions
     54         Standard Arithmetic Functions
     55         Remainder Functions
     56         Round-to-Integer Functions
     57         Comparison Functions
     58         Signaling NaN Test Functions
     59         Raise-Exception Function
     60     Contact Information
     61 
     62 
     63 
     64 -------------------------------------------------------------------------------
     65 Legal Notice
     66 
     67 SoftFloat was written by John R. Hauser.  This work was made possible in
     68 part by the International Computer Science Institute, located at Suite 600,
     69 1947 Center Street, Berkeley, California 94704.  Funding was partially
     70 provided by the National Science Foundation under grant MIP-9311980.  The
     71 original version of this code was written as part of a project to build
     72 a fixed-point vector processor in collaboration with the University of
     73 California at Berkeley, overseen by Profs. Nelson Morgan and John Wawrzynek.
     74 
     75 THIS SOFTWARE IS DISTRIBUTED AS IS, FOR FREE.  Although reasonable effort
     76 has been made to avoid it, THIS SOFTWARE MAY CONTAIN FAULTS THAT WILL AT
     77 TIMES RESULT IN INCORRECT BEHAVIOR.  USE OF THIS SOFTWARE IS RESTRICTED TO
     78 PERSONS AND ORGANIZATIONS WHO CAN AND WILL TAKE FULL RESPONSIBILITY FOR ANY
     79 AND ALL LOSSES, COSTS, OR OTHER PROBLEMS ARISING FROM ITS USE.
     80 
     81 
     82 -------------------------------------------------------------------------------
     83 Types and Functions
     84 
     85 When 64-bit integers are supported by the compiler, the `softfloat.h' header
     86 file defines four types:  `float32' (single precision), `float64' (double
     87 precision), `floatx80' (extended double precision), and `float128'
     88 (quadruple precision).  The `float32' and `float64' types are defined in
     89 terms of 32-bit and 64-bit integer types, respectively, while the `float128'
     90 type is defined as a structure of two 64-bit integers, taking into account
     91 the byte order of the particular machine being used.  The `floatx80' type
     92 is defined as a structure containing one 16-bit and one 64-bit integer, with
     93 the machine's byte order again determining the order of the `high' and `low'
     94 fields.
     95 
     96 When 64-bit integers are _not_ supported by the compiler, the `softfloat.h'
     97 header file defines only two types:  `float32' and `float64'.  Because
     98 ISO/ANSI C guarantees at least one built-in integer type of 32 bits,
     99 the `float32' type is identified with an appropriate integer type.  The
    100 `float64' type is defined as a structure of two 32-bit integers, with the
    101 machine's byte order determining the order of the fields.
    102 
    103 In either case, the types in `softfloat.h' are defined such that if a system
    104 implements the usual C `float' and `double' types according to the IEC/IEEE
    105 Standard, then the `float32' and `float64' types should be indistinguishable
    106 in memory from the native `float' and `double' types.  (On the other hand,
    107 when `float32' or `float64' values are placed in processor registers by
    108 the compiler, the type of registers used may differ from those used for the
    109 native `float' and `double' types.)
    110 
    111 SoftFloat implements the following arithmetic operations:
    112 
    113 -- Conversions among all the floating-point formats, and also between
    114    integers (32-bit and 64-bit) and any of the floating-point formats.
    115 
    116 -- The usual add, subtract, multiply, divide, and square root operations
    117    for all floating-point formats.
    118 
    119 -- For each format, the floating-point remainder operation defined by the
    120    IEC/IEEE Standard.
    121 
    122 -- For each floating-point format, a ``round to integer'' operation that
    123    rounds to the nearest integer value in the same format.  (The floating-
    124    point formats can hold integer values, of course.)
    125 
    126 -- Comparisons between two values in the same floating-point format.
    127 
    128 The only functions required by the IEC/IEEE Standard that are not provided
    129 are conversions to and from decimal.
    130 
    131 
    132 -------------------------------------------------------------------------------
    133 Rounding Modes
    134 
    135 All four rounding modes prescribed by the IEC/IEEE Standard are implemented
    136 for all operations that require rounding.  The rounding mode is selected
    137 by the global variable `float_rounding_mode'.  This variable may be set
    138 to one of the values `float_round_nearest_even', `float_round_to_zero',
    139 `float_round_down', or `float_round_up'.  The rounding mode is initialized
    140 to nearest/even.
    141 
    142 
    143 -------------------------------------------------------------------------------
    144 Extended Double-Precision Rounding Precision
    145 
    146 For extended double precision (`floatx80') only, the rounding precision
    147 of the standard arithmetic operations is controlled by the global variable
    148 `floatx80_rounding_precision'.  The operations affected are:
    149 
    150    floatx80_add   floatx80_sub   floatx80_mul   floatx80_div   floatx80_sqrt
    151 
    152 When `floatx80_rounding_precision' is set to its default value of 80, these
    153 operations are rounded (as usual) to the full precision of the extended
    154 double-precision format.  Setting `floatx80_rounding_precision' to 32
    155 or to 64 causes the operations listed to be rounded to reduced precision
    156 equivalent to single precision (`float32') or to double precision
    157 (`float64'), respectively.  When rounding to reduced precision, additional
    158 bits in the result significand beyond the rounding point are set to zero.
    159 The consequences of setting `floatx80_rounding_precision' to a value other
    160 than 32, 64, or 80 is not specified.  Operations other than the ones listed
    161 above are not affected by `floatx80_rounding_precision'.
    162 
    163 
    164 -------------------------------------------------------------------------------
    165 Exceptions and Exception Flags
    166 
    167 All five exception flags required by the IEC/IEEE Standard are
    168 implemented.  Each flag is stored as a unique bit in the global variable
    169 `float_exception_flags'.  The positions of the exception flag bits within
    170 this variable are determined by the bit masks `float_flag_inexact',
    171 `float_flag_underflow', `float_flag_overflow', `float_flag_divbyzero', and
    172 `float_flag_invalid'.  The exception flags variable is initialized to all 0,
    173 meaning no exceptions.
    174 
    175 An individual exception flag can be cleared with the statement
    176 
    177     float_exception_flags &= ~ float_flag_<exception>;
    178 
    179 where `<exception>' is the appropriate name.  To raise a floating-point
    180 exception, the SoftFloat function `float_raise' should be used (see below).
    181 
    182 In the terminology of the IEC/IEEE Standard, SoftFloat can detect tininess
    183 for underflow either before or after rounding.  The choice is made by
    184 the global variable `float_detect_tininess', which can be set to either
    185 `float_tininess_before_rounding' or `float_tininess_after_rounding'.
    186 Detecting tininess after rounding is better because it results in fewer
    187 spurious underflow signals.  The other option is provided for compatibility
    188 with some systems.  Like most systems, SoftFloat always detects loss of
    189 accuracy for underflow as an inexact result.
    190 
    191 
    192 -------------------------------------------------------------------------------
    193 Function Details
    194 
    195 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    196 Conversion Functions
    197 
    198 All conversions among the floating-point formats are supported, as are all
    199 conversions between a floating-point format and 32-bit and 64-bit signed
    200 integers.  The complete set of conversion functions is:
    201 
    202    int32_to_float32      int64_to_float32
    203    int32_to_float64      int64_to_float32
    204    int32_to_floatx80     int64_to_floatx80
    205    int32_to_float128     int64_to_float128
    206 
    207    float32_to_int32      float32_to_int64
    208    float32_to_int32      float64_to_int64
    209    floatx80_to_int32     floatx80_to_int64
    210    float128_to_int32     float128_to_int64
    211 
    212    float32_to_float64    float32_to_floatx80   float32_to_float128
    213    float64_to_float32    float64_to_floatx80   float64_to_float128
    214    floatx80_to_float32   floatx80_to_float64   floatx80_to_float128
    215    float128_to_float32   float128_to_float64   float128_to_floatx80
    216 
    217 Each conversion function takes one operand of the appropriate type and
    218 returns one result.  Conversions from a smaller to a larger floating-point
    219 format are always exact and so require no rounding.  Conversions from 32-bit
    220 integers to double precision and larger formats are also exact, and likewise
    221 for conversions from 64-bit integers to extended double and quadruple
    222 precisions.
    223 
    224 Conversions from floating-point to integer raise the invalid exception if
    225 the source value cannot be rounded to a representable integer of the desired
    226 size (32 or 64 bits).  If the floating-point operand is a NaN, the largest
    227 positive integer is returned.  Otherwise, if the conversion overflows, the
    228 largest integer with the same sign as the operand is returned.
    229 
    230 On conversions to integer, if the floating-point operand is not already an
    231 integer value, the operand is rounded according to the current rounding
    232 mode as specified by `float_rounding_mode'.  Because C (and perhaps other
    233 languages) require that conversions to integers be rounded toward zero, the
    234 following functions are provided for improved speed and convenience:
    235 
    236    float32_to_int32_round_to_zero    float32_to_int64_round_to_zero
    237    float64_to_int32_round_to_zero    float64_to_int64_round_to_zero
    238    floatx80_to_int32_round_to_zero   floatx80_to_int64_round_to_zero
    239    float128_to_int32_round_to_zero   float128_to_int64_round_to_zero
    240 
    241 These variant functions ignore `float_rounding_mode' and always round toward
    242 zero.
    243 
    244 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    245 Standard Arithmetic Functions
    246 
    247 The following standard arithmetic functions are provided:
    248 
    249    float32_add    float32_sub    float32_mul    float32_div    float32_sqrt
    250    float64_add    float64_sub    float64_mul    float64_div    float64_sqrt
    251    floatx80_add   floatx80_sub   floatx80_mul   floatx80_div   floatx80_sqrt
    252    float128_add   float128_sub   float128_mul   float128_div   float128_sqrt
    253 
    254 Each function takes two operands, except for `sqrt' which takes only one.
    255 The operands and result are all of the same type.
    256 
    257 Rounding of the extended double-precision (`floatx80') functions is affected
    258 by the `floatx80_rounding_precision' variable, as explained above in the
    259 section _Extended_Double-Precision_Rounding_Precision_.
    260 
    261 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    262 Remainder Functions
    263 
    264 For each format, SoftFloat implements the remainder function according to
    265 the IEC/IEEE Standard.  The remainder functions are:
    266 
    267    float32_rem
    268    float64_rem
    269    floatx80_rem
    270    float128_rem
    271 
    272 Each remainder function takes two operands.  The operands and result are all
    273 of the same type.  Given operands x and y, the remainder functions return
    274 the value x - n*y, where n is the integer closest to x/y.  If x/y is exactly
    275 halfway between two integers, n is the even integer closest to x/y.  The
    276 remainder functions are always exact and so require no rounding.
    277 
    278 Depending on the relative magnitudes of the operands, the remainder
    279 functions can take considerably longer to execute than the other SoftFloat
    280 functions.  This is inherent in the remainder operation itself and is not a
    281 flaw in the SoftFloat implementation.
    282 
    283 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    284 Round-to-Integer Functions
    285 
    286 For each format, SoftFloat implements the round-to-integer function
    287 specified by the IEC/IEEE Standard.  The functions are:
    288 
    289    float32_round_to_int
    290    float64_round_to_int
    291    floatx80_round_to_int
    292    float128_round_to_int
    293 
    294 Each function takes a single floating-point operand and returns a result of
    295 the same type.  (Note that the result is not an integer type.)  The operand
    296 is rounded to an exact integer according to the current rounding mode, and
    297 the resulting integer value is returned in the same floating-point format.
    298 
    299 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    300 Comparison Functions
    301 
    302 The following floating-point comparison functions are provided:
    303 
    304    float32_eq    float32_le    float32_lt
    305    float64_eq    float64_le    float64_lt
    306    floatx80_eq   floatx80_le   floatx80_lt
    307    float128_eq   float128_le   float128_lt
    308 
    309 Each function takes two operands of the same type and returns a 1 or 0
    310 representing either _true_ or _false_.  The abbreviation `eq' stands for
    311 ``equal'' (=); `le' stands for ``less than or equal'' (<=); and `lt' stands
    312 for ``less than'' (<).
    313 
    314 The standard greater-than (>), greater-than-or-equal (>=), and not-equal
    315 (!=) functions are easily obtained using the functions provided.  The
    316 not-equal function is just the logical complement of the equal function.
    317 The greater-than-or-equal function is identical to the less-than-or-equal
    318 function with the operands reversed; and the greater-than function can be
    319 obtained from the less-than function in the same way.
    320 
    321 The IEC/IEEE Standard specifies that the less-than-or-equal and less-than
    322 functions raise the invalid exception if either input is any kind of NaN.
    323 The equal functions, on the other hand, are defined not to raise the invalid
    324 exception on quiet NaNs.  For completeness, SoftFloat provides the following
    325 additional functions:
    326 
    327    float32_eq_signaling    float32_le_quiet    float32_lt_quiet
    328    float64_eq_signaling    float64_le_quiet    float64_lt_quiet
    329    floatx80_eq_signaling   floatx80_le_quiet   floatx80_lt_quiet
    330    float128_eq_signaling   float128_le_quiet   float128_lt_quiet
    331 
    332 The `signaling' equal functions are identical to the standard functions
    333 except that the invalid exception is raised for any NaN input.  Likewise,
    334 the `quiet' comparison functions are identical to their counterparts except
    335 that the invalid exception is not raised for quiet NaNs.
    336 
    337 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    338 Signaling NaN Test Functions
    339 
    340 The following functions test whether a floating-point value is a signaling
    341 NaN:
    342 
    343    float32_is_signaling_nan
    344    float64_is_signaling_nan
    345    floatx80_is_signaling_nan
    346    float128_is_signaling_nan
    347 
    348 The functions take one operand and return 1 if the operand is a signaling
    349 NaN and 0 otherwise.
    350 
    351 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    352 Raise-Exception Function
    353 
    354 SoftFloat provides a function for raising floating-point exceptions:
    355 
    356     float_raise
    357 
    358 The function takes a mask indicating the set of exceptions to raise.  No
    359 result is returned.  In addition to setting the specified exception flags,
    360 this function may cause a trap or abort appropriate for the current system.
    361 
    362 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    363 
    364 
    365 -------------------------------------------------------------------------------
    366 Contact Information
    367 
    368 At the time of this writing, the most up-to-date information about
    369 SoftFloat and the latest release can be found at the Web page `http://
    370 HTTP.CS.Berkeley.EDU/~jhauser/arithmetic/SoftFloat.html'.
    371 
    372 
    373