Half-precision floating-point in Java

Gibt es eine Java-Bibliothek überall ausführen können Berechnungen auf IEEE 754 halb-Präzision zahlen oder konvertieren Sie Sie in und aus double-precision?

Ansätze geeignet wären:

Halten Sie die zahlen in der Hälfte-precision-format und berechnen unter Verwendung der integer-Arithmetik - & bit-twiddling (wie MicroFloat für single - und double-precision)
Führen Sie alle Berechnungen im single-oder double precision Konvertierung zu/von der Hälfte Präzision für die übertragung (in dem Fall das, was ich brauche, ist gut getestete Funktionen für die Konvertierung.)

Bearbeiten: - Konvertierung Bedürfnisse zu 100% genau - es sind viele NaNs, unendlichen und subnormals in der input-Dateien.

Verwandte Frage, aber für JavaScript: Dekomprimieren die Hälfte Präzision Schwebt in Javascript

Verwandte: hier ist Python-code konvertiert Python-float IEEE 754-2008 (binary16) - format. Es unterstützt unendlichen, subnormals, plus/minus Nullen aber alle NaNs verwandeln sich in ein einziges Beispiel NaN und ich bin mir nicht sicher, ich verstehe das rundungsverhalten.

InformationsquelleAutor finnw | 2011-05-28

51

Können Sie Float.intBitsToFloat() und Float.floatToIntBits() konvertieren Sie zu und von der primitive float-Werte. Wenn Sie Leben können mit abgeschnitten Präzision (im Gegensatz zu Runden) die Konvertierung sollte möglich sein, die Umsetzung mit nur ein paar bit-Verschiebungen.

Ich habe nun ein bisschen mehr Mühe und es stellte sich heraus, nicht ganz so einfach, wie ich erwartet hatte, zumindest am Anfang. Diese version wird nun getestet und überprüft werden, in jedem Aspekt, den ich mir vorstellen konnte und ich bin sehr zuversichtlich, dass es produziert die genauen Ergebnisse für alle möglichen Eingabewerte. Es unterstützt die exakte Rundung und subnormal Umwandlung in beide Richtungen.
```
//ignores the higher 16 bits
public static float toFloat( int hbits )
{
    int mant = hbits & 0x03ff;            //10 bits mantissa
    int exp =  hbits & 0x7c00;            //5 bits exponent
    if( exp == 0x7c00 )                   //NaN/Inf
        exp = 0x3fc00;                    //-> NaN/Inf
    else if( exp != 0 )                   //normalized value
    {
        exp += 0x1c000;                   //exp - 15 + 127
        if( mant == 0 && exp > 0x1c400 )  //smooth transition
            return Float.intBitsToFloat( ( hbits & 0x8000 ) << 16
                                            | exp << 13 | 0x3ff );
    }
    else if( mant != 0 )                  //&& exp==0 -> subnormal
    {
        exp = 0x1c400;                    //make it normal
        do {
            mant <<= 1;                   //mantissa * 2
            exp -= 0x400;                 //decrease exp by 1
        } while( ( mant & 0x400 ) == 0 ); //while not normal
        mant &= 0x3ff;                    //discard subnormal bit
    }                                     //else +/-0 -> +/-0
    return Float.intBitsToFloat(          //combine all parts
        ( hbits & 0x8000 ) << 16          //sign  << ( 31 - 15 )
        | ( exp | mant ) << 13 );         //value << ( 23 - 10 )
}
```
```
//returns all higher 16 bits as 0 for all results
public static int fromFloat( float fval )
{
    int fbits = Float.floatToIntBits( fval );
    int sign = fbits >>> 16 & 0x8000;          //sign only
    int val = ( fbits & 0x7fffffff ) + 0x1000; //rounded value

    if( val >= 0x47800000 )               //might be or become NaN/Inf
    {                                     //avoid Inf due to rounding
        if( ( fbits & 0x7fffffff ) >= 0x47800000 )
        {                                 //is or must become NaN/Inf
            if( val < 0x7f800000 )        //was value but too large
                return sign | 0x7c00;     //make it +/-Inf
            return sign | 0x7c00 |        //remains +/-Inf or NaN
                ( fbits & 0x007fffff ) >>> 13; //keep NaN (and Inf) bits
        }
        return sign | 0x7bff;             //unrounded not quite Inf
    }
    if( val >= 0x38800000 )               //remains normalized value
        return sign | val - 0x38000000 >>> 13; //exp - 127 + 15
    if( val < 0x33000000 )                //too small for subnormal
        return sign;                      //becomes +/-0
    val = ( fbits & 0x7fffffff ) >>> 23;  //tmp exp for subnormal calc
    return sign | ( ( fbits & 0x7fffff | 0x800000 ) //add subnormal bit
         + ( 0x800000 >>> val - 102 )     //round depending on cut off
      >>> 126 - val );   //div by 2^(1-(exp-127+15)) and >> 13 | exp=0
}
```
Implementiert habe ich zwei kleine Erweiterungen im Vergleich zu den Buch, weil die Allgemeine Präzision für 16-bit-floats ist eher gering, die könnten die inhärenten Anomalien von floating-point-Formate visuell wahrnehmbar sind, im Vergleich zu größeren floating-point-Typen, wo Sie sind in der Regel nicht bemerkt aufgrund der großen Präzision.

Dem ersten dieser beiden Zeilen in die toFloat() Funktion:
```
if( mant == 0 && exp > 0x1c400 )  //smooth transition
    return Float.intBitsToFloat( ( hbits & 0x8000 ) << 16 | exp << 13 | 0x3ff );
```
Floating-point-zahlen in den Normalbereich vom Typ Größe anzunehmen, der exponent und damit die Präzision, um die Größenordnung des Wertes. Aber dies ist nicht eine glatte Annahme, es geschieht in den Schritten: Wechsel auf die nächst höhere Exponenten Ergebnisse in der Hälfte der Präzision. Die Präzision bleibt nun das gleiche für alle Werte der Mantisse, bis der nächste Sprung in die nächst höhere Exponenten. Die Erweiterung code oben macht diese übergänge glatter durch Rücksendung einen Wert, der in der geografischen Mitte der enthaltenen 32-bit-float-Bereich für diese Besondere Hälfte einem float-Wert. Jeder normale Hälfte float-Wert-Karten, um genau 8192 32-bit-float-Werte. Der zurückgegebene Wert sein soll, genau in der Mitte dieser Werte. Aber am übergang von der Hälfte float exponent der unteren 4096 Werte in doppelter Genauigkeit wie die oberen 4096 Werte und decken somit eine Reihe Raum, der ist nur halb so groß wie auf der anderen Seite. All diese 8192 32-bit-float-Werte anzeigen, um die gleiche Hälfte einem float-Wert, so dass die Konvertierung eine halbe float zu 32-bit-und wieder die Ergebnisse in der gleichen Hälfte einem float-Wert unabhängig davon, welche 8192 intermediate 32-bit-Werten gewählt wurde. Die Erweiterung führt jetzt zu so etwas wie ein glatter halben Schritt um einen Faktor sqrt(2) am übergang (siehe unten rechts) Bild unten, während die linke Bild soll die Visualisierung der scharfen Schritt durch einen Faktor von zwei, ohne anti-aliasing. Sie können sicher entfernen Sie diese beiden Zeilen aus dem code, um das standard-Verhalten.
```
covered number space on either side of the returned value:
       6.0E-8             #######                  ##########
       4.5E-8             |                       #
       3.0E-8     #########               ########
```
Die zweite Erweiterung ist in der fromFloat() Funktion:
```
    {                                     //avoid Inf due to rounding
        if( ( fbits & 0x7fffffff ) >= 0x47800000 )
...
        return sign | 0x7bff;             //unrounded not quite Inf
    }
```
Dieser extension etwas erweitert, die Zahl der half-float-format durch speichern einige 32-bit-Werte bilden, immer gefördert "Unendlich". Die betroffenen Werte sind diejenigen, die gewesen wäre, die kleiner als Unendlich, ohne Rundung und würde sich die Unendlichkeit nur durch die Rundung. Sie können sicher entfernen Sie die Zeilen, die oben gezeigt, wenn Sie nicht möchten, dass diese Erweiterung.

Habe ich versucht zu optimieren, der Weg für den normalen Werten in der fromFloat() Funktion so viel wie möglich, die machte es ein wenig weniger lesbar durch die Verwendung von vorausberechneten und unshifted Konstanten. Ich habe nicht so viel in 'toFloat ()', da er nicht überschreiten würde die Leistung einer lookup-Tabelle sowieso. Also, wenn Geschwindigkeit wirklich wichtig ist könnte die toFloat() Funktion nur ausfüllen, statische lookup-Tabelle mit 0x10000 Elemente und verwenden Sie diese Tabelle für die eigentliche Konvertierung. Dies ist etwa 3 mal schneller mit einem aktuellen x64-server-VM und etwa 5-mal schneller mit dem x86-client-VM.

Ich den code hiermit in die public domain.
- round-to-nearest in fromFloat (im Gegensatz zum abschneiden) ist nicht allzu schwer, um in die Entscheidung um auf-oder abrunden entscheiden die Mantisse bits verworfen: 0???????????? -> abrunden, 100000000000 -> Runde sogar, sonst aufrunden. EDIT: es IST schwer zu hinzufügen, ich vergaß die Sonderfälle NaN und Inf. Wahrscheinlich nicht Wert.
- Ich machte ein paar Korrekturen nach einigen tests. Ja, subnormals sind noch nicht richtig behandelt werden, von der Hälfte zu schweben. In der anderen Richtung sollten Sie haben und bekommen in 0 umgewandelt.
- subnormals behandelt werden sollten nun korrekt in beide Richtungen
- Bestimmte komisch NaN-Werte können bewirken, dass der fromFloat code zum Versagen durch überlaufen auf die Rundung der val und damit die Umwandlung in null. Sie können dieses Problem beheben, ohne Verlust der Geschwindigkeit durch subtrahieren 0x1000 von jeder Stelle, die Sie vergleichen oder subtrahieren von val, aber ich bin mir nicht sicher, es lohnt sich. Trotzdem, schöne Lösung!
- Ich sehe, was du meinst, aber diese NaN-Werte nicht zurückgegeben werden Float.floatToIntBits die normalisiert alle NaNs zu 0x7fc00000. Die abgerundeten val kann also nie geworden nagative. Vielleicht wäre es schneller floatToRawIntBits (was Sie nicht tut, NaN Normalisierung) und befasst sich dann mit dem überlauf NaNs, d.h. durch hinzufügen || val < 0 um die erste Filiale. Dies würde es auch erlauben, zu bewahren einige der extra NaN bits. Ich erinnere mich, dass ich geplant hatte, dies zu tun, aber konnte Sie nicht finden, ausreichende Dokumentation, wie man mit diesen bits und so ließ sich mit normalisierten NaNs.
- es ist jemand, der braucht eine halbe Präzision schwebt in der turmspitze. github.com/non/spire/issues/501 . Würden Sie mir, wenn wir diesen code verwenden?
- Ich würde nicht dagegen. Sie können den code in jeder Weise, die Sie mögen.
- Ich bin kopieren buttonius Kommentar misposted als Antwort: "Der code von x4u codiert der Wert 1 korrekt als 0x3c00 (ref: en.wikipedia.org/wiki/Half-precision_floating-point_format). Aber der decoder mit Glätte Verbesserungen dekodiert, die in 1.000122. Der wikipedia-Eintrag sagt, dass integer-Werte 0..2048 können exakt dargestellt werden. Nicht schön... das Entfernen der "| 0x3ff" aus der toFloat code sorgt dafür, dass toFloat(fromFloat(k)) == k für eine ganze Zahl k im Bereich von -2048..2048, vermutlich auf Kosten von etwas weniger Glätte." Benoit machte die gleiche Beobachtung in einem jetzt gelöschten Antwort.
- Warum nicht nehmen und zurück short? Ich merke short ist ein bisschen wie ein Bürger zweiter Klasse, aber short[]und ShortBuffer sind die natürlichen und schnell Container für die Hälften.
- Können Sie die Werte für die üblichen Konstanten wie MAX_VALUE und POSITIVE_INFINITY?
InformationsquelleAutor x4u
1

Den code von x4u codiert der Wert 1 korrekt als 0x3c00 (ref: https://en.wikipedia.org/wiki/Half-precision_floating-point_format). Aber der decoder mit Glätte Verbesserungen dekodiert, die in 1.000122. Der wikipedia-Eintrag sagt, dass integer-Werte 0..2048 können exakt dargestellt werden. Nicht schön...
Entfernen der "| 0x3ff" von der toFloat code sorgt dafür, dass toFloat(fromFloat(k)) == k für ganzzahlige k im Bereich von -2048..2048, vermutlich auf Kosten von etwas weniger Glätte.
- Ich habe beobachtet das gleiche. könnte man nur erarbeiten, was Sie bedeuten, durch die Umstellung jetzt zu verlieren "Glätte"?
- Meine interpretation ist, dass "Glätte" bedeutet, dass die Diskretisierung Schritte sind mehr gleichmäßig verteilt über die Domäne und/oder der Rundungsfehler kleiner sind, oder mehr "neutral". Für mich keine Vorteile "Glätte" waren es nicht Wert die mangelnde Genauigkeit der Umrechnungen für (klein -) integer-Werte.
- vielen Dank für die Erklärung. Auch ich ziehe Genauigkeit und entfernen der | 0x3ffhat den job schön!
- dies sollte ein Kommentar!
InformationsquelleAutor buttonius

Vorher sah ich die Lösung hier gepostet, ich hatte bis Schlagsahne etwas einfaches:

public static float toFloat(int nHalf)
    {
    int S = (nHalf >>> 15) & 0x1;                                                             
    int E = (nHalf >>> 10) & 0x1F;                                                            
    int T = (nHalf       ) & 0x3FF;                                                           

    E = E == 0x1F                                                                            
            ? 0xFF  //it's 2^w-1; it's all 1's, so keep it all 1's for the 32-bit float       
            : E - 15 + 127;     //adjust the exponent from the 16-bit bias to the 32-bit bias

    //sign S is now bit 31                                                                    
    //exp E is from bit 30 to bit 23                                                          
    //scale T by 13 binary digits (it grew from 10 to 23 bits)                                
    return Float.intBitsToFloat(S << 31 | E << 23 | T << 13);                               
    }

Mir gefällt der Ansatz, in den anderen gepostet Lösung, obwohl. Referenz:

    //notes from the IEEE-754 specification:

    //left to right bits of a binary floating point number:
    //size        bit ids       name  description
    //----------  ------------  ----  ---------------------------
    //1 bit                       S   sign
    //w bits      E[0]..E[w-1]    E   biased exponent
    //t=p-1 bits  d[1]..d[p-1]    T   trailing significant field

    //The range of the encoding’s biased exponent E shall include:
    //― every integer between 1 and 2^w − 2, inclusive, to encode normal numbers
    //― the reserved value 0 to encode ±0 and subnormal numbers
    //― the reserved value 2w − 1 to encode +/-infinity and NaN

    //The representation r of the floating-point datum, and value v of the floating-point datum
    //represented, are inferred from the constituent fields as follows:
    //a) If E == 2^w−1 and T != 0, then r is qNaN or sNaN and v is NaN regardless of S
    //b) If E == 2^w−1 and T == 0, then r=v=(−1)^S * (+infinity)
    //c) If 1 <= E <= 2^w−2, then r is (S, (E−bias), (1 + 2^(1−p) * T))
    //   the value of the corresponding floating-point number is
    //       v = (−1)^S * 2^(E−bias) * (1 + 2^(1−p) * T)
    //   thus normal numbers have an implicit leading significand bit of 1
    //d) If E == 0 and T != 0, then r is (S, emin, (0 + 2^(1−p) * T))
    //   the value of the corresponding floating-point number is
    //       v = (−1)^S * 2^emin * (0 + 2^(1−p) * T)
    //   thus subnormal numbers have an implicit leading significand bit of 0
    //e) If E == 0 and T ==0, then r is (S, emin, 0) and v = (−1)^S * (+0)

    //parameter                                      bin16  bin32
    //--------------------------------------------   -----  -----
    //k, storage width in bits                         16     32
    //p, precision in bits                             11     24
    //emax, maxiumum exponent e                        15    127
    //bias, E-e                                        15    127
    //sign bit                                          1      1
    //w, exponent field width in bits                   5      8
    //t, trailing significant field width in bits      10     23

InformationsquelleAutor cpurdy

Erstellte ich eine java-Klasse aufgerufen, die die HalfPrecisionFloat, die verwendet x4u Lösung. Die Klasse bietet bequeme Methoden und Fehler zu überprüfen. Es geht weiter und hat Methoden für die Rücksendung eine Doppel-und Float-2 byte half-precision-Wert.

Hoffentlich jemand helfen.

==>

import java.nio.ByteBuffer;

/**
 * Accepts various forms of a floating point half-precision (2 byte) number 
 * and contains methods to convert to a
 * full-precision floating point number Float and Double instance.
 * <p>
 * This implemention was inspired by x4u who is a user contributing 
 * to stackoverflow.com.
 * (https://stackoverflow.com/users/237321/x4u).
 *
 * @author dougestep
 */
public class HalfPrecisionFloat {
    private short halfPrecision;
    private Float fullPrecision;

    /**
     * Creates an instance of the class from the supplied the supplied 
     * byte array.  The byte array must be exactly two bytes in length.
     *
     * @param bytes the two-byte byte array.
     */
    public HalfPrecisionFloat(byte[] bytes) {
        if (bytes.length != 2) {
            throw new IllegalArgumentException("The supplied byte array " +
              "must be exactly two bytes in length");
        }

        final ByteBuffer buffer = ByteBuffer.wrap(bytes);
        this.halfPrecision = buffer.getShort();
    }

    /**
     * Creates an instance of this class from the supplied short number.
     *
     * @param number the number defined as a short.
     */
    public HalfPrecisionFloat(final short number) {
        this.halfPrecision = number;
        this.fullPrecision = toFullPrecision();
    }

    /**
     * Creates an instance of this class from the supplied 
     * full-precision floating point number.
     *
     * @param number the float number.
     */
    public HalfPrecisionFloat(final float number) {
        if (number > Short.MAX_VALUE) {
            throw new IllegalArgumentException("The supplied float is too "
              + "large for a two byte representation");
        }
        if (number < Short.MIN_VALUE) {
            throw new IllegalArgumentException("The supplied float is too "
              + "small for a two byte representation");
        }

        final int val = fromFullPrecision(number);
        this.halfPrecision = (short) val;
        this.fullPrecision = number;
    }

    /**
     * Returns the half-precision float as a number defined as a short.
     *
     * @return the short.
     */
    public short getHalfPrecisionAsShort() {
        return halfPrecision;
    }

    /**
     * Returns a full-precision floating pointing number from the 
     * half-precision value assigned on this instance.
     *
     * @return the full-precision floating pointing number.
     */
    public float getFullFloat() {
        if (fullPrecision == null) {
            fullPrecision = toFullPrecision();
        }
        return fullPrecision;
    }

    /**
     * Returns a full-precision double floating point number from the 
     * half-precision value assigned on this instance.
     *
     * @return the full-precision double floating pointing number.
     */
    public double getFullDouble() {
        return new Double(getFullFloat());
    }

    /**
     * Returns the full-precision float number from the half-precision 
     * value assigned on this instance.
     *
     * @return the full-precision floating pointing number.
     */
    private float toFullPrecision() {
        int mantisa = halfPrecision & 0x03ff;
        int exponent = halfPrecision & 0x7c00;

        if (exponent == 0x7c00) {
            exponent = 0x3fc00;
        } else if (exponent != 0) {
            exponent += 0x1c000;
            if (mantisa == 0 && exponent > 0x1c400) {
                return Float.intBitsToFloat(
                  (halfPrecision & 0x8000) << 16 | exponent << 13 | 0x3ff);
            }
        } else if (mantisa != 0) {
            exponent = 0x1c400;
            do {
                mantisa <<= 1;
                exponent -= 0x400;
            } while ((mantisa & 0x400) == 0);
            mantisa &= 0x3ff;
        }

        return Float.intBitsToFloat(
         (halfPrecision & 0x8000) << 16 | (exponent | mantisa) << 13);
    }

    /**
     * Returns the integer representation of the supplied 
     * full-precision floating pointing number.
     *
     * @param number the full-precision floating pointing number.
     * @return the integer representation.
     */
    private int fromFullPrecision(final float number) {
        int fbits = Float.floatToIntBits(number);
        int sign = fbits >>> 16 & 0x8000;

        int val = (fbits & 0x7fffffff) + 0x1000;

        if (val >= 0x47800000) {
            if ((fbits & 0x7fffffff) >= 0x47800000) {
                if (val < 0x7f800000) {
                    return sign | 0x7c00;
                }
                return sign | 0x7c00 | (fbits & 0x007fffff) >>> 13;
            }
            return sign | 0x7bff;
        }
        if (val >= 0x38800000) {
            return sign | val - 0x38000000 >>> 13;
        }
        if (val < 0x33000000) {
            return sign;
        }
        val = (fbits & 0x7fffffff) >>> 23;
        return sign | ((fbits & 0x7fffff | 0x800000) 
         + (0x800000 >>> val - 102) >>> 126 - val);
    }

Und hier ist die unit-tests

import org.junit.Assert;
import org.junit.Test;

import java.nio.ByteBuffer;

public class TestHalfPrecision {

  private byte[] simulateBytes(final float fullPrecision) {
    HalfPrecisionFloat halfFloat = new HalfPrecisionFloat(fullPrecision);
    short halfShort = halfFloat.getHalfPrecisionAsShort();

    ByteBuffer buffer = ByteBuffer.allocate(2);
    buffer.putShort(halfShort);
    return buffer.array();
  }

  @Test
  public void testHalfPrecisionToFloatApproach() {
    final float startingValue = 1.2f;
    final float closestValue = 1.2001953f;
    final short shortRepresentation = (short) 15565;

    byte[] bytes = simulateBytes(startingValue);
    HalfPrecisionFloat halfFloat = new HalfPrecisionFloat(bytes);
    final float retFloat = halfFloat.getFullFloat();
    Assert.assertEquals(new Float(closestValue), new Float(retFloat));

    HalfPrecisionFloat otherWay = new HalfPrecisionFloat(retFloat);
    final short shrtValue = otherWay.getHalfPrecisionAsShort();
    Assert.assertEquals(new Short(shortRepresentation), new Short(shrtValue));

    HalfPrecisionFloat backAgain = new HalfPrecisionFloat(shrtValue);
    final float backFlt = backAgain.getFullFloat();
    Assert.assertEquals(new Float(closestValue), new Float(backFlt));

    HalfPrecisionFloat dbl = new HalfPrecisionFloat(startingValue);
    final double retDbl = dbl.getFullDouble();
    Assert.assertEquals(new Double(startingValue), new Double(retDbl));
  }

  @Test(expected = IllegalArgumentException.class)
  public void testInvalidByteArray() {
    ByteBuffer buffer = ByteBuffer.allocate(4);
    buffer.putFloat(Float.MAX_VALUE);
    byte[] bytes = buffer.array();

    new HalfPrecisionFloat(bytes);
  }

  @Test(expected = IllegalArgumentException.class)
  public void testInvalidMaxFloat() {
    new HalfPrecisionFloat(Float.MAX_VALUE);
  }

  @Test(expected = IllegalArgumentException.class)
  public void testInvalidMinFloat() {
    new HalfPrecisionFloat(-35000);
  }

  @Test
  public void testCreateWithShort() {
    HalfPrecisionFloat sut = new HalfPrecisionFloat(Short.MAX_VALUE);
    Assert.assertEquals(Short.MAX_VALUE, sut.getHalfPrecisionAsShort());
  }
}

InformationsquelleAutor dgestep

Schreibe einen Kommentar

Du musst angemeldet sein, um einen Kommentar abzugeben.