Wie die Umsetzung der Softmax-Derivat unabhängig von etwaigen Verlust-Funktion?

Für neuronale Netze Bibliothek, die ich implementiert einige Funktionen Aktivierung-und-Verlust-Funktionen und deren Ableitungen. Sie können beliebig kombiniert werden und die Ableitung am Ausgang Schichten wird das Produkt von der Verlust-Derivat und die Aktivierung Derivat.

Allerdings, ich konnte zur Umsetzung der Ableitung der Softmax-Aktivierung-Funktion unabhängig von etwaigen Verlust-Funktion. Aufgrund der Normalisierung, d.h. der Nenner in der Gleichung, die das ändern einer einzelnen input-Aktivierung ändert sich alle output-Aktivierungen und nicht nur eine.

Hier ist mein Softmax-Implementierung, wo die Ableitung fehl, ist der gradient der überprüfung von über 1%. Wie kann ich die Umsetzung der Softmax-Derivat, so dass es sein kann, kombiniert mit einem Verlust der Funktion?

import numpy as np


class Softmax:

    def compute(self, incoming):
        exps = np.exp(incoming)
        return exps /exps.sum()

    def delta(self, incoming, outgoing):
        exps = np.exp(incoming)
        others = exps.sum() - exps
        return 1 /(2 + exps /others + others /exps)


activation = Softmax()
cost = SquaredError()

outgoing = activation.compute(incoming)
delta_output_layer = activation.delta(incoming) * cost.delta(outgoing)

InformationsquelleAutor danijar | 2015-11-05

Mathematisch die Ableitung von Softmax σ(j) mit Bezug auf das logit-Zi (Z. B. Wi*X) ist

Wie die Umsetzung der Softmax-Derivat unabhängig von etwaigen Verlust-Funktion?

wo das rote Dreieck ist eine Kronecker-delta.

Wenn Sie implementieren iterativ:

def softmax_grad(s):
    # input s is softmax value of the original input x. Its shape is (1,n) 
    # i.e.  s = np.array([0.3,0.7]),  x = np.array([0,1])

    # make the matrix whose size is n^2.
    jacobian_m = np.diag(s)

    for i in range(len(jacobian_m)):
        for j in range(len(jacobian_m)):
            if i == j:
                jacobian_m[i][j] = s[i] * (1 - s[i])
            else: 
                jacobian_m[i][j] = -s[i] * s[j]
    return jacobian_m

Test:

In [95]: x
Out[95]: array([1, 2])

In [96]: softmax(x)
Out[96]: array([ 0.26894142,  0.73105858])

In [97]: softmax_grad(softmax(x))
Out[97]: 
array([[ 0.19661193, -0.19661193],
       [-0.19661193,  0.19661193]])

Wenn Sie das umsetzen in eine vektorisierte version:

soft_max = softmax(x)    

# reshape softmax to 2d so np.dot gives matrix multiplication

def softmax_grad(softmax):
    s = softmax.reshape(-1,1)
    return np.diagflat(s) - np.dot(s, s.T)

softmax_grad(soft_max)

#array([[ 0.19661193, -0.19661193],
#       [-0.19661193,  0.19661193]])

für jacobian_m[i][j] = s[i] * (1-s[i]) bekomme ich die Fehlermeldung TypeError: 'numpy.float64' object does not support item assignment wie würden Sie dieses Problem beheben, für ein numpy matrix-input?
Warum nicht mit np.Ausgabe statt umzugestalten? def softmax_grad(s): return np.diagflat(s) - np.äußere(s, s) -

InformationsquelleAutor Aerin

10

Es sollte wie folgt sein: (x ist die Eingabe der softmax-Schicht und dy ist die delta kommen die aus dem Verlust oben)
```
    dx = y * dy
    s = dx.sum(axis=dx.ndim - 1, keepdims=True)
    dx -= y * s

    return dx
```
Aber die Art, wie Sie berechnen den Fehler, sollte sein:
```
    yact = activation.compute(x)
    ycost = cost.compute(yact)
    dsoftmax = activation.delta(x, cost.delta(yact, ycost, ytrue)) 
```
Erklärung: Weil die delta Funktion ist ein Teil des backpropagation-Algorithmus, dessen Aufgabe ist es, multiplizieren Sie den Vektor dy (in meinem code outgoing in deinem Fall) durch die Jacobi-der compute(x) Funktion ausgewertet an x. Wenn Sie arbeiten heraus, was hat das Jacobi-Aussehen für softmax [1], und dann multiplizieren Sie es von der linken Seite durch einen Vektor dy nach etwas algebra finden Sie heraus, dass Sie etwas bekommen, das entspricht meinem Python-code.

[1] https://stats.stackexchange.com/questions/79454/softmax-layer-in-a-neural-network
- Danke für deine Antwort. Was beziehen Sie sich, indem Sie res?
- Ich meinte dx (ich war manuelles refactoring der code für diese Antwort und vergaß in diesem code vorkommen =)). Ich fixe es in der Antwort.
- Deine Lösung funktioniert gut für mich. Steigungs-Prüfungen-pass. Aus Neugier, könnten Sie kurz erklären, wie Sie kam mit der Formel? Ich möchte wirklich, es zu verstehen.
- würden Sie bitte schreiben Sie Ihre Antwort in einem nicht-python-format?! Ich kann nicht herausfinden, was "keepdims=true" in Ihrem code!
- wenn Sie eine Summe auf einer 3x4-matrix, in der Standardeinstellung erzeugen Sie einen Vektor der Länge 3. keepdims=True fragt einfach sum-Funktion zur Ausgabe einer matrix 3x1 statt. es ist wichtig für die Vermehrung zu arbeiten der richtige Weg. der ursprüngliche code war numpy+python so denke ich, ein python-Antwort angemessen ist. lassen Sie mich wissen, wenn es mehr Verwirrung
- Ich bearbeitete die Antwort. Ich hoffe, es hilft.
InformationsquelleAutor ticcky

Hier ist eine c++ - vektorisierte version, mit Interna ( 22-mal (!) schneller als die non-SSE-version):

//How many floats fit into __m256 "group".
//Used by vectors and matrices, to ensure their dimensions are appropriate for 
//intrinsics.
//Otherwise, consecutive rows of matrices will not be 16-byte aligned, and 
//operations on them will be incorrect.
#define F_MULTIPLE_OF_M256 8


//check to quickly see if your rows are divisible by m256.
//you can 'undefine' to save performance, after everything was verified to be correct.
#define ASSERT_THE_M256_MULTIPLES
#ifdef ASSERT_THE_M256_MULTIPLES
    #define assert_is_m256_multiple(x)  assert( (x%F_MULTIPLE_OF_M256) == 0)
#else
    #define assert_is_m256_multiple (q) 
#endif


//usually used at the end of our Reduce functions,
//where the final __m256 mSum needs to be collapsed into 1 scalar.
static inline float slow_hAdd_ps(__m256 x){
    const float *sumStart = reinterpret_cast<const float*>(&x);
    float sum = 0.0f;

    for(size_t i=0; i<F_MULTIPLE_OF_M256; ++i){
        sum += sumStart[i];
    }
    return sum;
}



f_vec SoftmaxGrad_fromResult(const float *softmaxResult,  size_t size,  
                             const float *gradFromAbove){//<--gradient vector, flowing into us from the above layer
assert_is_m256_multiple(size);
//allocate vector, where to store output:
f_vec grad_v(size, true);//true: skip filling with zeros, to save performance.

const __m256* end   = (const __m256*)(softmaxResult + size);


for(size_t i=0; i<size; ++i){//<--for every row
    //go through this i'th row:
    __m256 sum =  _mm256_set1_ps(0.0f);

    const __m256 neg_sft_i  =  _mm256_set1_ps( -softmaxResult[i] );
    const __m256 *s  =  (const __m256*)softmaxResult;
    const __m256 *gAbove  =   (__m256*)gradFromAbove;

    for (s;  s<end; ){
        __m256 mul =  _mm256_mul_ps(*s, neg_sft_i);  // sftmaxResult_j  *  (-sftmaxResult_i)
        mul =  _mm256_mul_ps( mul, *gAbove );

        sum =  _mm256_add_ps( sum,  mul );//adding to the total sum of this row.
        ++s;
        ++gAbove;
    }
    grad_v[i]  =  slow_hAdd_ps( sum );//collapse the sum into 1 scalar (true sum of this row).
}//end for every row

//reset back to start and subtract a vector, to account for Kronecker delta:
__m256 *g =  (__m256*)grad_v._contents;
__m256 *s =  (__m256*)softmaxResult;
__m256 *gAbove =  (__m256*)gradFromAbove;

for(s; s<end; ){
    __m256 mul = _mm256_mul_ps(*s, *gAbove);
    *g = _mm256_add_ps( *g, mul );
    ++s; 
    ++g;
}

return grad_v;

}

Wenn aus irgendeinem Grund jemand will eine einfache (nicht SSE) version, hier ist es:

f_vec Mathf::SoftmaxGrad_fromResult_slow(const float *softmaxResult,  size_t size,  
                                         float *gradFromAbove){//<--gradient vector, flowing into us from the above layer
    assert_is_m256_multiple(size);
    f_vec grad_v(size, 0.0f);//allocate a vector, initialized with zeros

    //every pre-softmax element in a layer contributed to the softmax of every other element
    //(it went into the denominator). So gradient will be distributed from every post-softmax element to every pre-elem.
    for(size_t pre=0; pre<size; ++pre){//<--for every row
        for(size_t post=0; post<size; ++post){

            float grad;
            if (pre == post){//if 'pre' is same as 'post', thus doesn't matter which one is in which []:
                grad = softmaxResult[pre]  *  (1-softmaxResult[post]); 
            }
            else {//notice minus:
                grad =  -softmaxResult[pre]  *  softmaxResult[post]; 
            }

            grad_v[pre] += grad*gradFromAbove[post];//<--add
        }//end for every post-softmax element

    }//end for every pre-softmax element

    return grad_v;
}

InformationsquelleAutor Kari

Schreibe einen Kommentar

Du musst angemeldet sein, um einen Kommentar abzugeben.