Matrix Multiplikation CUDA

Habe ich gelesen, über mehrere Webseiten und auch verwendet NVIDA ist code als Leitfaden, aber ich bin noch immer die falsche Antwort. Der main wird der Benutzer aufgefordert, für die Größe, und zeigt A und B zeigen die resultierende matrix C. Allerdings sagen, dass ich laufen eine 2x2-matrix für die A-und B-dies ist meine Beispiel-Ausgabe:

Matrix A
0.000000 8.000000
2.000000 2.000000


Matrix B
3.000000 1.000000
5.000000 7.000000


Matrix C (Results)
0.000000 9.000000
7.000000 4.000000

Aber das ist falsch. Es sollte sein:

40.000 56.000
16.000 16.000

Ich änderte es von Dezimalzahlen auf ganze zahlen, so dass es leichter zu überprüfen, und ich fand, dass es falsch ist. Ich verstehe nicht, warum es falsch wäre, vor allem auch wenn ich nahm es direkt aus dem code-Beispiel.

#ifndef _MATRIXMUL_KERNEL_H_
#define _MATRIXMUL_KERNEL_H_

#include <stdio.h>

//Thread block size
#define BLOCK_SIZE 16
#define TILE_SIZE  16



//CUDA Kernel
__global__ void matrixMul( float* C, float* A, float* B, int wA, int wB)
{
    //Block index
    int bx = blockIdx.x;
    int by = blockIdx.y;

//Thread index
int tx = threadIdx.x;
int ty = threadIdx.y;

//Index of the first sub-matrix of A processed 
//by the block
int aBegin = wA * BLOCK_SIZE * by;

//Index of the last sub-matrix of A processed 
//by the block
int aEnd   = aBegin + wA - 1;

//Step size used to iterate through the 
//sub-matrices of A
int aStep  = BLOCK_SIZE;

//Index of the first sub-matrix of B processed 
//by the block
int bBegin = BLOCK_SIZE * bx;

//Step size used to iterate through the 
//sub-matrices of B
int bStep  = BLOCK_SIZE * wB;
float Csub=0;
//Loop over all the sub-matrices of A and B
//required to compute the block sub-matrix
for (int a = aBegin, b = bBegin; a <= aEnd; a += aStep, b += bStep) 
{
    //Declaration of the shared memory array As 
    //used to store the sub-matrix of A
    __shared__ float As[BLOCK_SIZE][BLOCK_SIZE];

    //Declaration of the shared memory array Bs 
    //used to store the sub-matrix of B
    __shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];

    //Load the matrices from global memory
    //to shared memory; each thread loads
    //one element of each matrix
    As[ty][tx] = A[a + wA * ty + tx];
    Bs[ty][tx] = B[b + wB * ty + tx];

    //Synchronize to make sure the matrices 
    //are loaded
    __syncthreads();

    //Multiply the two matrices together;
    //each thread computes one element
    //of the block sub-matrix
    for (int k = 0; k < BLOCK_SIZE; ++k)
        Csub += As[ty][k] * Bs[k][tx];

    //Synchronize to make sure that the preceding
    //computation is done before loading two new
    //sub-matrices of A and B in the next iteration
    __syncthreads();
}
//Write the block sub-matrix to device memory;
//each thread writes one element
int c = wB * BLOCK_SIZE * by + BLOCK_SIZE * bx;
C[c + wB * ty + tx] = Csub;
}

#endif //#ifndef _MATRIXMUL_KERNEL_H_

host-code:

    //perform the calculation
    //setup execution parameters
    dim3 threads(BLOCK_SIZE, BLOCK_SIZE);
    dim3 grid(c.colSize / threads.x, c.rowSize / threads.y);

    //  execute the kernel
    matrixMul<<< grid, threads >>>(deviceMatrixC, deviceMatrixA, deviceMatrixB, a.colSize, b.colSize);

Danke für Eure Hilfe,
Dan

Der code, den Sie verwenden implizit erfordert, dass die Größe der Matrizen sind rund vielfachen der block Größe (16x16 in diesem Fall). 2x2-Matrizen wird nicht funktionieren. Versuchen Sie, mit 16x16 Eingänge und bestätigt das Ergebnis.
Danke, das löste mein Problem. Ist es nur so dass eine 16x16, weil es die block-und tile-Größe?
Ja. Das innere Produkt Rechenverfahren eine Breite Kachel zu einer Zeit, ohne zu überprüfen, out-of-bounds memory access). Das ist, wo der Fehler Auftritt.
kannst du deinen Kommentar als Antwort (um diese zu entfernen aus dem unbeantwortet Liste?)

InformationsquelleAutor Dan | 2012-01-11

c cuda matrix-multiplication

3

Den code, den Sie verwenden implizit erfordert, dass die Größe der Matrizen sind rund vielfachen der block Größe (16x16 in diesem Fall). Das innere Produkt Rechenverfahren eine Breite Kachel zu einer Zeit, ohne zu überprüfen, out-of-bounds memory access). Aus diesem Grund 2x2-Matrizen wird nicht funktionieren.

Wenn Sie versuchen, die Laufenden kernel mit einer 16x16-Eingang (zum Beispiel zero-padding-Ihr 2x2 Fall, 16x16), sollten Sie in der Lage sein, um das Ergebnis zu bestätigen.

InformationsquelleAutor talonmies

Schreibe einen Kommentar

Du musst angemeldet sein, um einen Kommentar abzugeben.