

The actual kernel computational capability of our tesla card is 1 teraflops (1000 gigaflops) per second, but due to limited memory accessing speed, we can only achieve less than 4 percent of the actual speed. Since the computational kernel cannot compute more floating-point than the amount global memory has loaded, it will execute no more than 36 gigaflops per second. With 4 bytes in each single precision floating-point datum, we can load no more than 36 (144/4) giga single precision data per second. The nVidia Tesla C2075 companion processor supports 144 gigabytes per second (GB/s) of global memory access bandwidth. With this in mind, we can proceed to our example. We assume that in order to perform one floating point operation, the runtime need to transfer one single-precision floating-point from global memory datum to the computational kernel. One of the most important standards of a processor’s computation ability is its flops computation. The following example will show you why matching these two speeds is so important to GPU computation. The reason CUDA architecture has many memory types is to increase the memory accessing speed so that data transfer speed can match data processing speed. Matrix Multiplication with Global Memory source file: It will later on write back the result to the out put matrix. This thread will fetch the data and do all the calculations. We do this by assigning each entry in output matrix a thread of its own. As you can see, this kind of operation is highly paralleled, make it perfect for us to use CUDA. The result will be the value at entry (A,B) in the output matrix. We do this for all the elements in row A and column B, and then we get the sum of products.

Later, we take the second left element in row A and multiply it by second top element in column B. We first take the left most element in row A and multiply it by top element in column B. For example, to calculate entry (A,B) in the output matrix, we need to use row A in one input matrix and column B in another input matrix. If you have learned linear algebra before, you will know that the output of two square matrices multiplied together is a square matrix of the same size. The output matrix is P with the same size. Two input matrices of size Width x Width are M and N. In this example, we will do the Square Matrix Multiplication. This feature in CUDA architecture enable us to create two-dimensional or even three-dimensional thread hierarchy so that solving two or three-dimensional problems becomes easier and more efficient. As we know, threads can be organized into multi-dimensional block and blocks can also be organized into multi-dimensional grid.
Dim3 grid calculation how to#
Lastly, if filter is to be passed to the kernel, it should be copied from CPU to GPU using something like cudaMemcpy(filterGpu, filter, 25 * sizeof(float), cudaMemcpyHostToDevice) Īlternatively, if the filter is not needed anywhere else (as is the case here), it can be declared within the kernel instead.Starting from this example, we will look at the how to solve problem in two-dimensional domain using two-dimensional grid and block. Checking the various variables led to determine that filter was empty.
Dim3 grid calculation code#
This shows that a value within the kernel code is out of bounds (most possibly an array index). Smooth > (grayImage, smoothImage, width, height) Īfter the correction, using cuda-memcheck yielded results similar to = Invalid _global_ read of size 4 The following should work in this case dim3 blockSize(16, 16, 1) ĭim3 gridSize((width + blockSize.x - 1)/ blockSize.x, (height + blockSize.y - 1) / blockSize.y, 1) I'm using a GTX480, version - 2.x, Maximum dimensionality of grid of thread blocks - 3, Maximum x-, y-, or z-dimension of a grid of thread blocks - 65535, Maximum Number of Threads per Block - 1024įirst, the dimensions are totally invalid. Is this from the dimensions of the grid and block? I've tried a LOT of other possible dimensions. The problem is that, the resulting smooth image looking like the pixels are all in the wrong other (mixed up). Smooth > (grayImage, smoothImage, width, height, filter) I have a sequential smoothing algorithm void triangularSmooth(unsigned char *grayImage, unsigned char *smoothImage, const int width, const int height, const float *filter, NSTimer &timer, dim3 grid_size, dim3 block_size) ĭim3 gridSize((width*height)/1024,(width*height)/1024,1)
