This is machine translation

Translated by Microsoft
Mouseover text to see original. Click the button below to return to the English version of the page.

Note: This page has been translated by MathWorks. Click here to see
To view all translated materials including this page, select Country from the country navigator on the bottom of this page.

Benchmarking A\b with GPU Coder

This example looks at how we can benchmark the solving of a linear system by generating GPU code. The MATLAB® code to solve for x in A*x = b is very simple. Most frequently, we use matrix left division, also known as mldivide or the backslash operator (\), to calculate x (that is, x = A\b).

Prerequisites

  • CUDA®-enabled NVIDIA® GPU with compute capability 3.5 or higher.

  • NVIDIA CUDA toolkit.

  • Environment variables for the compilers and libraries. For more information see Environment Variables.

Create a New Folder and Copy Relevant Files

The following line of code creates a folder in your current working folder (pwd), and copies all the relevant files into this folder. If you do not want to perform this operation or if you cannot generate files in this folder, change your current working folder.

gpucoderdemo_setup('gpucoderdemo_backslash_bench');

Verify the GPU Environment

Use the coder.checkGpuInstall function and verify that the compilers and libraries needed for running this example are set up correctly.

envCfg = coder.gpuEnvConfig('host');
envCfg.BasicCodegen = 1;
envCfg.Quiet = 1;
coder.checkGpuInstall(envCfg);

Determine the Maximum Data Size

It is important to choose the appropriate matrix size for the computations. We can do this by specifying the amount of system memory in GB available to the CPU and the GPU. The default value is based only on the amount of memory available on the GPU, and you can specify a value that is appropriate for your system.

g = gpuDevice;
maxMemory = 0.25*g.AvailableMemory/1024^3;

The Benchmarking Function

We want to benchmark matrix left division (\), including the cost of transferring data between the CPU and GPU, to get a clear view of the total application time when using GPU Coder™, but not the time to create our data. We therefore separate the data generation from the function that solves the linear system, and will generate code and measure only that operation.

type getData.m
function [A, b] = getData(n, clz)

%   Copyright 2017 The MathWorks, Inc.

    fprintf('Creating a matrix of size %d-by-%d.\n', n, n);
    A = rand(n, n, clz) + 100*eye(n, n, clz);
    b = rand(n, 1, clz);
end

The Backslash Function

The backslash function encapsulates the (\) operation for which we want to generate code.

type backslash.m
function [x] = backslash(A,b)
%#codegen

%   Copyright 2017 The MathWorks, Inc.

    coder.gpu.kernelfun();
    x = A\b;
end

Generating the GPU Code

We create a function to generate the GPU MEX function based on the particular input data size.

type genGpuCode.m
function [] = genGpuCode(A, b)

%   Copyright 2017 The MathWorks, Inc.

    cfg = coder.gpuConfig('mex');
    evalc('codegen -config cfg -args {A,b} backslash');
end

Choosing Problem Size

As with a great number of other parallel algorithms, the performance of solving a linear system in parallel depends greatly on the matrix size. We compare the performance of the algorithm for different matrix sizes.

% Declare the matrix sizes to be a multiple of 1024.
sizeLimit = inf;
if ispc
    sizeLimit = double(intmax('int32'));
end
maxSizeSingle = min(floor(sqrt(maxMemory*1024^3/4)),floor(sqrt(sizeLimit/4)));
maxSizeDouble = min(floor(sqrt(maxMemory*1024^3/8)),floor(sqrt(sizeLimit/8)));
step = 1024;
if maxSizeDouble/step >= 10
    step = step*floor(maxSizeDouble/(5*step));
end
sizeSingle = 1024:step:maxSizeSingle;
sizeDouble = 1024:step:maxSizeDouble;
numReps = 5;

Comparing Performance: Speedup

We use the total elapsed time as our measure of performance because that allows us to compare the performance of the algorithm for different matrix sizes. Given a matrix size, the benchmarking function creates the matrix A and the right-hand side b once, and then solves A\b a few times to get an accurate measure of the time it takes.

type benchFcnMat.m

% We need to create a different function for GPU code execution that
% invokes the generated GPU MEX function.
type benchFcnGpu.m
function time = benchFcnMat(A, b, reps)

%   Copyright 2017 The MathWorks, Inc.

    time = inf;
    % We solve the linear system a few times and take the best run
    for itr = 1:reps
        tic;
        matX = backslash(A, b);
        tcurr = toc;
        time = min(tcurr, time);
    end
end

function time = benchFcnGpu(A, b, reps)

%   Copyright 2017 The MathWorks, Inc.

    time = inf;
    gpuX = backslash_mex(A, b);
    for itr = 1:reps
        tic;
        gpuX = backslash_mex(A, b);
        tcurr = toc;
        time = min(tcurr, time);
    end
end

Executing the Benchmarks

Having done all the setup, it is straightforward to execute the benchmarks. However, the computations can take a long time to complete, so we print some intermediate status information as we complete the benchmarking for each matrix size. We also encapsulate the loop over all the matrix sizes in a function, to benchmark both single- and double-precision computations.

It is important to note that actual execution times will vary across different hardware configurations. This benchmarking was done using MATLAB 18a on a machine with an 8 core, 2.6GHz Intel® Xeon® CPU and an NVIDIA Titan X GPU.

type executeBenchmarks.m
function [timeCPU, timeGPU] = executeBenchmarks(clz, sizes, reps)

%   Copyright 2017 The MathWorks, Inc.

    fprintf(['Starting benchmarks with %d different %s-precision ' ...
         'matrices of sizes\nranging from %d-by-%d to %d-by-%d.\n'], ...
            length(sizes), clz, sizes(1), sizes(1), sizes(end), ...
            sizes(end));
    timeGPU = zeros(size(sizes));
    timeCPU = zeros(size(sizes));   
    for i = 1:length(sizes)
        n = sizes(i);
        fprintf('Size : %d\n', n);
        [A, b] = getData(n, clz);
        genGpuCode(A, b);
        timeCPU(i) = benchFcnMat(A, b, reps);
        fprintf('Time on CPU: %f sec\n', timeCPU(i));
        timeGPU(i) = benchFcnGpu(A, b, reps);
        fprintf('Time on GPU: %f sec\n', timeGPU(i));
        fprintf('\n');
    end
end

We then execute the benchmarks in single and double precision.

[cpu, gpu] = executeBenchmarks('single', sizeSingle, numReps);
results.sizeSingle = sizeSingle;
results.timeSingleCPU = cpu;
results.timeSingleGPU = gpu;
[cpu, gpu] = executeBenchmarks('double', sizeDouble, numReps);
results.sizeDouble = sizeDouble;
results.timeDoubleCPU = cpu;
results.timeDoubleGPU = gpu;
Starting benchmarks with 9 different single-precision matrices of sizes
ranging from 1024-by-1024 to 25600-by-25600.
Size : 1024
Creating a matrix of size 1024-by-1024.
Time on CPU: 0.008953 sec
Time on GPU: 0.012897 sec

Size : 4096
Creating a matrix of size 4096-by-4096.
Time on CPU: 0.241337 sec
Time on GPU: 0.056992 sec

Size : 7168
Creating a matrix of size 7168-by-7168.
Time on CPU: 0.979883 sec
Time on GPU: 0.126366 sec

Size : 10240
Creating a matrix of size 10240-by-10240.
Time on CPU: 2.456934 sec
Time on GPU: 0.251279 sec

Size : 13312
Creating a matrix of size 13312-by-13312.
Time on CPU: 4.969921 sec
Time on GPU: 0.440791 sec

Size : 16384
Creating a matrix of size 16384-by-16384.
Time on CPU: 9.099973 sec
Time on GPU: 0.699288 sec

Size : 19456
Creating a matrix of size 19456-by-19456.
Time on CPU: 15.161504 sec
Time on GPU: 1.043330 sec

Size : 22528
Creating a matrix of size 22528-by-22528.
Time on CPU: 20.989874 sec
Time on GPU: 1.494572 sec

Size : 25600
Creating a matrix of size 25600-by-25600.
Time on CPU: 23.794767 sec
Time on GPU: 2.062284 sec

Starting benchmarks with 7 different double-precision matrices of sizes
ranging from 1024-by-1024 to 19456-by-19456.
Size : 1024
Creating a matrix of size 1024-by-1024.
Time on CPU: 0.013808 sec
Time on GPU: 0.019805 sec

Size : 4096
Creating a matrix of size 4096-by-4096.
Time on CPU: 0.459535 sec
Time on GPU: 0.169425 sec

Size : 7168
Creating a matrix of size 7168-by-7168.
Time on CPU: 1.899516 sec
Time on GPU: 0.726797 sec

Size : 10240
Creating a matrix of size 10240-by-10240.
Time on CPU: 4.896854 sec
Time on GPU: 1.970779 sec

Size : 13312
Creating a matrix of size 13312-by-13312.
Time on CPU: 10.263277 sec
Time on GPU: 4.184999 sec

Size : 16384
Creating a matrix of size 16384-by-16384.
Time on CPU: 18.258044 sec
Time on GPU: 7.642277 sec

Size : 19456
Creating a matrix of size 19456-by-19456.
Time on CPU: 21.981840 sec
Time on GPU: 12.661797 sec

Plotting the Performance

We can now plot the results, and compare the performance on the CPU and the GPU, both for single and double precision.

First, we look at the performance of the backslash operator in single precision.

fig = figure;
ax = axes('parent', fig);
plot(ax, results.sizeSingle, results.timeSingleGPU, '-x', ...
     results.sizeSingle, results.timeSingleCPU, '-o')
grid on;
legend('GPU', 'CPU', 'Location', 'NorthWest');
title(ax, 'Single-precision performance')
ylabel(ax, 'Time (s)');
xlabel(ax, 'Matrix size');
drawnow;

Now, we look at the performance of the backslash operator in double precision.

fig = figure;
ax = axes('parent', fig);
plot(ax, results.sizeDouble, results.timeDoubleGPU, '-x', ...
     results.sizeDouble, results.timeDoubleCPU, '-o')
legend('GPU', 'CPU', 'Location', 'NorthWest');
grid on;
title(ax, 'Double-precision performance')
ylabel(ax, 'Time (s)');
xlabel(ax, 'Matrix size');
drawnow;

Finally, we look at the speedup of the backslash operator when comparing the GPU to the CPU.

speedupDouble = results.timeDoubleCPU./results.timeDoubleGPU;
speedupSingle = results.timeSingleCPU./results.timeSingleGPU;
fig = figure;
ax = axes('parent', fig);
plot(ax, results.sizeSingle, speedupSingle, '-v', ...
     results.sizeDouble, speedupDouble, '-*')
grid on;
legend('Single-precision', 'Double-precision', 'Location', 'SouthEast');
title(ax, 'Speedup of computations on GPU compared to CPU');
ylabel(ax, 'Speedup');
xlabel(ax, 'Matrix size');
drawnow;

Cleanup

Remove the temporary files and return to the original folder.

cleanup