Efficient training of LSTM network with GPU

Question

Yuto Ozaki 2016년 4월 10일

1
링크

이 질문에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/278243-efficient-training-of-lstm-network-with-gpu

댓글: Joss Knight 2016년 4월 20일

Hi all,

I recently introduced a GPU implemented computer and currently trying to refactor my LSTM codes to take advantage of GPU. However, I found my implementation doesn't show improvement on speed, actually using CPU is faster than using GPU. Below testing codes are testing of basic algorithm of LSTM for comparison. Could anyone give some advice on how to employ the potential of GPU for LSTM? I tried using pagefun, arrayfun and bsxfun but they seemed not working to improve speed.

This one is for GPU.

function LSTM_gpu2()
  vis = 700; hid = 500;
  T = 80; epochs = 10;
  sigmoid = @(x) 1./(1+exp(-x));
    x = rand(vis,1,T); h = zeros(hid,1,T+1); c = h;
    W_z = rand(hid,vis,'gpuArray'); W_i = rand(hid,vis,'gpuArray');
    W_f = rand(hid,vis,'gpuArray'); W_o = rand(hid,vis,'gpuArray');
    R_z = rand(hid,hid,'gpuArray'); R_i = rand(hid,hid,'gpuArray');
    R_f = rand(hid,hid,'gpuArray'); R_o = rand(hid,hid,'gpuArray');
    P_i = diag(rand(hid,1,'gpuArray')); P_f = diag(rand(hid,1,'gpuArray'));
    P_o = diag(rand(hid,1,'gpuArray'));
    b_z = rand(hid,1,'gpuArray'); b_i = rand(hid,1,'gpuArray');
    b_f = rand(hid,1,'gpuArray'); b_o = rand(hid,1,'gpuArray');
    I = zeros(hid,T,'gpuArray'); F = zeros(hid,T,'gpuArray'); 
    O = zeros(hid,T,'gpuArray'); G = zeros(hid,T,'gpuArray');
    x = gpuArray(x); h = gpuArray(h); c = gpuArray(c);
    tic;
    for i=1:epochs
        for t=1:T
            G(:,t) = tanh(W_z*x(:,:,t) + R_z*h(:,:,t) + b_z);
            I(:,t) = sigmoid(W_i*x(:,:,t) + R_i*h(:,:,t) + P_i*c(:,:,t) + b_i);
            F(:,t) = sigmoid(W_f*x(:,:,t) + R_f*h(:,:,t) + P_f*c(:,:,t) + b_f);
            c(:,:,t+1) = G(:,t).*I(:,t) + c(:,:,t).*F(:,t);
            O(:,t) = sigmoid(W_o*x(:,:,t) + R_o*h(:,:,t) + P_o*c(:,:,t+1) + b_o);
            h(:,:,t+1) = tanh(c(:,:,t+1)).*O(:,t);
        end
        %%backprop
        %%update
    end
    toc;
  return;

And this one is for CPU.

function LSTM_cpu()
  vis = 700; hid = 500;
  T = 80; epochs = 10;
  sigmoid = @(x) 1./(1+exp(-x));
    x = rand(vis,1,T); h = zeros(hid,1,T+1); c = h;
    W_z = rand(hid,vis); W_i = rand(hid,vis);
    W_f = rand(hid,vis); W_o = rand(hid,vis);
    R_z = rand(hid,hid); R_i = rand(hid,hid);
    R_f = rand(hid,hid); R_o = rand(hid,hid);
    P_i = diag(rand(hid,1)); P_f = diag(rand(hid,1));
    P_o = diag(rand(hid,1));
    b_z = rand(hid,1); b_i = rand(hid,1);
    b_f = rand(hid,1); b_o = rand(hid,1);
    I = zeros(hid,T); F = zeros(hid,T); 
    O = zeros(hid,T); G = zeros(hid,T);
    tic;
    for i=1:epochs
        for t=1:T
            G(:,t) = tanh(W_z*x(:,:,t) + R_z*h(:,:,t) + b_z);
            I(:,t) = sigmoid(W_i*x(:,:,t) + R_i*h(:,:,t) + P_i*c(:,:,t) + b_i);
            F(:,t) = sigmoid(W_f*x(:,:,t) + R_f*h(:,:,t) + P_f*c(:,:,t) + b_f);
            c(:,:,t+1) = G(:,t).*I(:,t) + c(:,:,t).*F(:,t);
            O(:,t) = sigmoid(W_o*x(:,:,t) + R_o*h(:,:,t) + P_o*c(:,:,t+1) + b_o);
            h(:,:,t+1) = tanh(c(:,:,t+1)).*O(:,t);
        end
        %%backprop
        %%update
    end
    toc;
  return;

OS: Windows 10,

GPU: NVIDIA Quadro M5000,

CPU: Intel i7-5820K,

MATLAB: R2016a

Thank you,

Yuto Ozaki

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기

Yuto Ozaki 2016년 4월 10일

편집: Yuto Ozaki 2016년 4월 10일

Additional question:

Some papers[1] [2] use affine transform notation to realize a more compact way of calculation but they do not using peephole connections. In fact, Chainer's LSTM model does not implement peephole connections and TensorFlow provides LSTM models both having and not having peephole connections. To pursue calculation efficiency, would omitting peephole be the current best practice? If a model does not include peephole, all affine transform can be done at once and I think it can lead to more GPU-friendly coding.

[1] Kelvin Xu, et al.: Show, Attend and Tell: Neural Image Caption Generation with Visual Attention (2015)

[2] Wojciech Zaremba, Ilya Sutskever, Oriol Vinyals: RECURRENT NEURAL NETWORK REGULARIZATION (2014)

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Answer 1

Joss Knight 2016년 4월 15일

0
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/278243-efficient-training-of-lstm-network-with-gpu#answer_218045

To get good performance out of the GPU, you need to give it a lot of data to process. Your best bet is to vectorize your code to remove the inner loop. Your sigmoid and tanh activation functions, for instance, are element-wise operators and so should vectorize trivially, while your matrix multiplies can be executed in batch using pagefun.

Alternatively, have you considered using the new Deep Learning features in the Neural Network Toolbox in MATLAB R2016a, or the free 3rd party deep learning solution MatConvNet?

댓글 수: 2
없음 표시없음 숨기기

Yuto Ozaki 2016년 4월 16일

Joss,

Thank you for your reply. I just tried with bigger size samples training with mini batch and it yielded around 35% faster speed on GPU. However, I think removing the inner loop would be challenging since RNN basically gets input from previous state and that would make sequential for-loop be essential algorithm of RNN.

I have checked Neural Network Toolbox but seemingly the toolbox has not implemented RNN. My main interest is in music information retrieval so time-series models such as RNN and other variants are my main focus.

Joss Knight 2016년 4월 20일

Support for RNNs is considered high priority by the development team. Meanwhile, take a look at MatConvNet.

댓글을 달려면 로그인하십시오.

Efficient training of LSTM network with GPU

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기

채택된 답변

댓글 수: 2
없음 표시없음 숨기기

추가 답변 (0개)

참고 항목

카테고리

태그

Community Treasure Hunt

Efficient training of LSTM network with GPU

댓글 수: 1 이전 댓글 -1개 표시이전 댓글 -1개 숨기기

채택된 답변

댓글 수: 2 없음 표시없음 숨기기

추가 답변 (0개)

참고 항목

카테고리

태그

Community Treasure Hunt

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기

댓글 수: 2
없음 표시없음 숨기기