Efficient training of LSTM network with GPU

조회 수: 15 (최근 30일)
Yuto Ozaki
Yuto Ozaki 2016년 4월 10일
댓글: Joss Knight 2016년 4월 20일
Hi all,
I recently introduced a GPU implemented computer and currently trying to refactor my LSTM codes to take advantage of GPU. However, I found my implementation doesn't show improvement on speed, actually using CPU is faster than using GPU. Below testing codes are testing of basic algorithm of LSTM for comparison. Could anyone give some advice on how to employ the potential of GPU for LSTM? I tried using pagefun, arrayfun and bsxfun but they seemed not working to improve speed.
This one is for GPU.
function LSTM_gpu2()
vis = 700; hid = 500;
T = 80; epochs = 10;
sigmoid = @(x) 1./(1+exp(-x));
x = rand(vis,1,T); h = zeros(hid,1,T+1); c = h;
W_z = rand(hid,vis,'gpuArray'); W_i = rand(hid,vis,'gpuArray');
W_f = rand(hid,vis,'gpuArray'); W_o = rand(hid,vis,'gpuArray');
R_z = rand(hid,hid,'gpuArray'); R_i = rand(hid,hid,'gpuArray');
R_f = rand(hid,hid,'gpuArray'); R_o = rand(hid,hid,'gpuArray');
P_i = diag(rand(hid,1,'gpuArray')); P_f = diag(rand(hid,1,'gpuArray'));
P_o = diag(rand(hid,1,'gpuArray'));
b_z = rand(hid,1,'gpuArray'); b_i = rand(hid,1,'gpuArray');
b_f = rand(hid,1,'gpuArray'); b_o = rand(hid,1,'gpuArray');
I = zeros(hid,T,'gpuArray'); F = zeros(hid,T,'gpuArray');
O = zeros(hid,T,'gpuArray'); G = zeros(hid,T,'gpuArray');
x = gpuArray(x); h = gpuArray(h); c = gpuArray(c);
tic;
for i=1:epochs
for t=1:T
G(:,t) = tanh(W_z*x(:,:,t) + R_z*h(:,:,t) + b_z);
I(:,t) = sigmoid(W_i*x(:,:,t) + R_i*h(:,:,t) + P_i*c(:,:,t) + b_i);
F(:,t) = sigmoid(W_f*x(:,:,t) + R_f*h(:,:,t) + P_f*c(:,:,t) + b_f);
c(:,:,t+1) = G(:,t).*I(:,t) + c(:,:,t).*F(:,t);
O(:,t) = sigmoid(W_o*x(:,:,t) + R_o*h(:,:,t) + P_o*c(:,:,t+1) + b_o);
h(:,:,t+1) = tanh(c(:,:,t+1)).*O(:,t);
end
%%backprop
%%update
end
toc;
return;
And this one is for CPU.
function LSTM_cpu()
vis = 700; hid = 500;
T = 80; epochs = 10;
sigmoid = @(x) 1./(1+exp(-x));
x = rand(vis,1,T); h = zeros(hid,1,T+1); c = h;
W_z = rand(hid,vis); W_i = rand(hid,vis);
W_f = rand(hid,vis); W_o = rand(hid,vis);
R_z = rand(hid,hid); R_i = rand(hid,hid);
R_f = rand(hid,hid); R_o = rand(hid,hid);
P_i = diag(rand(hid,1)); P_f = diag(rand(hid,1));
P_o = diag(rand(hid,1));
b_z = rand(hid,1); b_i = rand(hid,1);
b_f = rand(hid,1); b_o = rand(hid,1);
I = zeros(hid,T); F = zeros(hid,T);
O = zeros(hid,T); G = zeros(hid,T);
tic;
for i=1:epochs
for t=1:T
G(:,t) = tanh(W_z*x(:,:,t) + R_z*h(:,:,t) + b_z);
I(:,t) = sigmoid(W_i*x(:,:,t) + R_i*h(:,:,t) + P_i*c(:,:,t) + b_i);
F(:,t) = sigmoid(W_f*x(:,:,t) + R_f*h(:,:,t) + P_f*c(:,:,t) + b_f);
c(:,:,t+1) = G(:,t).*I(:,t) + c(:,:,t).*F(:,t);
O(:,t) = sigmoid(W_o*x(:,:,t) + R_o*h(:,:,t) + P_o*c(:,:,t+1) + b_o);
h(:,:,t+1) = tanh(c(:,:,t+1)).*O(:,t);
end
%%backprop
%%update
end
toc;
return;
OS: Windows 10,
GPU: NVIDIA Quadro M5000,
CPU: Intel i7-5820K,
MATLAB: R2016a
Thank you,
Yuto Ozaki
  댓글 수: 1
Yuto Ozaki
Yuto Ozaki 2016년 4월 10일
편집: Yuto Ozaki 2016년 4월 10일
Additional question:
Some papers[1] [2] use affine transform notation to realize a more compact way of calculation but they do not using peephole connections. In fact, Chainer's LSTM model does not implement peephole connections and TensorFlow provides LSTM models both having and not having peephole connections. To pursue calculation efficiency, would omitting peephole be the current best practice? If a model does not include peephole, all affine transform can be done at once and I think it can lead to more GPU-friendly coding.
[1] Kelvin Xu, et al.: Show, Attend and Tell: Neural Image Caption Generation with Visual Attention (2015)
[2] Wojciech Zaremba, Ilya Sutskever, Oriol Vinyals: RECURRENT NEURAL NETWORK REGULARIZATION (2014)

댓글을 달려면 로그인하십시오.

채택된 답변

Joss Knight
Joss Knight 2016년 4월 15일
To get good performance out of the GPU, you need to give it a lot of data to process. Your best bet is to vectorize your code to remove the inner loop. Your sigmoid and tanh activation functions, for instance, are element-wise operators and so should vectorize trivially, while your matrix multiplies can be executed in batch using pagefun.
Alternatively, have you considered using the new Deep Learning features in the Neural Network Toolbox in MATLAB R2016a, or the free 3rd party deep learning solution MatConvNet?
  댓글 수: 2
Yuto Ozaki
Yuto Ozaki 2016년 4월 16일
Joss,
Thank you for your reply. I just tried with bigger size samples training with mini batch and it yielded around 35% faster speed on GPU. However, I think removing the inner loop would be challenging since RNN basically gets input from previous state and that would make sequential for-loop be essential algorithm of RNN.
I have checked Neural Network Toolbox but seemingly the toolbox has not implemented RNN. My main interest is in music information retrieval so time-series models such as RNN and other variants are my main focus.
Joss Knight
Joss Knight 2016년 4월 20일
Support for RNNs is considered high priority by the development team. Meanwhile, take a look at MatConvNet.

댓글을 달려면 로그인하십시오.

추가 답변 (0개)

카테고리

Help CenterFile Exchange에서 Sequence and Numeric Feature Data Workflows에 대해 자세히 알아보기

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by