This particular problem is a good example of something that is solved better with a custom kernel, but many real problems aren't like that. The MATLAB version of the game of life has to launch multiple kernels to do all the indexing and then create the mask, while the custom kernel can do all this communication between the different cells in a single kernel using shared memory.
In real-life code you most often see two types of algorithm: algorithms where the majority of the cost is in single MATLAB function calls to computationally intensive operations like fft or svd; and algorithms that contain large sequences of element-wise operations. In the first case, as you correctly surmise, MATLAB does a very good job of providing just about the best implementation you can write for that function, albeit perhaps limited by its need to be completely general rather than specific to your problem. In the second case, MATLAB does a good job of creating kernels on-the-fly that merge all the element-wise operations - or there's always arrayfun to fine-tune that yourself.
But nothing can beat writing your own CPU and GPU code specifically honed to your problem, as long as you trust yourself to write the best possible algorithm. If you do, it's worth it. If not, you may get better luck working with MATLAB's strengths, for instance, vectorizing your code and using the profiler to look for and optimize bottlenecks.
There has also been talk of providing a MATLAB function that generalizes stencil-type operations like convolutions. If we do this in a future version we may be able to provide a better solution to a large set of problems than directing users to mexcuda.