LookUp Table for 16-Bit Mat Efficient way?

The source code of LUT is in this file:

OpenCV can use several methods for performing lookup-table transforms efficiently: it can use Intel IPP library (class IppLUTParallelBody_LUTCN, for 3 or 4-channel images). If have Intel IPP, you can just copy the code of this class and use ippiLUTPalette_16u_C3R instead of ippiLUTPalette_8u_C3R +fix initialization).

Another possible way is OpenCL library (for GPU), it's invoked from ocl_LUT (sorry, have no experience with it, so I can't give any advice).

Or it uses LUTParallelBody/IppLUTParallelBody_LUTCN classes (corresponding to single and multichannel images). These classes use LUT8u_ template function. No rocket science here: it just iterates over the image a substitutes the values. So you can simply copy and paste IppLUTParallelBody and use slightly different function inside the loop. ParallelLoopBody base class uses a library like OpenMP or Intel TBB to run the loop in multiple threads. I suppose, you don't have to modify anything in it to make it work with new function.

