Friday, October 2, 2009

qcms — now faster

Thanks to some optimization work by Steve Snyder, qcms is even faster than before.

What follows is a chart with the new performance numbers:The benchmark is the same as last time but run on a slightly slower computer using OS X v10.6 instead of 10.5. As the chart shows, the new qcms code is more than twice as fast as the previous code. In addition to the performance improvement, the new code includes a version that only uses SSE1 instructions. This will be especially helpful for those with older computers where the time spent doing color correction isn't as negligible as it is on faster computers.

When running this benchmark again, I noticed that the performance of lcms had drastically improved since the last time I had run the benchmark. Why was lcms so much faster on 10.6? What had changed? The default architecture target: in OS X 10.6, the compiler builds 64 bit binaries by default. Still, it was surprising that compiling for 64 bit could nearly double the performance.

The large difference, it turns out, can largely be attributed to the MAT3evalW1 function. This function multiplies a 1×3 matrix with a 3×3 one using 9 32×32→64 multiplications. GCC can usually optimize these multiplications by using the 32×32→64 multiply instructions, however that wasn't happening in 32 bit mode. Instead of the expected 9 multiplies, we get 18 multiplies and a bunch of housekeeping work, likely caused by the 64 bit additions and additional register pressure. In 64 bit mode, however, we get the code that you'd expect. This only takes 38 instructions versus the 169 instructions the 32 bit build uses. With a difference like that in the inner loop, it's easy to see why the 64 bit build is so much faster.

1. MAT3evalW has a handwritten assembly version that should be faster than the one that GCC generates, unfortunately it is MSVC only.

6 comments:

Zack Weinberg said...

What does MAT3evalW correspond to in QCMS?

Jeff Muizelaar said...

There's no direct correspondence in qcms, but code for doing the matrix multiplication exists in all of the qcms_transform_data_*() specializations.

Caspy7 said...

Curious, what version of Firefox will this most likely find itself in? 3.7?

Jeff Muizelaar said...

@Mark: Certainly in 3.7 and depending on how thing go, we may be able to get it in for 3.6.

Dave said...

Very nice optimizations here. I'm interested in comparing it against what we're using (KodakCMS), however it doesn't look like these optimizations are included in the Git repository found here:

http://cgit.freedesktop.org/~jrmuizel/qcms/

Any chance of seeing these optimizations there? If not, where is qcms being maintained that I can find the latest code?

Thanks for sharing some great technology.

Jeff Muizelaar said...

@Dave: The newer optimizations are in the git repository now, though there might be some build issues.