Skip to content

Conversation

@aras-p
Copy link
Contributor

@aras-p aras-p commented Sep 8, 2025

(Note: there's an alternative PR at #2126 -- much simpler, it simply makes tables be initialized on first use)

B44 compression is probably very rarely used, and even when it is, the tables are only ever used on channels marked as "linear" (EXR_PERCEPTUALLY_LINEAR), which is not the default. So replace the tables with just math, which makes for much smaller code size, and cuts down size of the binary. This does cut down a megabyte of source code, and makes OpenEXRCore-4_0.dll be smaller by 254 kilobytes (868->614KB).

Just naïvely replacing tables with math makes it a bit slower, so to compensate this PR uses SIMD in several code paths (AVX2+F16C, SSE2, ARM NEON, regular C code). The math done in all of them is identical, and the actual exp / log implementations follow the same range reduction / polynomial approximation as the Highway SIMD library (I tried several others, including from OpenColorIO and OpenImageIO, but these were not producing bit-exact results compared to previous tables).

Timings for exrmetrics on one B44 image

Testing this https://aras-p.info/files/exr_files/Blender281rgb16_lin.exr image (3840x2160, RGB FP16, channels marked as "linear"), using exrmetrics -m -z b44 --time write,reread --passes 50 --csv, times printed in milliseconds:

Case PC write time PC reread time Mac write time Mac reread time
Lookup tables (main branch) 20.7 14.1 3.6 4.3
This PR, best SIMD (AVX2 or NEON) 20.3 14.5 6.1 4.8
This PR, SSE2 only 22.7 17.6 - -
This PR, no SIMD 29.7 21.0 7.8 8.3

("PC" is Ryzen 5950X, Windows / MSVC 2022 v17.14.12. "Mac" is MacBookPro M4 Max, Xcode 16.1)

This measures duration of whole compression/decompression, and while with no SIMD there is quite some slowdown, with the SIMD paths there's barely any overhead from using math instead of lookup table (exception is Mac "write" case, but then on this Mac the times are crazy low to begin with; I guess due to extremely large memory bandwidth that it has).

Timings for just the math/lookup comparison

Excluding everything else going on in B44 compression, here's times in milliseconds (one thread), to do to_linear and from_linear on 160 million numbers:

Case PC to_linear PC from_linear Mac to_linear Mac from_linear
Lookup tables (main branch) 64 65 43 41
This PR, best SIMD (AVX2 or NEON) 173 86 117 107
This PR, SSE2 only 472 377 - -
This PR, no SIMD 1041 568 475 264

This shows that expectedly, even with SIMD the "do the math" approach is about 2x slower than doing a table lookup, and several times more slower if not using SIMD. However, again see above: B44 compression seems to be more limited by memory bandwidth, that doing this extra math does not slow down things much, if at all.

And again, all of this only affects B44/B44A compression, and only when image channels are marked as "linear" (which is not the default setting for ImfChannel). I did not actually find any B44+Linear images anywhere, had to make my own using code.

B44 compression is probably very rarely used, and even when it is,
the table is only ever used on channels marked as "linear"
(EXR_PERCEPTUALLY_LINEAR), which is not the default.

This does make B44 compression with "linear" channels slower:
compressing 4K resolution image on Ryzen 5950X / Windows / VS2022,
exrmetrics write time goes 0.081 -> 0.210. The math is always
done on 16 value chunks and should be possible to SIMD it, but
not sure if worth it.

This does cut down half a megabyte of source code, and makes
OpenEXRCore-4_0.dll be smaller by 126 kilobytes (868->742KB)

Signed-off-by: Aras Pranckevicius <aras@nesnausk.org>
B44 compression is probably very rarely used, and even when it is,
the table is only ever used on channels marked as "linear"
(EXR_PERCEPTUALLY_LINEAR), which is not the default.

This does make B44 decompression with "linear" channels slower:
decompressing 4K resolution image on Ryzen 5950X / Windows / VS2022,
exrmetrics re-read time goes 0.038 -> 0.157. The math is always
done on 16 value chunks and should be possible to SIMD it, but
not sure if worth it.

This does cut down half a megabyte of source code, and makes
OpenEXRCore-4_0.dll be smaller by 128 kilobytes (742->614KB)

Signed-off-by: Aras Pranckevicius <aras@nesnausk.org>
Signed-off-by: Aras Pranckevicius <aras@nesnausk.org>
…r_16

AVX2+F16C, SSE2, NEON

Signed-off-by: Aras Pranckevicius <aras@nesnausk.org>
Copy link
Contributor

@meshula meshula left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer the math approach overall, however I'm marking both of these as approved from my point of view.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants