Fujifilm .RAF Unpack Slow LibRaw 2.2

Hi there.

As part of initial Integration Testing LibRaw 0.22 with my application, I came across an issue whereby this new version unpacks (before Demosaicing) FujiFilm GFX-100 II images almost 10 X slower than version 0.21.5. I've benchmarked other Camera formats and not seen such a difference. Reviewing the Release Notes I see the GFX100-II is now officially supported. I notice a very subtle difference between the .22 and .2.5 outputs, but clearly there is nothing to explain such a huge difference in performance.

I'll triple check my compilation setting and eventually go through the Libraw Repo and Diff the relevant FujiFilm source files, but the question to ask at this stage is a reason why I see such a disparity in performance between versions, specifically for GFX100-II images? Did the RAF file decoder change radically?

Any help would be appreciated.

Regards,

Sean.

AttachmentSize
Image icon Libraw 022_Decode RAF.png226.09 KB

Forums: 

Quick test (only unpack) with

Quick test (only unpack) with LibRaw compiled via make -f Makefile.dist (so no openmp), three files with different encoding:

0.22.0:

$ time ./bin/unprocessed_raw ~/tt/GFX100II/GFX100_II_1*RAF
Processing file /home/lexa/tt/GFX100II/GFX100_II_14bits_uncompressed_4x3.RAF
Image size: 11662x8746
Raw size: 11808x8754
Margins: top=2, left=0
Unpacked....
Stored to file /home/lexa/tt/GFX100II/GFX100_II_14bits_uncompressed_4x3.RAF.pgm
Processing file /home/lexa/tt/GFX100II/GFX100_II_16bits_lossless_4x3.RAF
Image size: 11662x8746
Raw size: 11808x8754
Margins: top=2, left=0
Unpacked....
Stored to file /home/lexa/tt/GFX100II/GFX100_II_16bits_lossless_4x3.RAF.pgm
Processing file /home/lexa/tt/GFX100II/GFX100_II_16bits_lossy_4x3.RAF
Image size: 11662x8746
Raw size: 11808x8754
Margins: top=2, left=0
Unpacked....
Stored to file /home/lexa/tt/GFX100II/GFX100_II_16bits_lossy_4x3.RAF.pgm

real    0m22,756s
user    0m21,958s
sys     0m0,783s

0.21.5:

$ time ./bin/unprocessed_raw ~/tt/GFX100II/GFX100_II_1*RAF
Processing file /home/lexa/tt/GFX100II/GFX100_II_14bits_uncompressed_4x3.RAF
Image size: 11662x8752
Raw size: 11808x8754
Margins: top=2, left=0
Unpacked....
Stored to file /home/lexa/tt/GFX100II/GFX100_II_14bits_uncompressed_4x3.RAF.pgm
Processing file /home/lexa/tt/GFX100II/GFX100_II_16bits_lossless_4x3.RAF
Image size: 11662x8752
Raw size: 11808x8754
Margins: top=2, left=0
Unpacked....
Stored to file /home/lexa/tt/GFX100II/GFX100_II_16bits_lossless_4x3.RAF.pgm
Processing file /home/lexa/tt/GFX100II/GFX100_II_16bits_lossy_4x3.RAF
Image size: 11662x8752
Raw size: 11808x8754
Margins: top=2, left=0
Unpacked....
Stored to file /home/lexa/tt/GFX100II/GFX100_II_16bits_lossy_4x3.RAF.pgm

real    0m22,798s
user    0m21,929s
sys     0m0,854s

CPU: Intel(R) Atom(TM) CPU C3758 @ 2.20GHz (2200.21-MHz K8-class CPU)
Storage: fast SSD (nvme)

So, please provide test file you use for benchmarking for more in-depth study.

-- Alex Tutubalin @LibRaw LLC

Hi Alex.

Hi Alex.

Thanks for looking into this.

Mia Culpa! Your response prompted me to look into this again so I checked my Libraw 0.22 compile flags and found the Libraw VS project did not have the /OPENMP flag set for Debug mode. This accounted for the big performance difference (on a 16 core machine).

Very sorry to have you chase a ghost.

I'll be more thorough before posting next time.

Regards,

Sean.

For the files you provide and

For the files you provide and openmp enabled the difference is neglible:

0.22
 
$ time ./bin/unprocessed_raw ~/tt/GFX/*RAF
Processing file /home/lexa/tt/GFX/DSCF0029.RAF
[skip]
Processing file /home/lexa/tt/GFX/DSCF0307.RAF
[skip]
Processing file /home/lexa/tt/GFX/DSCF0614.RAF
[skip]
 
real    0m4,604s
user    0m32,124s
sys     0m0,811s
 
0.21
 
$ time ./bin/unprocessed_raw ~/tt/GFX/*RAF
Processing file /home/lexa/tt/GFX/DSCF0029.RAF
[skip]
Processing file /home/lexa/tt/GFX/DSCF0307.RAF
[skip]
Processing file /home/lexa/tt/GFX/DSCF0614.RAF
[skip]
 
real    0m4,687s
user    0m32,431s
sys     0m0,758s

Also, diff in src/decoders/fuji_compressed.cpp is very small (it changes error handling in openmp case):

diff --git a/src/decoders/fuji_compressed.cpp b/src/decoders/fuji_compressed.cpp
index acea0825..40d92d78 100644
--- a/src/decoders/fuji_compressed.cpp
+++ b/src/decoders/fuji_compressed.cpp
@@ -229,9 +229,9 @@ static inline void fuji_fill_buffer(fuji_compressed_block *info)
 {
   if (info->cur_pos >= info->cur_buf_size)
   {
+    bool needthrow = false;
     info->cur_pos = 0;
     info->cur_buf_offset += info->cur_buf_size;
-    bool needthrow = false;
 #ifdef LIBRAW_USE_OPENMP
 #pragma omp critical
 #endif
@@ -1155,14 +1155,16 @@ void LibRaw::fuji_decode_loop(fuji_compressed_params *common_info, int count, IN
   const int lineStep = (libraw_internal_data.unpacker_data.fuji_total_lines + 0xF) & ~0xF;
 #ifdef LIBRAW_USE_OPENMP
   unsigned errcnt = 0;
-#pragma omp parallel for private(cur_block)
+#pragma omp parallel for private(cur_block) shared(errcnt)
 #endif
   for (cur_block = 0; cur_block < count; cur_block++)
   {
-    try{
+    try
+    {
       fuji_decode_strip(common_info, cur_block, raw_block_offsets[cur_block], block_sizes[cur_block],
                         q_bases ? q_bases + cur_block * lineStep : 0);
-    }  catch (...)
+    }
+    catch (...)
     {
 #ifdef LIBRAW_USE_OPENMP
 #pragma omp atomic

In fact, the only difference is errcnt variable is declared openmp-shared (it is atomically changed if error catched, so the difference should be neglible).

I can only recommend performing detailed profiling of both versions and comparing where exactly you're experiencing performance degradation at the individual operator level.

Since I don't see any performance differences on our end, there's nothing to look for there.

-- Alex Tutubalin @LibRaw LLC

Followup:

Followup:
compiled with clang 19.1.7
Compilation flags: -O3 -fopenmp

-- Alex Tutubalin @LibRaw LLC

Great to know that the

Great to know that the problem has been resolved

-- Alex Tutubalin @LibRaw LLC