ImageMagick signatures are different when using 0.20.2 vs 0.21.1

I'm trying to do duplicate detection in my photo library and I was experimenting with using ImageMagick's identify -verbose tool to get a signature (SHA256 hash) of the pixel data.

A problem arose when I ran it on my Mac, which used the Homebrew distribution of LibRaw (the latest; 0.21.1) but I was getting mismatches on my Ubuntu 22.04 machine (which gets 0.20.2 of LibRaw). Once I built 0.21.1 on the Linux machine, the signatures were identical as expected, so I don't think it's an ImageMagick issue.

I'm posting here because I want to understand if I was wrong in expecting LibRaw to produce stable output in terms of the raw pixel data; i.e., the same file will yield the same pixel data regardless of which version of LibRaw you use. (If I'm using the wrong term when I say "raw pixel data", please educate me. :) )

If the answer is "sorry, the pixels you get are subject to change according to how our raw processing logic evolves" that's fine -- but I would be curious to know if there's a way to get something that is entirely stable out of LibRaw so I can compute a hash from that instead. To be clear, I don't need to do any image processing -- I just want to know if the image data is identical even if metadata got changed, like the capture time or something.

Forums: 

Sorry, we do not know how

Sorry, we do not know how identify -verbose tool works. Is that possible that it checksums not RAW data but rendered image?

-- Alex Tutubalin @LibRaw LLC

RAW data vs. rendered image

Ahh thanks, I think you just helped me understand something.

I haven't checked, but I'm almost certain you're right about it using the rendered image. My understanding is that ImageMagick delegates all the decoding to libraw, libjpeg, libtiff, libpng, etc., and in my case it doesn't necessarily know it's dealing with a RAW image by the time it creates the signature.

So let's say I wanted to write my own signature program using LibRaw that only operates on the image data, leaving the metadata completely out of it. After a quick look at the API, my best guess is that I'd want to hash the contents of libraw_rawdata_t. Does that sound right?

Yes, one of *image pointer

Yes, one of *image pointer in libraw_rawdata_t will be non-zero after LibRaw::unpack() and will contain imgdata.sizes.raw_height rows, imgdata.sizes.raw_width items each, with imgdata.sizes.raw_pitch byte pitch.

-- Alex Tutubalin @LibRaw LLC

Thank you very much. This is

Thank you very much. This is all new to me but I'm eager to get into it.

One last question - I'm completely new to working with RAW processing, but I've been a coder for 25 years. Can you recommend any conceptual documentation or reference material that will help me understand RAW processing better? There are lots of search results, but if you have a recommended resource I would love to know about it. (Apologies if it's on page 1 of your documentation and I just missed it.)

Thanks again!

My solution

I stripped down the unprocessed_raw sample and achieved what I want (I think) by piping the output through a hash utility like sha256sum or xxh128sum.

Posting here in case it helps someone later:

#include <stdio.h>
#include "libraw/libraw.h"
int main(int ac, char *av[])
{
	LibRaw RawProcessor;
	int ret;  
	if ((ret = RawProcessor.open_file(av[1])) != LIBRAW_SUCCESS)  
	{
		fprintf(stderr, "Cannot open %s: %s\n", av[1], libraw_strerror(ret));  
		return 1;  
	}
	if ((ret = RawProcessor.unpack()) != LIBRAW_SUCCESS)  
	{
		fprintf(stderr, "Cannot unpack %s: %s\n", av[1], libraw_strerror(ret));  
		return 1;  
	}
	size_t bytes = RawProcessor.imgdata.sizes.raw_height * RawProcessor.imgdata.sizes.raw_pitch;  
	fwrite(RawProcessor.imgdata.rawdata.raw_alloc, sizeof(uint8_t), bytes, stdout);  
	fflush(stdout);
}

Compile with g++ rawbytes.cpp -o rawbytes -Ofast -lraw -lm.