Using ImageMagick.NET to compare images

Using ImageMagick.NET to compare images - c#

I need to do a "fuzzy" image comparison in c# - I have used ImageMagick.NET for stuff in the past and know it's good for the job.
There is a compare command in Image Magick: http://www.imagemagick.org/script/compare.php
And there is a Compare(Image reference) method in ImageMagick.NET however it seems that it's be hugely simplified so there is no way of getting at the verbose output.
I need to be able to get at that so I can match the images using a threshold. Am I missing something - is there a way to get this stuff into ImageMagick.NET if there isn't already? (I'm no C++ dev by a long shot) or am I barking up the wrong tree?

Pardon me if I don't get your question, but won't IsImagesEqual or SimilarityImage work?
IsImagesEqual returns "The normalized maximum quantization error for any single pixel in the image. This distance measure is normalized to a range between 0 and 1. It is independent of the range of red, green, and blue values in your image.
A small normalized mean square error, accessed as image->normalized_mean_error, suggests the images are very similar in spatial layout and color."
The corresponding method in the .NET bindings is Image.Compare which takes an image and returns a bool. However, if the result is false - the mean error (according to the metric above) is set on the current instance's meanErrorPerPixel, normalizedMaxError, and normalizedMeanError.
Aren't these three metrics enough to give you the result of your "fuzzy" compare?

Related

Best way to draw 'paths' to an array?

I have a question about drawing lines/paths on my own.
I use a combination of C#/WPF/Cudafy for UI and some calculations (e.g. the paths). Now I have a Byte[] array that should be filled with 'colors'/values (array-length = 4 * width * height of the result image).
I got some startpoints for the lines and one endpoint (somewhere between the startpoints). First I calculated some paths from those startpoints to the endpoints and then want to 'draw' them to the array (that will be used to construct a WriteableBitmap). The point coordinates are present in a 'reduced environment' though (since calculation of the paths needed to run a Dijkstra algorithm).
My paths are now defined by Tuples holding the point-coordinates (reduced size) and a 'linewidth'.
Since some paths may 'overlap' I thought I will do the following steps to ensure a nice looking of the result:
Merge the paths:
For that I will take one path and just keep it. Then I take the second and check if the path-points are somewhere near a path already added (like a near-neighbor search). I want to do this because in the end, I want to widen the line-width where paths overlap (3rd Tuple value).
When finished, I want to 'interpolate' the paths:
I don't really know how I should do that, since every path has a point at every (reduced-size) pixel.
One possibility would be to clear out all those path-coortinates of the paths that 'lie on a line' (and are not really necessary) and then do something like a Bezier - Interpolation. But all these steps seem to be overkill to me.
Don't you think there might be a better way to do this? If so, please share your thoughts :)
Thank's for any help!
Here's a link to an image of how it looks right now: CPVL Application

vbPixel position to real pixel position

I am importing data that was exported from a vb6 application into a new app made in c#.net.
The pixel coordinates in the data are in vbPixels. Is there a way to convert them into the real pixel coordinate? The bitmap is 800x500 and the pixels are like x=2265 y=1620.

Use these functions from .NET:
ToPixelsX - Used for coordinate conversion.
ToPixelsY - Used for coordinate conversion.
And read this to understand what is going on. Twips have a very special definition that is dependent on resolution.
In my previous answer I assumed that you knew that the give coordinate was the size of the image.

in VB6 you can use
Screen.TwipsPerPixelx
and
Screen.TwipsPerPixelY
These are almost always 15, but the user could change some settings which might result in other values (i am not sure which settings though :))

Finding matches between high quality and low quality, pixelated images - is it possible ? How?

I have a problem. My company has given me an awfully boring task. We have two databases of dialog boxes. One of these databases contains images of horrific quality, the other very high quality.
Unfortunately, the dialogs of horrific quality contain important mappings to other info.
I have been tasked with, manually, going through all the bad images and matching them to good images.
Would it be possible to automate this process to any degree? Here is an example of two dialog boxes (randomly pulled from Google images) :
So I am currently trying to write a program in C# to pull these photos from the database, cycle through them, find the ones with common shapes, and return theird IDs. What are my best options here ?

I really see no reason to use any external libraries for this, I've done this sort of thing many times and the following algorithm works quite well. I'll assume that if you're comparing two images that they have the same dimensions, but you can just resize one if they don't.
badness := 0.0
For x, y over the entire image:
r, g, b := color at x,y in image 1
R, G, B := color at x,y in image 2
badness += (r-R)*(r-R) + (g-G)*(g-G) + (b-B)*(b-B)
badness /= (image width) * (image height)
Now you've got a normalized badness value between two images, the lower the badness, the more likely that the images match. This is simple and effective, there are a variety of things that make it work better or faster in certain cases but you probably don't need anything like that. You don't even really need to normalize the badness, but this way you can just come up with a single threshold for it if you want to look at several possible matches manually.
Since this question has gotten some more attention I've decided to add a way to speed this up in cases where you are processing many images many times. I used this approach when I had several tens of thousands of images that I needed to compare, and I was sure that a typical pair of images would be wildly different. I also knew that all of my images would be exactly the same dimensions. In a situation in which you are comparing dialog boxes your typical images may be mostly grey-ish, and some of your images may require resizing (although maybe that just indicates a mis-match), in which case this approach may not gain you as much.
The idea is to form a quad-tree where each node represents the average RGB values of the region that node represents. So an 4x4 image would have a root node with RGB values equal to the average RGB value of the image, its children would have RGB values representing the average RGB value of their respective 2x2 regions, and their children would represent individual pixels. (In practice it is a good idea to not go deeper than a region of about 16x16, at that point you should just start comparing individual pixels.)
Before you start comparing images you will also need to decide on a badness threshold. You won't calculate badnesses above this threshold with any reliable accuracy, so this is basically the threshold at which you are willing to label an image as 'not a match'.
Now when you compare image A to image B, first compare the root nodes of their quad-tree representations. Calculate the badness just as you would for a single pixel image, and if the badness exceeds your threshold then return immediately and report the badness at this level. Because you are using normalized badnesses, and since badnesses are calculated using squared differences, the badness at any particular level will be equal to or less than the badness at lower levels, so if it exceeds the threshold at any points you know it will also exceed the threshold at the level of individual pixels.
If the threshold test passes on an nxn image, just drop to the next level down and compare it like it was a 2nx2n image. Once you get low enough just compare the individual pixels. Depending on your corpus of images this may allow you to skip lots of comparisons.

I would personally go for an image hashing algorithm.
The goal of image hashing is to transform image content into a feature sequence, in order to obtain a condensed representation.
This feature sequence (i.e. a vector of bits) must be short enough for fast matching and preserve distinguishable features for similarity measurement to be feasible.
There are several algorithms that are freely available through open source communities.
A simple example can be found in this article, where Dr. Neal Krawetz shows how the Average Hash algorithm works:
Reduce size. The fastest way to remove high frequencies and detail is to shrink the image. In this case, shrink it to 8x8 so that there are 64 total pixels. Don't bother keeping the aspect ratio, just crush it down to fit an 8x8 square. This way, the hash will match any variation of the image, regardless of scale or aspect ratio.
Reduce color. The tiny 8x8 picture is converted to a grayscale. This changes the hash from 64 pixels (64 red, 64 green, and 64 blue) to 64 total colors.
Average the colors. Compute the mean value of the 64 colors.
Compute the bits. This is the fun part. Each bit is simply set based on whether the color value is above or below the mean.
Construct the hash. Set the 64 bits into a 64-bit integer. The order does not matter, just as long as you are consistent. (I set the bits from left to right, top to bottom using big-endian.)
David Oftedal wrote a C# command-line application which can classify and compare images using the Average Hash algorithm.
(I tested his implementation with your sample images and I got a 98.4% similarity).
The main benefit of this solution is that you read each image only once, create the hashes and classify them based upon their similiarity (using, for example, the Hamming distance).
In this way you decouple the feature extraction phase from the classification phase, and you can easily switch to another hashing algorithm if you find it's not enough accurate.
Edit
You can find a simple example here (It includes a test set of 40 images and it gets a 40/40 score).

Here's a topic discussing image similarity with algorithms, already implemented in OpenCV library. You should have no problem importing low-level functions in your C# application.

The Commercial TinEye API is a really good option.
I've done image matching programs in the past and Image Processing technology these days is amazing, its advanced so much.
ps here's where those two random pics you pulled from google came from: http://www.tineye.com/search/1ec9ebbf1b5b3b81cb52a7e8dbf42cb63126b4ea/

Since this is a one-off job, I'd make do with a script (choose your favorite language; I'd probably pick Perl) and ImageMagick. You could use C# to accomplish the same as the script, although with more code. Just call the command line utilities and parse the resulting output.
The script to check a pair for similarity would be about 10 lines as follows:
First retrieve the sizes with identify and check aspect ratios nearly the same. If not, no match. If so, then scale the larger image to the size of the smaller with convert. You should experiment a bit in advance with filter options to find the one that produces the most similarity in known-equivalent images. Nine of them are available.
Then use the compare function to produce a similarity metric. Compare is smart enough to deal with translation and cropping. Experiment to find a similarity threshold that doesn't provide too many false positives.

I would do something like this :
If you already know how the blurred images have been blurred, apply the same function to the high quality images before comparison.
Then compare the images using least-squares as suggested above.
The lowest value should give you a match. Ideally, you would get 0 if both images are identical
to speed things up, you could perform most comparison on downsampled images then refine on a selected subsample of the images
If you don't know, try various probable functions (JPEG compression, downsampling, ...) and repeat

You could try Content-Based Image Retrieval (CBIR).
To put it bluntly:
For every image in the database, generate a fingerprint using a
Fourier Transform
Load the source image, make a fingerprint of the
image
Calculate the Euclidean Distance between the source and all
the images in the database
Sort the results

I think a hybrid approach to this would be best to solve your particular batch matching problem
Apply the Image Hashing algorithm suggested by #Paolo Morreti, to all images
For each image in one set, find the subset of images with a hash closer that a set distance
For this reduced search space you can now apply expensive matching methods as suggested by #Running Wild or #Raskolnikov ... the best one wins.

IMHO, best solution is to blur both images and later use some similarity measure (correlation/ mutual information etc) to get top K (K=5 may be?) choices.

If you extract the contours from the image, you can use ShapeContext to get a very good matching of images.
ShapeContext is build for this exact things (comparing images based on mutual shapes)
ShapeContext implementation links:
Original publication
A goot ppt on the subject
CodeProject page about ShapeContext
*You might need to try a few "contour extraction" techniques like thresholds or fourier transform, or take a look at this CodeProject page about contour extraction
Good Luck.

If you calculate just pixel difference of images, it will work only if images of the same size or you know exactly how to scale it in horizontal and vertical direction, also you will not have any shift or rotation invariance.
So I recomend to use pixel difference metric only if you have simplest form of problem(images are the same in all characteristics but quality is different, and by the way why quality is different? jpeg artifacts or just rescale?), otherwise i recommend to use normalized cross-correlation, it's more stable metric.
You can do it with FFTW or with OpenCV.

If bad quality is just result of lower resolution then:
rescale high quality image to low quality image resolution (or rescale both to equal low resolution)
compare each pixel color to find closest match
So for example rescaling all of images to 32x32 and comparing that set by pixels should give you quite reasonable results and its still easy to do. Although rescaling method can make difference here.

You could try a block-matching algorithm, although I'm not sure its exact effectiveness against your specific problem - http://scien.stanford.edu/pages/labsite/2001/ee368/projects2001/dropbox/project17/block.html - http://www.aforgenet.com/framework/docs/html/05d0ab7d-a1ae-7ea5-9f7b-a966c7824669.htm
Even if this does not work, you should still check out the Aforge.net library. There are several tools there (including block matching from above) that could help you in this process - http://www.aforgenet.com/

I really like Running Wild's algorithm and I think it can be even more effective if you could make the two images more similar, for example by decreasing the quality of the better one.

Running Wild's answer is very close. What you are doing here is calculating the Peak Signal to Noise Ratio for each image, or PSNR. In your case you really only need the Mean Squared Error, but the squaring component of it helps a great deal in calculating difference between images.
PSNR Reference
Your code should look like:
sum = 0.0
for(imageHeight){
for(imageWidth){
errorR = firstImage(r,x,y) - secondImage(r,x,y)
errorG = firstImage(g,x,y) - secondImage(g,x,y)
errorB = firstImage(b,x,y) - secondImage(b,x,y)
totalError = square(errorR) + square(errorG) + square(errorB)
}
sum += totalError
}
meanSquaredError = (sum / (imageHeight * imageWidth)) / 3

I asume the images from the two databases show the same dialog and that the images should be close to identical but of different quality? Then matching images will have same (or very close to same) aspect ratio.
If the low quality images were produced from the high quality images (or equivalent image), then you should use the same image processing procedure as a preprocessing step on the high quality image and match with the low quality image database. Then pixel by pixel comparison or histogram matching should work well.
Image matching can use a lot of resources if you have many images. Maybe a multipass approach is a good idea? For example:
Pass 1: use simple mesures like aspect ratio to groupe images (width and height fields in db?) (computationally cheap)
Pass 2: match or groupe by histogram for 1st-color-channel (or all channels) (relatively computationally cheap)
I will also recommend OpenCV. You can use it with c,c++ and Python (and soon Java).

Just thinking out loud:
If you use two images that should be compared as layers and combine these (subtract one from the other) you get a new image (some drawing programs can be scripted to do batch conversion, or you could use the GPU by writing a tiny DirectX or OpenGL program)
Next you would have to get the brightness of the resulting image; the darker it is, the better the match.

Have you tried contour/thresholding techniques in combination with a walking average window (for RGB values ) ?

Bilinear interpolation - DirectX vs. GDI+

I have a C# app for which I've written GDI+ code that uses Bitmap/TextureBrush rendering to present 2D images, which can have various image processing functions applied. This code is a new path in an application that mimics existing DX9 code, and they share a common library to perform all vector and matrix (e.g. ViewToWorld/WorldToView) operations. My test bed consists of DX9 output images that I compare against the output of the new GDI+ code.
A simple test case that renders to a viewport that matches the Bitmap dimensions (i.e. no zoom or pan) does match pixel-perfect (no binary diff) - but as soon as the image is zoomed up (magnified), I get very minor differences in 5-10% of the pixels. The magnitude of the difference is 1 (occasionally 2)/256. I suspect this is due to interpolation differences.
Question: For a DX9 ortho projection (and identity world space), with a camera perpendicular and centered on a textured quad, is it reasonable to expect DirectX.Direct3D.TextureFilter.Linear to generate identical output to a GDI+ TextureBrush filled rectangle/polygon when using the System.Drawing.Drawing2D.InterpolationMode.Bilinear setting?
For this (magnification) case, the DX9 code is using this (MinFilter,MipFilter set similarly):
Device.SetSamplerState(0, SamplerStageStates.MagFilter, (int)TextureFilter.Linear);
and the GDI+ path is using:
g.InterpolationMode = InterpolationMode.Bilinear;
I thought that "Bilinear Interpolation" was a fairly specific filter definition, but then I noticed that there is another option in GDI+ for "HighQualityBilinear" (which I've tried, with no difference - which makes sense given the description of "added prefiltering for shrinking")
Followup Question: Is it reasonable to expect pixel-perfect output matching between DirectX and GDI+ (assuming all external coordinates passed in are equal)? If not, why not?
Clarification: The images I'm using are opaque grayscale (R=G=B, A=1) using Format32bppPArgb.
Finally, there are a number of other APIs I could be using (Direct2D, WPF, GDI, etc.) - and this question generally applies to comparing the output of "equivalent" bilinear interpolated output images across any two of these. Thanks!

DirectX runs mostly in the GPU and DX9 may be running shaders. GDI+ runs on completely different algorithms. I don't think it is reasonable to expect the two to come up with exactly pixel-matching outputs.
I'd expect DX9 to have better quality than GDI+, which is a step improvement over the old GDI but not much. GDI+ is long understood to have trouble with anti-aliasing lines and also with preserving quality in image scaling (which seems to be your problem). In order to have something similar in quality than latest-generation GPU texture processing, you'll need to move to WPF graphics. This gives quality similar to DX.
WPF also uses the GPU (if available) and falls back to software rendering (if no GPU), therefore the output between GPU and software rendering are reasonably close.
EDIT: Although this has been picked as the answer, it is only an early attempt to explain and doesn't really address the true reason. The reader is referred to discussions laid out in the comments to the question and to the answers instead.

Why do you make the assumption that they use the same formula?
Even if they do use the same formula and you accept that the implementation is different would you expect the output to be the same?
At the end of the day the code is designed to work with perception not be mathematically precise. Although you can get this with CUDA if you want.
Rather than being suprised that you get different results i would be very suprised if you got pixel perfect matches.
the way they represent colour is different ... I know for a fact nvidia uses a float(maybe double) to represent colour wheras GDI uses int i believe.
http://en.wikipedia.org/wiki/GPGPU
In DX9 shader 2.0 appears which is when the implementation of colour switched from int to 24 and 32 bit floats.
try comparing ati/amd rendering to nvidia rendering and you can clearly see that colour is very different.
I first noticed this in quake 2 ... the difference between the 2 cards was staggering - of course that is due to a great many number of things, least of which is their bilinier interp implementation.
EDIT: the info about how the specification was made happeend after i answered. Anyway i think the datatypes used to store it iwll be different no matter how you specify it. Moreover the implementation of float is likley to be different. I may be wrong but im pretty sure that c# implements float differently to the C compiler that nvidia uses. (and that assumes that GDI+ doesnt just convert the float into the equivalent int ....)
Even if i am wrong about that I would enerally hold it to be exceptional to expect 2 different implementations of an algorithm to be identical. they are optomised for speed as a result the difference in optomisation will directly translate to a difference in image quality as this speed will come from a different approach to cutting corners/approximation.

There are two possibilities for round-off differences. The first is obvious, when the RGB value is calculated as a fraction of the values on either side. The second is more subtle, when calculating the ratio to use when determining the fraction between the two points.
I'd actually be very surprised if two different implementations of the algorithm were pixel-for-pixel identical, there's so many places for a +/-1 difference to occur. Without getting the exact implementation details of both methods it's impossible to be more precise than that.

Removing Duplicate Images [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this question
We have a collection of photo images sizing a few hundred gigs. A large number of the photos are visually duplicates, but with differing filesizes, resolution, compression etc.
Is it possible to use any specific image processing methods to search out and remove these duplicate images?

I recently wanted to accomplish this task for a PHP image gallery. I wanted to be able to generate a "fuzzy" fingerprint for an uploaded image, and check a database for any images that had the same fingerprint, indicating they were similar, and then compare them more closely to determine how similar.
I accomplished it by resizing the uploaded image to 150 pixels wide, reducing it to greyscale, rounding the value of each colour off to the nearest multiple of 16 (giving 17 possible shades of grey between 0 and 255), normalise them and store them in an array, thereby creating a "fuzzy" colour histogram, then creating an md5sum of the histogram which I could then search for in my database. This was extremely effective in narrowing down images which were very visually similar to the uploaded file.
Then to compare the uploaded file against each "similar" image in the database, I took both images, resized them to 16x16, and analysed them pixel by pixel and took the RGB value of each pixel away from the value of the corresponding pixel in the other image, adding all the values together and dividing by the number of pixels giving me an average colour deviation. Anything less than specific value was determined to be a duplicate.
The whole thing is written in PHP using the GD module, and a comparison against thousands of images takes only a few hundred milliseconds per uploaded file.
My code, and methodology is here: http://www.catpa.ws/php-duplicate-image-finder/

Try PerceptualDiff for comparing 2 images with the same dimensions. Allows threshholds such as considering images with only X number of pixels different to be visually indistinguishable.
If visual duplicates may have different dimensions due to scaling, or different filetypes,
you may want to make a standard format for comparisons. For example, I might use ImageMagick
to scale all images to 100x100 and save them as PNG files.

A very simple approach is the following:
Convert the image to greyscale in memory, so every pixel is only a number between 0 (black) and 255 (white).
Scale the image to a fixed size. Finding the right size is important, you should play around with different sizes. E.g. you could scale each image to 64x64 pixels, but you may get better or worse results with either smaller or bigger pictures.
Once you've done this for all images (yes, that will take a while), always load two images in memory and subtract them from each other. That is subtract the value of pixel (0,0) in image A ob the value of pixel (0,0) in image B, now do the same for (0,1) in both and so on. The resulting value might be positive or negative, you should always store the absolute value (so 5 results in 5, -8 however results in 8).
Now you have a third image being the "difference image" (delta image) of image A and B. If they were identical, the delta image is all black (all values will subtract to zero). The "less black" it is, the less identical the images are. You need to find a good threshold, since even if the images are in fact identical (to your eyes), by scaling, altering brightness and so on, the delta image will not be totally black, it will however have only very dark greytones. So you need a threshold that says "If average error (delta image brightness) is below a certain value, there is still a good chance they might be identical, however if it is above that value, they are most likely not. Finding the right threshold is as hard as finding the right scaling size. You will always have false positives (images deemed to be identical, though they are not at all) and false negatives (images deemed to be not identical, although they are).
This algorithm is ultra slow. Actually only creating the greyscale images takes tons of time. Then you need to compare each GS image to each other one, again, tons of time. Also storing all the GS images takes a lot of disk space. So this algorithm is very bad, but the results are not that bad, even though its that simple. While the results are not amazing, they are better than I had initially thought.
The only way to get even better results is to use advanced image processing and here it starts getting really complicated. It involves a lot of math (a real lot of it); there are good applications (dupe finders) for many systems that have these implemented, so unless you must program it yourself, you are probably better off using one of these solutions. I read a lot papers on this topic but I'm afraid most of this goes beyond my horizon. Even the algorithms I might be able to implement according to these papers are beyond it; that means I understand what needs to be done, but I have no idea why it works or how it actually works, it's just magic ;-)

I actually wrote an application that does this very thing.
I started with a previous application that used a basic Levenshtein Distance algorithm to compute image similarity, but that method is undesirable for a number of reasons. Without a doubt, the fastest algorithm you're going to find for determining image similarity is either mean squared error or mean absolute error (both have a running time of O(n), where n is the number of pixels in the image, and it'd also be trivial to thread an implementation of either algorithm in a number of different ways). Mecki's post is actually just a Mean Absolute Error implementation, which my application can perform (code is also available for your browsing pleasure, should you so desire).
In any event, in our application, we first down-sample images (e.g. everything is scaled to, say, 32*32 pixels), then convert to gray scale, and then run the resulting images through our comparison algorithms. We're also working on some more advanced pre-processing algorithms to further normalize images, but...not quite there yet.
There are definitely better algorithms than MSE/MAE (in fact, the problems with these two algorithms as applied to visual information has been well documented), like SSIM, but it comes at a cost. Other people attempt to compare other visual qualities in the image, such as luminance, contrast, color histograms, etc., but it's all pricey compared to simply measuring the error signal.
My application might work, depending on how many images are in those folders. It's multi-threaded (I've seen it fully load eight processor cores performing comparisons), but I've never tested against an image database larger than a few hundred images. A few hundred gigs of images sounds prohibitively large. (simply reading them from disk, downsampling, converting to gray scale and storing in memory--assuming you have enough memory to hold everything, which you probably don't--could take a couple hours).

This is still a research area, I believe. If you have some time in your hands, some relevant keywords are:
Image copy detection
Content based image retrieval
Image indexing
Image duplicate removal
Basically, each image is processed (indexed) to produce an "image signature". Similar images have similar signatures. If your images are just rescaled then probably their signature are nearly identical, so they cluster well. Some popular signatures are the MPEG-7 descriptors. To cluster, I think K-Means or any of its variants may be enough.
However, you probably need to deal with millions of images, this may be a problem.
Here is a link to the main Wikipedia entry:
http://en.wikipedia.org/wiki/CBIR
Hope this helps.

Image similarity is probably a sub-field of image processing/AI.
Be prepared to implement algorithms/formulae from papers if you're looking for an excellent (i.e. performant and scalable) solution.
If you want something quick n dirty, search google for Image Similarity
Here's a C# image similarity app that might do what you want.
Basically, all algorithms extract and compare features. How they define "feature" depends on the math model they're based on.

A quick hack at this is to write a program that will calculate the value of the average pixel in each image, in greyscale, sort by this value, and then compare them visually. Very similar images should occur near each other in the sorted order.

You will need a command line tool to deal with so much data.
Comparing every possible pair of images will not scale to such a large set of images.
You need to sort the entire set of images according to some metric so that further
comparisons are only needed on neighbouring images.
An example of a simple metric is the average value of all of the pixels in an image, expressed
as a single greyscale value. This should work only if the duplicates have not had any visual alterations.
Using a lossy file format can also result in visual alterations.

Thinking outside the box, you may be able to use image metadata to narrow down your dataset.
For example, your images may have fields showing the date and time the image was taken, down to the nearest second.
Duplicates are likely to have identical values.
A tool such as exiv2 could be used to dump out this data to a more convenient and sortable text format (with a little knowledge of batch/shell scripting).
Even fields such as the camera manufacturer and model could be used to reduce a set of 1,000,000 images
to say 100 sets of 10,000 images, a significant improvement.

The gqview program has an option for finding duplicates, so you might try looking there. However, it's not foolproof, so it'd only be suitable as a heuristic to present duplicates to a human, for manual confirmation.

The most important part is to make the files comparable.
A generic solution might be to scale all images to a certain fixed size and greyscale. Then save the resulting images in a separate directory with same name for later reference. It would then be possible to sort by filesize and visually compare neighboring entries.
The resulting pictures might be quantified in certain ways to programatically detect similarities (averaging of blocks, lines etc.).

I would imagine the most scaleable method would be to store an fingerprint with each image. Then when a new image is added, it's a simple case of SELECT id FROM photos where id='uploaded_image_id' to check for duplicates (or fingerprinting all the images, then doing a query for duplicate
Obviously a simple file-hash wouldn't work as the actual content differs..
Acoustic fingerprinting/this paper may be a good start on the concept, as there are many implementations of this. Here is a paper on image fingerprinting.
That said, you may be able to get away with something simpler. Something as basic as resizing the image to equal width or height, subtracting image_a from image_b, and summing the difference. If the total difference is below a threshold, the image is a duplicate.
The problem with this is you need to compare every image to every other. The time required will exponentially increase..

If you can come up with a way of comparing images that obeys the triangle inequality (eg, if d(a,b) is the difference between images a and b, then d(a,b) < d(a,c) + d(b,c) for all a,b,c), then a BK-Tree would be an effective way of indexing the images such that you can find matches in O(log n) time instead of O(n) time for each image.
If your matches are restricted to the same image after varying amounts of compression/resizing/etc, then converting to some canonical size/color balance/etc and simply summing the squares-of-differences of each pixel may be a good metric, and this obeys the triangle inequality, so you could use a BK-tree for efficient access.

If you have a little bit of money to spend, and maybe once you run a first pass to determine which images are maybe matches, you could write a test for Amazon's Mechanical Turk.
https://www.mturk.com/mturk/welcome
Essentially, you'd be creating a small widget that AMT would show to real human users who would then basically just have to answer the question "Are these two images the same?". Or you could show them a grid of say 5x5 images and ask them "Which of these images match?". You'd then collect the data.
Another approach would be to use the principles of Human Computation which have been most famously espoused by Luis Von Ahn (http://www.cs.cmu.edu/~biglou/) with reCaptcha, which uses Captcha answers to determine the unreadable words that have been run through Optical character Recognition, thus helping to digitize books. You could make a captcha that asked users to help refine the images.

It sounds like a procedural problem rather than a programming problem. Who uploads the photos? You or the customers? If you are uploading the photo, standardize the dimensions to a fixed scale and file format. That way comparisons will be easier. However, as it stands, unless you have days - or even weeks of free time - I suggest that you instead manually remove the duplicates images by either yourself or your team by visually comparing the images.
Perhaps you should group the images by location since it is a tourist images.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.