Advanced C# pattern search in long string (100-25000 char)

Advanced C# pattern search in long string (100-25000 char) - c#

Let me start with this: I can't zip it or anything similar.
What I'm trying to do is search through fairly large strings. I use data blocks that look like 0g12h. (The 0 is the color from my palette. The g is a space to divide the numbers. The 12 means 12 pixels in a row use that color. The h is to divide the numbers again.)
The problem I'm having is that the blocks aren't all the same length. They range from 0g1h to 2546g115h. Basically I want to create a palette of common patterns to hopefully save space. Say I have: 12g345h19g12h190g11h occurring at least three times, then I could save space if I had something like: a=12g345h19g12h190g11h in the palette array and just put 'a' in the string. Or even not look at the data blocks, as you see in the attached file you get g640h a ton of times.
I could be wrong, but I'm pretty sure this could work. If you have a better idea how I could save space and not lose data, I'm more than open to the ideas.
Here is a great example since you can visually see the pattern: http://pastebin.com/5dbhxZQK. I chose this file because I knew it would have massive redundancy; most aren't this simple.

You could use a dictionary (probably Dictionary<string, int> and just could how many times each pattern occurs, then go back and rewrite the string with the appropriate replacements.
However, I would recommend that you read up a little about compression algorithms, what you are implementing appears to be a Run Length Encoding (RLE) scheme. You are then trying to compress again on top of that, consider looking at how Sliding Window compression works (which is what GZIP does) as an alternative to your RLE. Or look at Huffman encoding as a mechanism to reduce the amount of space needed for the codewords that you are creating (in simple terms Huffman encoding uses shorter symbols for more frequent patterns and longer symbols for less frequent patterns in a 'optimal' way)
This is a fun problem space to play in! Good Luck!

Related

Shannon-fano coding algorithm - strange behaviour on larger sets

I am writing a Shannon-fano algorithm, and I am struggling to find a mistake in my program - my program works for examples I managed to find on internet - example:
This is my example with 10 characters, where it sets characters will lower possibilities longer codes:
On the left side are byte values, middle is possibility and left is the generated code. Why is 65 and 226's code longer then 0,3 and 32's? Can anybody see a bug in code?
EDIT: code hidden, because this question was about a school assignment

This is probably not a bug in your code but rather illustrates an inherent weakness in Shannon-Fano codes compared to, say, Huffman compression.
As you know, the Shannon-Fano technique is to sort the list of code frequencies in descending order and then assign a binary symbol (zero or one) to each half of the frequency range. This process is repeated in a recursive fashion as long as there is more than one element in a segment.
This has a weakness, though: While it's true that the more frequent symbols, when grouped together, will have a shorter encoding on average than the less frequent symbols, it is not necessarily the case for each and every symbol that it gets a shorter encoding assigned to it.
For more information, see a question I posted a while back over on Computer Science about this very issue.

Compress a short but repeating string

I'm working on a web app that needs to take a list of files on a query string (specifically a GET and not a POST), something like:
http://site.com/app?things=/stuff/things/item123,/stuff/things/item456,/stuff/things/item789
I want to shorten that string:
http://site.com/app?things=somekindofencoding
The string isn't terribly long, varies from 20-150 chars. Something that short isn't really suitable for GZip, but it does have an awful lot of repetition so compression should be possible.
I don't want a DB or Dictionary of strings - the URL will be built by a different application to the one that consumes it. I want a reversible compression that shortens this URL. It doesn't need to be secure.
Is there an existing way to do this? I'm working in C#/.Net but would be happy to adapt an algorithm from some other language/stack.

If you can express the data in BNF you could contruct a parser for the data. in stead of sending the data you could send the AST where each node would be identified as one character (or several if you have a lot of different nodes). In your example
we could have
files : file files
|
file : path id
path : itemsthing
| filesitem
| stuffthingsitem
you could the represent a list of files as path[id1,id2,...,idn] using 0,1,2 for the paths and the input being:
/stuff/things/item123,/stuff/things/item456,/stuff/things/item789
/files/item1,/files/item46,/files/item7
you'd then end up with ?things=2[123,456,789]1[1,46,7]
where /stuff/things/item is represented with 2 and /files/item/ is represented with 1 each number within [...] is an id. so 2[123] would expand to /stuff/things/item123
EDIT The approach does not have to be static. If you have to discover the repeated items dynamically you can use the same approach and pass the map between identifier and token. in that case the above example would be
?things=2[123,456,789]1[1,46,7]&tokens=2=/stuff/things/,1=/files/item
which if the grammar is this simple ofcourse would do better with
?things=/stuff/things/[123,456,789]/files/item[1,46,7]
compressing the repeated part to less than the unique value with such a short string is possible but will most likely have to be based on constraining the possible values or risk actually increasing the size when "compressing"

You can try zlib using raw deflate (no zlib or gzip headers and trailers). It will generally provide some compression even on short strings that are composed of printable characters and does look for and take advantage of repeated strings. I haven't tried it, but could also see if smaz works for your data.
I would recommend obtaining a large set of real-life example URLs to use for benchmark testing of possible compression approaches.

Improve pre-processing for OCR/Image Recognition

Currently i'm having a huge intrest in image processing and optical character recognition. After some basic recognition and some filters I decided to start on something more diffucult.
I'm trying to read the value out of these captchas:
http://img851.imageshack.us/img851/9579/57859946.png
I have written some filters for pre-processing:
- Replace Color (to White)
Remove blue lines
remove the lines that go trough the text (two)
- Threshold image (255)
Wich outputs an images like this;
http://img232.imageshack.us/img232/2325/00i3q45j1zt.png
As you can see there are holes in some letters. I first thought maybe it's better to leave the lines trough the letters but that made it worse. I'm using the tesseract OCR engine
and I trained it using the Elephant font (The font the captcha uses). I also tried
using other OCR engines like GOCR but it makes everything worse. With tesseract I now have a recognition of 20%. I'm coding in C# (.NET 4.0).
The captcha is generated by a software package named PHPCaptcha.
Now my question is:
Is there any algorithm or tick to fill up the holes in the letters? And is there any other way to get a better recognition?
I'm excited to hear from you guys :)
Greetings,

Part 0 - Preface
i) Before hand, you may want read to my OCR-related answer here, which may give you some tricks for using tesseract
ii) I assume you could just turn everything into black and white (in your case, processing in colors doesn't give you an edge)
Part 1 - Preprocessing
To fill 'the-holes' after you've removed the blue lines. You can always dilate or perform 'dilate-then-erode' operations. Here, dilation means you enlarge every pixel in 8-directions(making a bigger pixel). Once you've dilated the pixels, see if you can get them to be recognized or see if the characters are 'over-filled' (dilated too much). If the chars cannot be recognized or the characters are dilated too much, you can then apply a erosion operation. Of course there are advanced synthesis algorithms, but i think you are better off to start with a simpler image processing operation first.
Part 2 - OCR/Tesseract
With Tesseract, if you are feeding the whole image into Tesseract, it would perform line analysis and so on and so forth. Since characters in captcha dont behave like normal text, doing line analysis or recognizing them in a group may somewhat deteoriate the recognition rate. So my suggestion is to recognize by character-by-character first.

Findings string segments in a string

I have a list of segments (15000+ segments), I want to find out the occurence of segments in a given string. The segment can be single word or multiword, I can not assume space as a delimeter in string.
e.g.
String "How can I download codec from internet for facebook, Professional programmer support"
[the string above may not make any sense but I am using it for illustration purpose]
segment list
Microsoft word
Microsoft excel
Professional Programmer.
Google
Facebook
Download codec from internet.
Ouptut :
Download codec from internet
facebook
Professional programmer
Bascially i am trying to do a query reduction.
I want to achieve it less than O(list length + string length) time.
As my list is more than 15000 segments, it will be time consuming to search entire list in string.
The segments are prepared manully and placed in a txt file.
Regards
~Paul

You basically want a string search algorithm like Aho-Corasik string matching. It constructs a state machine for processing bodies of text to detect matches, effectively making it so that it searches for all patterns at the same time. It's runtime is on the order of the length of the text and the total length of the patterns.

In order to do efficient searches, you will need an auxiliary data structure in the form of some sort of index. Here, a great place to start would be to look at a KWIC index:
http://en.wikipedia.org/wiki/Key_Word_in_Context
http://www.cs.duke.edu/~ola/ipc/kwic.html

What your basically asking how to do is write a custom lexer/parser.
Some good background on the subject would be the Dragon Book or something on lex and yacc (flex and bison).
Take a look at this question:
Poor man's lexer for C#
Now of course, alot of people are going to say "just use regular expressions". Perhaps. The deal with using regex in this situation is that your execution time will grow linearly as a function of the number of tokens you are matching against. So, if you end up needing to "segment" more phrases, your execution time will get longer and longer.
What you need to do is have a single pass, popping words on to a stack and checking if they are valid tokens after adding each one. If they aren't, then you need to continue (disregard the token like a compiler disregards comments).
Hope this helps.

Removing Duplicate Images [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this question
We have a collection of photo images sizing a few hundred gigs. A large number of the photos are visually duplicates, but with differing filesizes, resolution, compression etc.
Is it possible to use any specific image processing methods to search out and remove these duplicate images?

I recently wanted to accomplish this task for a PHP image gallery. I wanted to be able to generate a "fuzzy" fingerprint for an uploaded image, and check a database for any images that had the same fingerprint, indicating they were similar, and then compare them more closely to determine how similar.
I accomplished it by resizing the uploaded image to 150 pixels wide, reducing it to greyscale, rounding the value of each colour off to the nearest multiple of 16 (giving 17 possible shades of grey between 0 and 255), normalise them and store them in an array, thereby creating a "fuzzy" colour histogram, then creating an md5sum of the histogram which I could then search for in my database. This was extremely effective in narrowing down images which were very visually similar to the uploaded file.
Then to compare the uploaded file against each "similar" image in the database, I took both images, resized them to 16x16, and analysed them pixel by pixel and took the RGB value of each pixel away from the value of the corresponding pixel in the other image, adding all the values together and dividing by the number of pixels giving me an average colour deviation. Anything less than specific value was determined to be a duplicate.
The whole thing is written in PHP using the GD module, and a comparison against thousands of images takes only a few hundred milliseconds per uploaded file.
My code, and methodology is here: http://www.catpa.ws/php-duplicate-image-finder/

Try PerceptualDiff for comparing 2 images with the same dimensions. Allows threshholds such as considering images with only X number of pixels different to be visually indistinguishable.
If visual duplicates may have different dimensions due to scaling, or different filetypes,
you may want to make a standard format for comparisons. For example, I might use ImageMagick
to scale all images to 100x100 and save them as PNG files.

A very simple approach is the following:
Convert the image to greyscale in memory, so every pixel is only a number between 0 (black) and 255 (white).
Scale the image to a fixed size. Finding the right size is important, you should play around with different sizes. E.g. you could scale each image to 64x64 pixels, but you may get better or worse results with either smaller or bigger pictures.
Once you've done this for all images (yes, that will take a while), always load two images in memory and subtract them from each other. That is subtract the value of pixel (0,0) in image A ob the value of pixel (0,0) in image B, now do the same for (0,1) in both and so on. The resulting value might be positive or negative, you should always store the absolute value (so 5 results in 5, -8 however results in 8).
Now you have a third image being the "difference image" (delta image) of image A and B. If they were identical, the delta image is all black (all values will subtract to zero). The "less black" it is, the less identical the images are. You need to find a good threshold, since even if the images are in fact identical (to your eyes), by scaling, altering brightness and so on, the delta image will not be totally black, it will however have only very dark greytones. So you need a threshold that says "If average error (delta image brightness) is below a certain value, there is still a good chance they might be identical, however if it is above that value, they are most likely not. Finding the right threshold is as hard as finding the right scaling size. You will always have false positives (images deemed to be identical, though they are not at all) and false negatives (images deemed to be not identical, although they are).
This algorithm is ultra slow. Actually only creating the greyscale images takes tons of time. Then you need to compare each GS image to each other one, again, tons of time. Also storing all the GS images takes a lot of disk space. So this algorithm is very bad, but the results are not that bad, even though its that simple. While the results are not amazing, they are better than I had initially thought.
The only way to get even better results is to use advanced image processing and here it starts getting really complicated. It involves a lot of math (a real lot of it); there are good applications (dupe finders) for many systems that have these implemented, so unless you must program it yourself, you are probably better off using one of these solutions. I read a lot papers on this topic but I'm afraid most of this goes beyond my horizon. Even the algorithms I might be able to implement according to these papers are beyond it; that means I understand what needs to be done, but I have no idea why it works or how it actually works, it's just magic ;-)

I actually wrote an application that does this very thing.
I started with a previous application that used a basic Levenshtein Distance algorithm to compute image similarity, but that method is undesirable for a number of reasons. Without a doubt, the fastest algorithm you're going to find for determining image similarity is either mean squared error or mean absolute error (both have a running time of O(n), where n is the number of pixels in the image, and it'd also be trivial to thread an implementation of either algorithm in a number of different ways). Mecki's post is actually just a Mean Absolute Error implementation, which my application can perform (code is also available for your browsing pleasure, should you so desire).
In any event, in our application, we first down-sample images (e.g. everything is scaled to, say, 32*32 pixels), then convert to gray scale, and then run the resulting images through our comparison algorithms. We're also working on some more advanced pre-processing algorithms to further normalize images, but...not quite there yet.
There are definitely better algorithms than MSE/MAE (in fact, the problems with these two algorithms as applied to visual information has been well documented), like SSIM, but it comes at a cost. Other people attempt to compare other visual qualities in the image, such as luminance, contrast, color histograms, etc., but it's all pricey compared to simply measuring the error signal.
My application might work, depending on how many images are in those folders. It's multi-threaded (I've seen it fully load eight processor cores performing comparisons), but I've never tested against an image database larger than a few hundred images. A few hundred gigs of images sounds prohibitively large. (simply reading them from disk, downsampling, converting to gray scale and storing in memory--assuming you have enough memory to hold everything, which you probably don't--could take a couple hours).

This is still a research area, I believe. If you have some time in your hands, some relevant keywords are:
Image copy detection
Content based image retrieval
Image indexing
Image duplicate removal
Basically, each image is processed (indexed) to produce an "image signature". Similar images have similar signatures. If your images are just rescaled then probably their signature are nearly identical, so they cluster well. Some popular signatures are the MPEG-7 descriptors. To cluster, I think K-Means or any of its variants may be enough.
However, you probably need to deal with millions of images, this may be a problem.
Here is a link to the main Wikipedia entry:
http://en.wikipedia.org/wiki/CBIR
Hope this helps.

Image similarity is probably a sub-field of image processing/AI.
Be prepared to implement algorithms/formulae from papers if you're looking for an excellent (i.e. performant and scalable) solution.
If you want something quick n dirty, search google for Image Similarity
Here's a C# image similarity app that might do what you want.
Basically, all algorithms extract and compare features. How they define "feature" depends on the math model they're based on.

A quick hack at this is to write a program that will calculate the value of the average pixel in each image, in greyscale, sort by this value, and then compare them visually. Very similar images should occur near each other in the sorted order.

You will need a command line tool to deal with so much data.
Comparing every possible pair of images will not scale to such a large set of images.
You need to sort the entire set of images according to some metric so that further
comparisons are only needed on neighbouring images.
An example of a simple metric is the average value of all of the pixels in an image, expressed
as a single greyscale value. This should work only if the duplicates have not had any visual alterations.
Using a lossy file format can also result in visual alterations.

Thinking outside the box, you may be able to use image metadata to narrow down your dataset.
For example, your images may have fields showing the date and time the image was taken, down to the nearest second.
Duplicates are likely to have identical values.
A tool such as exiv2 could be used to dump out this data to a more convenient and sortable text format (with a little knowledge of batch/shell scripting).
Even fields such as the camera manufacturer and model could be used to reduce a set of 1,000,000 images
to say 100 sets of 10,000 images, a significant improvement.

The gqview program has an option for finding duplicates, so you might try looking there. However, it's not foolproof, so it'd only be suitable as a heuristic to present duplicates to a human, for manual confirmation.

The most important part is to make the files comparable.
A generic solution might be to scale all images to a certain fixed size and greyscale. Then save the resulting images in a separate directory with same name for later reference. It would then be possible to sort by filesize and visually compare neighboring entries.
The resulting pictures might be quantified in certain ways to programatically detect similarities (averaging of blocks, lines etc.).

I would imagine the most scaleable method would be to store an fingerprint with each image. Then when a new image is added, it's a simple case of SELECT id FROM photos where id='uploaded_image_id' to check for duplicates (or fingerprinting all the images, then doing a query for duplicate
Obviously a simple file-hash wouldn't work as the actual content differs..
Acoustic fingerprinting/this paper may be a good start on the concept, as there are many implementations of this. Here is a paper on image fingerprinting.
That said, you may be able to get away with something simpler. Something as basic as resizing the image to equal width or height, subtracting image_a from image_b, and summing the difference. If the total difference is below a threshold, the image is a duplicate.
The problem with this is you need to compare every image to every other. The time required will exponentially increase..

If you can come up with a way of comparing images that obeys the triangle inequality (eg, if d(a,b) is the difference between images a and b, then d(a,b) < d(a,c) + d(b,c) for all a,b,c), then a BK-Tree would be an effective way of indexing the images such that you can find matches in O(log n) time instead of O(n) time for each image.
If your matches are restricted to the same image after varying amounts of compression/resizing/etc, then converting to some canonical size/color balance/etc and simply summing the squares-of-differences of each pixel may be a good metric, and this obeys the triangle inequality, so you could use a BK-tree for efficient access.

If you have a little bit of money to spend, and maybe once you run a first pass to determine which images are maybe matches, you could write a test for Amazon's Mechanical Turk.
https://www.mturk.com/mturk/welcome
Essentially, you'd be creating a small widget that AMT would show to real human users who would then basically just have to answer the question "Are these two images the same?". Or you could show them a grid of say 5x5 images and ask them "Which of these images match?". You'd then collect the data.
Another approach would be to use the principles of Human Computation which have been most famously espoused by Luis Von Ahn (http://www.cs.cmu.edu/~biglou/) with reCaptcha, which uses Captcha answers to determine the unreadable words that have been run through Optical character Recognition, thus helping to digitize books. You could make a captcha that asked users to help refine the images.

It sounds like a procedural problem rather than a programming problem. Who uploads the photos? You or the customers? If you are uploading the photo, standardize the dimensions to a fixed scale and file format. That way comparisons will be easier. However, as it stands, unless you have days - or even weeks of free time - I suggest that you instead manually remove the duplicates images by either yourself or your team by visually comparing the images.
Perhaps you should group the images by location since it is a tourist images.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.