Shannon-fano coding algorithm - strange behaviour on larger sets - c#

I am writing a Shannon-fano algorithm, and I am struggling to find a mistake in my program - my program works for examples I managed to find on internet - example:
This is my example with 10 characters, where it sets characters will lower possibilities longer codes:
On the left side are byte values, middle is possibility and left is the generated code. Why is 65 and 226's code longer then 0,3 and 32's? Can anybody see a bug in code?
EDIT: code hidden, because this question was about a school assignment

This is probably not a bug in your code but rather illustrates an inherent weakness in Shannon-Fano codes compared to, say, Huffman compression.
As you know, the Shannon-Fano technique is to sort the list of code frequencies in descending order and then assign a binary symbol (zero or one) to each half of the frequency range. This process is repeated in a recursive fashion as long as there is more than one element in a segment.
This has a weakness, though: While it's true that the more frequent symbols, when grouped together, will have a shorter encoding on average than the less frequent symbols, it is not necessarily the case for each and every symbol that it gets a shorter encoding assigned to it.
For more information, see a question I posted a while back over on Computer Science about this very issue.

Related

Hungarian Algorithm for non square matrix

I'm trying to implement the Hungarian algorithm. Everything is fine except for when the matrix isn't square. All the methods I've searched are saying that I should make it square by adding dummy rows/columns, and filling the dummy row/column with the maximum number in the matrix. My question is that won't this affect the final result? Shouldn't the dummy row/column be filled with at least max+1?
The dummy values should all be zero. The point is that it doesn't matter which one you choose, you're going to ignore those choices in the end because they weren't in the original data. By making them zero (at the start), your algorithm won't have to work as hard to find a value you're not going to use.
The main idea of the Hungarian algorithm is built upon the fact that the "optimal assignment of jobs, remains the same if a number is added/subtracted from all entries of any row or column of the matrix". Therefore, it does not matter if you use dummy value as "max or max+1 or 0". It can be set as any number and better it is 0 (as Yay295 said, the algorithm would like to work less if entries are already 0)

Dijkstra algorithm expanded with extra limit variable

I am having trouble implementing this into my current path finding algorithm.
Currently I have Dijkstra written and works like it should, but I need to step further away and add a limit (range). I can better explain with an image:
Let's say I have range of 80. I want to go from A to E. My current algorithm, works as it should, so it results in A->B-E.
However, I need to go only on paths with weight not more than the range - 80, which would mean that A->B->E is not the option any more, but A->C->D->B->E (considering that range/limit resets on every stop)
So far, I have implemented a bool named Possible which would return for the single part of path (e.g. A->B) is it possible comparing to my limit / range.
My main problem is that I do not know where/how to start. My only idea was to see where Possible is false (A->B on the total route A->B->E) and run the algorithm from A to A->E again without / excluding B stop/vertex.
Is this a good approach? Because of that my big O notation would increment twice (as far as I understand it).
I see two ways of doing this
Create a new graph G' that contains only edges < 80, and look for shortest path there... reduction time is O(V+E), and additional O(V+E) memory usage
You can change Dijkstra's algorithm, to ignore edges > 80, just skip edges >80, when giving values to neighbor vertices, the complexity and memory usage will stay the same in this case
Create a temporary version of your graph, and set all weights above the threshold to infinity. Then run the ordinary Dijkstra algorithm on it.
Complexity will increase or not, depending on your version of the algorithm:
if you have O(V^2) then it will increase to O(E + V^2)
if you have the O(ElogV) version then it will increase to O(E + ElogV)
if you have the O(E + VlogV) version it will remain the same
As noted by ArsenMkrt you can as well remove these edges, which makes even more sense but will make the complexity a bit worse. Modifying the algorithm to just skip those edges seems to be the best option though, as he suggested in his answer.

Parse 2D array to rectangles

I'm looking for a way to convert a 2D array to the fewest possible rectangles like in this example:
X
12345678
--------
1|00000000
2|00011100
3|00111000
Y 4|00111000
5|00111000
6|00000000
to the corner coordinates of the rectangles:
following the (x1,y1);(x2;y2) template
rectangle #1 (4,2);(6,2)
rectangle #2 (3,3);(5,5)
There has been a similar question here before but unfortunately, the link provided in its answer is broken, and I cannot check it anymore.
I'd like to do this in C# but any kind of help is appreciated.
(It doesn't even have to be the fewest possible rectangles, but the fewer the better :) )
Thanks in advance!
I think that you are trying to cover a set of points in the 2D plane with the minimum required number of rectangles. An answer to Find k rectangles so that they cover the maximum number of points said that this was an NP-complete problem and linked to here (which works for me). A google search finds http://2011.cccg.ca/PDFschedule/papers/paper102.pdf.
There papers agree that rectangle covering is NP-complete but do not actually prove it, and the references for this seem to be unusually elusive - https://cstheory.stackexchange.com/questions/3957/prove-that-the-problem-of-rectilinear-picture-compression-is-np-complete
What I take from these documents is this:
It is unlikely that there is an affordable way of getting the absolutely best answer for large problems, so you might have to either spend a lot of time to get exact answers for problems that are in some sense small, by exhausting over all possible alternatives or perhaps using something like branch and bound, or settle for affordable methods - like greedy search, or beam search, or limited discrepancy search - which are not guaranteed to give you the absolutely best answer.
In this case there seem to be more restricted versions of this problem which are not NP-complete. You might possibly read a paper and find that there is some detail of your problem that means that this method applies to you. One example is "AN ALGORITHM FOR CONSTRUCTING REGIONS WITH RECTANGLES:
INDEPENDENCE AND MINIMUM GENERATING SETS
FOR COLLECTIONS OF INTERVALS*" by Franzblau and Kleitman - I found this in the ACM Digital Library, though - I don't know if it is generally accessible. It works for a restricted set of polygons.
This may help you get started. If you convert the binary data to numbers, you get this:
0
28
56
56
56
0
So where ever there are consecutive equal numbers, there is a rectangle.

Advanced C# pattern search in long string (100-25000 char)

Let me start with this: I can't zip it or anything similar.
What I'm trying to do is search through fairly large strings. I use data blocks that look like 0g12h. (The 0 is the color from my palette. The g is a space to divide the numbers. The 12 means 12 pixels in a row use that color. The h is to divide the numbers again.)
The problem I'm having is that the blocks aren't all the same length. They range from 0g1h to 2546g115h. Basically I want to create a palette of common patterns to hopefully save space. Say I have: 12g345h19g12h190g11h occurring at least three times, then I could save space if I had something like: a=12g345h19g12h190g11h in the palette array and just put 'a' in the string. Or even not look at the data blocks, as you see in the attached file you get g640h a ton of times.
I could be wrong, but I'm pretty sure this could work. If you have a better idea how I could save space and not lose data, I'm more than open to the ideas.
Here is a great example since you can visually see the pattern: http://pastebin.com/5dbhxZQK. I chose this file because I knew it would have massive redundancy; most aren't this simple.
You could use a dictionary (probably Dictionary<string, int> and just could how many times each pattern occurs, then go back and rewrite the string with the appropriate replacements.
However, I would recommend that you read up a little about compression algorithms, what you are implementing appears to be a Run Length Encoding (RLE) scheme. You are then trying to compress again on top of that, consider looking at how Sliding Window compression works (which is what GZIP does) as an alternative to your RLE. Or look at Huffman encoding as a mechanism to reduce the amount of space needed for the codewords that you are creating (in simple terms Huffman encoding uses shorter symbols for more frequent patterns and longer symbols for less frequent patterns in a 'optimal' way)
This is a fun problem space to play in! Good Luck!

All valid combinations of points, in the most (speed) effective way

I know there are quite some questions out there on generating combinations of elements, but I think this one has a certain twist to be worth a new question:
For a pet proejct of mine I've to pre-compute a lot of state to improve the runtime behavior of the application later. One of the steps I struggle with is this:
Given N tuples of two integers (lets call them points from here on, although they aren't in my use case. They roughly are X/Y related, though) I need to compute all valid combinations for a given rule.
The rule might be something like
"Every point included excludes every other point with the same X coordinate"
"Every point included excludes every other point with an odd X coordinate"
I hope and expect that this fact leads to an improvement in the selection process, but my math skills are just being resurrected as I type and I'm unable to come up with an elegant algorithm.
The set of points (N) starts small, but outgrows 64 soon (for the "use long as bitmask" solutions)
I'm doing this in C#, but solutions in any language should be fine if it explains the underlying idea
Thanks.
Update in response to Vlad's answer:
Maybe my idea to generalize the question was a bad one. My rules above were invented on the fly and just placeholders. One realistic rule would look like this:
"Every point included excludes every other point in the triagle above the chosen point"
By that rule and by choosing (2,1) I'd exclude
(2,2) - directly above
(1,3) (2,3) (3,3) - next line
and so on
So the rules are fixed, not general. They are unfortunately more complex than the X/Y samples I initially gave.
How about "the x coordinate of every point included is the exact sum of some subset of the y coordinates of the other included points". If you can come up with a fast algorithm for that simply-stated constraint problem then you will become very famous indeed.
My point being that the problem as stated is so vague as to admit NP-complete or NP-hard problems. Constraint optimization problems are incredibly hard; if you cannot put extremely tight bounds on the problem then it very rapidly becomes not analyzable by machines in polynomial time.
For some special rule types your task seems to be simple. For example, for your example rule #1 you need to choose a subset of all possible values of X, and than for each value from the subset assign an arbitrary Y.
For generic rules I doubt that it's possible to build an efficient algorithm without any AI.
My understanding of the problem is: Given a method bool property( Point x ) const, find all points the set for which property() is true. Is that reasonable?
The brute-force approach is to run all the points through property(), and store the ones which return true. The time complexity of this would be O( N ) where (a) N is the total number of points, and (b) the property() method is O( 1 ). I guess you are looking for improvements from O( N ). Is that right?
For certain kind of properties, it is possible to improve from O( N ) provided suitable data structure is used to store the points and suitable pre-computation (e.g. sorting) is done. However, this may not be true for any arbitrary property.

Categories

Resources