Related
I have a search method in my custom A* algorithm. It uses a collection to keep track of what the search is doing.
For a set path i know i am doing the following with the collection:
Contains 860x (lookup)
Remove 91x
Add 270x
The order or sorting does not really matter unless i can find a way to specifically order it. It is possible to generate a unique ID for each node based on X and Y value. Making a dictionary lookup possible.
Is there any way to calculate based on my method, what would be the best collection to use in this specific case?
thanks in advance,
Smiley
The general census says:
If you don't run into performance issues, leave it alone.
If you do , but you can get away with it, leave it alone.
If you do, but you can't (or you just love your code to be tight), benchmark it, and you'll find.
(Clarification: I didn't use the "premature optimization is the root of evil" reference, because I do think that there is place for optimization. Here's a good article about the subject).
From what you're saying, I doubt it'll make much change, unless you're running on a device with next to no resources, but again, unless you need it, for the above numbers, i doubt you'll see any difference.
Edit:
as per the chat room continuation, I would suggest looking into hashtable and dictionary. To be more specific, a Sorted Dictionary :) .
For interesting read about hashtable vs dictionaries in c#, you can look at this question and at this one.
Good luck, and feel free to post your results for others to learn.
I want to restart with data structure ( and Ai + I want to clear all my misconceptions too. ;P )
For now I want to know how would I put given pictorial info into algorithm using C# structure. Image processing is not required here. Just need to feed the data in.
Here I need this question to be modified too if not clear. :|
Say Arad is a city in Romania from where I have to go to another city Bucharest.
This map also has info of how far all connecting cities are from any city.
How would I use these info in program to start with any searching or sorting algo?
Any pointer will be helpful. Say if this can be done using anything else than struct. Something like node or something. I don't know.
Please consider I want to learn things. So using C# for ease in use not to use its inbuilt searching and sorting functions. Later to confirm I might use.
The way you typically solve this problem is to create a node class and an edge class. Each node has a set of edges that have "lengths", and each edge connects two nodes. You then write a shortest-path algorithm that determines the least-total-length set of edges that connects two nodes.
For a brief tutorial on how to do that, see my series of articles on the A* algorithm:
http://blogs.msdn.com/b/ericlippert/archive/tags/astar/
Although it's not exactly what you're looking for, Eric Lippert's series on graph colouring is an excellent step-by-step example of designing data structures and implementing algorithms (efficiently!) in C#. It has helped me a lot; I highly recommend reading it. Once you've worked your way through that, you will know much more about C# and you will understand some of the specific design tradeoffs that you may encounter based on your specific problem, including what data structure to use for a particular problem.
If you just want to look at raw algorithms, the shortest path problem has many algorithms defined for it over the years. I would recommend implementing the common Dijkstra's algorithm first. The Wikipedia article has pseudocode; with what you get out of Eric Lippert's series, you should be in good shape to develop an implementation of this. If you still want more step-by-step guidance, try a search for "Dijkstra's algorithm in C#".
Hope that helps!
Forgive me if this is a silly question....but I think back to my Comp. Sci. classes and I distinctly remember learning/being quizzed on several sorting algorithms and the corresponding 'Big O' notation.
Outside of the classroom though, I've never actually written code to sort.
When I get results from a database, I use 'Order By'. Otherwise, I use a collection class that implements a sort. I have implemented IComparable to allow sorting; but I've never gone beyond that.
Was sorting always just an academic pursuit for those of us who don't implement languages/frameworks? Or is it just that modern languages running on modern hardware make it a trivial detail to worry about?
Finally, when I call .Sort on a List(Of String), for example, what sort algorithm is being used under the hood?
While you rarely might need to implement a sorting algorithm yourself understanding the different algorithms and their complexity might help you in solving more complex problems.
Finally, when I call .Sort on a List(Of String), for example, what sort algorithm is being used under the hood?
Quick Sort
I've never implemented my own sorting algorithm once since I took my CS classes in college and if I was ever even contemplating writing my own, I'd want my head examined first.
List<T> uses Quicksort per the MSDN documentation:
http://msdn.microsoft.com/en-us/library/b0zbh7b6.aspx
You probably won't implement you own sorting algorithm if you are using high level languages...
What you have learnt in classroom was merely there to teach you of the existence and importance of the big O (omicron) notation.
It was there to make you know that optimization is always a goal in programming and that when you code something you must always think of how will it execute.
It teaches you that loops inside loops and recursions can lead to big performance problems if not analyzed/optimized well before coding starts.
It is a guidance to check your design before and be able to approximate the execution speed.
It is important for a programmer to know how theses algorithms work. One reason would be that, in certain conditions, certain algorithms are better, although, sorting is rarely the bottleneck.
In some frameworks, the .Sort function uses various methods, depending on the situation.
Modern languages running on modern hardware make it a trivial detail to worry about, unless a profiler shows that sorting is the bottleneck of your code.
According to this, List.Sort uses Array.Sort, which uses QuickSort.
IMO, it's become a bit of an academic exercise. You need to understand algorithmic complexity, and sorting is a good example for working through it because you can easily see the results and calculate the different complexities. In real life, though, there's almost certainly a library call that sorts your range faster than you would be able to do if you try to roll your own.
I don't know what the .Net libraries use for their defualt implementation, but I'd guess it's Quicksort or Shellsort. I'd be interested to find out if it's something else.
I've occasionally had to write my own sort methods, but only when I was writing for a relatively immature and underpowered platform (like .Net 1.1 in Windows CE or embedded java or somesuch).
On anything resembling a modern computer, the best sorting algorithm is the ORDER BY clause in T-SQL.
Implementing your own sort is the kind of thing that you do to gain insight in how algorithms work, what the tradeoffs are, which tried-and-true approaches that solve a wide array of problems efficiently are known, etc.
As Darin Dimitrov's answer states, library sort routines need to have a very competitive average-case performance, so quicksort is typically chosen.
Was sorting always just an academic pursuit for those of us who don't implement languages/frameworks? Or is it just that modern languages running on modern hardware make it a trivial detail to worry about?
Even if you're not implementing your own routine, you may know how your data is likely to be arranged, and may want to choose a suitable library algorithm. For example, see this discussion on nearly sorted data.
I think there are times when you need to have a custom sorting method.
What if you wanted to sort by the make of cars, but not alphabetically?
For example, you have a database with the makes: Honda, Lexus, Toyota, Acura, Nissan, Infiniti
If you use a plain sort, you get the order: Acura, Ford, Honda, Hyundai, Lexus, Toyota
What if you wanted to sort them based on a car company's standard and luxury class together? Lexus, Toyota, Honda, Acura, Nissan, Infiniti.
I think you would need a custom sort method if that's the case.
I'm writing a bot that will analyse posts and reply with a vaguely related strings from a database. I'm not aiming for coherence, just for vague similarity that could pass as someone ignorant to the topic (but knowledgeable enough to try to reply). What are some methods that would help me to choose the right reply?
One thing I've come up with is to create a vocabulary list, check which elements of the list are in the post, and get a reply from the database based on these results. This crude method has been successful about 10% of the time (based on 100 replies to random posts). I might expand the list by more words, but this method has its limit. Any better ones?
(P. S. The database is sizeable -- about 500 000 replies)
First of all, I think the best you can hope for will be about a 50% answer rate, unless you're prepared to write a lot of code.
If you're willing to get your hands dirty with some statistics, check out term frequency–inverse document frequency. Basically, you will use the frequency of uncommon words to determine what keywords are critical to the document, and use this as the input into the tf-idf algorithm to pull out other replies with those same keywords.
You can then combine this further with whitelisting and blacklisting techniques to ignore common words and prioritize certain keywords. You can then keep tuning those lists to enhance the algorithm as you see it work.
There are also simpler string metrics you can use to test basic similarity. Take a look at this list of string metrics.
You might want to look into vector-space mapping and resemblance. The "vaguely related" problem could be handled by resemblance statistical analysis most likely.
Check out this novel use of resemblance:
http://www.cromwell-intl.com/security/attack-study/
There is a PHP function called "similar_text()", (e.g.:
$percent_similar = similar_text($str1, $str2);) This works fairly well but I didn't come up with anything similar in C#. If you could get hold of the source for the PHP function you might try to translate it. I think there may be a Java version also.
I need an algorithm that can compare two text files and highlight their difference and ( even better!) can compute their difference in a meaningful way (like two similar files should have a similarity score higher than two dissimilar files, with the word "similar" defined in the normal terms). It sounds easy to implement, but it's not.
The implementation can be in c# or python.
Thanks.
I can recommend to take a look at Neil Fraser's code and articles:
google-diff-match-patch
Currently available in Java,
JavaScript, C++ and Python. Regardless
of language, each library features the
same API and the same functionality.
All versions also have comprehensive
test harnesses.
Neil Fraser: Diff Strategies - for theory and implementation notes
In Python, there is difflib, as also others have suggested.
difflib offers the SequenceMatcher class, which can be used to give you a similarity ratio. Example function:
def text_compare(text1, text2, isjunk=None):
return difflib.SequenceMatcher(isjunk, text1, text2).ratio()
Look at difflib. (Python)
That will calculate the diffs in various formats. You could then use the size of the context diff as a measure of how different two documents are?
My current understanding is that the best solution to the Shortest Edit Script (SES) problem is Myers "middle-snake" method with the Hirschberg linear space refinement.
The Myers algorithm is described in:
E. Myers, ``An O(ND) Difference
Algorithm and Its Variations,''
Algorithmica 1, 2 (1986), 251-266.
The GNU diff utility uses the Myers algorithm.
The "similarity score" you speak of is called the "edit distance" in the literature which is the number of inserts or deletes necessary to transform one sequence into the other.
Note that a number of people have cited the Levenshtein distance algorithm but that is, albeit easy to implement, not the optimal solution as it is inefficient (requires the use of a possibly huge n*m matrix) and does not provide the "edit script" which is the sequence of edits that could be used to transform one sequence into the other and vice versa.
For a good Myers / Hirschberg implementation look at:
http://www.ioplex.com/~miallen/libmba/dl/src/diff.c
The particular library that it is contained within is no longer maintained but to my knowledge the diff.c module itself is still correct.
Mike
Bazaar contains an alternative difference algorithm, called patience diff (there's more info in the comments on that page) which is claimed to be better than the traditional diff algorithm. The file 'patiencediff.py' in the bazaar distribution is a simple command line front end.
If you need a finer granularity than lines, you can use Levenshtein distance. Levenshtein distance is a straight-forward measure on how to similar two texts are.
You can also use it to extract the edit logs and can a very fine-grained diff, similar to that on the edit history pages of SO.
Be warned though that Levenshtein distance can be quite CPU- and memory-intensive to calculate, so using difflib,as Douglas Leder suggested, is most likely going to be faster.
Cf. also this answer.
There are a number of distance metrics, as paradoja mentioned there is the Levenshtein distance, but there is also NYSIIS and Soundex. In terms of Python implementations, I have used py-editdist and ADVAS before. Both are nice in the sense that you get a single number back as a score. Check out ADVAS first, it implements a bunch of algorithms.
As stated, use difflib. Once you have the diffed output, you may find the Levenshtein distance of the different strings as to give a "value" of how different they are.
You could use the solution to the Longest Common Subsequence (LCS) problem. See also the discussion about possible ways to optimize this solution.
One method I've employed for a different functionality, to calculate how much data was new in a modified file, could perhaps work for you as well.
I have a diff/patch implementation C# that allows me to take two files, presumably old and new version of the same file, and calculate the "difference", but not in the usual sense of the word. Basically I calculate a set of operations that I can perform on the old version to update it to have the same contents as the new version.
To use this for the functionality initially described, to see how much data was new, I simple ran through the operations, and for every operation that copied from the old file verbatim, that had a 0-factor, and every operation that inserted new text (distributed as part of the patch, since it didn't occur in the old file) had a 1-factor. All characters was given this factory, which gave me basically a long list of 0's and 1's.
All I then had to do was to tally up the 0's and 1's. In your case, with my implementation, a low number of 1's compared to 0's would mean the files are very similar.
This implementation would also handle cases where the modified file had inserted copies from the old file out of order, or even duplicates (ie. you copy a part from the start of the file and paste it near the bottom), since they would both be copies of the same original part from the old file.
I experimented with weighing copies, so that the first copy counted as 0, and subsequent copies of the same characters had progressively higher factors, in order to give a copy/paste operation some "new-factor", but I never finished it as the project was scrapped.
If you're interested, my diff/patch code is available from my Subversion repository.
Take a look at the Fuzzy module. It has fast (written in C) based algorithms for soundex, NYSIIS and double-metaphone.
A good introduction can be found at: http://www.informit.com/articles/article.aspx?p=1848528