Telling the difference between two large pieces of text - c#

What would be the best way to compare big paragraphs of text in order to tell the differences apart. For example string A and string B are the same except for a few missing words, how would I highlight these?
Originally I thought of breaking it down into word arrays, and comparing the elements. However this breaks down when a word is deleted or inserted.

Use a diff algorithm.

I saw this a few months back when I was working on a small project, but it might set you on the right track.
http://www.codeproject.com/KB/recipes/DiffAlgorithmCS.aspx

You want to look into Longest Common Subsequence algorithms. Most languages have a library which will do the dirty work for you, and here is one for C#. Searching for "C# diff" or "VB.Net diff" will help you find additional libraries that suit your needs.

Usually text difference is measured in terms of edit distance, which is essentially the number of character additions, deletions or changes necessary to transform one text into the other.
A common implementation of this algorithm uses dynamic programming.

Here is an implementaion of a Merge Engine that compares 2 html files and shows the highlighted differences: http://www.codeproject.com/KB/string/htmltextcompare.aspx

If it's a one-shot deal, save them both in MS Word and use the document compare function.

Related

How to find minimum replacement strings or regex to convert string to another string

Ok the title may be not correct but this is what i came as best
My question is this
Example 1
see , saw
I can convert see to saw with as
replace ee with aw
string srA = "see";
string srB = "saw";
srA = srB.Replace("aw", "ee");
Or lets say
show , shown
add n to original string
Now what i want it is, with minimum length of code, generating such procedures to any compared strings
Looking for your ideas how can i make it? Can i generate regexes automatically to apply and convert?
c# 6
Check diffplex and and see if it is what you need. If you want to create a custom algorithm, instead of using a 3rd party library just go through the code -it's open source.
You might also want to check this work for optimizations, but it might get complicated.
Then there's also Diff.NET.
Also this blog post is part of a series in implementing a diff tool.
If you're simply interested in learning more about the subject, your googling efforts should be directed to the Levenshtein algorithm.
I can only assume what your end goal is, and the time you're willing to invest in this, but I believe the first library should be enough for most needs.

To find out the number of occruence of words in a file

I came across this question in an interview:
We have to find out the number of occurences of two given words in a text file with <=n words between them.
Example1:
text:`this is first string this is second string`
Keywords:`this, string`
n= 4
output= 2
"this is first string" is the first occurrence and number of words between this and string is 2(is, first) which is less than 4.
this is second string is the remaining string. number of words between *this and string * is 2 (is, second) which is less than 4.
Therefore the answer is 2.
I have thought that I will use
Dictionary<string, List<int>>.
My idea was that I use the dictionary and get the list of places where the particular word is repeated and then iterate through both the lists, increment the count if a condition is met and then display the count.
Is my thinking process correct? Please provide any suggestions to improve my solution.
Thanks,
Not an answer per-se (as quite honestly, I don't understand the question :P), but to add some general interview advice to the other answers:
In interviews the interviewer is always looking for the thought process and that you are a critical, logical thinker. Not necessarily that you have excellent coding recall and can compile code in your brain.
In addition interviews are a stressful process. By slowing down and talking out loud as you work things out you not only look like a better communicator and logical thinker (even if getting the question wrong), you also give yourself time to think.
Use a pen and paper, speak as you think, start off from the top and work through it. I've got jobs even if I didn't know the answers to tech questions by demonstrating that I can at least try to work things out ;-)
In short, it's not just down to technical prowess
I think it depends if the call is done only one or multiple times per string. If it's something like
int getOccurences(String str, String reference, int min_size) { ... }
then you don't really need the dictionary, not even a ist. You can just iterate through the string to find occurrences of words and then check the number of separators between them.
If on the other hand the problem is for arbitrary search/indexing, IMHO you do need a dictionary. I'd go for a dictionary where the key is the word and the value is a list of indexes where it occurs.
HTH
If you need to do that repeatedly for different pairs of words in the same text, then a word dictionary with a list of indexes is a good solution. However, if you were only looking for one pair, then two lists of indexes for those two words would be sufficient.
The lists allow you to separate the word detection operation from the counting logic.

Approximate string matching

I know this question have been asked a lot of time.
I want a suggestion on which algorithm is suitable for approximate string matching.
The application is specifically for company name matching only and nothing else.
The biggest challenge is probably the company end name part and short named part
Example:
1. companyA pty ltd vs companyA pty. ltd. vs companyA
2. WES Engineering vs W.E.S. Engineering (extremely rare occurance)
Do you think Levenshtein Edit Distance is adequate?
I'm using C#
Regards,
Max
There are various string distance metrics you could use.
I would recommend Jaro-Winkler. Unlike edit-distance where the result of a comparison is in discrete units of edits, JW gives you a 0-1 score. It is especially suited for proper names. Also look at this nice tutorial and this SO question.
I haven't worked with C# but here are some implementations of JW I found online:
Impl 1 (They have a DOT NET version too if you look at the file list)
Impl 2
If you want to do a bit more sophisticated matching, you can try to do some custom normalization of word forms commonly occurring in company names such as ltd/limited, inc/incorporated, corp/corporation to account for case insensitivity, abbreviations etc. This way if you compute
distance (normalize("foo corp."),
normalize("FOO CORPORATION") )
you should get the result to be 0 rather than 14 (which is what you would get if you computed levenshtein edit-distance).
Yes, Levenshtein distance is suitable for this. It will work for all those you have listed at least.
You could also possibly use Soundex, but I don't think you'll need it.
In these simple examples, just removing all non-alpha-numeric characters gives you a match, and is the easiest to do as you can pre-compute the data on each side, then do a straight equals match which will be a lot faster than cross multiplying and calculating the edit distance.
I have provided my answer already in another question.
https://stackoverflow.com/a/30120166/2282794
I have worked on really large scale system with similar name matching requirements that you have talked about.
Name matching is not very straightforward and the order of first and last names might be different.
Simple fuzzy name matching algorithms fail miserably in such scenarios.
If we just want to talk about the Approximate String matching algorithms, then there are many. Few of them are: Jaro-Winkler, Edit distance(Levenshtein), Jaccard similarity, Soundex/Phonetics based algorithms etc. A simple googling would give us all the details.
You can implement all of them in C#
Irony is, they work while you try to match two given input strings. Alright theoretically and to demonstrate the way fuzzy or approximate string matching works.
However, grossly understated point is, how do we use the same in production settings. Not everybody that I know of who were scouting for an approximate string matching algorithm knew how they could solve the same in the production environment.
I might have just talked about Lucene which is specific to Java but there is Lucene for .Net also.
https://lucenenet.apache.org/

Algorithm for text classification

I have millions of short (up to 30 words) documents which I need to split into several known categories. It's possible, that a document matches several of the categories (seldom, but possible). It's also possible that a document doesn't match any of the categories (also seldom). I also have millions of documents which have already been categorized. What algorithm should I use to do the job. I don't need to do it fast. I need to be sure that the algorithm categorizes correctly (as far as possible).
What algorithm should I use? Is there an implementation of in in C#?
Thank you for your help!
Take a look at term frequency and inverse document frequency also cosine similarity to find important words to create categories and assign documents to categories based on similarity
EDIT:
Found an example here
Interesting articles :
A self-organizing semantic map for information retrieval
WEBSOM - self-organizing maps of document collections
The major issue IMHO here is the length of the documents. I think I would call it phrase classification and there is work going on on this because of the twitter thing. You could bring in additional text performing a web search on the 30 words and then analyzing the top matches. There is a paper about this but I can't find it right now. Then I would try a feature vector approach (tdf-idf as in Jimmy's answer) and a multiclass SVM for classification.
Perhaps a decision tree combined with a NN?
You can use SVM Algorithm for Classify text in C# with libsvm.net library.

How can I generate pseudo-random "readable" strings in Java?

Generating a truly random string of a given length is a fairly straightforward (and already-well-covered) task.
However; I'd like to generate a "pseudo" random string with the additional constraint that it be relatively easily readable (to a native-English reader.)
I think another way to say this is to say that the generated string should consist of "recognizable syllables." For example, "akdjfwv" is a random string, but it's not recognizable at all. "flamyom"; however, is very "recognizable" (even though it's nonsense.)
Obviously, one could make a long list of "recognizable syllables," and then randomly select them.
But, is there a better way to do something like programmatically generate a "recognizable syllable," or generate a "syllable" and then test it to see if it's "recognizable"?
I can think of several ways to go about this implementation, but if someone has already implemented it (preferrably in Java or C#,) I'd rather re-use their work.
Any ideas?
You could try implementing a Markov chain and give it a suitable passage to process. There is a Java implementation that may work for you.
This is a sample from interpolating between Genesis in English and Genesis in Spanish (N = 1):
In bersaran thelely and avin inder tht teathe m lovig weay waw thod mofin he t thte h fupiteg s o t llissed od ma. lllar t land fingujod maid af de wand tetodamoiz fosu Andesp. ersunen thenas lowhejod whipanirede tifinas Gofuavithila d gió Y Diche fua Dios co l, liens ly Y crerdíquen ticuesereregos hielase agúnd veumarbas iarasens laragún co eruerá laciéluelamagúneren Dien a He.
I think this should do what you want:
Java Password Generator
It has the source code and a permissive license so you can adapt the source code to what you are looking for.
You need to generate random syllables. The simplest way to do it is to use syllables that are consonant-vowel, or consonant-vowel-consonant. From a list of consonants and vowels, pick randomly to build syllables, then join the syllables together to make a string.
Keep in mind your list of consonants shouldn't be letters that are consonants, but phonemes, so "th", "st", "sl", etc, could be entries in the consonant list.
You really should check out SCIgen. It generates entire semi-nonsense scientific papers: http://pdos.csail.mit.edu/scigen/
And the source is available: it's released under GPL, and is currently available via anonymous CVS.
I'm not sure exactly what you need this for, but if graphic-layot folks in the print industry have used Lorem Ipsum generators to create text that looks enough like text that your brain processes it as such without actually being readable words. More info here
I don't know if there's a web service to which you could subscribe, but there are several sites which will just generate Lorem Ipsum strings for you, so you may be able to use those.
There is a good section on this in Programming Pearls. It's online but I'd highly recommend buying the book; One of the best programming books around in my opinion.
Lots of Lorem Ipsum generators out there.
All gets back to why you want this. If you just want "pronounceable gibberish", I'd think the easiest thing to do would be to generate alternating consonants and vowels. That would be a tiny subset of all pronounceable gibberish, but what's the goal? To give a little broader range you could create a table of consonant phonemes and vowel phonemes, with the consonant list including not just individual letters like "b" and "d" but also "th", "br", and so on, and the vowel list could include "oo" and "ea", etc. One more step would be to generate syllables instead of letters, with a syllable containing either vowel, consonant-vowel, or consonant-vowel-consonant. That is, loop through creating syllables, then within syllables pick one of the three patterns. You probably want to forbid two vowel-only syllables in a row. (I'm trying to think of an example of that in English. It probably happens, but the only examples I can think of are borrowed from other languages, like "stoa".)
I created a Java package Pronounceable String Generator for generating pronounceable random strings quickly.
Just create an object PronounceableStringGenerator and invoke the method generate:
PronounceableStringGenerator mg = new PronounceableStringGenerator();
System.out.println(mg.generate(8));//8 is the length of the generated string
System.out.println(mg.generate(10));
System.out.println(mg.generate(6));

Categories

Resources