I have two strings and would like to display the difference between them. For example, if I have the strings "I am from Mars" and "I am from Venus", the output could be "I am from Venus". (Typically used to show what changed in an audit log, etc.)
Is there a simple algorithm for this? I am using C# but I guess a generic algorithm could be adapted from any programming language.
Or is there a framework class/third-party library that will do this sort of thing?
Check this out: http://en.wikipedia.org/wiki/Diff#Algorithm
Also: http://en.wikipedia.org/wiki/Longest_common_subsequence_problem
There is also an implementation described here: http://www.codeproject.com/KB/recipes/DiffAlgorithmCS.aspx
Related
Ok the title may be not correct but this is what i came as best
My question is this
Example 1
see , saw
I can convert see to saw with as
replace ee with aw
string srA = "see";
string srB = "saw";
srA = srB.Replace("aw", "ee");
Or lets say
show , shown
add n to original string
Now what i want it is, with minimum length of code, generating such procedures to any compared strings
Looking for your ideas how can i make it? Can i generate regexes automatically to apply and convert?
c# 6
Check diffplex and and see if it is what you need. If you want to create a custom algorithm, instead of using a 3rd party library just go through the code -it's open source.
You might also want to check this work for optimizations, but it might get complicated.
Then there's also Diff.NET.
Also this blog post is part of a series in implementing a diff tool.
If you're simply interested in learning more about the subject, your googling efforts should be directed to the Levenshtein algorithm.
I can only assume what your end goal is, and the time you're willing to invest in this, but I believe the first library should be enough for most needs.
I encountered a scenario where i have to send an array of integers as parameter from specflow feature file. I could have used tables which i don't want to do as i have send as row[] or col[]. If i pass parameter as a string
eg: Given set the value as '470,471,472,472'
and receive it and do split in step definition file. How different is StepArgumentTransformation from the above scenario? Is any other benefit in using step argument transformation. I understand we can convert XML,Date or any object. Why do we have to use stepargumenttransformation???
I hope I understood the question correctly.
Specflow supports some automatic transformation out of the box, so things like converting to Date, Double, int etc etc, and it does these by default as there is no ambiguity about them. You can easily convert a string to a double or a Date as you know the locale being used.
Why isn't converting to arrays supported? I suppose it could be, but there is some ambiguity. What should the list separator be? a comma? What about locales that use that as a separator between the whole and fractional part of a number?
So providing a default implementation of something which converted a list to int[] or IEnumerable<int> could be possible, but its just likely to get some people asking why it doesn't work for them when they have used ☃ as a list separator.
It's better to leave the things with ambiguity to individuals to implement, rather than guess at the best implementation.
The StepArgumentTransformation you want is very simple to write and could be included in an external step assembly if you wanted to share it amongst many projects.
So to answer your many questions:
It's not really any different, it just encapsulates it in a single place, which is good practise, which is a benefit.
Yes you can convert any object.
You don't have to use StepArgumentTransformation, many people don't, but IMHO they make your life much easier
Does C# offer a way to translate strings on-the-fly or something similiar?
I'm now working on some legacy code, which has some parts like this:
section.AddParagraph(String.Format("Premise: {0}", currentReport.Tenant.Code));
section.AddParagraph(String.Format("Description: {0}", currentReport.Tenant.Name));
section.AddParagraph();
section.AddParagraph(String.Format("Issued: #{0:D5}", currentReport.Id));
section.AddParagraph(String.Format("Date: {0}", currentReport.Timestamp.ToString(
"dd MMM yyyy", CultureInfo.InvariantCulture)));
section.AddParagraph(String.Format("Time: {0:HH:mm}", currentReport.Timestamp));
So, I want to implement the translation of these strings on-the-fly based on some substitution table (for example, as Qt does).
Is this possible (probably, using something what C# already has or using some post-processing - may be possible with PostSharp)?
Does some generic internalization approach for applications built with C# (from scratch) exist?
Does some generic internalization approach for applications built with C# (from scratch) exist?
Yes, using resource files. And here's another article on MSDN.
In the C# project I currently work on, we wrote a helper function that works like this:
section.AddParagraph(I18n.Translate("Premise: {0}", currentReport.Tenant.Code));
section.AddParagraph(I18n.Translate("That's all");
At build time, a script searches all I18n.Translate invocations, as well as all UI controls, and populates a table with all english phrases. This gets translated.
At runtime, the english text is looked up in a dictionary, and replaced with the translated text.
Something similar happens to our winforms Dialog resources: they are constructed in english and then translated using the same dictionary.
The biggest strength of this scheme, is also the biggest weakness: If you use the same string in two places, it gets translated the same. This shortens the file you send to translater which helps to reduce cost. If you ever need to force a different translation of the same english word, you need to work around that. As long as we have the system (4ish years or so), we never had the need for it. There's also benefits: You read the english UI text inline with the source (so not hiding behind an identifier you need to name), and if you delete code, its automatically removed from the translated resources as well.
What would be the best way to compare big paragraphs of text in order to tell the differences apart. For example string A and string B are the same except for a few missing words, how would I highlight these?
Originally I thought of breaking it down into word arrays, and comparing the elements. However this breaks down when a word is deleted or inserted.
Use a diff algorithm.
I saw this a few months back when I was working on a small project, but it might set you on the right track.
http://www.codeproject.com/KB/recipes/DiffAlgorithmCS.aspx
You want to look into Longest Common Subsequence algorithms. Most languages have a library which will do the dirty work for you, and here is one for C#. Searching for "C# diff" or "VB.Net diff" will help you find additional libraries that suit your needs.
Usually text difference is measured in terms of edit distance, which is essentially the number of character additions, deletions or changes necessary to transform one text into the other.
A common implementation of this algorithm uses dynamic programming.
Here is an implementaion of a Merge Engine that compares 2 html files and shows the highlighted differences: http://www.codeproject.com/KB/string/htmltextcompare.aspx
If it's a one-shot deal, save them both in MS Word and use the document compare function.
For the project that I'm currently on, I have to deliver specially formatted strings to a 3rd party service for processing. And so I'm building up the strings like so:
string someString = string.Format("{0}{1}{2}: Some message. Some percentage: {3}%", token1, token2, token3, number);
Rather then hardcode the string, I was thinking of moving it into the project resources:
string someString = string.Format(Properties.Resources.SomeString, token1, token2, token3, number);
The second option is in my opinion, not as readable as the first one i.e. the person reading the code would have to pull up the string resources to work out what the final result should look like.
How do I get around this? Is the hardcoded format string a necessary evil in this case?
I do think this is a necessary evil, one I've used frequently. Something smelly that I do, is:
// "{0}{1}{2}: Some message. Some percentage: {3}%"
string someString = string.Format(Properties.Resources.SomeString
,token1, token2, token3, number);
..at least until the code is stable enough that I might be embarrassed having that seen by others.
There are several reasons that you would want to do this, but the only great reason is if you are going to localize your application into another language.
If you are using resource strings there are a couple of things to keep in mind.
Include format strings whenever possible in the set of resource strings you want localized. This will allow the translator to reorder the position of the formatted items to make them fit better in the context of the translated text.
Avoid having strings in your format tokens that are in your language. It is better to use
these for numbers. For instance, the message:
"The value you specified must be between {0} and {1}"
is great if {0} and {1} are numbers like 5 and 10. If you are formatting in strings like "five" and "ten" this is going to make localization difficult.
You can get arround the readability problem you are talking about by simply naming your resources well.
string someString = string.Format(Properties.Resources.IntegerRangeError, minValue, maxValue );
Evaluate if you are generating user visible strings at the right abstraction level in your code. In general I tend to group all the user visible strings in the code closest to the user interface as possible. If some low level file I/O code needs to provide errors, it should be doing this with exceptions which you handle in you application and consistent error messages for. This will also consolidate all of your strings that require localization instead of having them peppered throughout your code.
One thing you can do to help add hard coded strings or even speed up adding strings to a resource file is to use CodeRush Xpress which you can download for free here: http://www.devexpress.com/Products/Visual_Studio_Add-in/CodeRushX/
Once you write your string you can access the CodeRush menu and extract to a resource file in a single step. Very nice.
Resharper has similar functionality.
I don't see why including the format string in the program is a bad thing. Unlike traditional undocumented magic numbers, it is quite obvious what it does at first glance. Of course, if you are using the format string in multiple places it should definitely be stored in an appropriate read-only variable to avoid redundancy.
I agree that keeping it in the resources is unnecessary indirection here. A possible exception would be if your program needs to be localized, and you are localizing through resource files.
yes you can
new lets see how
String.Format(Resource_en.PhoneNumberForEmployeeAlreadyExist,letterForm.EmployeeName[i])
this will gave me dynamic message every time
by the way I'm useing ResXManager