How to display word differences using c#? - c#

I would like to show the differences between two blocks of text. Rather than comparing lines of text or individual characters, I would like to just compare words separated by specified characters ('\n', ' ', '\t' for example). My main reasoning for this is that the block of text that I'll be comparing generally doesn't have many line breaks in it and letter comparisons can be hard to follow.
I've come across the following O(ND) logic in C# for comparing lines and characters, but I'm sort of at a loss for how to modify it to compare words.
In addition, I would like to keep track of the separators between words and make sure they're included with the diff. So if space is replaced by a hard return, I would like that to come up as a diff.
I'm using Asp.net to display the entire block of text including the deleted original text and added new text (both will be highlighted to show that they were deleted/added). A solution that works with those technologies would be appreciated.
Any advice on how to accomplish this is appreciated?
Thanks!

Microsoft has released a diff project on CodePlex that allows you to do word, character, and line diffs. It is licensed under Microsoft Public License (Ms-PL).
https://github.com/mmanela/diffplex

Other than a few general optimizations, if you need to include the separators in the comparison you are essentially doing a character by character comparison with breaks. Though you could use the O(ND) you linked, you are going to make as many changes to it as you would basically writing your own.
The main problem with difference comparison is finding the continuation (if I delete a single word, but leave the rest the same).
If you want to use their code start with the example and do not write the deleted characters, if there are replaced characters in the same place, do not output this result. You then need to compute the longest continuous run of "changed" words, highlight this string and output.
Sorry thats not much of an answer, but for this problem the answer is basically writing and tuning the function.

Well String.Split with '\n', ' ' and '\t' as the split characters will return you an array of words in your block of text.
You could then compare each array for differences. A simple 1:1 comparison would tell you if any word had been changed. Comparing:
hello world how are you
and:
hello there how are you
would give you that world and changed to there.
What it wouldn't tell you was if words had been inserted or removed and you would still need to parse the text blocks character by character to see if any of the separator characters had been changed.

string string1 = "hello world how are you";
string string2 = "hello there how are you";
var first = string1.Split(' ');
var second = string2.Split(' ');
var primary = first.Length > second.Length ? first : second;
var secondary = primary == second ? first : second;
var difference = primary.Except(secondary).ToArray();

Related

.NET Regular Expression (perl-like) for detecting text that was pasted twice in a row

I've got a ton of json files that, due to a UI bug with the program that made them, often have text that was accidentally pasted twice in a row (no space separating them).
Example: {FolderLoc = "C:\testC:\test"}
I'm wondering if it's possible for a regular expression to match this. It would be per-line. If I can do this, I can use FNR, which is a batch text processing tool that supports .NET RegEx, to get rid of the accidental duplicates.
I regret not having an example of one of my attempts to show, but this is a very unique problem and I wasn't able to find anything on search engines resembling it to even start to base a solution off of.
Any help would be appreciated.
Can collect text along the string (.+ style) followed by a lookahead check for what's been captured up to that point, so what would be a repetition of it, like
/(.+)(?=\1)/; # but need more restrictions
However, this gets tripped even just on double leTTers, so it needs at least a little more. For example, our pattern can require the text which gets repeated to be at least two words long.
Here is a basic and raw example. Please also see the note on regex at the end.
use warnings;
use strict;
use feature 'say';
my #lines = (
q(It just wasn't able just wasn't able no matter how hard it tried.),
q(This has no repetitions.),
q({FolderLoc = "C:\testC:\test"}),
);
my $re_rep = qr/(\w+\W+\w+.+)(?=\1)/; # at least two words, and then some
for (#lines) {
if (/$re_rep/) {
# Other conditions/filtering on $1 (the capture) ?
say $1
}
}
This matches at least two words: word (\w+) + non-word-chars + word + anything. That'll still get some legitimate data, but it's a start that can now be customized to your data. We can tweak the regex and/or further scrutinize our catch inside that if branch.
The pattern doesn't allow for any intervening text (the repetition must follow immediately), what is changed easily if needed; the question is whether then some legitimate repetitions could get flagged.
The program above prints
just wasn't able
C:\test
Note on regex This quest, to find repeated text, is much too generic
as it stands and it will surely pick on someone's good data. It is enough to note that I had to require at least two words (with one word that that is flagged), which is arbitrary and still insufficient. For one, repeated numbers realistically found in data files (3,3,3,3,3) will be matched as well.
So this needs further specialization, for what we need to know about data.

Remove excessive whitespace in user input field

In my controller method for handling a (potentially hostile) user input field I have the following code:
string tmptext = comment.Replace(System.Environment.NewLine, "{break was here}"); //marks line breaks for later re-insertion
tmptext = Encoder.HtmlEncode(tmptext);
//other sanitizing goes in here
tmptext = tmptext.Replace("{break was here}", "<br />");
var regex = new Regex("(<br /><br />)\\1+");
tmptext = regex.Replace(tmptext, "$1");
My goal is to preserve line breaks for typical non-malicious use and display user input in safe, htmlencoded strings. I take the user input, parse it for newline characters and place a delimiter at the line breaks. I perform the HTML encoding and reinsert the breaks. (i will likely change this to reinserting paragraphs as p tags instead of br, but for now i'm using br)
Now actually inserting real html breaks opens me up to a subtle vulnerability: the enter key. The regex.replace code is there to strip out a malicious user just standing on the enter key and filling the page with crap.
This is a fix for big crap floods of just white but still leaves me open to abuse like entering one character, two line breaks, one character, two line breaks all down the page.
My question is for a method of determining that this is abusive and failing it on validation. I'm scared that there might not be a simple procedural method to do it and instead will need heuristic techniques or bayesian filters. Hopefully, someone has an easier, better way.
EDIT: perhaps I wasn't clear in the problem description, the regex handles seeing multiple line breaks in a row and converting them to just one or two. That problem is solved. The real problem is distinguishing legitimate text from crap flood like this:
a
a
a
...imagine 1000 of these...
a
a
a
a
A random suggestion, inspired by slashdot.org's comment filters: compress your user input with a System.IO.Compression.DeflateStream, and if it is too small in comparison with the original (you'll have to do some experimentation to find a useful cut-off) reject it.
I would HttpUtility.HtmlEncode the string, then convert newline characters to <br/>.
HttpUtility.HtmlEncode(subject).Replace("\r\n", "<br/>").Replace("\r", "<br/>").Replace("\n", "<br/>");
Also you should perform this logic when you are outputting to the user, not when saving in the database. The only validation I do on the database is make sure it's properly escaped (other than normal business rules that is).
EDIT: To fix the actual problem however, you can use Regex to replace multiple newlines with a single newline beforehand.
subject = Regex.Replace(#"(\r\n|\r|\n)+", #"\n", RegexOptions.Singleline);
I'm not sure if you would need RegexOptions.Singleline.
It sounds like you're tempted to try something "clever" with a regex, but IMO the simplest approach is to just loop through the characters of the string copying them to a StringBuilder, filtering as you go.
Any that fail a char.IsWhiteSpace() test are not copied. (If one of these is a newline, then insert a <br/> and don't allow any more <br/>'s to be added until you have hit a non-whitespace character).
edit
If you want to stop the user entering any old crap, give up now. You will never find a way filtering that a user can't find a way around in less than a minute, if they really want to.
You will be much better off putting a limit on the number of newlines, or the total number of characters, in the input.
Think of how much effort it will take to do something clever to sanitise "bad input", and then consider how likely it is that this will happen. Probbaly there is no point. Probably all the sanitisation you really need is to ensure the data is legal (not too large for your system to handle, all dangerous characters stripped or escaped, etc). (This is exactly why forums have human moderators who can filter the posts based on whatever criteria are approriate).
This is not the most efficient way of handling this, nor the smartest (disclaimer),
but if your text is not too big it doesn't matter much and short of any smarter algorithms (note: it's hard to detect something like char\nchar\nchar\n... though you could set a limit on the line len)
You could just Split on white characters (add any you can think of, short of \n) - then Join with just one space and then split on \n (to get lines) - join with <br />. While joining the lines you can test for line.Length > 2 e.g. or something.
To make this faster you can iterate with a more efficient algorithm, char by char, using IndexOf etc..
Again not the most efficient or perfect way of handling this but would give you something fast.
EDIT: to filter 'same lines' - you could use e.g. DistinctUntilChanged - that's from the Ix - Interactive extensions (see NuGet Ix-experimental I think) which should filter 'same lines' consecutive + you could add line test for those.
Rather than attempting to replace the newlines with filtered text and then attempting to use regular expressions on that, why not sanitize your data before inserting the <br /> tags? Don't forget to sanitize the input with HttpUtility.HtmlEncode first.
In an attempt to take care of multiple short lines in a row, here's my best attempt:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
class Program {
static void Main() {
// Arbirary cutoff used to join short strings.
const int Cutoff = 6;
string input =
"\r\n\r\n\n\r\r\r\n\nthisisatest\r\nstring\r\nwith\nsome\r\n" +
"unsanatized\r\nbreaks\r\nand\ra\nsh\nor\nt\r\n\na\na\na\na" +
"\na\na\na\na\na\na\na\na\na\na\na\na\na\na\na\na\na";
input = (input ?? String.Empty).Trim(); // Don't forget to HtmlEncode it.
StringBuilder temp = new StringBuilder();
List<string> result = new List<string>();
var items = input.Split(
new[] { '\r', '\n' },
StringSplitOptions.RemoveEmptyEntries)
.Select(i => new { i.Length, Value = i });
foreach (var item in items) {
if (item.Length > Cutoff) {
if (temp.Length > 0) {
result.Add(temp.ToString());
temp.Clear();
}
result.Add(item.Value);
continue;
}
if (temp.Length > 0) { temp.Append(" "); }
temp.Append(item.Value);
}
if (temp.Length > 0) {
result.Add(temp.ToString());
}
Console.WriteLine(String.Join("<br />", result));
}
}
Produces the following output:
thisisatest<br />string with some<br />unsanatized<br />breaks and a sh or t a a
a a a a a a a a a a a a a a a a a a a
I'm sure you've already come up with this solution but unfortunately what you're asking for isn't very straight forward.
For those interested, here's my first attempt:
using System;
using System.Text.RegularExpressions;
class Program {
static void Main() {
string input = "\r\n\r\n\n\r\r\r\n\nthisisatest\r\nstring\r\nwith\nsome" +
"\r\nunsanatized\r\nbreaks\r\n\r\n";
input = (input ?? String.Empty).Trim().Replace("\r", String.Empty);
string output = Regex.Replace(
input,
"\\\n+",
"<br />",
RegexOptions.Multiline);
Console.WriteLine(output);
}
}
producing the following output:
thisisatest<br />string<br />with<br />some<br />unsanatized<br />breaks

Parsing a CSV File with C#, ignoring thousand separators

Working on a program that takes a CSV file and splits on each ",". The issue I have is there are thousand separators in some of the numbers. In the CSV file, the numbers render correctly. When viewed as a text document, they are shown like below:
Dog,Cat,100,100,Fish
In a CSV file, there are four cells, with the values "Dog", "Cat", "100,000", "Fish". When I split on the "," to an array of strings, it contains 5 elements, when what I want is 4. Anyone know a way to work around this?
Thanks
There are two common mistakes made when reading csv code: using a split() function and using regular expressions. Both approaches are wrong, in that they are prone to corner cases such as yours and slower than they could be.
Instead, use a dedicated parser such as Microsoft.VisualBasic.TextFieldParser, CodeProject's FastCSV or Linq2csv, or my own implemention here on Stack Overflow.
Typically, CSV files would wrap these elements in quotes, causing your line to be displayed as:
Dog,Cat,"100,100",Fish
This would parse correctly (if using a reasonable method, ie: the TextFieldParser class or a 3rd party library), and avoid this issue.
I would consider your file as an error case - and would try to correct the issue on the generation side.
That being said, if that is not possible, you will need to have more information about the data structure in the file to correct this. For example, in this case, you know you should have 4 elements - if you find five, you may need to merge back together the 3rd and 4th, since those two represent the only number within the line.
This is not possible in a general case, however - for example, take the following:
100,100,100
If that is 2 numbers, should it be 100100, 100, or should it be 100, 100100? There is no way to determine this without more information.
you might want to have a look at the free opensource project FileHelpers. If you MUST use your own code, here is a primer on the CSV "standard" format
well you could always split on ("\",\"") and then trim the first and last element.
But I would look into regular expressions that match elements with in "".
Don't just split on the , split on ", ".
Better still, use a CSV library from google or codeplex etc
Reading a CSV file in .NET?
You may be able to use Regex.Replace to get rid of specifically the third comma as per below before parsing?
Replaces up to a specified number of occurrences of a pattern specified in the Regex constructor with a replacement string, starting at a specified character position in the input string. A MatchEvaluator delegate is called at each match to evaluate the replacement.
[C#] public string Replace(string, MatchEvaluator, int, int);
I ran into a similar issue with fields with line feeds in. Im not convinced this is elegant, but... For mine I basically chopped mine into lines, then if the line didnt start with a text delimeter, I appended it to the line above.
You could try something like this : Step through each field, if the field has an end text delimeter, move to the next, if not, grab the next field, appaend it, rince and repeat till you do have an end delimeter (allows for 1,000,000,000 etc) ..
(Im caffeine deprived, and hungry, I did write some code but it was so ugly, I didnt even post it)
Do you know that it will always contain exactly four columns? If so, this quick-and-dirty LINQ code would work:
string[] elements = line.Split(',');
string element1 = elements.ElementAt(0);
string element2 = elements.ElementAt(1);
// Exclude the first two elements and the last element.
var element3parts = elements.Skip(2).Take(elements.Count() - 3);
int element3 = Convert.ToInt32(string.Join("",element3parts));
string element4 = elements.Last();
Not elegant, but it works.

Splitting a string in C#

Let's say I have this string:
"param1,r:1234,p:myparameters=1,2,3"
...and I would like to split it into:
param1
r:1234
p:myparameters=1,2,3
I've used the split function and of course it splits it at every comma. Is there a way to do this using regex or will I have to write my own split function?
Personally, I would try something like this:
,(?=[^,]+:.*?)
Basically, use a positive look-ahead to find a comma, followed by a "key-value" pair (this defined by a key, a colon, and more information [data] (including other commas). This should disqualify the commas between the numbers, too.
You can use ; for separating values which makes easy to work with it.
Since you have , for separation and also for values it is difficult to split it.
You have
string str = "param1,r:1234,p:myparameters=1,2,3"
Recommended to use
string str = "param1;r:1234;p:myparameters=1,2,3"
which can be splited as
var strArray = str.Split(';');
strArray[0]; // contains param1
strArray[1]; // r:1234
strArray[2]; // p:myparameters=1,2,3
I'm not sure how you would write a split that knew which commas to split on there, honestly.
Unless it's a fixed number each time in which case, just use the String.Split overload that takes an int specifying how many substrings to return at max
If you're going to have comma-delimited data that's not always a fixed number of items and it could have literal commas in the data itself, they really should be quoted. If you can control the input in any way, you should encourage that, and use an actual CSV parser instead of String.Split
That depends. You can't parse it with regex (or anything else) unless you can identify a consistent rule separating one group from another. Based on your sample, I can't clearly identify such a rule (though I have some guesses). How does the system know that p:myparameters=1,2,3 is a single item? For example, if there were another item after it, what would be the difference between that and the 1,2,3? Figure that out and you'll be pretty close to a solution.
If you're able to change the format of the input string, why not decide on a consistent delimiter between your groups? ; would be a good choice. Use an input like param1;r:1234;p:myparameters=1,2,3 and there will be no ambiguity where the groups are, plus you can just split on ; and you won't need regex.
The simplest approach would be changing your delimiter from "," to something like "|". Then you can split on "|" no problem. However if you can't change the delimiting character then maybe you could encode the sections in a fashion similar to CSV.
CSV files have the same issue... the standard there is to put double quotes "" around columns.
For example, your string would be "param1","r:1234","p:myparameters=1,2,3".
Then you could use the Microsoft.VisualBasic.FileIO.TextFieldParser to split/parse. You can include this in c# even though its in the VisualBasic namespace.
TextFieldParser
Do you mean that:string[] str = System.Text.RegularExpression.Regex.Spilt("param1,r:1234,p:myparameters=1,2,3",#"\,");

Delete Lines From Beginning of Multiline Textbox in C#

Is there a graceful way in C# to delete multiple lines of text from the beginning of a multiline textbox? I am using Microsoft Visual C# 2008 Express Edition.
EDIT - Additional Details
The multiline textbox in my application is disabled (i.e. it is only editable by the application itself), and every line is terminated with a "\r\n".
This is an incomplete question. So assuming you are using either TextBox or RichTextBox you can use the Lines property found inTextBoxBase.
//get all the lines out as an arry
string[] lines = this.textBox.Lines;
You can then work with this array and set it back.
this.textBox.Lines= newLinesArray;
This might not be the most elegant way, but it will remove the first line.
EDIT: you don't need select, just using skip will be fine
//number of lines to remove from the beginning
int numOfLines = 30;
var lines = this.textBox1.Lines;
var newLines = lines.Skip(numOfLines);
this.textBox1.Lines = newLines.ToArray();
This solution works for me in WPF:
while (LogTextBox.LineCount > Constants.LogMaximumLines)
{
LogTextBox.Text = LogTextBox.Text.Remove(0, LogTextBox.GetLineLength(0));
}
You can replace LogTextBox with the name of your text box, and Constants.LogMaximumLines with the maximum number of lines you would like your text box to have.
Unfortunately, no, there is no "elegant" way to delete lines from the text of a multiline TextBox, regardless of whether you are using ASP.NET, WinForms, or WPF/Silverlight. In every case, you build a string that does not contain the lines you don't want and set the Text property.
WinForms will help you a little bit by pre-splitting the Text value into lines, using the Lines property, but it's not very helpful because it's a string array, and it's not exactly easy to delete an element of an array.
Generally, this algorithm will work for all possible versions of the TextBox class:
var lines = (from item in myTextBox.Text.Split('\n') select item.Trim());
lines = lines.Skip(numLinesToSkip);
myTextBox.Text = string.Join(Environment.Newline, lines.ToArray());
Note: I'm using Environment.Newline specifically for the case of Silverlight on a Unix platform. For all other cases, you're perfectly fine using "\r\n" in the string.Join call.
Also, I do not consider this an elegant solution, even though it's only 3 lines. What it does is the following:
splits the single string into an array of strings
iterates over that array and builds a second array that does not include the lines skipped
joins the array back into a single string.
I do not consider it elegant because it essentially builds two separate arrays, then builds a string from the second array. A more elegant solution would not do this.
One thing to keep in mind is that the Lines collection of the TextBox does not accurately reflect what the user sees as lines. The Lines collection basically works off of carriage returns, whereas the user could see lines wrapping from one line to the next without a carriage return. This may or may not be the behavior you want.
For example, the user would see the below as three lines, but the Lines collection will show 2 (since there are only 2 carriage returns):
This is line number
one.
This is line 2.
Also, if the form, and the text control are resizable the visible lines in the text will change as the control grows or shrinks.
I wrote a blog post several years ago on how to determine the number of lines in the textbox as the user sees them and get the index of a given line (like to get the line at index: http://ryanfarley.com/blog/archive/2004/04/07/511.aspx, perhaps this post will help.
if (txtLog.Lines.Length > maxNumberLines)
{
txtLog.Lines = txtLog.Lines.Skip(txtLog.Lines.Length - maxNumberLines).ToArray();
}

Categories

Resources