Comparing strings that contain formatting in C#

Comparing strings that contain formatting in C# - c#

I'm working on a function that given some settings - such as line spacing, the output (in string form) is modified. In order to test such scenarios, I'm using string literals, as shown below for the expected result.
The method, using a string builder, (AppendLine) generates the said output. One issue I have run into is that of comparing such strings. In the example below, both are equal in terms of what they represent. The result is the area which I care about, however when comparing two strings, one literal, one not, equality naturally fails. This is because one of the strings emits line spacing, while the other only demonstrates the formatting it contains.
What would be the best way of solving this equality problem? I do care about formatting such as new lines from the result of the method, this is crucially important.
Code:
string expected = #"Test\n\n\nEnd Test.";
string result = "Test\n\n\nEnd Test";
Console.WriteLine(expected);
Console.WriteLine(result);
Output:
Test\n\n\nEnd Test.
Test
End Test

The # prefix tells the compiler to take the string exactly as it is written. So, it doesn't format the \n characters to carriage returns and line feeds.
Since you don't have the same prefix for the string assigned to your result variable, the compiler formats it. If you would like to continue to use the # prefix, just do the following:
string expected = #"Test
End Test";
You'll have to input the carriage returns and line feed within the string as invisible characters.

You're using the term "literal" incorrectly. "Literal" simply means an actual value that exists in code. In other words, values exist in code either as variables (for the sake of simplicity I'm including constants in this group) and literals. Variables are an abstract notion of a value, whereas literals are a value.
All this is to say that both of your strings are string literals, as they're hard-coded into your application. The # prefix simply states that the compiler is to include escape characters (indeed, anything other than a double-quote) in the string, rather than evaluating the escape sequences when compiling the string literal into the assembly.
First of all, whatever your function returns (either a string that contains standard escape sequences for newlines rather than newlines themselves, or a string that actually contains newlines) is what your test variable should contain. Make your tests as close to the actual output as possible, as the more work you do to massage the values into a comparable form the more code paths you have to test. If you're looking to be able to compare a string with formatting escape sequences embedded into it to a string where those sequences have been evaluated (essentially comparing the two strings in your example), then I would say this:
Be sure that this is really want you want to do.
You'll have to duplicate the functionality of the C# compiler in interpreting these values and turning your "format string" into a "formatted string".
For doing #2, a RegEx processor is probably going to be the simplest option. See this page for a list of C# string escape sequences.

I feel somewhat enlightened, yet annoyed at what I discovered.
This is my first project using MSTest, and after a failing test I was selecting View Test Details to see how and why my test failed. The formatting for string output in this details display is very poor, for example you get:
Assert.AreEqual failed. Expected:<TestTest End>. Actual:<TestTest End>.
This is for formatted text - the strange thing is if you have /r (line feeds) instead of line breaks (/n) the formatting is actually somewhat correct.
It turns out to view the correct output you need to run the tests in debug mode. In other words, when you have a failing test, run the test in debug and the exception will be caught and displayed as follows:
Assert.AreEqual failed. Expected:<Test
Test End>. Actual:<Test
Test End>.
The above obviously containing the correct formatting.
In the end it turns out my initial method of storing the expectations (with formatting) in strings was correct, yet my unfamiliarity of MSTest made me question my means as it appeared to be valid input, yet was simply being displayed back to myself in what appeared a valid output.

Use a regex to strip white space before you do your compare?

Related

How to arrange the matched output in c#?

i'm matching words to create simple lexical analyzer.
here is my example code and output
example code:
public class
{
public static void main (String args[])
{
System.out.println("Hello");
}
}
output:
public = identifier
void = identifier
main = identifier
class = identifier
as you all can see my output is not arranged as the input comes. void and main comes after class but in output the class comes at the end. i want to print result as the input is matched.
c# code:
private void button1_Click(object sender, EventArgs e)
{
if (richTextBox1.Text.Contains("public"))
richTextBox2.AppendText("public = identifier\n");
if (richTextBox1.Text.Contains("void"))
richTextBox2.AppendText("void = identifier\n");
if (richTextBox1.Text.Contains("class"))
richTextBox2.AppendText("class = identifier\n");
if (richTextBox1.Text.Contains("main"))
richTextBox2.AppendText("main = identifier\n");
}

Your code is asking the following qustions:
Does the input contain the text "public"? If so, write down "public = identifier".
Does the input contain the text "void"? If so, write down "void = identifier".
Does the input contain the text "class"? If so, write down "class = identifier".
Does the input contain the text "main"? If so, write down "main = identifier".
The answer to all of these questions is yes, and since they're executed in that exact order, the output you get should not be surprising. Note: public, void, class and main are keywords, not identifiers.
Splitting on whitespace?
So your approach is not going to help you tokenize that input. Something slightly more in the right direction would be input.Split() - that will cut up the input at whitespace boundaries and give you an array of strings. Still, there's a lot of whitespace entries in there.
input.Split(new char[] { ' ', '\t', '\r', '\n' }, StringSplitOptions.RemoveEmptyEntries) is a little better, giving us the following output: public, class, {, public, static, void, main, (String, args[]), {, System.out.println("Hello");, } and }.
But you'll notice that some of these strings contain multiple 'tokens': (String, args[]) and System.out.println("Hello");. And if you had a string with whitespace in it it would get split into multiple tokens. Apparently, just splitting on whitespace is not sufficient.
Tokenizing
At this point, you would start writing a loop that goes over every character in the input, checking if it's whitespace or a punctuation character (such as (, ), {, }, [, ], ., ;, and so on). Those characters should be treated as the end of the previous token, and punctuation characters should also be treated as a token of their own. Whitespace can be skipped.
You'll also have to take things like string literals and comments into account: anything in-between two double-quotes should not be tokenized, but be treated as part of a single 'string' token (including whitespace). Also, strings can contain escape sequences, such as \", that produce a single character (that double quote should not be treated as the end of the string, but as part of its content).
Anything that comes after two forward slashes should be ignored (or parsed as a single 'comment' token, if you want to process comments somehow), until the next newline (newline characters/sequences differ across operating systems). Anything after a /* should be ignored until you encounter a */ sequence.
Numbers can optionally start with a minus sign, can contain a dot (or start with a dot), a scientific notation part (e..), which can also be negative, and there are type suffixes...
In other words, you're writing a state machine, with different behaviour depending on what state you're in: 'string', 'comment', 'block comment', 'numeric literal', and so on.
Lexing
It's useful to assign a type to each token, either while tokenizing or as a separate step (lexing). public is a keyword, main is an identifier, 1234 is an integer literal, "Hello" is a string literal, and so on. This will help during the next step.
Parsing
You can now move on to parsing: turning a list of tokens into an abstract syntax tree (AST). At this point you can check if a list of tokens is actually valid code. You basically repeat the above step, but at a higher level.
For example, public, protected and private are keyword tokens, and they're all access modifiers. As soon as you encounter one of these, you know that either a class, a function, a field or a property definition must follow. If the next token is a while keyword, then you signal an error: public while is not a valid C# construct. If, however, the next token is a class keyword, then you know it's a class definition and you continue parsing.
So you've got a state machine once again, but this time you've got states like 'class definition', 'function definition', 'expression', 'binary expression', 'unary expression', 'statement', 'assignment statement', and so on.
Conclusion
This is by no means complete, but hopefully it'll give you a better idea of all the steps involved and how to approach this. There are also tools available that can generate parsing code from a grammar specification, which can ease the job somewhat (though you still need to learn how to write such grammars).
You may also want to read the C# language specification, specifically the part about its grammar and lexical structure. The spec can be downloaded for free from one of Microsofts websites.

CodeCaster is right. You are not on the right path.
I have an lexical analyzer made by me some time ago as a project.
I know, I know I'm not supposed to put things on a plate here, but the analyzer is for c++ so you'll have to change a few things.
Take a look at the source code and please try to understand how it works at least: C++ Lexical Analyzer

In the strictest sense, the reason for the described behaviour is that in the evaluating code, the search for void comes before the search for class. However, the approach in total seems far too simple for a lexical analysis, as it simply checks for substrings. I totally second the comments above; depending on what you are trying to achieve in the big picture, a more sophisticated approach might be necessary.

.Net regex not replacing correctly

Background
I am trying to do some regex matching and replacing, but for some reason the replacement isn't correct in .NET.
Regex pattern - "^.*?/rebate/?$"
Input string - "/my-tax/rebate"
Replacement string - "/new-path/rebate"
Basically, if the word 'rebate' is seen in a string, the input string needs to be replaced entirely by the replacement string.
Problem
If I create a regex with the pattern and execute
patternMatch.Pattern.Replace("/my-tax/rebate", "/new-path/rebate")
I get /my-tax/new-path/rebate, which isn't correct.
But, if I execute -
new Regex(#"^.*?/rebate/?$").Replace("/my-tax/rebate", "/new-path/rebate"),
the result is correct - /new-path/rebate
Why is that?
patternMatch is an object with two properties - one Pattern (which is the Regex Pattern) and another one is TargetPath (which is the replacement string). In this example, I am only using the pattern property.
patternMatch.Pattern on debugging is
Here are the results during run time-

You are simply wrongly using the function. I'm not sure how you are getting /my-tax/new-path/rebate since it is giving me an error on ideone.com (Maybe you have a regex named Pattern?).
Anyway, you shouldn't have any issues with using the function like this:
patternMatch.Replace("/my-tax/rebate", "/new-path/rebate");
ideone demo

A number of points in your question are incorrect. The regex is replacing correctly.
Per #XiaoguangQiao's comment, what is patternMatch.Pattern.Replace? Your example...
var patternMatch = new Regex("^.*?/rebate/?$");
patternMatch.Pattern.Replace("/my-tax/rebate", "/new-path/rebate");
...errors with the message...
'System.Text.RegularExpressions.Regex' does not contain a definition for 'Pattern' and no extension method 'Pattern' accepting a first argument of type 'System.Text.RegularExpressions.Regex' could be found
...when I throw it into a quick LINQPad 4 query (set to C# Statement(s)).
pattern is a private string field of System.Text.RegularExpressions.Regex; and patternMatch.Replace("/my-tax/rebate", "/new-path/rebate") - which I expect is what you meant - yields the correct result ("/new-path/rebate") rather than the incorrect result you said you get ("/my-tax/new-path/rebate").
Otherwise your pattern(s) (i.e. with and without the extra / that #rene pointed out) is fine for the input ("/my-tax/rebate") and replacement ("/new-path/rebate") you initially outline - insofar as they match and yield the result you want. You can check this outside your code in quick fiddles with the extra / and without the extra /.

Use String.Replace Method.
str.replace("rebate","new-path/rebate")
http://msdn.microsoft.com/en-us/library/fk49wtc1%28v=vs.110%29.aspx

Regular Expression to match a quoted string embedded in another quoted string

I have a data source that is comma-delimited, and quote-qualified. A CSV. However, the data source provider sometimes does some wonky things. I've compensated for all but one of them (we read in the file line-by-line, then write it back out after cleansing), and I'm looking to solve the last remaining problem when my regex-fu is pretty weak.
Matching a Quoted String inside of another Quoted String
So here is our example string...
"foobar", 356, "Lieu-dit "chez Métral", Chilly, FR", "-1,000.09", 467, "barfoo", 1,345,456,235,231, "935.18"
I am looking to match the substring "chez Métral", in order to replace it with the substring chez Métral. Ideally, in as few lines of code as possible. The final goal is to write the line back out (or return it as a method return value) with the replacement already done.
So our example string would end up as...
"foobar", 356, "Lieu-dit chez Métral, Chilly, FR", "-1,000.09", 467, "barfoo", 1,345,456,235,231, "935.18"
I know I could define a pattern such as (?<quotedstring>\"\w+[^,]+\") to match quoted strings, but my regex-fu is weak (database developer, almost never use C#), so I'm not sure how to match another quoted string within the named group quotedstring.
FYI: For those noticing the large integer that is formatted with commas but not quote-qualified, that's already handled. As is the random use of row-delimiters (sometimes CR, sometimes LF). As other problems...

Replace with this regex
(?<!,\s*|^)"([^",]*)"
now replace it with $1
try it here
escaping " with "" it would become
(?<!,\s*|^)""([^"",]*)""

Parsing a CSV File with C#, ignoring thousand separators

Working on a program that takes a CSV file and splits on each ",". The issue I have is there are thousand separators in some of the numbers. In the CSV file, the numbers render correctly. When viewed as a text document, they are shown like below:
Dog,Cat,100,100,Fish
In a CSV file, there are four cells, with the values "Dog", "Cat", "100,000", "Fish". When I split on the "," to an array of strings, it contains 5 elements, when what I want is 4. Anyone know a way to work around this?
Thanks

There are two common mistakes made when reading csv code: using a split() function and using regular expressions. Both approaches are wrong, in that they are prone to corner cases such as yours and slower than they could be.
Instead, use a dedicated parser such as Microsoft.VisualBasic.TextFieldParser, CodeProject's FastCSV or Linq2csv, or my own implemention here on Stack Overflow.

Typically, CSV files would wrap these elements in quotes, causing your line to be displayed as:
Dog,Cat,"100,100",Fish
This would parse correctly (if using a reasonable method, ie: the TextFieldParser class or a 3rd party library), and avoid this issue.
I would consider your file as an error case - and would try to correct the issue on the generation side.
That being said, if that is not possible, you will need to have more information about the data structure in the file to correct this. For example, in this case, you know you should have 4 elements - if you find five, you may need to merge back together the 3rd and 4th, since those two represent the only number within the line.
This is not possible in a general case, however - for example, take the following:
100,100,100
If that is 2 numbers, should it be 100100, 100, or should it be 100, 100100? There is no way to determine this without more information.

you might want to have a look at the free opensource project FileHelpers. If you MUST use your own code, here is a primer on the CSV "standard" format

well you could always split on ("\",\"") and then trim the first and last element.
But I would look into regular expressions that match elements with in "".

Don't just split on the , split on ", ".
Better still, use a CSV library from google or codeplex etc
Reading a CSV file in .NET?

You may be able to use Regex.Replace to get rid of specifically the third comma as per below before parsing?
Replaces up to a specified number of occurrences of a pattern specified in the Regex constructor with a replacement string, starting at a specified character position in the input string. A MatchEvaluator delegate is called at each match to evaluate the replacement.
[C#] public string Replace(string, MatchEvaluator, int, int);

I ran into a similar issue with fields with line feeds in. Im not convinced this is elegant, but... For mine I basically chopped mine into lines, then if the line didnt start with a text delimeter, I appended it to the line above.
You could try something like this : Step through each field, if the field has an end text delimeter, move to the next, if not, grab the next field, appaend it, rince and repeat till you do have an end delimeter (allows for 1,000,000,000 etc) ..
(Im caffeine deprived, and hungry, I did write some code but it was so ugly, I didnt even post it)

Do you know that it will always contain exactly four columns? If so, this quick-and-dirty LINQ code would work:
string[] elements = line.Split(',');
string element1 = elements.ElementAt(0);
string element2 = elements.ElementAt(1);
// Exclude the first two elements and the last element.
var element3parts = elements.Skip(2).Take(elements.Count() - 3);
int element3 = Convert.ToInt32(string.Join("",element3parts));
string element4 = elements.Last();
Not elegant, but it works.

How to split a user-generated string which may contain the delimitter?

I'd like to String.Split() the following string using a comma as the delimitter:
John,Smith,123 Main Street,212-555-1212
The above content is entered by a user. If they enter a comma in their address, the resulting string would cause problems to String.Split() since you now have 5 fields instead of 4:
John,Smith,123 Main Street, Apt 101,212-555-1212
I can use String.Replace() on all user input to replace commas with something else, and then use String.Replace() again to convert things back to commas:
value = value.Replace(",", "*");
However, this can still be fooled if a user happens to use the placeholder delimitter "*" in their input. Then you'd end up with extra commas and no asterisks in the result.
I see solutions online for dealing with escaped delimitters, but I haven't found a solution for this seemingly common situation. What am I missing?
EDIT: This is called delimitter collision.

This is a common scenario — you have some arbitrary string values that you would like to compose into a structure, which is itself a string, but without allowing the values to interfere with the delimiters in structure around them.
You have several options:
Input restriction: If it is acceptable for your scenario, the simplest solution is to restrict the use of delimiters in the values. In your specific case, this means disallow commas.
Encoding: If input restriction is not appropriate, the next easiest option would be to encode the entire input value. Choose an encoding that does not have delimiters in its range of possible outputs (e.g. Base64 does not feature commas in its encoded output)
Escaping delimiters: A slightly more complex option is to come up with a convention for escaping delimiters. If you're working with something mainstream like CSV it is likely that the problem of escaping is already solved, and there's a standard library that you can use. If not, then it will take some thought to come up with a complete escaping system, and implement it.
If you have the flexibility to not use CSV for your data representation this would open up a host of other options. (e.g. Consider the way in which parameterised SQL queries sidestep the complexity of input escaping by storing the parameter values separately from the query string.)

This may not be an option for you but would is it not be easier to use a very uncommon character, say a pipe |, as your delimiter and not allow this character to be entered in the first instance?

If this is CSV, the address should be surrounded by quotes. CSV parsers are widely available that take this into account when parsing the text.
John,Smith,"123 Main Street, Apt. 6",212-555-1212

One foolproof solution would be to convert the user input to base64 and then delimit with a comma. It will mean that you will have to convert back after parsing.

You could try putting quotes, or some other begin and end delimiters, around each of the user inputs, and ignore any special character between a set of quotes.
This really comes down to a situation of cleansing user inputs. You should only allow desired characters in the user input and reject/strip invalid inputs from the user. This way you could use your asterisk delimiter.
The best solution is to define valid characters, and reject non valid characters somehow, then use the nonvalid character (which will not appear in the input since they are "banned") as you delimiters

Dont allow the user to enter that character which you are using as a Delimiter. I personally feel this is best way.

Funny solution (works if the address is the only field with coma):
Split the string by coma. First two pieces will be name and last name; the last piece is the telephone - take those away. Combine the rest by coma back - that would be address ;)

In a sense, the user is already "escaping" the comma with the space afterward.
So, try this:
string[] values = RegEx.Split(value, ",(?![ ])");
The user can still break this if they don't put a space, and there is a more foolproof method (using the standard CSV method of quoting values that contain commas), but this will do the trick for the use case you've presented.
One more solution: provide an "Address 2" field, which is where things like apartment numbers would traditionally go. User can still break it if they are lazy, though what they'll actually break the fields after address2.

Politely remind your users that properly-formed street addresses in the United States and Canada should NEVER contain any punctuation whatsoever, perhaps?
The process of automatically converting corrupted data into useful data is non-trivial without heuristic logic. You could try to outsource the parsing by calling a third-party address-formatting library to apply the USPS formatting rules.
Even USPS requires the user to perform much of the work, by having components of the address entered into distinct fields on their address "canonicalizer" page (http://zip4.usps.com/zip4/welcome.jsp).

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.