How to compare two strings with conditions? - c#

I have a paragraph which contains author name like:
Gopi, K.P., and Vijay, S. (1997) Computer Controlled Systems: Theory
and Design, Third edition, Mc Graw-Hill, ND Cliffs, IND.
and another paragraph like this:
It will cause numerical difficulty (Gopi and Vijay, 1997). What’s
more, when the process constraints are activated, the significant
deterioration of closed-loop control performance will be clearly
witnessed as kind of nonlinearity is dominating the control system
(Tenny, Rawlings, and Wright, 2004).
So how to compare these two paragraph with the multiple author name (Gopi & vijay) with the year of publication.
Note: In the first reference part all the format styling of author name with year information is constant.

A "compare" (between those strings) provides three possible results:
the first string is "greater" than the second
the first string is "less" than the second
the two strings are "identical"
The meaning of "greater", "less" and "identical" depends of the comparison function.
You probably don't want a "compare". What does "the second paragraph is less than the first" even mean?
You are probably interested in finding out where the reference to "Computer Controlled Systems" is used in the text. (Something which should be trivial to do if the paper was referenced properly...)
If this is what you actually need, then it's time to figure out how would you, as a human, handle this task.
My first approach would be to take the reference string
string str = "Gopi, K.P., and Vijay, S. (1997) Computer Controlled Systems";
and see what is actually relevant in it
string[] substrings = str.Split(new char[] { ' ', ',', '(', ')' });
A paragraph referencing this "Computer Controlled Systems" source would be likely to contain "Gopi and Vijay, 1997" somewhere in it.
string toFind = substrings[0] + " and " + substrings[5] + ", " + substrings[9];
Then, I'd open the text in my favorite text viewer and search for "Gopi and Vijay, 1997".
string text = "It will cause numerical difficulty (Gopi and Vijay, 1997). What’s more, when the process constraints are activated, the significant deterioration of closed-loop control performance will be clearly witnessed as kind of nonlinearity is dominating the control system (Tenny, Rawlings, and Wright, 2004).";
int pos = text.IndexOf(toFind);
And then I'd store both the position of the match and a bit of context somewhere.
string match = "[...]" + text.Substring(Math.Max(pos - 50, 0), Math.Min(text.Length - pos, pos + toFind.Length + 50)) + "[...]";
Then, I would start looking at a regex because I would realize that there may be other combinations of "Gopi", "Vijay", "1997" and punctuation marks that may be used in the text.

Related

Regex ignore a pattern

I am trying to figure out a viable way to go about parsing this CSV file. Currently I am using filehelpers which is great. But with this csv file it seems to be having issues.
Each record in the the csv file is contained in quotes and delimited by a comma.
The records have commas within them and 1 record out of the 90,000 records im dealing with has one single " that mucks up the Readline.
The record looks like this "24" Blah ",
So I'm looking to write a regex to insert into the BeforeReadRecord that will go through and replace all instances of " with a space.
I'm newer to regex but I'm not finding any way to exclude three cases.
Case one: each line starts with a "
Case two: each line ends with a "
Case three: each field is separated by ","
I am trying to figure out how I could exclude those three cases and be left to just replace any straggler " .
So far I've been failing miserably and am not even sure if there is a way to accomplish this. Perhaps someone knows of a better csv parser that handles this one odd case as well?
EDIT: Well here's what I ended up with. It takes a little time to process(also just changes any outlier " to ' which is fine since the data that contains quotes is needed for any queries) but looking for any pitfalls I may be falling in to make it faster but it seemed to be the quickest solution so far(took about 7 seconds for 92,000 records) but there doesn't seem any way around checking every line so... My previous solution was a nasty nested if that seemed to 30 seconds or so over the course of processing the records. It accounts for all scenarios except for where someone decides to put a random ", at the end of a field... hoping I don't run into a record like this but it wouldn't surprise me.
in its own method{
engine.BeforeReadRecord += (sender, args) =>
args.RecordLine = checkQuote(args.RecordLine);
var records = engine.ReadFile(reportFilePath);
}
private static string checkQuote(string checkString)
{
if (checkString.Substring(0, 1) == #"""")
{
string removeQuote = #"""" + checkString.Replace(#"""", "'").Replace(#"','", #""",""").Remove(checkString.Length-1,1).Remove(0,1) + #"""";
return removeQuote;
}
else
return checkString; }
File format readers typically don't handle malformed input well. Why should they? If you give a CSV reader bad data, I would expect it to barf. I've rarely had good luck with computer software that makes assumptions about what I meant.
Do you really need a regular expression? If you define a straggler as the last quote character when the number is odd, then it's trivial to remove the last one: just count them and if the number is odd, remove the last one.
For example:
var quoteCount = inputString.Count(c => c == '\"');
if ((quoteCount % 2) == 1)
{
inputString = inputString.Remove(inputString.LastIndexOf('\"'));
}
Done and done.
You could also do it in a single pass with a loop, but that's probably overkill. I strongly suspect that sanitizing the input is not a major bottleneck in your program.
For more complex patterns (i.e. you're looking for "," or for a quote at the start and end, you just write a simple state machine. It's probably a dozen lines of code.
I realize that you might be able to do this with regular expressions. I find regex great for finding stuff and doing simple replacements. For more complicated rules like "replace quote with space unless the quote is at the beginning or end of line or next to a comma", I find it hard to come up with a good expression. For example, what about this case:
"first name","last name","","phone"
You have to take that blank field (i.e. "") into account. You also have to take into account spaces between fields (i.e. "first" , "last" , ""), and a whole host of other things. I'm reasonably sure that regex can do it. My experience has been that I can usually write the simple state machine and prove that it's correct faster than I can puzzle out the required regex. And it's certain that I'll more easily understand the state machine six months later.

Confused about C# character special meanings

I'm a newbie in C# programming, and I don't understand how character escaping works. I tried many of them, but only '\t' works like it should.
An example for my code:
textBox1.Text = "asd";
textBox1.Text = textBox1.Text + '\n';
textBox1.Text = textBox1.Text + "asd";
MultiLine is enabled, but my output is simply "asdasd" without breaking the line.
I hope some of you knows what the answer is.
You need "\r\n" for a line break in Windows controls, because that's the normal line break combination for Windows. (Line breaks are the bane of programmers everywhere. Different operating systems have different character sequences that they use as their "typical" line breaks.)
"\r\n" is two characters: \r for carriage return (U+000D) and \n for line feed (U+000A). You don't need to do it in three statements though:
textBox1.Text = "First line\r\nSecond line";
Now, I've deliberately gone with \r\n there instead of Environment.NewLine, on the grounds that if you're working with System.Windows.Forms, those will be Windows-oriented controls. It's unclear to me what an implementation of Windows Forms will do on Mac or Linux, where the normal line break is different. (My guess is that a pragmatic implementation will break on any of \r, \n or \r\n, just like TextReader does, for simplicity.)
Sometimes - such as if you're writing a text file for consumption on the same machine - it's good to use Environment.NewLine.
Sometimes - such as when you're implementing a network protocol such as HTTP which mandates a specific line break format - it's good to be explicit about it.
Sometimes - such as in this case - it's just not clear. However, it's always worth thinking about.
For a complete list of escape sequences in C#, you can either look at the C# language specification (always fun!) or look at my Strings article which contains that information and more.
The most legible way to insert a newline in C# is:
textBox1.Text = "asd";
textBox1.Text = textBox1.Text + Environment.NewLine;
textBox1.Text = textBox1.Text + "asd";
This code snippet would set textBox1 text to "asd", then add a newline to it, then on the second line, it would add "asd" again.
The special characters you are confused about are artifacts from when computers used teletype printers for all of their output. Teletype printers predated digital displays by several years.
\n tells the printer (output) to move the print head down one row (new line).
\r tells the printer (output) to move the print head back to the start of the row (carriage return).
Combined, this resulted in the print head being positioned at the start of the next row, ready for output onto clean paper.
Further information can be sought out on the internet with ease, including places such as Wikipedia's NewLine article

0x202A in filename: Why?

I recently needed to do a isnull in SQL on a varbinary image.
So far so (ab)normal.
I very quickly wrote a C# program to read in the file no_image.png from my desktop, and output the bytes as hex string.
That program started like this:
byte[] ba = System.IO.File.ReadAllBytes(#"‪D:\UserName\Desktop\no_image.png");
Console.WriteLine(ba.Length);
// From here, change ba to hex string
And as I had used readallbytes countless times before, I figured no big deal.
To my surprise, I got a "NotSupported" exception on ReadAllBytes.
I found that the problem was that when I right click on the file, go to tab "Security", and copy-paste the object-name (start marking at the right and move inaccurately to the left), this happens.
And it happens only on Windows 8.1 (and perhaps 8), but not on Windows 7.
When I output the string in question:
public static string ToHexString(string input)
{
string strRetVal = null;
System.Text.StringBuilder sb = new System.Text.StringBuilder();
foreach (char c in input)
{
sb.Append(((int)c).ToString("X2"));
}
strRetVal = sb.ToString();
sb.Length = 0;
sb = null;
return strRetVal;
} // End Function ToHexString
string str = ToHexString(#"‪D:\UserName\Desktop\cookie.png");
string strRight = " (" + ToHexString(#"D:\UserName\Desktop\cookie.png") + ")"; // Correct value, for comparison
string msg = str + Environment.NewLine + " " + strRight;
Console.WriteLine(msg);
I get this:
202A443A5C557365724E616D655C4465736B746F705C636F6F6B69652E706E67
(443A5C557365724E616D655C4465736B746F705C636F6F6B69652E706E67)
First thing, when I lookup 20 2A in ascii, it's [space] + *
Since I don't see neither a space nor a star, when I google 20 2A, the first thing I get is paragraph 202a of the german penal code
http://dejure.org/gesetze/StGB/202a.html
But I suppose that is rather an unfortunate coincidence and it is actually the unicode control character 'LEFT-TO-RIGHT EMBEDDING' (U+202A)
http://www.fileformat.info/info/unicode/char/202a/index.htm
Is that a bug, or is that a feature ?
My guess is, it's a buggy feature.
The issue is that the string does not begin with a letter D at all - it just looks like it does.
It appears that the string is hard-coded in your source file.
If that's the case, then you have pasted the string from the security dialog. Unbeknownst to you, the string you pasted begins with the LRO character. This is an invisible character which tales no space, but tells the renderer to render characters from left-to-right, ignoring the usual rendering.
You just need to delete the character.
To do this, position the cursor AFTER the D in the string. Use the Backspace or Delete to Left key <x] to delete the D. Use the key again to delete the invisible LRO character. One more time to delete the ". Now retype the " and the D.
A similar problem could occur wherever the string came from - e.g. from user input, command line, script file etc.
Note: The security dialog shows the filename beginning with the LRO character to ensure that characters are displayed in the left-to-right order, which is necessary to ensure that the hierarchy is correctly understood when using RTL characters. e.g. a filename c:\folder\path\to\file in Arabic might be c:\folder\مسار/إلى/ملف. The "gotcha" is the Arabic parts read in the other direction so the word "path" according to google translate is مسار, and that is the rightmost word, making it appear is if it was the last element of the path, when in fact it is the element immediately after "c:\folder\".
Because security object paths have an hierarchy which is in conflict with the RTL text layout rules, the security dialog always displays RTL text in LTR mode. That means that the Arabic words will be mangled (letters in wrong order) on the security tab. (Imagine it as if it said "elif ot htap"). So the meaning is just about discernable, but from the point of view of security, the security semantics are preserved.
Filenames that contain RLO/LRO overrides are commonly created by malware. Eg. “exe” read backwards spells “malware”. You probably have an infected host, or the origin of the .png is infected.
This question bothered me a lot, how would it be possible that a deterministic function would give 2 different results for identical input? After some testing, it turns out that the answer is simple.
If you look through it in your debugger, you will see that the 'D' char in your #"‪D:\UserName\Desktop\cookie.png" (first use of Hex function) is NOT the same char as in #"D:\UserName\Desktop\cookie.png" (second use).
You must have used some other 'D'-like character, probably by unwanted keyboard shortcut or by messing with your Visual Studio character encoding.
It looks exactly the same, but in reality it's not event a single char 9try to watch the c variable in your toHex function.
if you change to the normal 'D' in your first example, it will work fine.

How to display word differences using c#?

I would like to show the differences between two blocks of text. Rather than comparing lines of text or individual characters, I would like to just compare words separated by specified characters ('\n', ' ', '\t' for example). My main reasoning for this is that the block of text that I'll be comparing generally doesn't have many line breaks in it and letter comparisons can be hard to follow.
I've come across the following O(ND) logic in C# for comparing lines and characters, but I'm sort of at a loss for how to modify it to compare words.
In addition, I would like to keep track of the separators between words and make sure they're included with the diff. So if space is replaced by a hard return, I would like that to come up as a diff.
I'm using Asp.net to display the entire block of text including the deleted original text and added new text (both will be highlighted to show that they were deleted/added). A solution that works with those technologies would be appreciated.
Any advice on how to accomplish this is appreciated?
Thanks!
Microsoft has released a diff project on CodePlex that allows you to do word, character, and line diffs. It is licensed under Microsoft Public License (Ms-PL).
https://github.com/mmanela/diffplex
Other than a few general optimizations, if you need to include the separators in the comparison you are essentially doing a character by character comparison with breaks. Though you could use the O(ND) you linked, you are going to make as many changes to it as you would basically writing your own.
The main problem with difference comparison is finding the continuation (if I delete a single word, but leave the rest the same).
If you want to use their code start with the example and do not write the deleted characters, if there are replaced characters in the same place, do not output this result. You then need to compute the longest continuous run of "changed" words, highlight this string and output.
Sorry thats not much of an answer, but for this problem the answer is basically writing and tuning the function.
Well String.Split with '\n', ' ' and '\t' as the split characters will return you an array of words in your block of text.
You could then compare each array for differences. A simple 1:1 comparison would tell you if any word had been changed. Comparing:
hello world how are you
and:
hello there how are you
would give you that world and changed to there.
What it wouldn't tell you was if words had been inserted or removed and you would still need to parse the text blocks character by character to see if any of the separator characters had been changed.
string string1 = "hello world how are you";
string string2 = "hello there how are you";
var first = string1.Split(' ');
var second = string2.Split(' ');
var primary = first.Length > second.Length ? first : second;
var secondary = primary == second ? first : second;
var difference = primary.Except(secondary).ToArray();

How to add line break in C# behind page

I have written code in C# which is exceeding page width, so I want it to be broken into next line according to my formatting. I tried to search a lot to get that character for line break but was not able to find out.
In VB.NET I use '_' for line break, same way what is used in C#?
I am trying to break a string.
In C# there's no 'new line' character like there is in VB.NET. The end of a logical 'line' of code is denoted by a ';'. If you wish to break the line of code over multiple lines, just hit the carriage return (or if you want to programmatically add it (for programmatically generated code) insert 'Environment.NewLine' or '\r\n'.
Edit: In response to your comment: If you wish to break a string over multiple lines (i.e. programmatically), you should insert the Environment.NewLine character. This will take the environment into account in order to create the line ending. For instance, many environments, including Unix/Linux only use a NewLine character (\n), but Windows uses both carriage return and line feed (\r\n). So to break a string you would use:
string output = "Hello this is my string\r\nthat I want broken over multiple lines."
Of course, this would only be good for Windows, so before I get flamed for incorrect practice you should actually do this:
string output = string.Format("Hello this is my string{0}that I want broken over multiple lines.", Environment.NewLine);
Or if you want to break over multiple lines in your IDE, you would do:
string output = "My string"
+ "is split over"
+ "multiple lines";
Option A: concatenate several string literal into one:
string myText = "Looking up into the night sky is looking into infinity" +
" - distance is incomprehensible and therefore meaningless.";
Option B: use a single multiline string literal:
string myText = #"Looking up into the night sky is looking into infinity
- distance is incomprehensible and therefore meaningless.";
With option B, the newline character(s) will be part of the string saved into variable myText. This might, or might not, be what you want.
result = "Minimum MarketData"+ Environment.NewLine
+ "Refresh interval is 1";
Use # symbol before starting the string.
like
string s = #"this is a really
long string
and this is
the rest of it";
If I am understanding this correctly, you should be able to break the string into substrings to accomplish this.
i.e.:
string s = "this is a really long string" +
"and this is the rest of it";
C# doesn't have an explicit line break character. You statements end with a semicolon so you can span your statements over many lines. These are both the same:
public string GenerateString()
{
return "abc" + "def";
}
public string GenerateString()
{
return
"abc" +
"def";
}
All you need to do is add \n or to write on files go \r\n.
Examples:
say you wanted to write duck(line break) cow this is how you would do it
Console.WriteLine("duck\n cow");
Edit: I think I didn't understand the question. You can use
#"duck
cow".Replace("\r\n", "")
as a linebreak in code, that produces \r\n which is used Windows.
C# code can be split between lines on pretty much any syntatic construct without a need for a '_' style construct.
For example
foo.
Bar(
42
, "again");
dt = abj.getDataTable(
"select bookrecord.userid,usermaster.userName, "
+" book.bookname,bookrecord.fromdate, "
+" bookrecord.todate,bookrecord.bookstatus "
+" from book,bookrecord,usermaster "
+" where bookrecord.bookid='"+ bookId +"' "
+" and usermaster.userId=bookrecord.userid "
+" and book.bookid='"+ bookId +"'");
guys.. use resources for long strings in code behind!!
also.. you don't need an _ for codeline breaks in C#. In VB the codelines end with a newline character (or a ':'), using the the _ would tell the parser it has not reached the end of the line yet. The codeline in C# ends with a ';' so you can use newlines to styleformat your code.
Strings are immutable, so using
public string GenerateString()
{
return
"abc" +
"def";
}
will slow you performance - each of those values is a string literal which must be concatenated at runtime - bad news if you reuse the method/property/whatever alot...
Store your string literals in resources is a good idea...
public string GenerateString()
{
return Resources.MyString;
}
That way it is localisable and the code is tidy (although performance is pretty terrible).

Categories

Resources