Find a specific string within another string without getting similar results

Find a specific string within another string without getting similar results - c#

In a console application I get as input the RTF (Rich text Format) code of a file. The source is a database and data gathered via query.
My goal is to search whether in the input code, as string, is present the code: \par (end of carriage in RTF).
I tried with string.IndexOf and string.Contains but both returns me bad results since they match also code like: "\pard".
Given a string like:
{\rtf1\ansi\deff0{\fonttbl{\f0\fnil\fcharset0 Times New Roman;}
{\f1\fnil\fcharset0 MS Sans Serif;}
\deflang1033\pard\plain\tx0\f2\lang1033\fs20\cf1 Payment}
How can I build my condition so that it return false, since the string does not contain \par? Eventually how could I set a regex to say that exactly the keyword "\par" (so length 4 chars) and no other will match? Thanks.
EDIT: The language used is C# and I am developing the console application with VS 2010.

You don't tell us the language you are using, but generally you need a word boundary something like this:
\\par\b
to ensure that there is not a word character following

Related

Regular Expression for Digits and Special Characters - C#

I use Html-Agility-Pack to extract information from some websites. In the process I get data in the form of string and I use that data in my program.
Sometimes the data I get includes multiple details in the single string. As the name of this Movie "Dog Eats Dog (2012) (2012)". The name should have been "Dog Eats Dog (2012)" rather than the first one.
Above is the one example from many. In order to correct the issue I tried to use string.Distinct() method but it would remove all the duplicate characters in the string as in above example it would return "Dog Eats (2012)". Now it solved my initial problem by removing the 2nd (2012) but created a new one by changing the actual title.
I thought my problem could be solved with Regex but I have no idea as to how I can use it here. As far as I know if I use Regex it would tell me that there are duplicate items in the string according to the defined Regex code.
But how do I remove it? There can be a string like "Meme 2013 (2013) (2013)".
Now the actual title is "Meme 2013" with year (2013) and the duplicate year (2013). Even if I get a bool value indicating that the string has duplicate year, I cant think of any method to actually remove the duplicate substring.
The duplicate year always comes in the end of the string. So what should be the Regex that I would use to determine that the string actually has two years in it, like (2012) (2012)?
If I can correctly identify the string contains duplicate maybe I can use string.LastIndexOf() to try and remove the duplicate part. If there is any better way to do it please let me know.
Thanks.

The right regex is "( \(\d{4}\))\1+".
string pattern = #"( \(\d{4}\))\1+";
new Regex(pattern).Replace(s, "$1");
Example here : https://repl.it/Evcy/2
Explanation:
Capture one " (dddd)" block, and remove all following identical ones.
( \(\d{4}\)) does the capture, \1+ finds any non empty sequence of that captured block
Finally, replace the initial block and its copies by the initial block alone.

This regex will allow for any pattern of whitespace, even none, as in (2013)(2013)
`#"(\(\d{4}\))(?:\s*\1)+"`
I have a demo of it here

Why SQL Server stored procedure doesn't recognize text from Visual Studio?

Ok. Will try to explain with images... This is my SQL Server and my query:
As you see I getting the result. But then I start my app in VS2013, put break point when I want to call my stored procedure and copy text from VS:
And paste Name in Qhuery:
But I didn't get the result! The names ABSOLUTELY THE SAME!
This Query doesnt't work:
SELECT TOP 1 [Employee].[EmployeeID]
FROM [Employee]
WHERE [Employee].[FullName] = 'Brad Oelmann'

I agree the initial suspect is a "special character" that shows up as whitespace pasting in SSMS.
It has happened to me filtering client data with t-sql.
To replace special characters, there is a good starting point here:
.NET replace non-printable ASCII with string representation of hex code
In that case, they're looking for "control characters" in particular and doing a fancy replacement, but the idea of finding the special characters RegEx is the same.
You can look at all kinds of special sets of characters here:
http://msdn.microsoft.com/en-us/library/20bw873z(v=vs.110).aspx
But it might be easier to define what you do want if you are doing something specific like a name.
For example, you can replace anything that isn't an English letter (for one example) with a space:
str = System.Text.RegularExpressions.Regex.Replace( _
str, _
"[^a-zA-Z]", _
" ")

It's really stupid, but I got simple solution. Since my DB Table contains only ~50 records, I retyped all names and now it works. So the problem was not in VS but in SQL Server side.
If somebady will have similar problem, first of all try to update data in your table somehow. You can try to select all data, copy-paste in in notepad and put it back in SQL Server.

How to produce a soft return using C#.net

I know this is kind of easy question but i cant seem to find it anywhere. Is there someone out there who knows how to create a soft return inside a set of text using C#.net?
I need to print soft return to a text file/xml file. this text file will be generated using c#.net. you could verify if the answer is correct if you use NOTEPAD++ then enable the option to “View>Show Symbol > Show End of Line” then you will see a symbol like this:
Thanks in advance :)

Not sure what you mean by a soft return. A quick Google search says it's a non-stored line break typically due to word wrapping in which case you wouldn't actually put this in a string, it would only be relevant when the string was rendered for display.
To put a carriage return and/or line feed in the string you would use:
string s = "line one\r\nline two";
And for further reference, here are the other escape codes that you can use.
Link (MSDN Blogs)
In response to your edit
The LF that you see can be represented with \n in a string. Obviously you have a specific line ending sequence that you need to represent. If you were to use Environment.NewLine that is going to give you different results on different platforms.

var message = $"Tom{Convert.ToChar(10)}Harry";
Results in:
Tom
Harry
With just a line feed between.

Lke already mentioned you can use Enviroment.NewLine but I am not sure if that i what you want or if you are actually trying to append a ASCII 141 to your string as mentioned in the comments.
You can add ASCII chr sequences to your string like this.
var myString = new StringBuilder("Foo");
myString.Append((char)141);

Replacing specific Unicode characters in strings read from Excel

I am attempting to replace some undesirable characters in a string retrieved from an Excel spreadsheet. The reason being that our Oracle database is using the WE8ISO8859P1 character set, which does not define several characters that Excel "helpfully" inserts for you in text (curly quotes, em and en dashes, etc.) Since I have no control over the database or how the Excel spreadsheets are created I need to replace the characters with something else.
I retrieve the cell contents into a string thus:
string s = xlRange.get_Range("A1", Missing.Value).Value2.ToString().Trim();
Viewing the string in Visual Studio's Text Visualiser shows the text to be complete and correctly retrieved. Next I try and replace one of the undesirable characters (in this case the right-hand curly quote symbol):
s = Regex.Replace(s, "\u0094", "\u0022");
But it does nothing (Text Visualiser shows it still to be there). To try and verify that the character I want to replace is actually in there, I tried:
bool a = s.Contains("\u0094");
but it returns false. However:
bool b = s.Contains("”");
returns true.
My (somewhat lacking) understanding of strings in .NET is that they're encoded in UTF-16, whereas Excel would probably be using ANSI. So does that mean I need to change the encoding of the text as it comes out of Excel? Or am I doing something else wrong here? Any advice would be greatly appreciated. I have read and re-read all articles I can find about Unicode and encoding but am still none the wiser.

Yes strings in .Net are UTF-16.
You're doing it right; perhaps your hex-math is incorrect.
The character you tested for isn't "\u0094" (Not sure that's what you meant). The following worked for me:
((int)"”"[0]).ToString("X") returns "201D"
"”" == "\u201D" returns true
"\u0094" == "" (right hand side is the empty string) returns false
A lot of UTF-16 characters will seem as an empty string by the text visualizer but they can either be an undisplayable character or part of a surrogate (i.e. Some characters may need to be typed "\UXXXXXXXX" while others you can do with (four digits) "\uXXXX".). My knowledge of this domain is very limited.
References - Jon Skeet's articles on:
Strings
Unicode

You can use NVARCHAR and NTEXT instead of VARCHAR and TEXT for the columns that need to accomodate those characters.
That wayyou don't have to convert the whole database, and you are future proof, because the columns will be Unicode.

Comparing strings that contain formatting in C#

I'm working on a function that given some settings - such as line spacing, the output (in string form) is modified. In order to test such scenarios, I'm using string literals, as shown below for the expected result.
The method, using a string builder, (AppendLine) generates the said output. One issue I have run into is that of comparing such strings. In the example below, both are equal in terms of what they represent. The result is the area which I care about, however when comparing two strings, one literal, one not, equality naturally fails. This is because one of the strings emits line spacing, while the other only demonstrates the formatting it contains.
What would be the best way of solving this equality problem? I do care about formatting such as new lines from the result of the method, this is crucially important.
Code:
string expected = #"Test\n\n\nEnd Test.";
string result = "Test\n\n\nEnd Test";
Console.WriteLine(expected);
Console.WriteLine(result);
Output:
Test\n\n\nEnd Test.
Test
End Test

The # prefix tells the compiler to take the string exactly as it is written. So, it doesn't format the \n characters to carriage returns and line feeds.
Since you don't have the same prefix for the string assigned to your result variable, the compiler formats it. If you would like to continue to use the # prefix, just do the following:
string expected = #"Test
End Test";
You'll have to input the carriage returns and line feed within the string as invisible characters.

You're using the term "literal" incorrectly. "Literal" simply means an actual value that exists in code. In other words, values exist in code either as variables (for the sake of simplicity I'm including constants in this group) and literals. Variables are an abstract notion of a value, whereas literals are a value.
All this is to say that both of your strings are string literals, as they're hard-coded into your application. The # prefix simply states that the compiler is to include escape characters (indeed, anything other than a double-quote) in the string, rather than evaluating the escape sequences when compiling the string literal into the assembly.
First of all, whatever your function returns (either a string that contains standard escape sequences for newlines rather than newlines themselves, or a string that actually contains newlines) is what your test variable should contain. Make your tests as close to the actual output as possible, as the more work you do to massage the values into a comparable form the more code paths you have to test. If you're looking to be able to compare a string with formatting escape sequences embedded into it to a string where those sequences have been evaluated (essentially comparing the two strings in your example), then I would say this:
Be sure that this is really want you want to do.
You'll have to duplicate the functionality of the C# compiler in interpreting these values and turning your "format string" into a "formatted string".
For doing #2, a RegEx processor is probably going to be the simplest option. See this page for a list of C# string escape sequences.

I feel somewhat enlightened, yet annoyed at what I discovered.
This is my first project using MSTest, and after a failing test I was selecting View Test Details to see how and why my test failed. The formatting for string output in this details display is very poor, for example you get:
Assert.AreEqual failed. Expected:<TestTest End>. Actual:<TestTest End>.
This is for formatted text - the strange thing is if you have /r (line feeds) instead of line breaks (/n) the formatting is actually somewhat correct.
It turns out to view the correct output you need to run the tests in debug mode. In other words, when you have a failing test, run the test in debug and the exception will be caught and displayed as follows:
Assert.AreEqual failed. Expected:<Test
Test End>. Actual:<Test
Test End>.
The above obviously containing the correct formatting.
In the end it turns out my initial method of storing the expectations (with formatting) in strings was correct, yet my unfamiliarity of MSTest made me question my means as it appeared to be valid input, yet was simply being displayed back to myself in what appeared a valid output.

Use a regex to strip white space before you do your compare?

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.