C# Regex.Replace not matching and removing double quotes

C# Regex.Replace not matching and removing double quotes - c#

I googled for an answer and I found some questions here on Stack Exchange asking similar question but they didn't help me. For example, I found C# regex - not matching my string but the answers given are way too complicated for me to understand. I don't know or understand regex. All I want to do is strip a double quote from a string.
To put my question simply, I have a string "\"123.456\"" and I need to remove the "\""
so I made my expression "[^\w\\"]" and after calling
string myString Regex.Replace("\"123.456\"", "[^\\w\\\"]", "",
RegexOptions.None, TimeSpan.FromSeconds(1.5));
myString is "\"123.456\"". I just need to know what my expression should be. I won't be able to understand any lengthy discussions or lectures on learning regex.
I got my example directly from Microsoft at http://msdn.microsoft.com/en-us/library/844skk0h(v=vs.110).aspx so basically all I did was replace the ".#-" with "\"".
UPDATE
Apparently trying to ask a simple question only attracts trolls. I didn't want to get too complicated because I didn't want all you hard working busy people to spend too much time answering the wrong question. I was trying to be nice.
We have a situation where we need to parse input files from several clients and going forwards, the number of clients will increase and there also the number of files from each client will increase.
We found that in several of our clients' transmitted files many fields will have various extra characters. We don't know how or why those characters are in there and our clients aren't telling. (if you want to know why they aren't telling, please move along, these aren't the questions you are looking for)
So, we have many files from many clients each with many rows with many fields of data and we need to strip out "bad" characters.
I took Microsofts method and changed it a bit to be more dynamic.
private string CleanInput(string strIn, string chars)
{
// Replace invalid characters with empty strings.
try
{
string regexString = string.Format(#"[^\w\{0}]", chars);
return Regex.Replace(strIn, regexString, "",
RegexOptions.None, TimeSpan.FromSeconds(1.5));
}
// If we timeout when replacing invalid characters,
// we should return Empty.
catch (RegexMatchTimeoutException)
{
return string.Empty;
}
goal here is to be able to strip out any characters that don't belong dynamically But we can't just hard code those characters because not all fields will have any of these characters, and more importantly, some fields will have some bad characters along with other characters which are not to be considered bad for that field but may be considered bad for other fields.
With me so far?
So, in trying to get my work done by Friday (yes, tomorrow), I decided to start slowly with only a couple of known bad characters from 3 input files. So far, those characters are single quote, dash, double quote, dollar sign, comma. But not all the fields in my 3 files need these characters stripped, so I intend to call the CleanInput method only on those fields that need it, and only for the characters that we need stripped.
OK, so while I was testing, I discovered on one field, where we want to strip the comma, single quote, double quote and dollar sign, it was not removing the double quotes (an apparently the backslashes too). So I debugged this issue by first passing in only the comma -that worked. Then I tried passing in only the single quote - that worked. Then I passed in the dollar sign - that worked. Then I passed in the escaped double quote -and that didn't work - the double quotes are still in the string. So I simplified my test in a new console project and I hard coded the string and I called my method just to make sure nothing else could be interfering with it.
I hope and pray no one spends hours of their precious time trying to reconfigure my input files or attempting to teach me the end all be all of regex programming. I have to get this done by tomorrow. Please, I only want to know how to strip the double quote (and apparently the backslashes too) from the given string.

Rather than getting regex involved, perhaps you can just use Replace?
var myString = "\\\"123.456\\\"";
var myCleanString = myString.Replace(#"\""", "");

You are matching on a negated group (the [^] bit). This matches any character not in the square brackets and replaces it. You want to replace anything that is in the group which you can do by just placing the characters you wish to replace inside the square brackets and remove the negation (^):
private static string CleanInput(string strIn, string chars)
{
// Replace invalid characters with empty strings.
try
{
string regexString = string.Format(#"[{0}]", chars);
return Regex.Replace(strIn, regexString, "",
RegexOptions.None, TimeSpan.FromSeconds(1.5));
}
// If we timeout when replacing invalid characters,
// we should return Empty.
catch (RegexMatchTimeoutException)
{
return string.Empty;
}
}
You would use the negative version if you knew what you wanted to include rather than exclude. For example if you knew you only wanted numbers and the period character you could do:
string myString = Regex.Replace("\"123.456\"", "[^\\d.]", "",
RegexOptions.None, TimeSpan.FromSeconds(1.5));

Related

Failure To Get Specific Text From Regex Group

My example is working fine with greedy when I use to capture the whole value of a string and a group(in group[1] ONLY) enclose with a pair of single quote.
But when I want to capture the whole value of a string and a group(in group[1] ONLY) enclose with multiple pair of single quote , it only capture the value of string enclose with last pair but not the string between first and last single quotes.
string val1 = "Content:abc'23'asad";
string val2 = "Content:'Scale['#13212']'ta";
Match match1 = Regex.Match(val1, #".*'(.*)'.*");
Match match2 = Regex.Match(val2, #".*'(.*)'.*");
if (match1.Success)
{
string value1 = match1.Value;
string GroupValue1 = match1.Groups[1].Value;
Console.WriteLine(value1);
Console.WriteLine(GroupValue1);
string value2 = match2.Value;
string GroupValue2 = match2.Groups[1].Value;
Console.WriteLine(value2);
Console.WriteLine(GroupValue2);
Console.ReadLine();
// using greedy For val1 i am getting perfect value for-
// value1--->Content:abc'23'asad
// GroupValue1--->23
// BUT using greedy For val2 i am getting the string elcosed by last single quote-
// value2--->Content:'Scale['#13212']'ta
// GroupValue2---> ]
// But i want GroupValue2--->Scale['#13212']
}

The problem with your existing regex is that you are using too many greedy modifiers. That first one is going to grab everything it can until it runs into the second to last apostrophe in the string. That's why your end result of the second example is just the stuff within the last pair of quotes.
There are a few ways to approach this. The simplest way is to use Slai's suggestion - just a pattern to grab anything and everything within the most "apart" apostrophes available:
'(.*)'
A more explicitly defined approach would be to slightly tweak the pattern you are currently using. Just change the first greedy modifier into a lazy one:
.*?'(.*)'.*
Alternatively, you could change the dot in that first and last section to instead match every character other than an apostrophe:
[^']*'(.*)'[^']*
Which one you end up using depends on what you're personally going after. One thing of note, though, is that according to Regex101, the first option involves the fewest steps, so it will be the most efficient method. However, it also dumps the rest of the string, but I don't know if that matters to you.

First off use named match capture groups such as (?<Data> ... ) then you can access that group by its name in C# such as match1.Groups["Data"].Value.
Secondly try not to use * which means zero to many. Is there really going to be no data? For a majority of the cases, that answer is no, there is data.
Use the +, one to many instead.
IMHO * screws up more patterns because it has to find zero data, when it does that, it skips ungodly amounts of data. When you know there is data use +.
It is better to match on what is known, than unknown and we will create a pattern to what is known. Also in that light use the negation set [^ ] to capture text such as [^']+ which says capture everything that is not a ', one to many times.
Pattern
Content:\x27?[^\x27?]+\x27(?<Data>[^\27]+?)\x27
The results on your two sets of data are 23 and #13212 and placed into match capture group[1] and group["Data"].
Note \x27 is the hex escape of the single quote '. \x22 is for the double quote ", which I bet is what you are really running into.
I use the hex escapes when dealing with quotes so not to have to mess with the C# compiler thinking they are quotes while parsing.

Prevent Regex from devouring optional part of the match

I'v searched extensively but I can't find a simple answer to this and my Regex experience is limited. I'd appreciate a simple solution that is explained, please.
I have a very large string and I need to substitute certain words in it as follows:
Example: wherever you find the string "LINK-ABC" make it "LINK_ABC".
I wrote my Regex Match and Replace strings:
#"LINK-ABC", #"LINK_ABC" and it worked.
But there were a couple of things I had not recognized.
There COULD be words in the file like this:
LINK-ABC-DEF LINK-ABC-GHI-JKL ... and so on.
So I get "LINK_ABC-DEF" etc. (which is NOT what I want; this should have remained intact...)
Once I realized the problem it seemed that what I REALLY wanted was to recognize ONLY the word being matched and leave any cases where it was in combination with something else, unchanged. It seemed to me that if I checked for a space or period on the Match word, that should do it, so...
#"LINK-ABC[ |\\.]",#"LINK_ABC"
... and now I have stumbled.
Sample string:
link-xxx link-aaa-sss link-xxx-bbb link-xxx link-xxx.
Match/Replace string:
link-xxx[ |\\.],link_xxx
Result string:
link_xxxlink-aaa-sss link-xxx-bbb link_xxxlink_xxx
The replacements are correct, BUT the trailing comma or period has been "devoured" and so the result string is wrong.
Is there a way that I can match so that if it matches on space, the replacement will have a space and if it matches on a period, the replacement will have a period? I s'pose I could do 2 separate matches but I'd like to increase my understanding of Regex and do it more elegantly if it is possible.

You should be able to achieve the behavior you want with "capture groups"
var matchstring = #"link-xxx([ \.]|$)";
var fixstr = #"link_xxx$1";
The parenthesis around the last part of the matchstring will retain whatever matched inside it, and the $1 in the fixstr will substitute whatever was captured by that group.
I've also modified your punctuation section a little bit, presuming you want to replace a match if it happens to be the last word in the input (by adding the |$). A | inside a character class [] is a literal | character, so I removed it assuming you don't actually expect that in your input.

C# special characters

I need to verify that a string doesn't contain any special characters like #,%...™ etc. Basically it's a Name/surname (and some similar) strings, however, sticking to [a-zA-Z] wouldn't do as symbols like ščřž... are allowed.
At the moment I'd go with somewhat like
bool NonSpecial(string text){
return !Regex.Match(Regex.Escape("!##$%^&......")).Success;
}
but that just seems to be too complicated and clumsy.
Is there any simpler and/or more elegant way?
Update:
So after reading all the replies I decided to go with
private bool IsName( string text ) {
return Regex.Match( text, #"^[\p{L}\p{Nd}'\.\- ]+$" ).Success && !Regex.Match( text, #"['\-\.]{2}" ).Success && !Regex.Match( text, " " ).Success;
}
Basically the name can contain Letters, numbers, ', ., -, and spaces, any of the ",.-" must be separeted by at least 1 other allowed characters and there cannot be 2 spaces in a row.
Hope that's correct.

Have you tried text.All(Char.IsLetter)?
PS http://www.kalzumeus.com/2010/06/17/falsehoods-programmers-believe-about-names/

You can use the Unicode category for letters:
Regex.Match(text, #"\p{L}+");
See Supported Unicode Categories.

This problem is worse than you imagine.
There are literally thousands of allowable characters that can legitimately be part of a name, spread over hundreds of ranges in the various unicode alphabets.
There are also literally tens of thousands of characters that will never be part of a name. Think of all the emoji and ascii art characters. These are also spread over hundreds of separate ranges of unicode characters.
Sifting the wheat from the chaff via manual code, even regular expressions, just isn't going to work well.
Thankfully, this work has been done for you. Look at the char.IsLetter() method.
You may also want to have an exception for the various allowed separator characters and accents that are not letters, but can be part of a name: hyphens, apostrophe's, and periods are legitimate, and all have more than one allowed unicode encoding. Unfortunately, I don't have a quick solution for you here. This may have to a best-effort approach, looking at just some of the more common.

try using Linq/Lambda as well pretty straight forward
will return true if it doesn't contain letters
bool result = text.Any(x => !char.IsLetter(x));

Match up everything before STRING or STRING

I've searched for hours and already tried tons of different patterns - there's a simple thing I wan't to achive with regex, but somehow it just won't do as I want:
Possible Strings
String1
This is some text \0"§%lfsdrlsrblabla\0\0\0}dfglpdfgl
String2
This is some text
String3
This is some text \0
Desired Match/Result
This is some text
I simply want to match everything - until and except the \0 - resulting in only 1 Match. (everything before the \0)
Important for my case is, that it will match everytime, even when the \0 is not given.
Thanks for your help!

You can try with this pattern:
#"^(?:[^\\]+|\\(?!0))+"
In other words: all characters except backslashes or backslashes not followed by 0

I like
#"^((?!\\0).)*"
Because it's very easy to implement with any arbitrary string. The basic trick is the negative lookahead, which asserts that the string starting at this point doesn't match the
regular expression inside. We follow this with a wildcard to mean "Literally any character not at the start of my string. If your string should change, this is an easy update - just
#"^((?!--STRING--).)*)"
As long as you properly escape that string. Heck, with this pattern, you're merely a regex_escape function from generating any delimiter string.
Bonus: using * instead of + will return a blank string as a valid match when your string starts with your delimiter.

Using a Regex to find specifically formatted fields

In an application I'm developing, someone thought it would be ok idea to include commas in the values of a csv file. So what I'm trying to do select these values and then strip the commas out of these values. But the Regex I've built won't match to anything.
The Regex pattern is: .*,\"<money>(.+,.+)\",?.*
And the sort of values I'm trying to match would be the 2700, 2650 and 2600 in "2,700","2,650","2,600".

Commas are allowed in CSV's and should not cause a problem if you use a text qualifier (usually double quote ").
Here are details: http://en.wikipedia.org/wiki/Comma-separated_values
On to the regex:
This code works for your sample data (but only allows one comma, basically thousands seperated numbers up to 999,999):
string ResultString = null;
try {
ResultString = Regex.Replace(myString, "([0-9]{1,3})(?:(,)?([0-9]{3})?)", "$1$3");
} catch (ArgumentException ex) {
// Syntax error in the regular expression
}
It takes this:
Test 12,205 26,000 Test. And the sort
of values I'm trying to match would be
the 2700, 2650 and 2600 in
"2,700","2,650","2,600"
and produces this:
Test 12205 26000 Test. And the sort
of values I'm trying to match would be
the 2700 2650 and 2600 in
"2700","2650","2600"

Thanks for the answer Ryan. I should've kept my temper in check a bit more. Though still, commas in a CSV that don't seperate values?
After asking around the office I got pointed towards a free Regex designer application that I used to build the pattern I needed. The application is Rad Software Regular Expression Designer
Oh and for the answer purposes the pattern I came out with is:
\"(?<money>[0-9,.$]*)\"
Edit: Final Andwer
Woah. Completely forgot about this question. I played around with the regex even more and came out with one that would match everything I needed. The regex is:
\"([0-9.$]+(,[0-9.]+)+)\"
That regex has been able to match any deciaml string within a double quote that I've thrown at it.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.