Remove white spaces unless within quotes, ignoring escaped quotes

Remove white spaces unless within quotes, ignoring escaped quotes - c#

I have a JSON string in which I would like to remove all white spaces that are not within quotes. I searched online and I already found a solution, which is the following:
aidstring = Regex.Replace(aidstring, "\\s+(?=([^\"]*\"[^\"]*\")*[^\"]*$)", "");
However, I am now dealing with a string that contains escaped quotes:
"boolean": "k near/3 \"funds private\""
and the above regular expression solution turns it into:
"boolean":"k near/3 \"fundsprivate\""
Since escaped quotes are treated as normal quotes.
Could anyone post a regex in which escaped quotes are ignored?

I'd suggest using
aidstring = Regex.Replace(aidstring, #"(""[^""\\]*(?:\\.[^""\\]*)*"")|\s+", "$1");
See regex demo
The regex will match all C quoted strings into Capture group 1 and with $1 these strings will be restored in the result, but all whitespaces caught with \s+ will be removed.
Regex explanation:
Alternative 1:
("[^"\\]*(?:\\.[^"\\]*)*"):
" - a literal "
[^"\\]* - zero or more characters other than \ or "
(?:\\.[^"\\]*)* - zero or more sequences of...
\\. - \ and any character but a newline
[^"\\]* - zero or more characters other than \ or "
" - a literal "
Alternative 2:
\s+ - 1 or more whitespace (in .NET, any Unicode whitespace)

Just a thought... And this doesn't immediately look legit because there are obvious possible flaws. But if you think about it, the scenarios where would fail are nearly zero chance of happening:
Regex.Replace(aidstring, #"\"\s*:\s*\"", "\":\"");
Long story short, look for the spaces you WANT to replace, instead of looking for all of the spaces you Don't Want to replace:
"boolean" : "k near/3 \"funds private\""
^^^^^^^^^
The only time it would fail is if the actual value-content of the json object were literally a colon... let me know how often that happens. :)
But Skeet is most-right. Use a Json Parser to clean it up.

Related

C# - Removing single word in string after certain character

I have string that I would like to remove any word following a "\", whether in the middle or at the end, such as:
testing a\determiner checking test one\pronoun
desired result:
testing a checking test one
I have tried a simple regex that removes anything between the backslash and whitespace, but it gives the following result:
string input = "testing a\determiner checking test one\pronoun";
Regex regex = new Regex(#"\\.*\s");
string output = regex.Replace(input, " ");
Result:
testing a one\pronoun
It looks like this regex matches from the backslash until the last whitespace in the string. I cannot seem to figure out how to match from the backlash to the next whitespace. Also, I am not guaranteed a whitespace at the end, so I would need to handle that. I could continue processing the string and remove any text after the backslash, but I was hoping I could handle both cases with one step.
Any advice would be appreciated.

Change .* which match any characters, to \w*, which only match word characters.
Regex regex = new Regex(#"\\\w*");
string output = regex.Replace(input, "");

".*" matches zero or more characters of any kind. Consider using "\w+" instead, which matches one or more "word" characters (not including whitespace).
Using "+" instead of "*" would allow a backslash followed by a non-"word" character to remain unmatched. For example, no matches would be found in the sentence "Sometimes I experience \ an uncontrollable compulsion \ to intersperse backslash \ characters throughout my sentences!"

With your current pattern, .* tells the parser to be "greedy," that is, to take as much of the string as possible until it hits a space. Adding a ? right after that * tells it instead to make the capture as small as possible--to stop as soon as it hits the first space.
Next, you want to end at not just a space, but at either a space or the end of the string. The $ symbol captures the end of the string, and | means or. Group those together using parentheses and your group collectively tells the parser to stop at either a space or the end of the string. Your code will look like this:
string input = #"testing a\determiner checking test one\pronoun";
Regex regex = new Regex(#"\\.*?(\s|$)");
string output = regex.Replace(input, " ");

Try this regex (\\[^\s]*)
(\\[^\s]*)
1st Capturing group (\\[^\s]*)
\\ matches the character \ literally
[^\s]* match a single character not present in the list below
Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
\s match any white space character [\r\n\t\f ].

C# Troubles reading xml value [duplicate]

I've noticed that C# adds additional slashes (\) to paths. Consider the path C:\Test. When I inspect the string with this path in the text visualiser, the actual string is C:\\Test.
Why is this? It confuses me, as sometimes I may want to split the path up (using string.Split()), but have to wonder which string to use (one or two slashes).

The \\ is used because the \ is an escape character and is need to represent the a single \.
So it is saying treat the first \ as an escape character and then the second \ is taken as the actual value. If not the next character after the first \ would be parsed as an escaped character.
Here is a list of available escape characters:
\' - single quote, needed for character literals
\" - double quote, needed for string literals
\\ - backslash
\0 – Null
\a - Alert
\b - Backspace
\f - Form feed
\n - New line
\r - Carriage return
\t - Horizontal tab
\v - Vertical quote
\u - Unicode escape sequence for character
\U - Unicode escape sequence for surrogate pairs.
\x - Unicode escape sequence similar to "\u" except with variable length.
EDIT: To answer your question regarding Split, it should be no issue. Use Split as you would normally. The \\ will be treated as only the one character of \.

.Net is not adding anything to your string here. What your seeing is an effect of how the debugger chooses to display strings. C# strings can be represented in 2 forms
Verbatim Strings: Prefixed with an # sign and removes the need o escape \\ characters
Normal Strings: Standard C style strings where \\ characters need to escape themselves
The debugger will display a string literal as a normal string vs. a verbatim string. It's just an issue of display though, it doesn't affect it's underlying value.

Debugger visualizers display strings in the form in which they would appear in C# code. Since \ is used to escape characters in non-verbatum C# strings, \\ is the correct escaped form.

Okay, so the answers above are not wholly correct. As such I am adding my findings for the next person who reads this post.
You cannot split a string using any of the chars in the table above if you are reading said string(s) from an external source.
i.e,
string[] splitStrings = File.ReadAllText([path]).Split((char)7);
will not split by those chars. However internally created strings work fine.
i.e.,
string[] splitStrings = "hello\agoodbye".Split((char)7);
This may not hold true for other methods of reading text from a file. I am unsure as I have not tested with other methods. With that in mind, it is probably best not to use those chars for delimiting strings!

What does `\?` mean in a regular expression?

May I know what \? means in a regular expression? For example, what is its significance in this expression.
I have used this for validating 7 digit telephone no
Any help is highly appreciated.

"\?" means "?" itself. "\" - is escape character. "?" is quantifier and "\" is used to escape it.

I have used this for validating 7 digit telephone no
"[[:number:]]\{3\}[ -]\?[[:number:]]\{4\}"
Looking at your example, it seems that you are talking about BRE, then the \ (escaping) gave ? special meaning: one or zero[ -]
If it is ERE/PCRE, the \ will take that speical meaning away from ?, that is, \? means literal question mark: ?

The properly-escaped "?" will match that exact character, the "?", as it appears in the text.
For instance, if you do
Regex re = new Regex(#"\d{3}-\?\d{4}");
, you will be able to get a positive match for 123-?1234.
If you want to get a positive match for 1231234 OR 123-1234, you can use the special character "?" without escape, like this:
Regex re = new Regex(#"\d{3}-?\d{4}");
P.S. for C# .NET, I find the best regex-testing place online is MyRegexTester. If you use it for C#, don't forget to check the appropriate "C# .NET" checkbox.
P.P.S. as per the comment, putting "\s*" into the regex will match any length white space (spaces and tabs included), "\ ?" will match an optional space, and "[ ]" will match exactly one space (no less).

"\?" escapes "?" that have a special meaning in the regex (0 or 1 match) so "\?" escapes it and identifies the literal "?"
your regex looks strange to me, it looks that all the special character are escaped (also "{" ) and doesn't appear to be valid from what i know.
i think you want to write
"\d{3}[ -]?\d{4}"
if you want to match something that respect the pattern or
"^\d{3}[ -]?\d{4}$"
if you want to have a match something that is exactly the pattern

Trying to understand this regex

I have this regex
^(\\w|#|\\-| |\\[|\\]|\\.)+$
I'm trying to understand what it does exactly but I can't seem to get any result...
I just can't understand the double backslashes everywhere... Isn't double backslash supposed to be used to get a single backslash?
This regex is to validate that a username doesn't use weird characters and stuff.
If someone could explain me the double backslashes thing please. #_#
Additional info: I got this regex in C# using Regex.IsMatch to check if my username string match the regex. It's for an asp website.

My guess is that it's simply escaping the \ since backslash is the escape character in c#.
string pattern = "^(\\w|#|\\-| |\\[|\\]|\\.)+$";
Can be rewritten using a verbatim string as
string pattern = #"^(\w|#|\-| |\[|\]|\.)+$";
Now it's a bit easier to understand what's going on. It will match any word character, at-sign, hyphen, space, square bracket or period, repeated one or more times. The ^ and $ match the begging and end of the string, respectively, so only those characters are allowed.
Therefore this pattern is equivalent to:
string pattern = #"^([\w# \[\].-])+$";

Double slash are supposed to be single slash. Double slash are used to escape the slash itself, as slashes are used for other escape characters in C# String context e.g. \n stands for new line
With double slashes sorted out, it becomes ^(\w|#|\-| |\[|\]|\.)+$
Break down this regex, as | means OR, and \w|#|\-| |\[|\]|\. would mean \w or # or \- or space or \[ or \] or \.. That is, any alphanumeric character, #, -, space, [, ] and . characters. Note that this slash is regex escape, to escape -, [, ] and . characters as they all have special meanings in regex context
And, + means the previous token (i.e. \w|#|\-| |\[|\]|\.) repeated one or more times
So, the entire thing means one or more of any combination of alphanumeric character, #, -, space, [, ] and . characters.

There are online tools to analyze regexes. Once such is at http://www.myezapp.com/apps/dev/regexp/show.ws
where it reports
Sequence: match all of the followings in order
BeginOfLine
Repeat
CapturingGroup
GroupNumber:1
OR: match either of the followings
WordCharacter
#
-
[
]
.
one or more times
EndOfLine
As others have noted, the double backslashes just escape a backslash so you can embed the regex in a string. For example, "\\w" will be interpreted as "\w" by the parser.

^ means beginning of the line.
the parentheses is use for grouping
\w is a word character
| means OR
# match the # character
\- match the hyphen character
[ and ] matches the squares brackets
\. match a period
+ means one or more
$ the end of line.
So the regex is use to match a string which contains only word characters or an # or an hyphen or a space or squares brackets or a dot.

Here's what it means:
^(\\w|#|\\-| |\\[|\\]|\\.)+$
^ - Means the regex starts at the beginning of the string. The match shouldn't start in the middle of the string.
Here's the individual things in the parentheses:
\\w - Indicates a "word" character. Normally, this is shown as \w, but this is being escaped.
# - Indicates an # symbol is allowed
\\- - Indicates a - is allowed. This is escaped since the dash can have other meanings in regex. Since it's not in a character class, I don't believe this is technically needed.
- A space is allowed
\\[ and \\] - [ and ] are allowed.
\\. - A period is a valid character. Escaped because periods have special meanings in regex.
Now all of those characters have | as delimiters in the parentheses - this means OR. So any of those characters are valid.
The + at the end means one or more characters as described in parentheses are valid. The $ means the end of the regex must match the end of the string.
Note that the double slashes aren't necessary if you just prefix the string like this:
#"\w" is the same as "\\w"

what does \ do on non escape characters?

I asked another question poorly so i'll ask something else.
According to http://www.c-point.com/javascript_tutorial/special_characters.htm there are a few escape characters such as \n and \b. However / is not one of them. What happens in this case? (\/) is the \ ignored?
I have a string in javascript 'http:\/\/www.site.com\/user'. Not that this is a literal with ' so with " it would look like \\/ anyways i would like to escape this string thus the question on what happens on non 'special' escape characters.
And another question is if i had name:\t me (or "name:\\t me" is there a function to escape it so there is a tab? i am using C# and these strings come from a JSON file

According to Mozilla:
For characters not listed [...] a preceding backslash is ignored, but this usage is deprecated and
should be avoided.
https://developer.mozilla.org/en/JavaScript/Guide/Values%2c_Variables%2c_and_Literals#section_19
The \/ sequence is not listed but there're at least two common usages:
<1> It's required to escape literal slashes in regular expressions that use the /foo/ syntax:
var re = /^http:\/\//;
<2> It's required to avoid invalid HTML when you embed JavaScript code inside HTML:
<script type="text/javascript"><!--
alert('</p>')
//--></script>
... triggers: end tag for element "P" which is not open
<script type="text/javascript"><!--
alert('<\/p>')
//--></script>
... doesn't.

If a backslash is found before a character which is not meaningful as an escape sequence, it will be ignored, i.e. "\/" and "/" are the same string in Javascript.
The / character is the regular expression delimiter, so it only has to be escaped in a regex context:
/[a-z]/[0-9]/ // Invalid.
/[a-z]\/[0-9]/ // Matches a lowercase letter, followed by a slash,
// followed by a digit.
Finally, if you want to collapse a backslash followed by a character into the corresponding escape sequence, you'll have to replace the whole expression:
string expr = "name:\\t me"; // Backslash followed by `t`.
expr = expr.Replace("\\t", "\t"); // Tab character.

\ is evaluated as \ if \ + next character is not an escape sequence.
examples:
\t -> escape sequence t -> tab
\\t -> escape \ and t -> \t
\\ -> escape sequence \ -> \
\c -> \c (not an escape sequence)
\a -> escape sequence a -> ???
Note that there are escape sequences also on completely weird symbols, so be careful. IMHO there is no good standard between languages and operating systems.
And actually, its even more non-stardard: in basic C '\y' -> y + warning, not \y. So this is very language dependent, be careful. (disregard my comment below).
br,
Juha
edit: What language are you using?= Java and c have slightly different behavior.
C and java seem to have the same escapes and python has different:
http://en.csharp-online.net/CSharp_FAQ:_What_are_the_CSharp_character_escape_sequences
http://www.cerritos.edu/jwilson/cis_182/language_resources/java_escape_sequences.htm
http://www.java2s.com/Code/Python/String/EscapeCodesbtnar.htm

In C# you can use the backslash character to tell the compiler what you really want. After compiling though, these escape characters do not exist.
If you use string myString = "\t"; the string will actually contain a TAB character, not just represent one. You can test this by checking myString.Length which is 1.
If you want to send the characters "backslash" and "t" to your JSON client however, you'll have to tell the compiler to keep his hands off the backslash, by escaping the backslash:
string myString = "\\t"; will result in a string of two characters, the "backslash" and the "t".
Things get messy if you have to cross multiple layers of escaping and unescaping, try to debug through these layers to see what's really happening under the hood.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Remove white spaces unless within quotes, ignoring escaped quotes - c#

Related

C# - Removing single word in string after certain character

C# Troubles reading xml value [duplicate]

What does `\?` mean in a regular expression?

Trying to understand this regex

what does \ do on non escape characters?

Categories

Resources