Regex to match spaces within quotes only - c#

I need to match any space within double quotes ONLY - not outside. I've tried a few things, but none of them work.
[^"]\s[^"] - matches spaces outside of quotes
[^"] [^"] - see above
\s+ - see above
For example I want to match "hello world" but not "helloworld" and not hello world (without quotes). I will specifically be using this regex inside of Visual Studio via the Find feature.

With .net and pcre regex engines, you can use the \G feature that matches the position after a successful match or the start of the string to build a pattern that returns contiguous occurrences from the start of the string:
((?:\G(?!\A)|\A[^"]*")[^\s"]*)\s([^\s"]*"[^"]*")?
example for a replacement with #: demo
pattern details:
( # capture group 1
(?: # two possible beginning
\G(?!\A) # contiguous to a previous match
| # OR
\A[^"]*" # start of the string and reach the first quote
) # at this point you are sure to be inside quotes
[^\s"]* # all that isn't a white-space or a quote
)
\s # the white-space
([^\s"]*"[^"]*")? # optional capture group 2: useful for the last quoted white-space
# since it reaches an eventual next quoted part.
Notice: with the .net regex engine you can also use the lookbehind to test if the number of quotes before a space is even or odd, but this way isn't efficient. (same thing for a lookahead that checks remaining quotes until the end, but in addition this approach may be wrong if the quotes aren't balanced).

Related

Regular expression to match base path

I'm trying to come out with a regular expression to match a certain base path. The rule should be to match the base path itself plus a "/" or a "." and the rest of the path.
for example, given /api/ping, the following should match
/api/ping.json
/api/ping
/api/ping/xxx/sss.json
/api/ping.xml
and this should NOT match
/api/pingpong
/api/ping_pong
/api/ping-pong
I tried with the following regexp:
/api/ping[[\.|\/].*]?
But it doesn't seem to catch the /api/ping case
Here is the a link to regex storm tester
--
update: thanks to the answers, now I have this version that reflects better my reasoning:
\/api\/ping(?:$|[.\/?]\S*)
The expression either ends after ping (that's the $ part) or continues with a ., / or ? followed by any non-space characters
here's the regex
You can use this regex which uses alternations to ensure the base path is followed by either a . or / or end of line $
\/api\/ping(?=\.|\/|$)\S*
Explanation:
\/api\/ping - Matches /api/ping text literally
(?=\.|\/|$) - Look ahead ensuring what follows is either a literal dot . or a slash / or end of line $
\S* - Optionally follows whatever non-space character follows the path
Demo
In your regex, /api/ping[[\.|\/].*]? usage of character set [] is not correct, where you don't need to escape a dot . and alternation | isn't needed in a character set and can't be done by placing | within character class, and also as the character class looks nested, it isn't required and not the right thing to do. I guess you wanted to make your regex something like this,
\/api\/ping([.\/].*)?$
Demo with your corrected regex
Notice, once you place anything in [] then it is only counted as one character allowing everything contained within character set, hence it allows either a dot . or slash / and notice you need to escape / as \/
Your pattern uses a character class that will match any of the listed which could also be written as [[./|].
It does not match /api/ping because the the character class has to match at least 1 time as it is not optional.
You could use an alternation to match /api/ping followed by asserting the end of the string or | match the structure by repeating 0 or more times matching a forward slash followed by not a forward slash followed by a dot and 1+ times and then a dot and the extension.
/api/ping(?:(?:/[^/\s]+)*\.\S+|$)
That will match
/api/ping Match literally
(?: Non capturing group
(?:/[^/\s]+)* Repeat a grouping structure 0+ times matching / then 1+ times not / or a whitespace character
\.\S+ Match a dot and 1+ times a non whitespace character
| Or
$ Assert the end of the string
) Close non capturing group
See the regex demo | C# demo

find terminate word using regular expression

I want to find if word terminate with 's or 'm or 're using regular expression in c#.
if (Regex.IsMatch(word, "/$'s|$'re|$'m/"))
textbox1.text=word;
The /$'s|$'re|$'m/ .NET regex matches 3 alternatives:
/$'s - / at the end of a string after which 's should follow (this will never match as there can be no text after the end of a string)
$'re - end of string and then 're must follow (again, will never match)
$'m/ - end of string with 'm/ to follow (again, will never match).
In a .NET regex, regex delimiters are not used, thus the first and last / are treated as literal chars that the engine tries to match.
The $ anchor signalize the end of a string and using anything after it makes the pattern match no string (well, unless you have a trailing \n after it, but that is an edge case that rarely causes any trouble). Just FYI: to match the very end of string in a .NET regex, use \z.
What you attempted to write was
Regex.IsMatch(word, "'(?:s|re|m)$")
Or, if you put single character alternatives into a single character class:
Regex.IsMatch(word, "'(?:re|[sm])$")
See the regex demo.
Details
' - a single quote
(?: - start of a non-capturing group:
re - the re substring
| - or
[sm] - a character class matching s or m
) - end of the non-capturing group
$ - end of string.

Get each item within a capturing group

If you have a string like this:
[hello world] this is [the best .Home] is nice place.
How do you extract each word(separated by space) within brackets[] only.
Right now I have this working https://regex101.com/r/Tgokeq/2
Which returns:
hello world
the best .Home
But I want:
hello
world
the
best
.Home
PS: I know I could just do string split in a foreach but I don't want that I want it in the regex itself, just like this which gets every word, except I want words within the brackets [ ] only.
https://regex101.com/r/eweRWj/2
Use this Pattern ([^\[\] ]+)(?=[^\[\]]*\]) Demo
( # Capturing Group (1)
[^\[\] ] # Character not in [\[\] ] Character Class
+ # (one or more)(greedy)
) # End of Capturing Group (1)
(?= # Look-Ahead
[^\[\]] # Character not in [\[\]] Character Class
* # (zero or more)(greedy)
\] # "]"
) # End of Look-Ahead
This pattern may not seems as elegant since it does not match individual words separately. The full solution takes advantage of .Net regex library to get individual words. However, it avoids excessive backtracking of alpha bravo's solution. The importance of that will largely depend on how many lines you search and/or if you are matching large chunks of text or only individual lines at a time.
This approach also lets you identify exactly how many bracket pairs and which words were captured in each pair. A simple pattern-only solution will just get you the matched words without context.
The pattern:
\[\s*((?<word>[^[\]\s]+)\s*)+]
Then some brief code demonstrating how to get captured words via the .Net regex object model:
using System.Text.RegularExpressions;
...
Regex rx = new Regex(#"\[\s*((?<word>[^[\]\s]+)\s*)+]");
MatchCollection matches = rx.Matches(searchText);
foreach(Match m in matches) {
foreach(Capture c in m.Groups["word"].Captures) {
System.Console.WriteLine(c.Value);
}
}
Breakdown of pattern:
\[ # Opening bracket
\s* # Optional white space
( # Group for word delimited by space
(?<word> # Named capture group
[^[\]\s] # Negative character class: no brackets, no white space
+ # one or more greedy
) # End named capture group
\s* # Match white space after word
) # End of word+space grouping
+ # Match multiple occurrences of word+space
] # Literal closing bracket (no need to escape outside character class)
The above will match line feeds between the brackets. If you don't want that then use
\[\ *((?<word>[^[\]\s]+)\ *)+]

Trying to understand this regex

I have this regex
^(\\w|#|\\-| |\\[|\\]|\\.)+$
I'm trying to understand what it does exactly but I can't seem to get any result...
I just can't understand the double backslashes everywhere... Isn't double backslash supposed to be used to get a single backslash?
This regex is to validate that a username doesn't use weird characters and stuff.
If someone could explain me the double backslashes thing please. #_#
Additional info: I got this regex in C# using Regex.IsMatch to check if my username string match the regex. It's for an asp website.
My guess is that it's simply escaping the \ since backslash is the escape character in c#.
string pattern = "^(\\w|#|\\-| |\\[|\\]|\\.)+$";
Can be rewritten using a verbatim string as
string pattern = #"^(\w|#|\-| |\[|\]|\.)+$";
Now it's a bit easier to understand what's going on. It will match any word character, at-sign, hyphen, space, square bracket or period, repeated one or more times. The ^ and $ match the begging and end of the string, respectively, so only those characters are allowed.
Therefore this pattern is equivalent to:
string pattern = #"^([\w# \[\].-])+$";
Double slash are supposed to be single slash. Double slash are used to escape the slash itself, as slashes are used for other escape characters in C# String context e.g. \n stands for new line
With double slashes sorted out, it becomes ^(\w|#|\-| |\[|\]|\.)+$
Break down this regex, as | means OR, and \w|#|\-| |\[|\]|\. would mean \w or # or \- or space or \[ or \] or \.. That is, any alphanumeric character, #, -, space, [, ] and . characters. Note that this slash is regex escape, to escape -, [, ] and . characters as they all have special meanings in regex context
And, + means the previous token (i.e. \w|#|\-| |\[|\]|\.) repeated one or more times
So, the entire thing means one or more of any combination of alphanumeric character, #, -, space, [, ] and . characters.
There are online tools to analyze regexes. Once such is at http://www.myezapp.com/apps/dev/regexp/show.ws
where it reports
Sequence: match all of the followings in order
BeginOfLine
Repeat
CapturingGroup
GroupNumber:1
OR: match either of the followings
WordCharacter
#
-
[
]
.
one or more times
EndOfLine
As others have noted, the double backslashes just escape a backslash so you can embed the regex in a string. For example, "\\w" will be interpreted as "\w" by the parser.
^ means beginning of the line.
the parentheses is use for grouping
\w is a word character
| means OR
# match the # character
\- match the hyphen character
[ and ] matches the squares brackets
\. match a period
+ means one or more
$ the end of line.
So the regex is use to match a string which contains only word characters or an # or an hyphen or a space or squares brackets or a dot.
Here's what it means:
^(\\w|#|\\-| |\\[|\\]|\\.)+$
^ - Means the regex starts at the beginning of the string. The match shouldn't start in the middle of the string.
Here's the individual things in the parentheses:
\\w - Indicates a "word" character. Normally, this is shown as \w, but this is being escaped.
# - Indicates an # symbol is allowed
\\- - Indicates a - is allowed. This is escaped since the dash can have other meanings in regex. Since it's not in a character class, I don't believe this is technically needed.
- A space is allowed
\\[ and \\] - [ and ] are allowed.
\\. - A period is a valid character. Escaped because periods have special meanings in regex.
Now all of those characters have | as delimiters in the parentheses - this means OR. So any of those characters are valid.
The + at the end means one or more characters as described in parentheses are valid. The $ means the end of the regex must match the end of the string.
Note that the double slashes aren't necessary if you just prefix the string like this:
#"\w" is the same as "\\w"

Regular expression for not splitting string if inside single or double quotes

I have a regular expression with the following pattern in C#
Regex param = new Regex(#"^-|^/|=|:");
Basically, its for command line parsing.
If I pass the below cmd line args it spilts C: as well.
/Data:SomeData /File:"C:\Somelocation"
How do I make it to not apply to characters inside double or single quotes ?
You can do this in two steps:
Use the first regex
Regex args = new Regex("[/-](?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)");
to split the string into the different arguments. Then use the regex
Regex param = new Regex("[=:](?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)");
to split each of the arguments into parameter/value pairs.
Explanation:
[=:] # Split on this regex...
(?= # ...only if the following matches afterwards:
(?: # The following group...
[^"]*" # any number of non-quote character, then one quote
[^"]*" # repeat, to ensure even number of quotes
)* # ...repeated any number of times, including zero,
[^"]* # followed by any number of non-quotes
$ # until the end of the string.
) # End of lookahead.
Basically, it looks ahead in the string if there is an even number of quotes ahead. If there is, we're outside of a string. However, this (somewhat manageable) regex only handles double quotes, and only if there are no escaped quotes inside those.
The following regex handles single and double quotes, including escaped quotes, correctly. But I guess you'll agree that if anybody ever finds this in production code, I'm guaranteed a feature article on The Daily WTF:
Regex param = new Regex(
#"[=:]
(?= # Assert even number of (relevant) single quotes, looking ahead:
(?:
(?:\\.|""(?:\\.|[^""\\])*""|[^\\'""])*
'
(?:\\.|""(?:\\.|[^""'\\])*""|[^\\'])*
'
)*
(?:\\.|""(?:\\.|[^""\\])*""|[^\\'])*
$
)
(?= # Assert even number of (relevant) double quotes, looking ahead:
(?:
(?:\\.|'(?:\\.|[^'\\])*'|[^\\'""])*
""
(?:\\.|'(?:\\.|[^'""\\])*'|[^\\""])*
""
)*
(?:\\.|'(?:\\.|[^'\\])*'|[^\\""])*
$
)",
RegexOptions.IgnorePatternWhitespace);
Further explanation of this monster here.
You should read "Mastering Regular Expressions" to understand why there's no general solution to your question. Regexes cannot handle that to an arbitrary depth. As soon as you start to escape the escape character or to escape the escaping of the escape character or ... you're lost. Your use case needs a parser and not a regex.

Categories

Resources