Pulling data out of quotes? - c#

I'm looking for a regex that can pull out quoted sections in a string, both single and double quotes.
IE:
"This is 'an example', \"of an input string\""
Matches:
an example
of an input string
I wrote up this:
[\"|'][A-Za-z0-9\\W]+[\"|']
It works but does anyone see any flaws with it?
EDIT: The main issue I see is that it can't handle nested quotes.

How does it handle single quotes inside of double quotes (or vice versa)?
"This is 'an example', \"of 'quotes within quotes'\""
should match
an example
of 'quotes within quotes'
Use a backreference if you need to support this.
(\"|')[A-Za-z0-9\\W]+?\1
EDIT: Fixed to use a reluctant quantifier.

Like that?
"([\"'])(.*?)\1"
Your desired match would be in sub group 2, and the kind of quote in group one.
The flaw in your regex is 1) the greedy "+" and 2) [A-Za-z0-9] is not really matching an awful lot. Many characters are not in that range.

It works but doesn't match other characters in quotes (e.g., non-alphanumeric, like binary or foreign language chars). How about this:
[\"']([^\"']*)[\"']
My C# regex is a little rusty so go easy on me if that's not exactly right :)

#"(\"|')(.*?)\1"

You might already have one of these, but, in case not, here's a free, open source tool I use all the time to test my regular expressions. I typically have the general idea of what the expression should look like, but need to fiddle around with some of the particulars.
http://renschler.net/RegexBuilder/

Related

How to Match a Comma Seperated List and End with a Different Character

One project I am currently working on involves writing a parser in C#.
I chose to use Regex to extract the parts of each line. Only one problem... I have very little Regex experience.
My current issue is that I can't get argument lists to work. More specifically, I can't match comma separated lists. After two hours of being stuck, I've turned to SO.
My closest regex so far is:
(?:\s|^)(bool|int|string|float|void)\s+(\w+)\s*\(((?:bool|int|string|float)\s+\w+\s*)*\)
Obviously, the actual code part is not matched. Only the listed types are wanted.
I removed any and all comma detection code, as it all broke.
I want to make it match void FunctionName(int a, string b) or the equivalent with other spacing.
How can I make this happen?
Please suggest edits before voting to close, I'm bad at Stack Overflowing.
Try it like this:
(?:\s|^)(bool|int|string|float|void)\s+(\w+)\s*\(((?:bool|int|string|float)\s+\w+(?(?=\s*,\s*\w)\s*,\s*|\s*))*\)
Demo
Explanation:
the crucial part here is the if-else regex a la (?(?=regex)then|else):
(?(?=\s*,\s*\w)\s*,\s*|\s*)
which means: if a type-param pair is followed by a comma assert another word character appears.
However, if feel using regex could turn out to be the wrong choice for your task at hand. There are some lightweight parser frameworks out, e.g. Sprache.
You're actually very close:
(?:\s|^)(bool|int|string|float|void)\s+(\w+)\s*\(((?:bool|int|string|float)\s+\w+,?\s*)*\)
The only difference is the ,? close to to end of the regex, which Means an optional comma and will match the comma between variables.

Regular Expression Pattern Matching

Hi I need to do like this.
Actually **ctu** is a good university but **ctu's** is not. There are many **,ctus,** present.
What I want to do is, I want to replace ctu in the string like this.
Actually **<s>ctu<e>** is a good university but **<s>ctu's<e>** is not. There are many **,<s>ctus<e>,** present.
But with the following pattern
**\\bctu*(?:['\\\\|""\\\\]*)\\w+\\b**
I'm getting the out put as:
A**<s>ctu<e>**ally **<s>ctu<e>** is a good university but **<s>ctu's<e>** is not. There are many **,ctus,** present.
I dont want to replace ctu inside words Actually. and also I need to replace " ,ctus, " with " ,<s>ctus<e>, "
How do I achieve this using regex. I need this in c#. csharp.
Thanks in advance.
The following regex matches all the cases listed in your example:
#"(\bctu(?:'\w+)?\w*\b)"
Then just replace the match with #"<s>\1<e>" where \1 is the backreference to the match above.
Are you looking for #"\bctu\b" ("ctu" with word boundaries on both sides, so it matches ctu but not Actually, ctu's, or ,ctus,) for the first search pattern and ",ctus," (exactly the string ,ctus,, regardless of where it might fall in a word) as the second search pattern? To search for both of these at once, you could use #"(\bctu\b|,ctus,)".
As a slight aside, in C# you can write regex literals much easier by using the #"" notation (verbatim strings) instead of "". E.g. to get regex to understand a word boundary, it must see \b, which can be represented as #"\b" or "\\b", and a literal \ is "\\\\" or #"\\". The first is easier to read, especially in more complex cases.
If this doesn't answer your question, please give a clear example of expected input/output.

C# Regex Escape Sequences

Is there a complete list of regex escape sequences somewhere? I found this, but it was missing \\ and \e for starters. Thus far I have come up with this regex pattern that hopefully matches all the escape sequences:
#"\\([bBdDfnreasStvwWnAZG\\]|x[A-Z0-9]{2}|u[A-Z0-9]{4}|\d{1,3}|k<\w+>)"
Alternatively, if you only want to escape a string correctly, you could just depend on Regex.Escape() which will do the necessary escaping for you.
Hint: There is also a Regex.Unescape()
This MSDN page (Regular Expression Language Elements) is a good starting place, with this subpage specifically about escape sequences.
Don't forget the zillions of possible unicode categories: \p{Lu}, \P{Sm} etc.
There are too many of these for you to match individually, but I suppose you could use something along the lines of \\[pP]\{[A-Za-z0-9 \-_]+?\} (untested).
And there's also the simpler stuff that's missing from your list: \., \+, \*, \? etc etc.
If you're simply trying to unescape an existing regex then you could try Regex.Unescape. It's not perfect, but it's probably better than anything you or I could knock up in a short space of time.

How to Check if a String is a "string" or a RegEx?

How can I check if a String in an textbox is a plain String ore a RegEx?
I'm searching through a text file line by line.
Either by .Contains(Textbox.Text); or by Regex(Textbox.Text) Match(currentLine)
(I know, syntax isn't working like this, it's just for presentation)
Now my Program is supposed to autodetect if Textbox.Text is in form of a RegEx or if it is a normal String.
Any suggestions? Write my own little RexEx to detect if Textbox contains a RegEx?
Edit:
I failed to add thad my Strings
can be very simple like Foo ore 0005
I'm trying the suggested solutions
right away!
You can't detect regular expressions with a regular expression, as regular expressions themselves are not a regular language.
However, the easiest you probably could do is trying to compile a regex from your textbox contents and when it succeeds you know that it's a regex. If it fails, you know it's not.
But this would classify ordinary strings like "foo" as a regular expression too. Depending on what you need to do, this may or may not be a problem. If it's a search string, then the results are identical for this case. In the case of "foo.bar" they would differ, though since it's a valid regex but matches different things than the string itself.
My advice, also stated in another comment, would be that you simply always enable regex search since there is exactly no difference if you split code paths here. Aside from a dubious performance benefit (which is unlikely to make any difference if there is much of a benefit at all).
Many strings could be a regex, every regex could actually be a string.
Consider the string "thin." could either be a string ('.' is a dot) or a regex ('.' is any character).
I would just add a checkbox where the user indicates if he enters a regex, as usual in many applications.
One possible solution depending on your definition of string and regex would be to check if the string contains any regex typical characters.
You could do something like this:
string s = "I'm not a Regex";
if (s == Regex.Escape(s))
{
// no regex indeed
}
Try and use it in a regex and see if an exception is thrown.
This approach only checks if it is a valid regex, not whether it was intended to be one.
Another approach could be to check if it is surrounded by slashes (ie. ‘/foo/‘) Surrounding regexes with slashes is common practice (although you must remove the slashes before feeding it into the regex library)

Difficulty with Simple Regex (match prefix/suffix)

I'm try to develop a regex that will be used in a C# program..
My initial regex was:
(?<=\()\w+(?=\))
Which successfully matches "(foo)" - matching but excluding from output the open and close parens, to produce simply "foo".
However, if I modify the regex to:
\[(?<=\()\w+(?=\))\]
and I try to match against "[(foo)]" it fails to match. This is surprising. I'm simply prepending and appending the literal open and close brace around my previous expression. I'm stumped. I use Expresso to develop and test my expressions.
Thanks in advance for your kind help.
Rob Cecil
Your look-behinds are the problem. Here's how the string is being processed:
We see [ in the string, and it matches the regex.
Look-behind in regex asks us to see if the previous character was a '('. This fails, because it was a '['.
At least thats what I would guess is causing the problem.
Try this regex instead:
(?<=\[\()\w+(?=\)\])
Out of context, it is hard to judge, but the look-behind here is probably overkill. They are useful to exclude strings (as in strager's example) and in some other special circumstances where simple REs fail, but I often see them used where simpler expressions are easier to write, work in more RE flavors and are probably faster.
In your case, you could probably write (\b\w+\b) for example, or even (\w+) using natural bounds, or if you want to distinguish (foo) from -foo- (for example), using \((\w+)\).
Now, perhaps the context dictates this convoluted use (or perhaps you were just experimenting with look-behind), but it is good to know alternatives.
Now, if you are just curious why the second expression doesn't work: these are known as "zero-width assertions": they check that what is following or preceding is conform to what is expected, but they don't consume the string so anything after (or before if negative) them must match the assertion too. Eg. if you put something after the positive lookahead which doesn't match what is asserted, you are sure the RE will fail.

Categories

Resources