regex capture multi character delimiter - c#

I'm trying to learn regex, but still have no clue. I have this line of code, which successfully seperates the placeholder 'FirstWord' by the '{' delimiter from all following text:
var regexp = new Regex(#"(?<FirstWord>.*?)\{(?<TextBetweenCurlyBrackets>.*?)\}");
Which reads this string with no problem:
Greetings{Hello World}
What I want to do is to replace the '{' with a character chain like for instance '/>>'
so I tried this:
var regexp = new Regex(#"(?<FirstWord>.*?)\/>>(?<OtherText>.*?)\");
I removed the last bracket and replaced the first one with '/>>' But it throws an ArgumentException. How would the correct character combination look like?

/ does not need to be escaped, unless you use it as the pattern-delimiter.:
#"(?<FirstWord>.*?)/>>(?<OtherText>.*?)\"
Also your last \ will basically escape the " which should end the String (c#-wise: remove it):
#"(?<FirstWord>.*?)/>>(?<OtherText>.*?)"
And since you want most likely fetch until the END of the String (.*? will fetch as less characters as required to satisfy the expression), you should use the $ at the end or use any other sort of delimiter (whitspace, linebreak, etc...).
#"(?<FirstWord>.*?)/>>(?<OtherText>.*?)$"
Example:
(.*?)/>>(.*?)$
Debuggex Demo
Removing the trailing $ will fetch the empty string for the second match group, because "" is the shortest string possible satisfying the expression .*?
(.*?)/>>(.*?)$ on This/>>Test One will match This and Test One
(.*?)/>>(.*?)\s on This/>>Test One will match This and Test
(.*?)/>>(.*?) on This/>>Test One will match This and ""
Note: I'm saying "" is the shortest string possible satisfying the expression .?* on purpose! A frequent Misstake is to interpret .*?a as "everything until a":
Regex is greedy by default!
Searching for the expressiong (.*?)a$ on "caba" will NOT fail to match - it will return cab!, because cab followed by a is satisfying the expression AND cab is the shortest string possible for any match.
One might also expect b to be matched - but regex is working from left to right, hence aborting once it found cab - even if b would be shorter.

Related

RegEx to find non-existence of white space prefix but not include the character in the match?

So i have the following RegEx for the purpose of finding and adding whitespace:
(\S)(\()
So for a string like "SomeText(Somemoretext)" I want to update this to "SomeText (Somemoretext)" it matches "t(" and so my replace eliminates the "t" from the string which is not good. I also do not know what the character could be, I'm merely trying to find the non-existence of whitespace.
Is there a better expression to use or is there a way to exclude the found character from the match returned so that I can safely replace without catching characters i do not want to replace?
Thanks
I find lookarounds hard to read and would prefer using substitutions in the replacement string instead:
var s = Regex.Replace("test1() test2()", #"(\S)\(", "$1 (");
Debug.Assert(s == "test1 () test2 ()");
$1 inserts the first capture group from the regex into the replacement string which is the non-space character before the opening parenthesis (.
If you need to detect the absence of space before a specific character (such as bracket) after a word, how about the following?
\b(?=[^\s])\(
This will detect words ( [a-zA-z0-9_] that are followed by a bracket, without a space).
(if I got your problem correctly) you can replace the full match with ( and get exactly what you need.
In case you need to look for absence spaces before a symbol (like a bracket) in any kind of text (as in the text may be non-word, such as punctuation) you might want to use the following instead.
^(?:\S*)(\()(?:\S*)$
When using this, your result will be in group 1, instead of just full match (which now contains the whole line, if a line is matched).

Regex to extract string between parentheses which also contains other parentheses

I've been trying to figure this out, but I don't think I understand Regex well enough to get to where I need to.
I have string that resemble these:
filename.txt(1)attribute, 2)attribute(s), more!)
otherfile.txt(abc, def)
Basically, a string that always starts with a filename, then has some text between parentheses. And I'm trying to extract that part which is between the main parentheses, but the text that's there can contain absolutely anything, even some more parentheses (it often does.)
Originally, there was a 'hacky' expression made like this:
/\(([^#]+)\)\g
And it worked, until we ran into a case where the input string contained a # and we were stuck. Obviously...
I can't change the way the strings are generated, it's always a filename, then some parentheses and something of unknown length and content inside.
I'm hoping for a simple Regex expression, since I need this to work in both C# and in Perl -- is such a thing possible? Or does this require something more complex, like its own parsing method?
You can change exception for # symbol in your regex to regex matches any characters and add quantifier that matches from 0 to infinity symbols. And also simplify your regex by deleting group construction:
\(.*\)
Here is the explanation for the regular expression:
Symbol \( matches the character ( literally.
.* matches any character (except for line terminators)
* quantifier matches between zero and unlimited times, as many times
as possible, giving back as needed (greedy)
\) matches the character ) literally.
You can use regex101 to compose and debug your regular expressions.
Regex seems overkill to me in this case. Can be more reliably achieved using string manipulation methods.
int first = str.IndexOf("(");
int last = str.LastIndexOf(")");
if (first != -1 && last != -1)
{
string subString = str.Substring(first + 1, last - first - 1);
}
I've never used Perl, but I'll venture a guess that it has equivalent methods.

Regex-like construction to match %([text]) where [text] can contain escaped parens

I'm trying to resolve tokens in a string.
What I would like is given input like this:
string input = "asdf %(text) %(123) %(a\)a) asdf";
That I could run that through regex.Replace() and have it replace on "%(text)", "%(123)" and "%(a\)a)".
That is, that it would match everything between a starting "%(" and a closing ")" unless the closing ")" was escaped. (But of course, then you could escape the slash with another slash, which would prevent it from escaping the end paren...)
I'm pretty sure standard regular expressions can't do this, but I'm wondering if any of the various fancy expanded capabilities of the C# regular expression library could, rather than just iterating across the string totally manually? Or some other method that could do this? I feel like it's a common enough program that there has to be some way to solve it without implementing the solution from scratch, given the immensity of the .net framework? If I do have to implement iterating through the string and replacing with string.Replace(), I will, but it just seems so inelegant.
How about
var regex = new Regex(#"%\(.*?(?<!\\)(?:\\\\)*\)");
var result = regex.Replace(source,"");
%\( match literal %(
.*? match anything non-greedy
(?<!\\) preceding character to next match must not be \
(?:\\\\)* match zero or more literal \\ (i.e. match escaped \
\) match literal )
This is working for me :
String something = "\"asdf %(text) %(123) %(a\\)a) asdf\";";
String change = something.replaceAll("%\\(.*\\)", "");
System.out.println(change);
The output
"asdf asdf";

Regular expression to replace a string

I'm working on some code inherited from someone else and trying to understand some regular expression code in C#:
Regex.Replace(query, #"""[^""~]+""([^~]|$)",
m => string.Format(field + "_exact:{0}", m.Value))
What is the above regular expression doing? This is in relation to input from a user performing a search. It's doing a replace of the query string using the pattern provided in the second argument, with the value of the third. But what is that regular expression? For the life of me, it doesn't make sense. Thanks.
As far as I can see, xanatos' answer is correct. I tried to understand the regex, so here it comes:
"[^"~]+"([^~]|$)
You can test our regex and play with the single parts for better understanding at http://www.regexpal.com/
1.) a single character
"
The first pattern is a literal character. Since there is no statement of relative position, it can occur everywhere.
2.) a character class
[^"~]
The next expression is the []-bracket. This is a character set. It defines a quantity of characters, which maybe follow next. It is a placeholder for one single character... So lets see inside, which content is allowed:
^"~
The definition of the character class begins with an caret (^), which is a special character. Typing a caret after the opening square bracket will negate the character class. So it's "upside down": everything following, which does not match the class expression, matches and is a valid character.
In this case, every literal character is possible, except the two excluded ones: " or ~.
3.) a special character
+
The next expression, a plus, tells the engine to attempt to match the preceding token once or more.
So the defined character class should one or multiple times repeated to match the given expression.
4.) a single character
"
To match, the expression should contain furthermore one further apostrophe, which will be the corresponding apostrophe to the first one in 1.) since the character class in (2.) hence (3.) does not permit an apostrophe.
5.) a lookaround
([^~]|$)
The first structure here to examine is the ()-bracket. This is called a "Lookaround".
It is is a special kind of group. Lookaround matches a position. It does not expand the regex match.
So this means this part does not try to find any certain characters inside of an expression
rather then to localize them.
The localisation demands has two conditions, which are connected by a logical OR by the pipeline symbol: |
So the next character of the matched expression could either be
[^~] one single character out of the class everything excluding the character ~
or
$ the end of the line (or word, if multiline-mode is not used in regex engine)
I'll try to edit my answer to a better format, since this is my first post, I first have to check out how this is working.. :)
Update:
to "detect" a Asterisk/star in front/end of the line, you have to do following:
First it's a special character, so you have to escape it with an backslash: *
To define the position, you can use:
^ to look at the beginning of the line,
$ end of the line
The overall expression would be:
^* in front of the expression to search for an * at the beginning of
the line $* at the end of the regex to demand an * at the end.
.... in your case you can add the * in the last character class to detect an * in the end:
([^~]|$|$*)
and to force an * in the end, delete the other conditions:
($*)
PS:
(somehow my regex is swallowed up by formating engine, so my update is wrong...)
The # makes it necessary to escape all the " with a second ", so "". Without it to escape the " you would have used \", but I consider it better to always use # in regexes, because the \ is used quite often, and it's boring and unreadable to always have to escape it to \\.
Let's see what the regex really is:
Console.WriteLine(#"""[^""~]+""([^~]|$)");
is
"[^"~]+"([^~]|$)
So now we can look at the "real" regex.
It looks for a " followed by one or more non-" and non-~ followed by another " followed by a non-~ or the end of the string. Note that the match could start after the start of the string and it could end before the end of the string (with a non-~)
For example in
car"hello"help
it would match "hello"h

Can Regular Expressions Achieve This?

I'm trying to split a string into tokens (via regular expressions)
in the following way:
Example #1
input string: 'hello'
first token: '
second token: hello
third token: '
Example #2
input string: 'hello world'
first token: '
second token: hello world
third token: '
Example #3
input string: hello world
first token: hello
second token: world
i.e., only split up the string if it is NOT in single quotation marks, and single quotes should be in their own token.
This is what I have so far:
string pattern = #"'|\s";
Regex RE = new Regex(pattern);
string[] tokens = RE.Split("'hello world'");
This will work for example #1 and example #3 but it will NOT work for example #2.
I'm wondering if there's theoretically a way to achieve what I want with regular expressions
You could build a simple lexer, which would involve consuming each of the tokens one by one. So you would have a list of regular expressions and would attempt to match one of them at each point. That is the easiest and cleanest way to do this if your input is anything beyond the very simple.
Use a token parsor to split into tokens. Use regex to find a string patterns
'[^']+' will match text inside single quotes. If you want it grouped, (')([^']+)('). If no matches are found, then just use a regular string split. I don't think it makes sense to try to do the whole thing in one regular expression.
EDIT: It seems from your comments on the question that you actually want this applied over a larger block of text rather than just simple inputs like you indicated. If that's the case, then I don't think a regular expression is your answer.
While it would be possible to match ' and the text inside separately, and also alternatively match the text alone, RegExp does not allow an indefinite number of matches. Or better said, you can only match those objects you explicitely state in the expression. So ((\w+)+\b) could theoretically match all words one-by-one. The outer group will correctly match the whole text, and also the inner group will match the words separately correctly, but you will only be able to reference the last match.
There is no way to match a group of matched matches (weird sentence). The only possible way would be to match the string and then split it into separate words.
Not exactly what you are trying to do, but regular expression conditions might help out as you look for a solution:
(?<quot>')?(?<words>(?(quot)[^']|\w)+)(?(quot)')
If a quote is found, then it matches until a non-quote is found. Otherwise looks at word characters. Your results are in groups named "quot" and "words".
You'll have hard time using Split here, but you can use a MatchCollection to find all matches in your string:
string str = "hello world, 'HELLO WORLD': we'll be fine.";
MatchCollection matches = Regex.Matches(str, #"(')([^']+)(')|(\w+)");
The regex searches for a string between single quotes. If it cannot find one, it takes a single word.
Now it gets a little tricky - .net returns a collection of Matchs. Each Match has several Groups - the first Group has the whole string ('hello world'), but the rest have sub-matches (',hello world,'). Also, you get many empty unsuccessful Groups.
You can still iterate easily and get your matches. Here's an example using LINQ:
var tokens = from match in matches.Cast<Match>()
from g in match.Groups.Cast<Group>().Skip(1)
where g.Success
select g.Value;
tokens is now a collection of strings:
hello, world, ', HELLO WORLD, ', we, ll, be, fine
You can first split on quoted string, and then further tokenize.
foreach (String s in Regex.Split(input, #"('[^']+')")) {
// Check first if s is a quote.
// If so, split out the quotes.
// If not, do what you intend to do.
}
(Note: you need the brackets in the pattern to make sure Regex.Split returns those too)
Try this Regular Expression:
([']*)([a-z]+)([']*)
This finds 1 or more single quotes at the beginning and end of a string. It then finds 1 or more characters in the a-z set (if you don't set it to be case insensitive it will only find lower case characters). It groups these so that group 1 has the ', group 2 (or more) has the words which are split by anything that is not a character a - z and the last group has the single quote if it exists.

Categories

Resources