Match all 'X' from 'Y' until 'Z' - c#

Well, I hope the title is not too confusing. My task is to match (and replace) all Xs that are between Y and Z.
I use X,Y,Z since those values may vary at runtime, but that's not a problem at all.
What I've tried so far is this:
pattern = ".*Y.*?(X).*?Z.*";
Which actually works.. but only for one X. I simply can't figure out, how to match all Xs between those "tags".
I also tried this:
pattern = #"((Y|\G).*?)(?!Z)(X)"
But this matches all Xs, ignoring the "tags".
What is the correct pattern to solve my problem? Thanks in advance :)
Edit
some more information:
X is a single char, Y and Z are strings
A more real life test string:
Some.text.with.dots [nodots]remove.dots.here[/nodots] again.with.dots
=> match .s between [nodots] and [/nodots]
(note: I used xml-like syntax here, but that's not guaranteed so I can unfortunately not use a simple xml or html parser)

In C#, if you need to replace some text inside some block of text, you may match the block(s) with a simple regex like (?s)(START)(.*?)(END) and then inside a match evaluator make the necessary replacements in the matched blocks.
In your case, you may use something like
var res = Regex.Replace(str, #"(?s)(\[nodots])(.*?)(\[/nodots])",
m => string.Format(
"{0}{1}{2}",
m.Groups[1].Value, // Restoring start delimiter
m.Groups[2].Value.Replace(".",""), // Modifying inner contents
m.Groups[3].Value // Restoring end delimiter
)
);
See the C# online demo
Pattern details:
(?s) - an inline version of the RegexOptions.Singleline modifier flag
(\[nodots])- Group 1: starting delimiter (literal string [nodots])
(.*?) - Group 2: any 0+ chars as few as possible
(\[/nodots]) - Group 3: end delimiter (literal string [/nodots])

Related

RegEx to find non-existence of white space prefix but not include the character in the match?

So i have the following RegEx for the purpose of finding and adding whitespace:
(\S)(\()
So for a string like "SomeText(Somemoretext)" I want to update this to "SomeText (Somemoretext)" it matches "t(" and so my replace eliminates the "t" from the string which is not good. I also do not know what the character could be, I'm merely trying to find the non-existence of whitespace.
Is there a better expression to use or is there a way to exclude the found character from the match returned so that I can safely replace without catching characters i do not want to replace?
Thanks
I find lookarounds hard to read and would prefer using substitutions in the replacement string instead:
var s = Regex.Replace("test1() test2()", #"(\S)\(", "$1 (");
Debug.Assert(s == "test1 () test2 ()");
$1 inserts the first capture group from the regex into the replacement string which is the non-space character before the opening parenthesis (.
If you need to detect the absence of space before a specific character (such as bracket) after a word, how about the following?
\b(?=[^\s])\(
This will detect words ( [a-zA-z0-9_] that are followed by a bracket, without a space).
(if I got your problem correctly) you can replace the full match with ( and get exactly what you need.
In case you need to look for absence spaces before a symbol (like a bracket) in any kind of text (as in the text may be non-word, such as punctuation) you might want to use the following instead.
^(?:\S*)(\()(?:\S*)$
When using this, your result will be in group 1, instead of just full match (which now contains the whole line, if a line is matched).

Can Regular Expressions Achieve This?

I'm trying to split a string into tokens (via regular expressions)
in the following way:
Example #1
input string: 'hello'
first token: '
second token: hello
third token: '
Example #2
input string: 'hello world'
first token: '
second token: hello world
third token: '
Example #3
input string: hello world
first token: hello
second token: world
i.e., only split up the string if it is NOT in single quotation marks, and single quotes should be in their own token.
This is what I have so far:
string pattern = #"'|\s";
Regex RE = new Regex(pattern);
string[] tokens = RE.Split("'hello world'");
This will work for example #1 and example #3 but it will NOT work for example #2.
I'm wondering if there's theoretically a way to achieve what I want with regular expressions
You could build a simple lexer, which would involve consuming each of the tokens one by one. So you would have a list of regular expressions and would attempt to match one of them at each point. That is the easiest and cleanest way to do this if your input is anything beyond the very simple.
Use a token parsor to split into tokens. Use regex to find a string patterns
'[^']+' will match text inside single quotes. If you want it grouped, (')([^']+)('). If no matches are found, then just use a regular string split. I don't think it makes sense to try to do the whole thing in one regular expression.
EDIT: It seems from your comments on the question that you actually want this applied over a larger block of text rather than just simple inputs like you indicated. If that's the case, then I don't think a regular expression is your answer.
While it would be possible to match ' and the text inside separately, and also alternatively match the text alone, RegExp does not allow an indefinite number of matches. Or better said, you can only match those objects you explicitely state in the expression. So ((\w+)+\b) could theoretically match all words one-by-one. The outer group will correctly match the whole text, and also the inner group will match the words separately correctly, but you will only be able to reference the last match.
There is no way to match a group of matched matches (weird sentence). The only possible way would be to match the string and then split it into separate words.
Not exactly what you are trying to do, but regular expression conditions might help out as you look for a solution:
(?<quot>')?(?<words>(?(quot)[^']|\w)+)(?(quot)')
If a quote is found, then it matches until a non-quote is found. Otherwise looks at word characters. Your results are in groups named "quot" and "words".
You'll have hard time using Split here, but you can use a MatchCollection to find all matches in your string:
string str = "hello world, 'HELLO WORLD': we'll be fine.";
MatchCollection matches = Regex.Matches(str, #"(')([^']+)(')|(\w+)");
The regex searches for a string between single quotes. If it cannot find one, it takes a single word.
Now it gets a little tricky - .net returns a collection of Matchs. Each Match has several Groups - the first Group has the whole string ('hello world'), but the rest have sub-matches (',hello world,'). Also, you get many empty unsuccessful Groups.
You can still iterate easily and get your matches. Here's an example using LINQ:
var tokens = from match in matches.Cast<Match>()
from g in match.Groups.Cast<Group>().Skip(1)
where g.Success
select g.Value;
tokens is now a collection of strings:
hello, world, ', HELLO WORLD, ', we, ll, be, fine
You can first split on quoted string, and then further tokenize.
foreach (String s in Regex.Split(input, #"('[^']+')")) {
// Check first if s is a quote.
// If so, split out the quotes.
// If not, do what you intend to do.
}
(Note: you need the brackets in the pattern to make sure Regex.Split returns those too)
Try this Regular Expression:
([']*)([a-z]+)([']*)
This finds 1 or more single quotes at the beginning and end of a string. It then finds 1 or more characters in the a-z set (if you don't set it to be case insensitive it will only find lower case characters). It groups these so that group 1 has the ', group 2 (or more) has the words which are split by anything that is not a character a - z and the last group has the single quote if it exists.

Extending [^,]+, Regular Expression in C#

Duplicate
Regex for variable declaration and initialization in c#
I was looking for a Regular Expression to parse CSV values, and I came across this Regular Expression
[^,]+
Which does my work by splitting the words on every occurance of a ",". What i want to know is say I have the string
value_name v1,v2,v3,v4,...
Now I want a regular expression to find me the words v1,v2,v3,v4..
I tried ->
^value_name\s+([^,]+)*
But it didn't work for me. Can you tell me what I am doing wrong? I remember working on regular expressions and their statemachine implementation. Doesn't it work in the same way.
If a string starts with Value_name followed by one or more whitespaces. Go to Next State. In That State read a word until a "," comes. Then do it again! And each word will be grouped!
Am i wrong in understanding it?
You could use a Regex similar to those proposed:
(?:^value_name\s+)?([^,]+)(?:\s*,\s*)?
The first group is non-capturing and would match the start of the line and the value_name.
To ensure that the Regex is still valid over all matches, we make that group optional by using the '?' modified (meaning match at most once).
The second group is capturing and would match your vXX data.
The third group is non-capturing and would match the ,, and any whitespace before and after it.
Again, we make it optional by using the '?' modifier, otherwise the last 'vXX' group would not match unless we ended the string with a final ','.
In you trials, the Regex wouldn't match multiple times: you have to remember that if you want a Regex to match multiple occurrences in a strings, the whole Regex needs to match every single occurrence in the string, so you have to build your Regex not only to match the start of the string 'value_name', but also match every occurrence of 'vXX' in it.
In C#, you could list all matches and groups using code like this:
Regex r = new Regex(#"(?:^value_name\s+)?([^,]+)(?:\s*,\s*)?");
Match m = r.Match(subjectString);
while (m.Success) {
for (int i = 1; i < m.Groups.Count; i++) {
Group g = m.Groups[i];
if (g.Success) {
// matched text: g.Value
// match start: g.Index
// match length: g.Length
}
}
m = m.NextMatch();
}
I would expect it only to get v1 in the group, because the first comma is "blocking" it from grabbing the rest of the fields. How you handle this is going to depend on the methods you use on the regular expression, but it may make sense to make two passes, first grab all the fields seperated by commas and then break things up on spaces. Perhaps ^value_name\s+(?:([^,]+),?)* instead.
Oh yeah, lists....
/(?:^value_name\s+|,\s*)([^,]+)/g will theoreticly grab them, but you will have to use RegExp.exec() in a loop to get the capture, rather than the whole match.
I wish pre-matches worked in JS :(.
Otherwise, go with Logan's idea: /^value_name\s+([^,]+(?:,\s*[^,]+)*)$/ followed by .split(/,\s*/);

Regex for variable declaration and initialization in c#

I want to write a RegEx to pull out all the variable values and their names from the variable declaration statement. Say i have
int i,k = 10,l=0
i want to write a regex something like int\s^,?|(^,?)*
but this will also accept
k = 10 i.e. (without int preceding it)
Basically idea is
If string starts with int then
get the variable list seperated by ,
i know to extract csv values, but here my string has some initial value as well. How can i resolve it?
Start thinking about the structure of a definition, say,
(a line can start with some spaces) followed by,
(Type) followed by
(at least one space)
(variable_1)
(optionally
(comma // next var
|
'='number // initialization
) ...`
then try to convert each group:
^ \s* \w+ \s+ \w+ ? (',' | '=' \d+ ) ...
line some type at least var optionally more or init some
start spaces (some chars) one space (some chars) vars val digits
Left as homework to remove spaces and fix up the final regex.
Here is some useful information which you can use
http://compsci.ca/v3/viewtopic.php?t=6712
You could build up your regular expression from the [C# Grammar](http://msdn.microsoft.com/en-us/library/aa664812(VS.71).aspx). But building a parser would certainly be better.
Try this:
^(int|[sS]tring)\s+\w+\s*(=\s*[^,]+)?(,\s*\w+\s*(=\s*[^,]+)?)*$
It'll match your example code
int i,k = 10,l=0
And making a few assumptions about the language you may or may not be using, it'll also match:
int i, j, k=10, l=0
string i=23, j, k=10, l=0

Regex - replace only last part of an expression

I'm attempting to find the best methodology for finding a specific pattern and then
replace the ending portion of the pattern. Here is a quick example (in C#):
//Find any year value starting with a bracket or underscore
string patternToFind = "[[_]2007";
Regex yearFind = new Regex(patternToFind);
//I want to change any of these values to x2008 where x is the bracket or underscore originally in the text. I was trying to use Regex.Replace(), but cannot figure out if it can be applied.
If all else fails, I can find Matches using the MatchCollection and then switch out the 2007 value with 2008; however, I'm hoping for something more elegant
MatchCollections matches = yearFind.Matches(" 2007 [2007 _2007");
foreach (Match match in matches){
//use match to find and replace value
}
Your pattern does not work as described: as described you need to start with "\[|_" (the pipe means OR), and the solution to your actual problem is regex grouping. Surround the part of the pattern you are interested in in brackets "(" and ")" and you can access them in the replacer.
You therefore need a pattern like this: /^(\[|_)2007/
edit: .NET code
string s = Regex.Replace(source, #"^(\[|_)2007", #"$12008");
n.b. misunderstood the requirement, pattern amended
You can wrap the part you want to keep in parentheses to create a sub-match group. Then in the replacement text, use a backreference to put it back in. If I'm understanding what you are trying to do correctly, you would do something like this:
Regex yearFind = new Regex("([[_])2007");
yearFine.Replace("_2007", #"$12008"); // => "_2008"
yearFine.Replace("[2007", #"$12008"); // => "[2008"
The "$1" in the replacement text is replaced with whatever was matched inside the parentheses.
To show substitution (using vim in this case). if I have a file with the following contents:
aaa _2007
bbb , 2007
ccc [2007]
and I use the regular expression
:1,$ s/\([_[ ]\)\(2007\)/\12008/g
The first group (in the (, )) will match the character preceding the year and the second group will match the year 2007. The substitution substitutes in the first match and overwrites whatever was matched by the the second group with 2008, giving:
aaa _2008
bbb , 2008
ccc [2008]
Different regex libraries will have minor syntactic variations on this principle.

Categories

Resources