Regex - replace only last part of an expression - c#

I'm attempting to find the best methodology for finding a specific pattern and then
replace the ending portion of the pattern. Here is a quick example (in C#):
//Find any year value starting with a bracket or underscore
string patternToFind = "[[_]2007";
Regex yearFind = new Regex(patternToFind);
//I want to change any of these values to x2008 where x is the bracket or underscore originally in the text. I was trying to use Regex.Replace(), but cannot figure out if it can be applied.
If all else fails, I can find Matches using the MatchCollection and then switch out the 2007 value with 2008; however, I'm hoping for something more elegant
MatchCollections matches = yearFind.Matches(" 2007 [2007 _2007");
foreach (Match match in matches){
//use match to find and replace value
}

Your pattern does not work as described: as described you need to start with "\[|_" (the pipe means OR), and the solution to your actual problem is regex grouping. Surround the part of the pattern you are interested in in brackets "(" and ")" and you can access them in the replacer.
You therefore need a pattern like this: /^(\[|_)2007/
edit: .NET code
string s = Regex.Replace(source, #"^(\[|_)2007", #"$12008");
n.b. misunderstood the requirement, pattern amended

You can wrap the part you want to keep in parentheses to create a sub-match group. Then in the replacement text, use a backreference to put it back in. If I'm understanding what you are trying to do correctly, you would do something like this:
Regex yearFind = new Regex("([[_])2007");
yearFine.Replace("_2007", #"$12008"); // => "_2008"
yearFine.Replace("[2007", #"$12008"); // => "[2008"
The "$1" in the replacement text is replaced with whatever was matched inside the parentheses.

To show substitution (using vim in this case). if I have a file with the following contents:
aaa _2007
bbb , 2007
ccc [2007]
and I use the regular expression
:1,$ s/\([_[ ]\)\(2007\)/\12008/g
The first group (in the (, )) will match the character preceding the year and the second group will match the year 2007. The substitution substitutes in the first match and overwrites whatever was matched by the the second group with 2008, giving:
aaa _2008
bbb , 2008
ccc [2008]
Different regex libraries will have minor syntactic variations on this principle.

Related

RegEx different substitutions based on groups?

So I'm relatively n00bish at regular expressions, and doing a little practicing.
I'm playing with a dog-simple "deobfucator" that just looks for [dot] or (dot) or [at] or (at). Case-insensitive, and with or w/out any number of spaces before or after the match(s).
This is for the usual: someemail [AT] domain (dot) com type of thing. I want to obviously turn it into someemail#domain.com.
The RegEx I've come up with does the matching fine, but now I want to replace with either a . or a # depending on the match.
i.e.
I want the group matching the "dot" group to replace it with the literal ., and the group matching the "at" group with the literal #.
I know I could just write 2 different (almost identical) RegEx's and run it through both, but for the sake of education, I'm trying to see if I can do it all in one RegEx?
Here's the RegEx I came up with (probably not the smallest possible, which I'd also be interested in seeing):
+(\[|\()(dot)(\)|\]) +| +(\[|\()(at)(\)|\]) +
NOTE: before each + there's an empty space, for matching spaces.
What I'm looking for is what I would use to do the replacement(s) properly?
Update: Sorry all, forgot to add which language I was working with for this. In this case, I'm using a clipboard utility that can run RegEx's on it's input (whatever gets copied to the clipboard), and the engine it uses is C#/VB.NET. Ultimate goal for this little project is to just be able to copy an "obfuscated" email address or URL, and run the RegEx on it so that it's set on the clipboard in it's "unobfuscated" state.
That said, I do tend to use RegEx's on many different languages, so converting them between languages generally isn't an issue.
.NET regex does not support conditional replacement patterns.
for the sake of education, I'm trying to see if I can do it all in one RegEx?
There are other regex engines that allow conditional replacement logic in a single regex replacement operation with conditional replacement patterns.
There are 3 engines that support this type of replacements: JGsoft V2, Boost, and PCRE2.
For conditionals to work in Boost, you need to pass regex_constants::format_all to regex_replace. For them to work in PCRE2, you need to pass PCRE2_SUBSTITUTE_EXTENDED to pcre2_substitute.
In PCRE2:
${1:+matched:unmatched} where 1 is a number between 1 and 99 referencing a numbered capturing group. If your regex contains named capturing groups then you can reference them in a conditional by their name: ${name:+matched:unmatched}.
If you want a literal colon in the matched part, then you need to escape it with a backslash. If you want a literal closing curly brace anywhere in the conditional, then you need to escape that with a backslash too. Plus signs have no special meaning beyond the :+ that starts the conditional, so they don't need to be escaped.
Also, see The Boost-Specific Format Sequences:
When specifying the format_all flag to regex_replace(), the escape sequences recognized are the same as those above for format_perl. In addition, conditional expressions of the following form are recognized:
?Ntrue-expression:false-expression
where N is a decimal digit representing a sub-match. If the corresponding sub-match participated in the full match, then the substitution is true-expression. Otherwise, it is false-expression. In this mode, you can use parens () for grouping. If you want a literal paren, you must escape it as \(.
In Boost replacement patterns, literal ( and ) must be escaped.
The syntax for JGsoft V2 replacement string conditionals is the same as that in the C++ Boost library.
So, your regex can be contracted to ( +)[[(](?:(dot)|(at))[])]( +):
( +) - Group 1: one or more spaces
[[(] - a [ or (
(?:(dot)|(at)) - Either (Group 2) a dot substring or (Group 3) an at substring
[])] - a ) or ]
( +) - Group 4: one or more spaces
And replace with $1(?{3}.:#)$4:
$1 - Group 1 value,
(?{3}.:#) - if Group 3 matched, replace with ., else with #
$4 - Group 4 value.
This is available in Notepad++:
If you are using Java, try replaceAll method from String class.
And finally you need to normalize it with white spaces:
- Pure Java - String after = before.trim().replaceAll("\\s+", " ");
- Pure Java - String after = before.replaceAll("\\s{2,}", " ").trim();
- Apache commons lang3 - String after = StringUtils.normalizeSpace(String str);
- ...

Match all 'X' from 'Y' until 'Z'

Well, I hope the title is not too confusing. My task is to match (and replace) all Xs that are between Y and Z.
I use X,Y,Z since those values may vary at runtime, but that's not a problem at all.
What I've tried so far is this:
pattern = ".*Y.*?(X).*?Z.*";
Which actually works.. but only for one X. I simply can't figure out, how to match all Xs between those "tags".
I also tried this:
pattern = #"((Y|\G).*?)(?!Z)(X)"
But this matches all Xs, ignoring the "tags".
What is the correct pattern to solve my problem? Thanks in advance :)
Edit
some more information:
X is a single char, Y and Z are strings
A more real life test string:
Some.text.with.dots [nodots]remove.dots.here[/nodots] again.with.dots
=> match .s between [nodots] and [/nodots]
(note: I used xml-like syntax here, but that's not guaranteed so I can unfortunately not use a simple xml or html parser)
In C#, if you need to replace some text inside some block of text, you may match the block(s) with a simple regex like (?s)(START)(.*?)(END) and then inside a match evaluator make the necessary replacements in the matched blocks.
In your case, you may use something like
var res = Regex.Replace(str, #"(?s)(\[nodots])(.*?)(\[/nodots])",
m => string.Format(
"{0}{1}{2}",
m.Groups[1].Value, // Restoring start delimiter
m.Groups[2].Value.Replace(".",""), // Modifying inner contents
m.Groups[3].Value // Restoring end delimiter
)
);
See the C# online demo
Pattern details:
(?s) - an inline version of the RegexOptions.Singleline modifier flag
(\[nodots])- Group 1: starting delimiter (literal string [nodots])
(.*?) - Group 2: any 0+ chars as few as possible
(\[/nodots]) - Group 3: end delimiter (literal string [/nodots])

Don't use capturing groups in c# Regex

I am writing a regular expression in Visual Studio 2013 using C#
I have the following scenario:
Match match = Regex.Match("%%Text%%More text%%More more text", "(?<!^)%%[^%]+%%");
But my problem is that I don't want to capture groups. The reason is that with capture groups match.Value contains %%More text%% and my idea is the get on match.Value directly the string: More text
The string to get will be always between the second and the third group of %%
Another approach is that the string will be always between the fourth and fifth %
I tried:
Regex.Match("%%Text%%More text%%More more text", "(?:(?<!^)%%[^%]+%%)");
But with no luck.
I want to use match.Value because all my regex are in a database table.
Is there a way to "transform" that regex to one not using capturing groups and the in match.value the desired string?
If you are sure you have no %s inside double %%s, you can just use lookarounds like this:
(?<=^%%[^%]*%%)[^%]+(?=%%)
^^^^^^^^^^^^^^ ^^^^^
If you have single-% delimited strings (like %text1%text2%text3%text4%text5%text6, see demo):
(?<=^%[^%]*%)[^%]+(?=%)
See regex demo
And in case it is between the 4th and the 5th:
(?<=^%%(?:[^%]*%%){3})[^%]+(?=%%)
^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^
For single-% delimited strings (see demo):
(?<=^%(?:[^%]*%){3})[^%]+(?=%)
See another demo
Both the regexps contain a variable-width lookbehind and the same lookahead to restrict the context the 1 or more characters other than % appears in.
The (?<=^%%[^%]*%%) makes sure the is %%[something_other_then_%]%% right after the beginning of the string, and (?<=^%%(?:[^%]*%%){3}) matches %%[substring_not_having_%]%%[substring_not_having_%]%%[substring_not_having_%]%% after the string start.
In case there can be single % symbols inside the double %%, you can use an unroll-the-loop regex (see demo):
(?<=^%%(?:[^%]*(?:%(?!%)[^%]*)*%%){3})[^%]*(?:%(?!%)[^%]*)*(?=%%)
Which is matching the same stuff that can be matched with (?<=^%%(?:.*?%%){3}).*?(?=%%). For short strings, the .*? based solution should work faster. For very long input texts, use the unrolled version.

Can Regular Expressions Achieve This?

I'm trying to split a string into tokens (via regular expressions)
in the following way:
Example #1
input string: 'hello'
first token: '
second token: hello
third token: '
Example #2
input string: 'hello world'
first token: '
second token: hello world
third token: '
Example #3
input string: hello world
first token: hello
second token: world
i.e., only split up the string if it is NOT in single quotation marks, and single quotes should be in their own token.
This is what I have so far:
string pattern = #"'|\s";
Regex RE = new Regex(pattern);
string[] tokens = RE.Split("'hello world'");
This will work for example #1 and example #3 but it will NOT work for example #2.
I'm wondering if there's theoretically a way to achieve what I want with regular expressions
You could build a simple lexer, which would involve consuming each of the tokens one by one. So you would have a list of regular expressions and would attempt to match one of them at each point. That is the easiest and cleanest way to do this if your input is anything beyond the very simple.
Use a token parsor to split into tokens. Use regex to find a string patterns
'[^']+' will match text inside single quotes. If you want it grouped, (')([^']+)('). If no matches are found, then just use a regular string split. I don't think it makes sense to try to do the whole thing in one regular expression.
EDIT: It seems from your comments on the question that you actually want this applied over a larger block of text rather than just simple inputs like you indicated. If that's the case, then I don't think a regular expression is your answer.
While it would be possible to match ' and the text inside separately, and also alternatively match the text alone, RegExp does not allow an indefinite number of matches. Or better said, you can only match those objects you explicitely state in the expression. So ((\w+)+\b) could theoretically match all words one-by-one. The outer group will correctly match the whole text, and also the inner group will match the words separately correctly, but you will only be able to reference the last match.
There is no way to match a group of matched matches (weird sentence). The only possible way would be to match the string and then split it into separate words.
Not exactly what you are trying to do, but regular expression conditions might help out as you look for a solution:
(?<quot>')?(?<words>(?(quot)[^']|\w)+)(?(quot)')
If a quote is found, then it matches until a non-quote is found. Otherwise looks at word characters. Your results are in groups named "quot" and "words".
You'll have hard time using Split here, but you can use a MatchCollection to find all matches in your string:
string str = "hello world, 'HELLO WORLD': we'll be fine.";
MatchCollection matches = Regex.Matches(str, #"(')([^']+)(')|(\w+)");
The regex searches for a string between single quotes. If it cannot find one, it takes a single word.
Now it gets a little tricky - .net returns a collection of Matchs. Each Match has several Groups - the first Group has the whole string ('hello world'), but the rest have sub-matches (',hello world,'). Also, you get many empty unsuccessful Groups.
You can still iterate easily and get your matches. Here's an example using LINQ:
var tokens = from match in matches.Cast<Match>()
from g in match.Groups.Cast<Group>().Skip(1)
where g.Success
select g.Value;
tokens is now a collection of strings:
hello, world, ', HELLO WORLD, ', we, ll, be, fine
You can first split on quoted string, and then further tokenize.
foreach (String s in Regex.Split(input, #"('[^']+')")) {
// Check first if s is a quote.
// If so, split out the quotes.
// If not, do what you intend to do.
}
(Note: you need the brackets in the pattern to make sure Regex.Split returns those too)
Try this Regular Expression:
([']*)([a-z]+)([']*)
This finds 1 or more single quotes at the beginning and end of a string. It then finds 1 or more characters in the a-z set (if you don't set it to be case insensitive it will only find lower case characters). It groups these so that group 1 has the ', group 2 (or more) has the words which are split by anything that is not a character a - z and the last group has the single quote if it exists.

Extending [^,]+, Regular Expression in C#

Duplicate
Regex for variable declaration and initialization in c#
I was looking for a Regular Expression to parse CSV values, and I came across this Regular Expression
[^,]+
Which does my work by splitting the words on every occurance of a ",". What i want to know is say I have the string
value_name v1,v2,v3,v4,...
Now I want a regular expression to find me the words v1,v2,v3,v4..
I tried ->
^value_name\s+([^,]+)*
But it didn't work for me. Can you tell me what I am doing wrong? I remember working on regular expressions and their statemachine implementation. Doesn't it work in the same way.
If a string starts with Value_name followed by one or more whitespaces. Go to Next State. In That State read a word until a "," comes. Then do it again! And each word will be grouped!
Am i wrong in understanding it?
You could use a Regex similar to those proposed:
(?:^value_name\s+)?([^,]+)(?:\s*,\s*)?
The first group is non-capturing and would match the start of the line and the value_name.
To ensure that the Regex is still valid over all matches, we make that group optional by using the '?' modified (meaning match at most once).
The second group is capturing and would match your vXX data.
The third group is non-capturing and would match the ,, and any whitespace before and after it.
Again, we make it optional by using the '?' modifier, otherwise the last 'vXX' group would not match unless we ended the string with a final ','.
In you trials, the Regex wouldn't match multiple times: you have to remember that if you want a Regex to match multiple occurrences in a strings, the whole Regex needs to match every single occurrence in the string, so you have to build your Regex not only to match the start of the string 'value_name', but also match every occurrence of 'vXX' in it.
In C#, you could list all matches and groups using code like this:
Regex r = new Regex(#"(?:^value_name\s+)?([^,]+)(?:\s*,\s*)?");
Match m = r.Match(subjectString);
while (m.Success) {
for (int i = 1; i < m.Groups.Count; i++) {
Group g = m.Groups[i];
if (g.Success) {
// matched text: g.Value
// match start: g.Index
// match length: g.Length
}
}
m = m.NextMatch();
}
I would expect it only to get v1 in the group, because the first comma is "blocking" it from grabbing the rest of the fields. How you handle this is going to depend on the methods you use on the regular expression, but it may make sense to make two passes, first grab all the fields seperated by commas and then break things up on spaces. Perhaps ^value_name\s+(?:([^,]+),?)* instead.
Oh yeah, lists....
/(?:^value_name\s+|,\s*)([^,]+)/g will theoreticly grab them, but you will have to use RegExp.exec() in a loop to get the capture, rather than the whole match.
I wish pre-matches worked in JS :(.
Otherwise, go with Logan's idea: /^value_name\s+([^,]+(?:,\s*[^,]+)*)$/ followed by .split(/,\s*/);

Categories

Resources