Using Regular Expression to match fields with an arbitrary delimiter

Using Regular Expression to match fields with an arbitrary delimiter - c#

I suppose this should be an old question, however, I didn't find suitable solution in the forums after several hours searching.
I'm using C# and I know the Regex.Split and String.Split methods can be used to achieve the expected results. For some reason, I need to use a regular expression to match the required fields by specifying an arbitrary delimiter. For example, here is the string:
#DIV#This#DIV#is#DIV#"A "#DIV#string#DIV#
Here, #DIV# is the delimiter and is going to be split as:
This
is
"A "
string
How can I use a regular expression to match these values?
By the way, the leading and trailing #DIV# could also be ignored, for example, below source string should also be same result with above:
#DIV#This#DIV#is#DIV#"A "#DIV#string
This#DIV#is#DIV#"A "#DIV#string#DIV#
This#DIV#is#DIV#"A "#DIV#string

UPDATE:
I think I found a way (mind it is not efficient!) to get rid of empty values with a regex.
var splits = Regex.Matches(strIn, #"(?<=#DIV#|^)(?:(?!#DIV#).)+?(?=$|#DIV#)");
See demo on regexstorm (mind the \r? is only to demo in Multiline mode, you do not need it when using in real life)
ORIGINAL ANSWER
Here is another approach using a regular Split:
var strIn = "#DIV#This#DIV#is#DIV#\"A # \"#DIV#string#DIV#";
var splitText = strIn.Split(new[] {"#DIV#"}, StringSplitOptions.RemoveEmptyEntries);
Or else, you can use a regex to match the fields you need and then remove empty items with LINQ:
var spltsTxt2 = Regex.Matches(strIn, #"(?<=#DIV#|^).*?(?=#DIV#|$)").Cast<Match>().Where(p => !string.IsNullOrEmpty(p.Value)).Select(p => p.Value).ToList();
Output:

#DIV#|(.+?)(?=#DIV#|$)
Try this.Grab the captures or groups.See demo.
https://www.regex101.com/r/fJ6cR4/21

You can use the following to match:
/#?DIV#?/g
And replace with ' ' (space)
But this will give trailing and leading spaces sometimes.. which can be removed by using String.Trim()
Edit1: If you want to match the field values you can use the following:
(?<=(#?DIV#?)|^)[^#]*?(?=(#?DIV#?)|$)
See DEMO
Edit2: More generalized regex for matching # in fields:
(?m)(?<=(^(?!#?DIV#)|(#?DIV#)))(.*?)(?=($|(#DIV#?)))

Related

RegEx for matching special chars no spaces or newlines

I have a string and want to use regex to match all the chars, but no spaces.
I tried to replace all the spaces with nothing, using:
Regex.Replace(seller, #"[A-z](.+)", m => m.Groups[1].Value);
//rating
var betyg = Regex.Replace(seller, #"[A-z](.+)", m => m.Groups[1].Value);`
I expect the output of
"Iris-presenter | 5"
but, the output is
"Iris-presenter"
seen in this also seen in this demo.
The string is:
<spaces>Iris-presenter
<spaces>|
<spaces>5

Great question! I'm not quite sure, if this would be what you might be looking for. This expression however matches your input string:
^((?!\s|\n).)*
Graph
The graph shows how it might work:
Edit
Based on revo's advice, the expression can be much simplified, because
^((?!\s|\n).)* is equal to ^((?!\s).)* and both are equal to ^\S*.

I used (\s(.*?)) for it to work. This removes all spaces and new lines seen here

How to replace all occurrences of `someObject.ToString()` with `Convert.ToString(someObject);`

I want to search through my code and replace all instances of someObject.ToString() with Convert.ToString(someObject).
For example if I have:
var x = someClassInstance.ToString()
I want to replace it with:
var x = Convert.ToString(someClassInstance)
Is it possible to do this through regular expression?

The solution will differ slightly based on your environment, but for example in Notepad++:
Search for ([0-9a-zA-Z_]+)\.toString\(\).
Replace with Convert.toString\($1\).

In C#, you can use the following regex:
\b(\w+)\.ToString\(\)
It starts by matching a Word boundary and then graps all Word characters before the dot and ToString(). Note the escaped characters, they have special meaning in regex,
You then need to replace it with:
Convert.ToString($1)
Here '$1' will be replaced by the matched Group 1 from the regex (the name of the method).
Edit:
The above regex will fail if the method name is a call to a method, like 'myMethod(param).ToString()'.
I have changed the regex to accept anything not being a dot followed by 'ToString' (since the code can already compile, there's no need for further syntax checking):
\b((?!\.ToString)(?:[\w.()+*/-])*?)\.ToString\(\)
Now it should include function calls.
Example of match: 'SomeFunction(Int32.MaxValue-1).ToString()'
It will fail, if there are Spaces in the match.

RegEx - Match using symbols but don't replace them

I would like to use a symbols in a RegEx pattern to find matches, but I don't want them replaced. This is for class and namespace manipulation in C#.
For example:
MyNamespaceLib.EntityDataModelTests.TestsMyClassTests+MyInnerClassTests
must be replaced as:
MyNamespaceLib.EntityDataModel.TestsMyClass+MyInnerClass
(Note, only "Tests" is replace when it appears at the end of the namespace part, and not when it's part of the class/namespace name)
I've managed to get the first part right in finding the matches, but I'm battling to keep the symbols in the replaced match.
So far I have:
var input = "MyNamespaceLib.EntityDataModelTests.TestsMyClassTests+MyInnerClassTests";
var output = Regex.Replace(input, "Tests[.+]|$", "");
I've tried using a non-capturing group, but I suspect it's not meant for the way I'm trying to use it.
Thanks!

So what you want to do is replace matches not followed by a . or a +? Use a lookahead:
#"Tests(?![.+])"

You can use the MatchEvaluator overload of the Regex.Replace method, where the string to replace the match with is generated on the fly. I get the special simbol in a capturing group (and the first capturing group is always Group1 of the match), and replace the match with the value, like this:
var output = Regex.Replace(input, #"Tests([.+]|$)", m => m.Groups[1].Value);
Also, per minitech's comment, you can also use $1 for the first capturing group in the (string, string) overload of Regex.Replace, like:
var output = Regex.Replace(input, #"Tests([.+]|$)", "$1");
That said, a regex is often write-only code, so you can always do a dumb and simple replace:
var output = input.Replace("Tests+","").Replace("Tests.","") ...;

Can Regular Expressions Achieve This?

I'm trying to split a string into tokens (via regular expressions)
in the following way:
Example #1
input string: 'hello'
first token: '
second token: hello
third token: '
Example #2
input string: 'hello world'
first token: '
second token: hello world
third token: '
Example #3
input string: hello world
first token: hello
second token: world
i.e., only split up the string if it is NOT in single quotation marks, and single quotes should be in their own token.
This is what I have so far:
string pattern = #"'|\s";
Regex RE = new Regex(pattern);
string[] tokens = RE.Split("'hello world'");
This will work for example #1 and example #3 but it will NOT work for example #2.
I'm wondering if there's theoretically a way to achieve what I want with regular expressions

You could build a simple lexer, which would involve consuming each of the tokens one by one. So you would have a list of regular expressions and would attempt to match one of them at each point. That is the easiest and cleanest way to do this if your input is anything beyond the very simple.

Use a token parsor to split into tokens. Use regex to find a string patterns

'[^']+' will match text inside single quotes. If you want it grouped, (')([^']+)('). If no matches are found, then just use a regular string split. I don't think it makes sense to try to do the whole thing in one regular expression.
EDIT: It seems from your comments on the question that you actually want this applied over a larger block of text rather than just simple inputs like you indicated. If that's the case, then I don't think a regular expression is your answer.

While it would be possible to match ' and the text inside separately, and also alternatively match the text alone, RegExp does not allow an indefinite number of matches. Or better said, you can only match those objects you explicitely state in the expression. So ((\w+)+\b) could theoretically match all words one-by-one. The outer group will correctly match the whole text, and also the inner group will match the words separately correctly, but you will only be able to reference the last match.
There is no way to match a group of matched matches (weird sentence). The only possible way would be to match the string and then split it into separate words.

Not exactly what you are trying to do, but regular expression conditions might help out as you look for a solution:
(?<quot>')?(?<words>(?(quot)[^']|\w)+)(?(quot)')
If a quote is found, then it matches until a non-quote is found. Otherwise looks at word characters. Your results are in groups named "quot" and "words".

You'll have hard time using Split here, but you can use a MatchCollection to find all matches in your string:
string str = "hello world, 'HELLO WORLD': we'll be fine.";
MatchCollection matches = Regex.Matches(str, #"(')([^']+)(')|(\w+)");
The regex searches for a string between single quotes. If it cannot find one, it takes a single word.
Now it gets a little tricky - .net returns a collection of Matchs. Each Match has several Groups - the first Group has the whole string ('hello world'), but the rest have sub-matches (',hello world,'). Also, you get many empty unsuccessful Groups.
You can still iterate easily and get your matches. Here's an example using LINQ:
var tokens = from match in matches.Cast<Match>()
from g in match.Groups.Cast<Group>().Skip(1)
where g.Success
select g.Value;
tokens is now a collection of strings:
hello, world, ', HELLO WORLD, ', we, ll, be, fine

You can first split on quoted string, and then further tokenize.
foreach (String s in Regex.Split(input, #"('[^']+')")) {
// Check first if s is a quote.
// If so, split out the quotes.
// If not, do what you intend to do.
}
(Note: you need the brackets in the pattern to make sure Regex.Split returns those too)

Try this Regular Expression:
([']*)([a-z]+)([']*)
This finds 1 or more single quotes at the beginning and end of a string. It then finds 1 or more characters in the a-z set (if you don't set it to be case insensitive it will only find lower case characters). It groups these so that group 1 has the ', group 2 (or more) has the words which are split by anything that is not a character a - z and the last group has the single quote if it exists.

C# regex replace unexpected behavior

Given $displayHeight = "800";, replace whatever number is at 800 with int value y_res.
resultString = Regex.Replace(
im_cfg_contents,
#"\$displayHeight[\s]*=[\s]*""(.*)"";",
Convert.ToString(y_res));
In Python I'd use re.sub and it would work. In .NET it replaces the whole line, not the matched group.
What is a quick fix?

Building on a couple of the answers already posted. The Zero-width assertion allows you to do a regular expression match without placing those characters in the match. By placing the first part of the string in a group we've separated it from the digits that you want to be replaced. Then by using a zero-width lookbehind assertion in that group we allow the regular expression to proceed as normal but omit the characters in that group in the match. Similarly, we've placed the last part of the string in a group, and used a zero-width lookahead assertion. Grouping Constructs on MSDN shows the groups as well as the assertions.
resultString = Regex.Replace(
im_cfg_contents,
#"(?<=\$displayHeight[\s]*=[\s]*"")(.*)(?="";)",
Convert.ToString(y_res));
Another approach would be to use the following code. The modification to the regular expression is just placing the first part in a group and the last part in a group. Then in the replace string, we add back in the first and third groups. Not quite as nice as the first approach, but not quite as bad as writing out the $displayHeight part. Substitutions on MSDN shows how the $ characters work.
resultString = Regex.Replace(
im_cfg_contents,
#"(\$displayHeight[\s]*=[\s]*"")(.*)("";)",
"${1}" + Convert.ToString(y_res) + "${3}");

Try this:
resultString = Regex.Replace(
im_cfg_contents,
#"\$displayHeight[\s]*=[\s]*""(.*)"";",
#"\$displayHeight = """ + Convert.ToString(y_res) + #""";");

It replaces the whole string because you've matched the whole string - nothing about this statement tells C# to replace just the matched group, it will find and store that matched group sure, but it's still matching the whole string overall.
You can either change your replacer to:
#"\$displayHeight = """ + Convert.ToString(y_res) + #""";"
..or you can change your pattern to just match the digits, i.e.:
#"[0-9]+"
..or you could see if C# regex supports lookarounds (I'm not sure if it does offhand) and change your match accordingly.

You could also try this, though I think it is a little slower than my other method:
resultString = Regex.Replace(
im_cfg_contents,
"(?<=\\$displayHeight[\\s]*=[\\s]*\").*(?=\";)",
Convert.ToString(y_res));

Check this pattern out
(?<=(\$displayHeight\s*=\s*"))\d+(?=";)
A word about "lookaround".

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Using Regular Expression to match fields with an arbitrary delimiter - c#

#DIV#|(.+?)(?=#DIV#|$) Try this.Grab the captures or groups.See demo. https://www.regex101.com/r/fJ6cR4/21

Related

RegEx for matching special chars no spaces or newlines

How to replace all occurrences of `someObject.ToString()` with `Convert.ToString(someObject);`

RegEx - Match using symbols but don't replace them

Can Regular Expressions Achieve This?

C# regex replace unexpected behavior

Categories

Resources