regex split to implement tokenizer - c#

I have a string with all possible chars and now I want to split it by following
"+"
",OU="
can anyone show me how to do this with regex.split?
I tried many times, but still no luck
I'm using C#

I think you can use string.split, which you can specify multiple separators.
string[] separator = new string[]{"+", ",OU="};
string[] resultTokens = testString.split(separator, StringSplitOption.None);

for the Regex version :
string[] split = Regex.Split(yourstring, #"\+|OU=");

You may have needed a backslash in front of the "+" to treat it as a literal, and you're probably defining the regex using a string, so the string itself will want the backslash character escaped. It can be easier to read to use square brackets instead.
"([+]|,[Oo][Uu]=)"

Related

Split string by different marks

how to split string by several different symbols, for example like dot . and - in c# string
string str = "sally-vikram.dean.sarah-ray";
but without replace all to same mark:
str = str.Replace("-", "."):
and split by dot for example:
string[] words = str.Split('.');
to get:
sally
vikram
dean
sarah
ray
string.Split can actually take an array of values:
string[] words = str.Split('.', '-');
For your use case, a regex character class (MSDN) is a good choise:
string[] words = Regex.Split(str, "[.-]");
Note: Since - is also used to define a character range like a-z it's good practice to put the - at the end of character group. Otherwise, just escape it, e.g. \-.
This is most appropriate if you expected that you need further delimiters and other requirements, find the regex more readable, and performance isn't an issue (the Regex.Split is much slower than the String.Split equivalent).

Concatenate and split strings

Maybe this is a silly question. But I haven't found an answer yet.
I've got some strings. I would like to concatenate them and then split the resulting string in a different moment. I would like to know if there's something available inside the .NET Framework. The Join and Split methods of String work quite well. The problem is to escape the separator character.
For example, I would like to use the "#" as separator character. If I have "String1", "Str#ing2" and "String3", I would like to obtain "String1#Str##ing2#String3".
Is there something that does what I need or do I have to write my own function?
Thank you.
Just escape the separator on the way in.
var inputs = ["String1", "Str#ing2", "String3"];
var joined = string.Join(inputs.Select(i => i.Replace("#", "##"));
You can then split on single # chars.
var split = Regex.Split(joined, "(?<!#)#(?!#)");
This uses zero-width negative lookbehind/lookahead patterns to assert the character before and after the # is not another #. You should run some tests on cases where # is at the start or end of your input strings however.
Call .Replace("#", "##") on each string before passing them to .Join

Regex.Split on "() " and "?"

myString= "First?Second Third";
String[] result = Regex.Split(myString, #"( )\?");
Should result:
First,
Second,
Third
What am I missing? (I also need brackets to split on for something else)
I guess with ( ), you meant whitespace. You don't need any capturing group there. Just use alteration, or a character class:
String[] result = Regex.Split(myString, #"\s|\?");
// OR
String[] result = Regex.Split(myString, #"[\s?]");
Using string methods:
myString= "First?Second Third";
String[] result = myString.Split(' ','?');
I'm not quite sure what you are trying to do with the quotes. Remember that in C# parenthesis are used to denote a logical group in your regular expression, they do not escape a space. Rather you want to split on an explicit set of characters, which is denoted by brackets []. You should use the following pattern to split:
String[] result = Regex.Split(myString, #"[\?\s]");
Note that \? is an escaped space (as you had in your original). White-space characters are escaped as \s. Thus, my solution is essentially saying to separate the string on any of the explicitly indicated characters (based on the []) and lists those characters as ? (escaped as \?) and " " (escaped as \s).
EDIT AFTER MORE INFO FROM OP:
I also saw, after answering this post, that you editted the top comment to say you wanted a logical grouping for the white-space, in which case I would go with:
String[] result = Regex.Split(myString, #"[\?(\s)]");
You need to surround those chars inside [] to create a range of them. [\s\?] This will split on:
a space
?
You can use \s to handle "any" whitespace char.

Regular Expression - Remove zeroes inside an expression

I need to remove leading zeroes from the numerical part of an expression (using .net 2.0 C# Regex class).
Ex:
PAR0000034 -> PAR34
WP0003204 -> WP3204
I tried the following:
//keep starting characters, get rid of leading zeroes, keep remaining digits
string result = Regex.Replace(inputStr, "^(.+)(0+)(/d*)", "$1$3", RegexOptions.IgnoreCase)
Obviously, it did not work. I need a bit of help to find the mistake.
You don't need a regular expression for that, the Split method can do that for you.
Splitting on '0', removing empty entries (i.e. between the mulitple zeroes), and limiting the result to two strings will give you the two strings before and after the leading zeroes. Then you just put those two strings together again:
string result = String.Concat(
input.Split(new char[] { '0' }, 2, StringSplitOptions.RemoveEmptyEntries)
);
In your expression the .* part is greedy, so it catches full string. Further
use backslash instead of slash for digit \d
string result = Regex.Replace(inputStr, #"^([^0]+)(0+)(\d*)", "$1$3");
Or use look behind instead:
string result = Regex.Replace(inputStr, "(?<=[a-zA-Z])0+", "");
This works for me:
Regex.Replace("PPP00001001", "([^0]*)0+(.*)", "$1$2");
The phrase "leading zeroes" is confusing, since the zeroes you're talking about aren't actually at the beginning of the string. But if I understand you correctly, you want this:
string result = Regex.Replace(inputStr, "^(.*?)0+", "$1");
There are actually several ways to do it, with and without regex, but the above is probably the shortest and easiest to understand. The important part is the .*? lazy quantifier. This will ensure that it a) finds only the first string of zeroes, and b) deletes all the "leading" zeroes in the string.

Replacing numbers in strings with C#

I'd thought i do a regex replace
Regex r = new Regex("[0-9]");
return r.Replace(sz, "#");
on a file named aa514a3a.4s5 . It works exactly as i expect. It replaces all the numbers including the numbers in the ext. How do i make it NOT replace the numbers in the ext. I tried numerous regex strings but i am beginning to think that its a all or nothing pattern so i cant do this? do i need to separate the ext from the string or can i use regex?
This one does it for me:
(?<!\.[0-9a-z]*)[0-9]
This does a negative lookbehind (the string must not occur before the matched string) on a period, followed by zero or more alphanumeric characters. This ensures only numbers are matched that are not in your extension.
Obviously, the [0-9a-z] must be replaced by which characters you expect in your extension.
I don't think you can do that with a single regular expression.
Probably best to split the original string into base and extension; do the replace on the base; then join them back up.
Yes, I thing you'd be better off separating the extension.
If you are sure there is always a 3-character extension at the end of your string, the easiest, most readable/maintainable solution would be to only perform the replace on
yourString.Substring(0,YourString.Length-4)
..and then append
yourString.Substring(YourString.Length-4, 4)
Why not run the regex on the substring?
String filename = "aa514a3a.4s5";
String nameonly = filename.Substring(0,filename.Length-4);

Categories

Resources