Split string by char, but skip certain char combinations

Split string by char, but skip certain char combinations - c#

Say I have a string in a form similar to this:
"First/Second//Third/Fourth" (notice the double slash between Second and Third)
I want to be able to split this string into the following substrings "First", "Second//Third", "Fourth". Basically, what I want is to split the string by a char (in this case /), but not by double of that char (in this case //). I though of this in a number of ways, but couldn't get it working.
I can use a solution in C# and/or JavaScript.
Thanks!
Edit: I would like a simple solution. I have already thought of parsing the string char by char, but that is too complicated in my real live usage.

Try with this C# solution, it uses positive lookbehind and positive lookahead:
string s = #"First/Second//Third/Fourth";
var values = Regex.Split(s, #"(?<=[^/])/(?=[^/])", RegexOptions.None);
It says: delimiter is / which is preceded by any character except / and followed by any character except /.
Here is another, shorter, version that uses negative lookbehind and lookahead:
var values = Regex.Split(s, #"(?<!/)/(?!/)", RegexOptions.None);
This says: delimiter is / which is not preceded by / and not followed by /
You can find out more about 'lookarounds' here.

In .NET Regex you can do it with negative assertions.(?<!/)/(?!/) will work. Use Regex.Split method.

ok one thing you can do is to split the string based on /. The array you get back will contain empty allocations for all the places // were used. loop through the array and concatenate i-1 and i+1 allocations where i is the pointer to the empty allocation.

How about this:
var array = "First/Second//Third/Fourth".replace("//", "%%").split("/");
array.forEach(function(element, index) {
array[index] = element.replace("%%", "//");
});

Related

Split string by different marks

how to split string by several different symbols, for example like dot . and - in c# string
string str = "sally-vikram.dean.sarah-ray";
but without replace all to same mark:
str = str.Replace("-", "."):
and split by dot for example:
string[] words = str.Split('.');
to get:
sally
vikram
dean
sarah
ray

string.Split can actually take an array of values:
string[] words = str.Split('.', '-');

For your use case, a regex character class (MSDN) is a good choise:
string[] words = Regex.Split(str, "[.-]");
Note: Since - is also used to define a character range like a-z it's good practice to put the - at the end of character group. Otherwise, just escape it, e.g. \-.
This is most appropriate if you expected that you need further delimiters and other requirements, find the regex more readable, and performance isn't an issue (the Regex.Split is much slower than the String.Split equivalent).

Regex matching numbers without letters in front of it

I want to match numbers like "100", "1.1", "5.404", IF they do not include a letter in front like this: "V102".
Here is my current regular expression:
(?<![A-Za-z])[0-9.]+
This is supposed to match any character 0-9. one or more repetitions, if prefix is absent (A-Za-z).
But what it does is match V102, as 02, so it just chips away V and one more letter and then the rest fits while it actually shouldn't match that case at all. How can I make it so it grabs all numbers, and then checks if the prefix is non existent?

Add digits and decimal point to your negative lookbehind:
(?<![A-Za-z0-9.])[0-9.]+
This will force all matches to start with a non-digit and non-letter (i.e., a space or other separator). That way the end of a number will not be a valid match either.
Demo: http://www.rubular.com/r/EDuI2D9jnW

could you possibly be able to use word boundaries?
\b[0-9\.]+\b

Try the regex:
(?<![A-Za-z0-9])[0-9.]+

If you don't want letters or spaces anywhere in your string, then this should work:
^[0-9.]+$

A Non-Regex solution.
If you have the following string, then you can use double.TryParse to see if the string is a double. Try:
string str = "100 1.1 V100 d333 ABC 1.1";
double temp;
string[] result = str.Split().Where(r => (double.TryParse(r, out temp))).ToArray();
Or if you need a double array in return then:
double[] numberArray = str.Split()
.Where(r => double.TryParse(r, out temp))
.Select(r => double.Parse(r))
.ToArray();

Try using the caret ^ operator. This operator indicates that you want your pattern to start at the beginning of the input. For example ^[0-9.]+ will match inputs that begin with a digit or a . and has any number of them.
Note that this pattern does not match only numbers, as it matches also patterns with more then 1 dot, for example 2.00.2, which is not a valid number.

Regular Expression - Remove zeroes inside an expression

I need to remove leading zeroes from the numerical part of an expression (using .net 2.0 C# Regex class).
Ex:
PAR0000034 -> PAR34
WP0003204 -> WP3204
I tried the following:
//keep starting characters, get rid of leading zeroes, keep remaining digits
string result = Regex.Replace(inputStr, "^(.+)(0+)(/d*)", "$1$3", RegexOptions.IgnoreCase)
Obviously, it did not work. I need a bit of help to find the mistake.

You don't need a regular expression for that, the Split method can do that for you.
Splitting on '0', removing empty entries (i.e. between the mulitple zeroes), and limiting the result to two strings will give you the two strings before and after the leading zeroes. Then you just put those two strings together again:
string result = String.Concat(
input.Split(new char[] { '0' }, 2, StringSplitOptions.RemoveEmptyEntries)
);

In your expression the .* part is greedy, so it catches full string. Further
use backslash instead of slash for digit \d
string result = Regex.Replace(inputStr, #"^([^0]+)(0+)(\d*)", "$1$3");
Or use look behind instead:
string result = Regex.Replace(inputStr, "(?<=[a-zA-Z])0+", "");

This works for me:
Regex.Replace("PPP00001001", "([^0]*)0+(.*)", "$1$2");

The phrase "leading zeroes" is confusing, since the zeroes you're talking about aren't actually at the beginning of the string. But if I understand you correctly, you want this:
string result = Regex.Replace(inputStr, "^(.*?)0+", "$1");
There are actually several ways to do it, with and without regex, but the above is probably the shortest and easiest to understand. The important part is the .*? lazy quantifier. This will ensure that it a) finds only the first string of zeroes, and b) deletes all the "leading" zeroes in the string.

Regex to isolate a specific substring

I have this string I have retrieved from a File.ReadAllText:
6 11 rows processed
As you can see there is always an integer specifying the line number in this document. What I am interested in is the integer that comes after it and the words "rows processed". So in this case I am only interested in the substring "11 rows processed".
So, knowing that each line will start with an integer and then some white space, I need to be able to isolate the integer that follows it and the words "rows processed" and return that to a string by itself.
I have been told this is easy to do with Regex, but so far I haven't the faintest clue how to build it.

You don't need regular expressions for this. Just split on the whitespace:
var fields = s.Split(new char[0], StringSplitOptions.RemoveEmptyEntries);
Console.WriteLine(String.Join(" ", fields.Skip(1));
Here, I am using the fact that if you pass an empty array as the char [] parameter to String.Split, it splits on all whitespace.

This should work for what you need:
\d+(.*)
This searches for 1 or more digits (\d+) and then it puts everything afterwards in a group:
. = any character
* = repeater (zero or more of the preceding value (which is any character in the above
() = grouping
However, Jason is correct in that you only need to use a split function

If you need to use a Regex it would be like this:
string result = null;
Match match = Regex.Match(row, #"^\s*\d+\s*(.*)");
if (match.Success)
result = match.Groups[1].Value;
The regex matches from start of row: first spaces if any, then digits and then more spaces. Last it extracts rest of line and return it as result.

This is done easily with Regex.Replace() using the following regular expression...
^\d+\s+
So it'd be something like this:
return Regex.Replace(text, #"^\d+\s+", "");
Basically you're just trimming the first number \d and the whitespace \s that follows.

Example in PHP(C# regex should be compatible):
$line = "6 11 rows processed";
$resp = preg_match("/[0-9]+\s+(.*)/",$line,$out);
echo $out[1];
I hope I catched your point.

Regex which ensures no character is repeated

I need to ensure that a input string follows these rules:
It should contain upper case characters only.
NO character should be repeated in the string.
eg. ABCA is not valid because 'A' is being repeated.
For the upper case thing, [A-Z] should be fine.
But i am lost at how to ensure no repeating characters.
Can someone suggest some method using regular expressions ?

You can do this with .NET regular expressions although I would advise against it:
string s = "ABCD";
bool result = Regex.IsMatch(s, #"^(?:([A-Z])(?!.*\1))*$");
Instead I'd advise checking that the length of the string is the same as the number of distinct characters, and checking the A-Z requirement separately:
bool result = s.Cast<char>().Distinct().Count() == s.Length;
Alteranatively, if performance is a critical issue, iterate over the characters one by one and keep a record of which you have seen.

This cannot be done via regular expressions, because they are context-free. You need at least context-sensitive grammar language, so only way how to achieve this is by writing the function by hand.
See formal grammar for background theory.

Why not check for a character which is repeated or not in uppercase instead ? With something like ([A-Z])?.*?([^A-Z]|\1)

Use negative lookahead and backreference.
string pattern = #"^(?!.*(.).*\1)[A-Z]+$";
string s1 = "ABCDEF";
string s2 = "ABCDAEF";
string s3 = "ABCDEBF";
Console.WriteLine(Regex.IsMatch(s1, pattern));//True
Console.WriteLine(Regex.IsMatch(s2, pattern));//False
Console.WriteLine(Regex.IsMatch(s3, pattern));//False
\1 matches the first captured group. Thus the negative lookahead fails if any character is repeated.

This isn't regex, and would be slow, but You could create an array of the contents of the string, and then iterate through the array comparing n to n++
=Waldo

It can be done using what is call backreference.
I am a Java program so I will show you how it is done in Java (for C#, see here).
final Pattern aPattern = Pattern.compile("([A-Z]).*\\1");
final Matcher aMatcher1 = aPattern.matcher("ABCDA");
System.out.println(aMatcher1.find());
final Matcher aMatcher2 = aPattern.matcher("ABCDA");
System.out.println(aMatcher2.find());
The regular express is ([A-Z]).*\\1 which translate to anything between 'A' to 'Z' as group 1 ('([A-Z])') anything else (.*) and group 1.
Use $1 for C#.
Hope this helps.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Split string by char, but skip certain char combinations - c#

In .NET Regex you can do it with negative assertions.(?<!/)/(?!/) will work. Use Regex.Split method.

ok one thing you can do is to split the string based on /. The array you get back will contain empty allocations for all the places // were used. loop through the array and concatenate i-1 and i+1 allocations where i is the pointer to the empty allocation.

How about this: var array = "First/Second//Third/Fourth".replace("//", "%%").split("/"); array.forEach(function(element, index) { array[index] = element.replace("%%", "//"); });

Related

Split string by different marks

Regex matching numbers without letters in front of it

Regular Expression - Remove zeroes inside an expression

Regex to isolate a specific substring

Regex which ensures no character is repeated

Categories

Resources