Regex.Split adding empty strings to result array - c#

I have a Regex to split out words operators and brackets in simple logic statements (e.g. "WORD1 & WORD2 | (WORd_3 & !word_4 )". the Regex I've come up with is "(?[A-Za-z0-9_]+)|(?[&!\|()]{1})". Here is a quick test program.
using System;
using System.Text.RegularExpressions;
namespace ConsoleApplication1
{
class Program
{
static void Main(string[] args)
{
Console.WriteLine("* Test Project *");
string testExpression = "!(LIONV6 | NOT_superCHARGED) &RHD";
string removedSpaces = testExpression.Replace(" ", "");
string[] expectedResults = new string[] { "!", "(", "LIONV6", "|", "NOT_superCHARGED", ")", "&", "RHD" };
string[] splits = Regex.Split(removedSpaces, #"(?[A-Za-z0-9_]+)|(?[&!\|()]{1})");
Console.WriteLine("Expected\n{0}\nActual\n{1}", expectedResults.AllElements(), splits.AllElements());
Console.WriteLine("*** Any Key to finish ***");
Console.ReadKey();
}
}
public static class Extensions
{
public static string AllElements(this string[] str)
{
string output = "";
if (str != null)
{
foreach (string item in str)
{
output += "'" + item + "',";
}
}
return output;
}
}
The Regex does the required job of splitting out words and operators into an array in the right sequence, but the result array contains many empty elements, and I can't work out why. Its not a serious problem as I just ignore empty elements when consuming the array but I'd like Regex to do all the work if possible, including ignoring spaces.

Try this:
string[] splits = Regex.Split(removedSpaces, #"(?[A-Za-z0-9_]+)|(?[&!\|()]{1})").Where(x => x != String.Empty);

The spaces are jsut becasue of the way the split works. From the help page:
If multiple matches are adjacent to one another, an empty string is inserted into the array.
What split is doing as standard is taking your matches as delimiters. So in effect the standard that would be returned is a lot of empty strings between the adjacent matches (imagine as a comparison what you might expect if you split ",,,," on ",", you'd probably expect all the gaps.
Also from that help page though is:
If capturing parentheses are used in a Regex.Split expression, any
captured text is included in the resulting string array.
This is the reason you are getting what you actually want in there at all. So effectively it is now showing you the text that has been split (all the empty strings) with the delimiters too.
What you are doing may well be better off done with just matching the regular expression (with Regex.Match) since what is in your regular expression is actually what you want to match.
Something like this (using some linq to convert to a string array):
Regex.Matches(testExpression, #"([A-Za-z0-9_]+)|([&!\|()]{1})")
.Cast<Match>()
.Select(x=>x.Value)
.ToArray();
Note that because this is taking positive matches it doesn't need the spaces to be removed first.

var matches = Regex.Matches(removedSpaces, #"(\w+|[&!|()])");
foreach (var match in matches)
Console.Write("'{0}', ", match); // '!', '(', 'LIONV6', '|', 'NOT_superCHARGED', ')', '&', 'RHD',
Actually, you don't need to delete spaces before extracting your identifiers and operators, the regex I proposed will ignore them anyway.

Related

Splitting a string with a space/two spaces after the character

Consider a number of strings, which are assumed to contain "keys" of the form "Wxxx", where x are digits from 0-9. Each one can contain either one only, or multiple ones, separated by ',' followed by two spaces. For example:
W123
W432
W546, W234, W167
The ones that contain multiple "keys" need to be split up, into an array. So, the last one in the above examples should be split into an array like this: {"W546", "W234", "W167"}.
As a quick solution, String.Split comes to mind, but as far as I am aware, it can take one character, like ','. The problem is that it would return an array with like this: {"W546", " W234", " W167"}. The two spaces in all the array entries from the second one onwards can probably be removed using Substring, but is there a better solution?
For context, these values are being held in a spreadsheet, and are assumed to have undergone data validation to ensure the "keys" are separated by a comma followed by two spaces.
while ((ws.Cells[row,1].Value!=null) && (ws.Cells[row,1].Value.ToString().Equals("")))
{
// there can be one key, or multiple keys separated by ','
if (ws.Cells[row,keysCol].Value.ToString().Contains(','))
{
// there are multiple
// need to split the ones in this cell separated by a comma
}
else
{
// there is one
}
row++;
}
You can just specify ',' and ' ' as separators and RemoveEmptyEntries.
Using your sample of single keys and a string containing multiple keys you can just handle them all the same and get your list of individual keys:
List<string> cells = new List<string>() { "W123", "W432", "W546, W234, W167" };
List<string> keys = new List<string>();
foreach (string cell in cells)
{
keys.AddRange(cell.Split(new char[] { ',', ' ' }, StringSplitOptions.RemoveEmptyEntries));
}
Split can handle strings where's nothing to split and AddRange will accept your single keys as well as the multi-key split results.
You could use an old favorite--Regular Expressions.
Here are two flavors 'Loop' or 'LINQ'.
static void Main(string[] args)
{
var list = new List<string>{"W848","W998, W748","W953, W9484, W7373","W888"};
Console.WriteLine("LINQ");
list.ForEach(l => TestSplitRegexLinq(l));
Console.WriteLine();
Console.WriteLine("Loop");
list.ForEach(l => TestSplitRegexLoop(l));
}
private static void TestSplitRegexLinq(string s)
{
string pattern = #"[W][0-9]*";
var reg = new Regex(pattern);
reg.Matches(s).ToList().ForEach(m => Console.WriteLine(m.Value));
}
private static void TestSplitRegexLoop(string s)
{
string pattern = #"[W][0-9]*";
var reg = new Regex(pattern);
foreach (Match m in reg.Matches(s))
{
Console.WriteLine(m.Value);
}
}
Just replace the Console.Write with anything you want: eg. myList.Add(m.Value).
You will need to add the NameSpace: using System.Text.RegularExpressions;
Eliminate the extra space first (using Replace()), then use split.
var input = "W546, W234, W167";
var normalized = input.Replace(", ",",");
var array = normalized.Split(',');
This way, you treat a comma followed by a space exactly the same as you'd treat a comma. If there might be two spaces you can also replace that:
var input = "W546, W234, W167";
var normalized = input.Replace(" "," ").Replace(", ",",");
var array = normalized.Split(',');
After trying this in .NET fiddle, I think I may have a solution:
// if there are multiple
string keys = ws.Cells[row,keysCol].Value.ToString();
// remove spaces
string keys_normalised = keys.Replace(" ", string.Empty);
Console.WriteLine("Checking that spaces have been removed: " + keys3_normalised + "\n");
string[] splits = keys3_normalised.Split(',');
for (int i = 0; i < splits.Length; i++)
{
Console.WriteLine(splits[i]);
}
This produces the following output in the console:
Checking that spaces have been removed: W456,W234,W167
W456
W234
W167

How to use (?!...) regex pattern to skip the whole unmatched part?

I would like to use the ((?!(SEPARATOR)).)* regex pattern for splitting a string.
using System;
using System.Text.RegularExpressions;
public class Program
{
public static void Main()
{
var separator = "__";
var pattern = String.Format("((?!{0}).)*", separator);
var regex = new Regex(pattern);
foreach (var item in regex.Matches("first__second"))
Console.WriteLine(item);
}
}
It works fine when a SEPARATOR is a single character, but when it is longer then 1 character I get an unexpected result. In the code above the second matched string is "_second" instead of "second". How shall I modify my pattern to skip the whole unmatched separator?
My real problem is to split lines where I should skip line separators inside quotes. My line separator is not a predefined value and it can be for example "\r\n".
You can do something like this:
using System;
using System.Text.RegularExpressions;
public class Example
{
public static void Main()
{
string input = "plum--pear";
string pattern = "-"; // Split on hyphens
string[] substrings = Regex.Split(input, pattern);
foreach (string match in substrings)
{
Console.WriteLine("'{0}'", match);
}
}
}
// The method displays the following output:
// 'plum'
// ''
// 'pear'
The .NET regex does not does not support matching a piece of text other than a specific multicharacter string. In PCRE, you would use (*SKIP)(*FAIL) verbs, but they are not supported in the native .NET regex library. Surely, you might want to use PCRE.NET, but .NET regex can usually handle those scenarios well with Regex.Split
If you need to, say, match all but [anything here], you could use
var res = Regex.Split(s, #"\[[^][]*]").Where(m => !string.IsNullOrEmpty(m));
If the separator is a simple literal fixed string like __, just use String.Split.
As for your real problem, it seems all you need is
var res = Regex.Matches(s, "(?:\"[^\"]*\"|[^\r\n\"])+")
.Cast<Match>()
.Select(m => m.Value)
.ToList();
See the regex demo
It matches 1+ (due to the final +) occurrences of ", 0+ chars other than " and then " (the "[^"]*" branch) or (|) any char but CR, LF or/and " (see [^\r\n"]).

Is it possible to split a string into an array of strings and remove sections not between delimiters using String.Split or regex?

I was wanting to split a string with a known delimiter between different parts into an array of strings using a method (e.g. MethodToSplitIntoArray(String toSplit) like in the example below. The values are string values which can have any character except for '{', '}', or ',' so am unable to delimit on any other character. The string can also contain undesired white space at the start and end as the file can be generated from multiple different sources, the desired information will also be inbetween "{" "}" and separated by a comma.
String valueCombined = " {value},{value1},{value2} ";
String[] values = MethodToSplitIntoArray(valueCombined);
foreach(String value in values)
{
//Do something with array
Label.Text += "\r\nString: " + value;
}
Where the label would show:
String: value
String: value1
String: value2
My current implementation of splitting method is below. It splits the values but includes any spaces before the first parenthesis and anything between them.
private String[] MethodToSplitIntoArray(String toSplit)
{
return filesPassed.Split(new string[] { "{", "}" }, StringSplitOptions.RemoveEmptyEntries);
}
I though this would separate out the strings between the curly braces and remove the rest of the string, but my output is:
String:
String: value
String: ,
String: value1
String: ,
String: value2
String:
What am I doing wrong in my split that I'm still getting the string values outside of the parenthesis? Ideally I would like to use regex or String.Split if its possible
For those with similar problems check out DotNet Perls on splitting
Making the assumption that commas are not permitted inside a curly brace pair, and that outside a curly brace pair only commas or whitespace will appear, it seems to me that the most straightforward, easy-to-read way to approach this is to first split on commas, then trim the results of that (to remove whitespace), and then finally to remove the first and last characters (which at that point should only be the curly braces):
valuesCombined.Split(',').Select(s => s.Trim().Substring(1, s.Length - 2)).ToArray();
I believe that including the curly braces in the initial split operation just makes everything harder, and is more likely to break in hard-to-identify ways (i.e. bad data will result in weirder results than if you use something like the above).
Add , to delimeters:
return filesPassed.Split(new char[] { '{', '}', ',' }, StringSplitOptions.RemoveEmptyEntries);
Not sure if you are expecting those spaces in the front and end so added some trimming to prevent empty results for those.
private String[] MethodToSplitIntoArray(String toSplit)
{
return toSplit.Trim().Split(new char[] { '{', '}', ',' }, StringSplitOptions.RemoveEmptyEntries);
}
This might be one of the way to get all the values as u are looking for
String valueCombined = " {value},{value1},{value2} ";
String[] values = valueCombined.Split(new string[] { "},{" }, StringSplitOptions.RemoveEmptyEntries);
int lastVal = values.Count() - 1;
values[0] = values[0].Replace("{", "");
values[lastVal] = values[lastVal].Replace("}", "");
What I did here is that splited the string with "},{" and then removed { from the first array item and } from the last array item.
Try regex and linq.
return Regex.Split(toSplit, "[.{.}.,]").Where(x => !string.IsNullOrWhiteSpace(x)).ToArray();
Though very late but can you try this:
Regex.Split(" { value},{ value1},{ value2};", #"\s*},{\s*|{\s*|},?;?").Where(s => string.IsNullOrWhiteSpace(s) == false).ToArray()

Split string that contains array to get array of values C#

I have a string that contains an array
string str = "array[0]=[1,a,3,4,asdf54,6];array[1]=[1aaa,2,4,k=6,2,8];array[2]=[...]";
I'd like to split it to get an array like this:
str[0] = "[1,a,3,4,asdf54,6]";
str[1] = "[1aaa,2,4,k=6,2,8]";
str[2] = ....
I've tried to use Regex.Split(str, #"\[\D+\]") but it didn't work..
Any suggestions?
Thanks
SOLUTION:
After seen your answers I used
var arr = Regex.Split(str, #"\];array\[[\d, -]+\]=\[");
This works just fine, thanks all!
var t = str.Split(';').Select(s => s.Split(new char[]{'='}, 2)).Select(s => s.Last()).ToArray();
In regex, \d matches any digit, whilst \D matches anything that is not a digit. I assume your use of the latter is erroneous. Additionally, you should allow your regex to also match negation signs, commas, and spaces, using the character class [\d\-, ]. You can also include a lookahead lookbehind for the = character, written as (?<=\=), in order to avoid getting the [0], [1], [2], ...
string str = "array[0]=[1,2,3,4,5,6];array[1]=[1,2,4,6,2,8];array[2]=[...]";
string[] results = Regex.Matches(str, #"(?<=\=)\[[\d\-, ]+\]")
.Cast<Match>()
.Select(m => m.Value)
.ToArray();
Try this - using regular expression look behind to grab the relevant parts of your string.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Text.RegularExpressions;
namespace RegexSplit
{
class Program
{
static void Main(string[] args)
{
string str = "array[0]=[1,2,3,4,5,6];array[1]=[1,2,4,6,2,8];array[2]=[...]";
Regex r = new Regex(#"(?<=\]=)(\[.+?\])");
string[] results = r.Matches(str).Cast<Match>().Select(p => p.Groups[1].Value).ToArray();
}
}
}
BONUS - convert to int[][] if you are fancy.
int[][] ints = results.Select(p => p.Split(new [] {'[', ',', ']'}, StringSplitOptions.RemoveEmptyEntries)
.Where(s => { int temp; return int.TryParse(s, out temp);}) //omits the '...' match from your sample. potentially you could change the regex pattern to only catch digits but that isnt what you wanted (i dont think)
.Select(s => int.Parse(s)).ToArray()).ToArray();
Regex would be an option, but it would be a bit complicated. Assuming you don't have a parser for your input, you can try the following:
-Split the string by ; characters, and you'd get a string array (e.g. string[]):
"array[0]=[1,2,3,4,5,6]", "array[1]=[1,2,4,6,2,8]", "array[2]=[...]". Let's call it list.
Then for each of the elements in that array (assuming the input is in order), do this:
-Find the index of ]=, let that be x.
-Take the substring of your whole string from the starting index x + 2. Let's call it sub.
-Assign the result string as the current string in your array, e.g. if you are iterating with a regular for loop, and your indexing variable is i such as for(int i = 0; i < len; i++){...}:
list[i] = sub.
I know it is a dirty and an error-prone solution, e.g. if input is array[0] =[1,2... instead of array[0]=[1,2,... it won't work due to the extra space there, but if your input mathces that exact pattern (no extra spaces, no newlines etc), it will do the job.
UPDATE: cosset's answer seems to be the most practical and easiest way to achieve your result, especially if you are familiar with LINQ.
string[] output=Regex.Matches(input,#"(?<!array)\[.*?\]")
.Cast<Match>()
.Select(x=>x.Value)
.ToArray();
OR
string[] output=Regex.Split(input,#";?array[\d+]=");
string str = "array[0]=[1,a,3,4,asdf54,6];array[1]=[1aaa,2,4,k=6,2,8];array[2]=[2,3,2,3,2,3=3k3k]";
string[] m1 = str.Split(';');
List<string> m3 = new List<string>();
foreach (string ms in m1)
{
string[] m2 = ms.Split('=');
m3.Add(m2[1]);
}

String.Split cut separator

Is it possible to use String.Split without cutting separator from string?
For example I have string
convertSource = "http://www.domain.com http://www.domain1.com";
I want to build array and use code below
convertSource.Split(new[] { " http" }, StringSplitOptions.RemoveEmptyEntries)
I get such array
[1] http://www.domain.com
[2] ://www.domain1.com
I would like to keep http, it seems String.Split not only separate string but also cut off separator.
This is screaming for Regular Expressions:
Regex regEx = new Regex(#"((mailto\:|(news|(ht|f)tp(s?))\://){1}\S+)");
Match match= regEx.Match("http://www.domain.com http://www.domain1.com");
IList<string> values = new List<string>();
while (match.Success)
{
values.Add(match.Value);
match = match.NextMatch();
}
string[] array = Regex.Split(convertSource, #"(?=http://)");
That's because you use " http" as separator.
Try this:
string separator = " ";
convertSource.Split(separator.ToCharArray(), StringSplitOptions.RemoveEmptyEntries)
The Split method works in a way that when it comes to the separator you provide it cuts it off right there and removes the separator from the string also.
From what you are saying you want to do there are other ways to split the string keeping the delimiters and then if you only want to remove leading or trailing spaces from your string then I wouuld suggest that you use .Trim() method: convertSource.Trim()

Categories

Resources