How can I split the following string into a string array

How can I split the following string into a string array - c#

I want to split the following:
name[]address[I]dob[]nationality[]occupation[]
So my results would be:
name[]
address[I]
dob[]
nationality[]
occupation[]
I have tried using Regex.Split but can't get these results.

You can use Regex.Split with the following regex:
(?<=])(?=[a-z])
which will split between a closing square bracket on the left and a letter on the right. This is done using lookaround assertions. They don't consume any characters of the match so in this constellation they're pretty handy to match between letters.
Basically it means exactly what I wrote: (?<=]) will match a point in the string preceded by a closing bracket, while (?=[a-z]) matches a point in the string (both zero-width, i.e. between characters) where a letter follows. You can tweak that a little if your input data looks different from what you gave us in the question.
You could also simplify it a little, at the expense of readability, by using (?<=])\b. But I would advise against that since \b is tied to \w which is a really ugly thing, usually. It would work roughly the same, but not quite, as \b in this context amounts to (?=[\w]) and \w matches a lot more things, namely decimal digits and an underscore too.
Quick PowerShell test (it uses the same regex implementation since it's .NET underneath):
PS> 'name[]address[I]dob[]nationality[]occupation[]' -split '(?<=])(?=[a-z])'
name[]
address[I]
dob[]
nationality[]
occupation[]
Just for completeness, there is also another option. You can either split the string between the tokens you want to retain, or you could just collect all matches of tokens you want to keep. In the latter case you'll need a pattern that matches what you need, such as
[a-z]+\[[^\]]*]
or what Dennis gave as an answer (I just tend to avoid \w and \b except for quick and dirty hacks or golfing since I maintain that they have no useful application). You can use that with Regex.Matches.
Generally both approaches can work fine, it then depends on whether the split or the match pattern is easier to understand. And for Regex.Matches you'll get Match objects so you don't actually end up with a string[] if you need that, so that'd require .Select(m => m.Value) as well.
In this case I guess neither regex should be left alone without a comment explaining what it does. I can read them just fine, but many developers are a little uneasy around regexes and especially more advanced concepts like lookaround often warrant an explanation.

text.Split(new Char[] { ']' }, StringSplitOptions.RemoveEmptyEntries).Select(s => s + "]").ToArray();

Use this regex pattern:
\w*\[\w*\]

Regular expression should be fine. You can also consider to catch the opening and the closing square brackets with string.IndexOf, for example:
IEnumerable<string> Results(string input)
{
int currentIndex = -1;
while (true)
{
currentIndex++;
int openingBracketIndex = input.IndexOf("[", currentIndex);
int closingBracketIndex = input.IndexOf("]", currentIndex);
if (openingBracketIndex == -1 || closingBracketIndex == -1)
yield break;
yield return input.Substring(currentIndex, closingBracketIndex - currentIndex + 1);
currentIndex = closingBracketIndex;
}
}

string inputString = "name[]address[I]dob[]nationality[]occupation[]";
var result = Regex.Matches(inputString, #".*?\[I?\]").Cast<Match>().Select(m => m.Groups[0].Value).ToArray();

Related

Escaping \x from strings

Well, I got this little method:
static string escapeString(string str) {
string s = str.Replace(#"\r", "\r").Replace(#"\n", "\n").Replace(#"\t", "\t");
Regex regex = new Regex(#"\\x(..)");
var matches = regex.Matches(s);
foreach (Match match in matches) {
s = s.Replace(match.Value, ((char)Convert.ToByte(match.Value.Replace(#"\x", ""), 16)).ToString());
}
return s;
}
It replaces "\x65" from String, which I've got in args[0].
But my Problem is: "\\x65" will be replaced too, so I get "\e". I'd tried to figure out a regex which would check if there are more then one backslashs, but I had no luck.
Can sombody gimme a hint?

You can continue to hack regexes together with things like "\s|\w\x(..)" to remove the case of \x65. Obviously that will be brittle since there is no guarantee that your sequence \x65 always has a space or character in front of it. It could be the beginning of the file. Also, your regex will match \xTT, which obviously isn't unicode. Consider replacing the '.' with a character class like "\x([0-9a-f]{2})".
If this was a school project, I would do something like the following. You can replace all combinations of "\" into another unlikely sequence, like "#!!#!!#", run the regex and replacements, and then replace all of the unlikely sequence back to "\". For example:
String s = inputString.Replace(#"\\", #"_#!!#!!#_");
// do all of the regex, replacements, etc here
String output = s.Replace(#"_#!!#!!#_", #"\");
However, you shouldn't do this in production code because if your input stream ever has the magic sequence then you will get extra backslashes.
It's obvious that you are writing come kind of interpolator. I feel obligated to recommend looking into something more robust like lexers that use regexes to form Finite State Machines. Wiki has some great articles on this topic, and I'm a big fan of ANTLR. It may be overengineering now, but if you keep running into these special cases consider solving your problem in a more general way.
Start reading here for the theory: http://en.wikipedia.org/wiki/Lexical_analysis

Use a negative look-behind:
Regex regex = new Regex(#"(?<!([^\]|^)\\)\\x(..)");
This asserts that the previous character is not a solo backslash, but without capturing the previous character (look-arounds do not capture).

C# - Splitting on a pipe with an escaped pipe in the data?

I've got a pipe delimited file that I would like to split (I'm using C#). For example:
This|is|a|test
However, some of the data can contain a pipe in it. If it does, it will be escaped with a backslash:
This|is|a|pip\|ed|test (this is a pip|ed test)
I'm wondering if there is a regexp or some other method to split this apart on just the "pure" pipes (that is, pipes that have no backslash in front of them). My current method is to replace the escaped pipes with a custom bit of text, split on pipes, and then replace my custom text with a pipe. Not very elegant and I can't help but think there's a better way. Thanks for any help.

Just use String.IndexOf() to find the next pipe. If the previous character is not a backslash, then use String.Substring() to extract the word. Alternatively, you could use String.IndexOfAny() to find the next occurrence of either the pipe or backslash.
I do a lot of parsing like this, and this is really pretty straight forward. Taking my approach, if done correctly will also tend to run faster as well.
EDIT
In fact, maybe something like this. It would be interesting to see how this compares performance-wise to a RegEx solution.
public List<string> ParseWords(string s)
{
List<string> words = new List<string>();
int pos = 0;
while (pos < s.Length)
{
// Get word start
int start = pos;
// Get word end
pos = s.IndexOf('|', pos);
while (pos > 0 && s[pos - 1] == '\\')
{
pos++;
pos = s.IndexOf('|', pos);
}
// Adjust for pipe not found
if (pos < 0)
pos = s.Length;
// Extract this word
words.Add(s.Substring(start, pos - start));
// Skip over pipe
if (pos < s.Length)
pos++;
}
return words;
}

This oughta do it:
string test = #"This|is|a|pip\|ed|test (this is a pip|ed test)";
string[] parts = Regex.Split(test, #"(?<!(?<!\\)*\\)\|");
The regular expression basically says: split on pipes that aren't preceded by an escape character. I shouldn't take any credit for this though, I just hijacked the regular expression from this post and simplified it.
EDIT
In terms of performance, compared to the manual parsing method provided in this thread, I found that this Regex implementation is 3 to 5 times slower than Jonathon Wood's implementation using the longer test string provided by the OP.
With that said, if you don't instantiate or add the words to a List<string> and return void instead, Jon's method comes in at about 5 times faster than the Regex.Split() method (0.01ms vs. 0.002ms) for purely splitting up the string. If you add back the overhead of managing and returning a List<string>, it was about 3.6 times faster (0.01ms vs. 0.00275ms), averaged over a few sets of a million iterations. I did not use the static Regex.Split() for this test, I instead created a new Regex instance with the expression above outside of my test loop and then called its Split method.
UPDATE
Using the static Regex.Split() function is actually a lot faster than reusing an instance of the expression. With this implementation, the use of regex is only about 1.6 times slower than Jon's implementation (0.0043ms vs. 0.00275ms)
The results were the same using the extended regular expression from the post I linked to.

I came across a similar scenario, For me the count of number of pipes were fixed(not pipes with "\|") . This is how i have handled.
string sPipeSplit = "This|is|a|pip\\|ed|test (this is a pip|ed test)";
string sTempString = sPipeSplit.Replace("\\|", "¬"); //replace \| with non printable character
string[] sSplitString = sTempString.Split('|');
//string sFirstString = sSplitString[0].Replace("¬", "\\|"); //If you have fixed number of fields and you are copying to other field use replace while copying to other field.
/* Or you could use a loop to replace everything at once
foreach (string si in sSplitString)
{
si.Replace("¬", "\\|");
}
*/

Here is another solution.
One of the most beautiful thing about programming, is the several ways of giving a solution to the same problem:
string text = #"This|is|a|pip\|ed|test"; //The original text
string parsed = ""; //Where you will store the parsed string
bool flag = false;
foreach (var x in text.Split('|')) {
bool endsWithArroba = x.EndsWith(#"\");
parsed += flag ? "|" + x + " " : endsWithArroba ? x.Substring(0, x.Length-1) : x + " ";
flag = endsWithArroba;
}

Cory's solution is pretty good. But, i fyou prefer not to work with Regex, then you can simply do something searching for "\|" and replacing it with some other character, then doing your split, then replace it again with the "\|".
Another option is is to do the split, then examine all the strings and if the last character is a \, then join it with the next string.
Of course, all this ignores what happens if you need an escaped backslash before a pipe.. like "\\|".
Overall, I lean towards regex though.
Frankly, I prefer to use FileHelpers because, even though this isn't comma delimeted, it's basically the same thing. And they have a great story about why you shouldn't write this stuff yourself.

You can do this with a regex. Once you decide to use a backslash as your escape character, you have two escape cases to account for:
Escaping a pipe: \|
Escaping a backslash that you want interpreted literally.
Both of these can be done in the same regex. Escaped backslashes will always be two \ characters together. Consecutive, escaped backslashes will always be even numbers of \ characters. If you find an odd-numbered sequence of \ before a pipe, it means you have several escaped backslashes, followed by an escaped pipe. So you want to use something like this:
/^(?:((?:[^|\\]|(?:\\{2})|\\\|)+)(?:\||$))*/
Confusing, perhaps, but it should work. Explanation:
^ #The start of a line
(?:...
[^|\\] #A character other than | or \ OR
(?:\\{2})* #An even number of \ characters OR
\\\| #A literal \ followed by a literal |
...)+ #Repeat the preceding at least once
(?:$|\|) #Either a literal | or the end of a line

What is the best way of splitting up a string by capital letters in C#?

What is the best way of splitting up a string by capital letters in C#?
Example:
HelloStackOverflow Users.How Are you doing?
Expected result:
Hello Stack Overflow Users. How are you doing?

You can use a regex:
static readonly Regex splitter = new Regex(#"\s+|(?=\s*[A-Z]+)|(?<=[,.?!])");
var spacedOut = splitter.Replace(str, " ");
This uses a lookahead to match the spot before a capital letter (with \s* to swallow the whitespace).
It uses a lookbehind to match the spot after punctuation.

It depends how you define "best".
Unless you want a trivial implementation (blindly insert a space in front of every uppercase letter), I'd avoid regex and just write the few lines of code that do precisely what I need - create a destination StringBuilder, do a foreach through the characters of the string, copying characters across and inserting extra spaces when appropriate - you'll just need to keep a state variable to know if the previous character was uppercase. This will make it easy to handle all the possible special cases (first character is uppercase, acronyms, characters following punctuation or whitespace, single words like "A", culture-sensitive handling, etc).
Why wouldn't I use regex?
Firstly, if you want to handle all the special cases well, you'll probably need quite advaned regex skills, and the result will be an undecipherable "magic string" (difficult to read/maintain, as perfectly demonstrated by #Slaks IMHO - can you read and understand his regex in under 10 seconds?). A simple loop will be much easier to write, test, debug, read and upgrade unless you (and anyone else who might have to read/maintain your code in future) have been doing regexes for years.
Secondly, a loop through the characters is very simple. The regex will almost certainly be slower due to the higher level of generalisation it provides. This may or may not be an issue for you, but efficiency could be a significant factor when definiing "best".
Thirdly, I'm an old dog and I don't see much point in using clever new tricks to solve problems that a simple for loop can handle :-) ... I often see programmers using "cool" obfuscated LINQ queries and Regexes in place of a simple 2-or-3-line loop, and it makes me think of the old adage "to a man with a hammer, everything looks like a nail". Regex, like all tools, has its place. And I'm not convinced this justifies anything that complex.

I'm an oldschool guy, I would write it using StringBuilder because I do not speak regexish:
var sb = new StringBuilder(input.Length);
int nextIndexToAdd = 0;
for (int i = 1; i < input.Length;i++ )
if (char.IsUpper(input[i])
&& !char.IsWhiteSpace(input[i - 1])
&& (!char.IsUpper(input[i - 1]) || (i < input.Length - 1 && !char.IsUpper(input[i + 1]))))
{
sb.Append(input.Substring(nextIndexToAdd, i - nextIndexToAdd));
sb.Append(" ");
nextIndexToAdd = i;
}
sb.Append(input.Substring(nextIndexToAdd));
string result = sb.ToString();
This handles both IAmFromUSA and HelloStack...

Regex which ensures no character is repeated

I need to ensure that a input string follows these rules:
It should contain upper case characters only.
NO character should be repeated in the string.
eg. ABCA is not valid because 'A' is being repeated.
For the upper case thing, [A-Z] should be fine.
But i am lost at how to ensure no repeating characters.
Can someone suggest some method using regular expressions ?

You can do this with .NET regular expressions although I would advise against it:
string s = "ABCD";
bool result = Regex.IsMatch(s, #"^(?:([A-Z])(?!.*\1))*$");
Instead I'd advise checking that the length of the string is the same as the number of distinct characters, and checking the A-Z requirement separately:
bool result = s.Cast<char>().Distinct().Count() == s.Length;
Alteranatively, if performance is a critical issue, iterate over the characters one by one and keep a record of which you have seen.

This cannot be done via regular expressions, because they are context-free. You need at least context-sensitive grammar language, so only way how to achieve this is by writing the function by hand.
See formal grammar for background theory.

Why not check for a character which is repeated or not in uppercase instead ? With something like ([A-Z])?.*?([^A-Z]|\1)

Use negative lookahead and backreference.
string pattern = #"^(?!.*(.).*\1)[A-Z]+$";
string s1 = "ABCDEF";
string s2 = "ABCDAEF";
string s3 = "ABCDEBF";
Console.WriteLine(Regex.IsMatch(s1, pattern));//True
Console.WriteLine(Regex.IsMatch(s2, pattern));//False
Console.WriteLine(Regex.IsMatch(s3, pattern));//False
\1 matches the first captured group. Thus the negative lookahead fails if any character is repeated.

This isn't regex, and would be slow, but You could create an array of the contents of the string, and then iterate through the array comparing n to n++
=Waldo

It can be done using what is call backreference.
I am a Java program so I will show you how it is done in Java (for C#, see here).
final Pattern aPattern = Pattern.compile("([A-Z]).*\\1");
final Matcher aMatcher1 = aPattern.matcher("ABCDA");
System.out.println(aMatcher1.find());
final Matcher aMatcher2 = aPattern.matcher("ABCDA");
System.out.println(aMatcher2.find());
The regular express is ([A-Z]).*\\1 which translate to anything between 'A' to 'Z' as group 1 ('([A-Z])') anything else (.*) and group 1.
Use $1 for C#.
Hope this helps.

Regex: I want this AND that AND that... in any order

I'm not even sure if this is possible or not, but here's what I'd like.
String: "NS306 FEBRUARY 20078/9/201013B1-9-1Low31 AUGUST 19870"
I have a text box where I type in the search parameters and they are space delimited. Because of this, I want to return a match is string1 is in the string and then string2 is in the string, OR string2 is in the string and then string1 is in the string. I don't care what order the strings are in, but they ALL (will somethings me more than 2) have to be in the string.
So for instance, in the provided string I would want:
"FEB Low"
or
"Low FEB"
...to return as a match.
I'm REALLY new to regex, only read some tutorials on here but that was a while ago and I need to get this done today. Monday I start a new project which is much more important and can't be distracted with this issue. Is there anyway to do this with regular expressions, or do I have to iterate through each part of the search filter and permutate the order? Any and all help is extremely appreciated. Thanks.
UPDATE:
The reason I don't want to iterate through a loop and am looking for the best performance wise is because unfortunately, the dataTable I'm using calls this function on every key press, and I don't want it to bog down.
UPDATE:
Thank you everyone for your help, it was much appreciated.
CODE UPDATE:
Ultimately, this is what I went with.
string sSearch = nvc["sSearch"].ToString().Replace(" ", ")(?=.*");
if (sSearch != null && sSearch != "")
{
Regex r = new Regex("^(?=.*" + sSearch + ").*$", RegexOptions.IgnoreCase);
_AdminList = _AdminList.Where<IPB>(
delegate(IPB ipb)
{
//Concatenated all elements of IPB into a string
bool returnValue = r.IsMatch(strTest); //strTest is the concatenated string
return returnValue;
}).ToList<IPB>();
}
}
The IPB class has X number of elements and in no one table throughout the site I'm working on are the columns in the same order. Therefore, I needed to any order search and I didn't want to have to write a lot of code to do it. There were other good ideas in here, but I know my boss really likes Regex (preaches them) and therefore I thought it'd be best if I went with that for now. If for whatever reason the site's performance slips (intranet site) then I'll try another way. Thanks everyone.

You can use (?=…) positive lookahead; it asserts that a given pattern can be matched. You'd anchor at the beginning of the string, and one by one, in any order, look for a match of each of your patterns.
It'll look something like this:
^(?=.*one)(?=.*two)(?=.*three).*$
This will match a string that contains "one", "two", "three", in any order (as seen on rubular.com).
Depending on the context, you may want to anchor on \A and \Z, and use single-line mode so the dot matches everything.
This is not the most efficient solution to the problem. The best solution would be to parse out the words in your input and putting it into an efficient set representation, etc.
Related questions
How does the regular expression (?<=#)[^#]+(?=#) work?
More practical example: password validation
Let's say that we want our password to:
Contain between 8 and 15 characters
Must contain an uppercase letter
Must contain a lowercase letter
Must contain a digit
Must contain one of special symbols
Then we can write a regex like this:
^(?=.{8,15}$)(?=.*[A-Z])(?=.*[a-z])(?=.*[0-9])(?=.*[!##$%^&*]).*$
\__________/\_________/\_________/\_________/\______________/
length upper lower digit symbol

Why not just do a simple check for the text since order doesn't matter?
string test = "NS306 FEBRUARY 20078/9/201013B1-9-1Low31 AUGUST 19870";
test = test.ToUpper();
bool match = ((test.IndexOf("FEB") >= 0) && (test.IndexOf("LOW") >= 0));
Do you need it to use regex?

I think the most expedient thing for today will be to string.Split(' ') the search terms and then iterate over the results confirming that sourceString.Contains(searchTerm)
var source = #"NS306 FEBRUARY 20078/9/201013B1-9-1Low31 AUGUST 19870".ToLowerInvariant();
var search = "FEB Low";
var terms = search.Split(' ');
bool all_match = !terms.Any(term => !(source.Contains(term.ToLowerInvariant())));
Notice that we use Any() to set up a short-circuit, so if the first term fails to match, we skip checking the second, third, and so forth.
This is not a great use case for RegEx. The string manipulation necessary to take an arbitrary number of search strings and convert that into a pattern almost certainly negates the performance benefit of matching the pattern with the RegEx engine, though this may vary depending on what you're matching against.
You've indicated in some comments that you want to avoid a loop, but RegEx is not a one-pass solution. It is not hard to create horrifically non-performant searches that loop and step character by character, such as the infamous catastrophic backtracking, where a very simple match takes thousands of steps to return false.

The answer by #polygenelubricants is both complete and perfect but I had a case where I wanted to match a date and something else e.g. a 10-digit number so the lookahead does not match and I cannot do it with just lookaheads so I used named groups:
(?:.*(?P<1>[0-9]{10}).*(?P<2>2[0-9]{3}-(?:0?[0-9]|1[0-2])-(?:[0-2]?[0-9]|3[0-1])).*)+
and this way the number is always group 1 and the date is always group 2. Of course it has a few flaws but it was very useful for me and I just thought I should share it! ( take a look https://www.debuggex.com/r/YULCcpn8XtysHfmE )

var text = #"NS306Low FEBRUARY 2FEB0078/9/201013B1-9-1Low31 AUGUST 19870";
var matches = Regex.Matches(text, #"(FEB)|(Low)");
foreach (Match match in matches) {
Console.WriteLine(match.Value);
}
output:
Low
FEB
FEB
Low
should get you started

You don't have to test each permutation, just split your search into multiple parts "FEB" and "Low" and make sure each part matches. That will be far easier than trying to come up with a regex which matches the whole thing in one go (which I'm sure is theoretically possible, but probably not practical in reality).

Use string.Split(). It will return an array of subtrings thata re delimited by a specified string/char. The code will look something like this.
int maximumSize = 100;
string myString = "NS306 FEBRUARY 20078/9/201013B1-9-1Low31 AUGUST 19870";
string[] individualString = myString.Split(' ', maximumSize);
For more information
http://msdn.microsoft.com/en-us/library/system.string.split.aspx
Edit:
If you really wanted to use Regular Expressions this pattern will work.
[^ ]*
And you will just use Regex.Matches();
The code will be something like this:
string myString = "NS306 FEBRUARY 20078/9/201013B1-9-1Low31 AUGUST 19870";
string pattern = "[^ ]*";
Regex rgx = new Regex(pattern);
foreach(Match match in reg.Matches(s))
{
//do stuff with match.value
}

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.