C# String split based on the delimiter pattern [,] - c#

I am having trouble splitting the below text. Is there any easier way to split this.
the input will be either like
"1[,]Group A[,]2[,]Group B[,]3[,]Group C[,]4[,]Group D"
OR
"a[,]Group A[,]b[,]Group B[,]c[,]Group C[,]d[,]Group D"
OR
"a)[,]Group A[,]b)[,]Group B[,]c)[,]Group C[,]d)[,]Group D"
Or Sometimes it will be like below text. How do I identify the absence of above pattern as well
"1 Group A[,]2 Group B[,]3 Group C[,]4 Group D"
Expected output
Group A
Group B
Group C
Group D

Instead of splitting your string, you can try just picking the parts you want out of the string:
var r = new Regex("Group [A-Z]");
var m = r.Matches(inputstring);
var result = m.Select(t => t.Value).ToList();
That will match any "Group" followed by a single uppercase letter.

I cooked up this method real quick with a pseudo pattern check:
static void pattern(string input)
{
string[] splits = input.Split(new[] { "[,]" }, StringSplitOptions.None);
if (splits.Length < 2)
return;
//pseudo pattern check
char[] patternStart = splits[0].ToCharArray();
for(int i = 2; i < splits.Length; i+=2)
{
patternStart[0]++;
if (!patternStart.SequenceEqual(splits[i]))
{
Console.WriteLine("pattern fail");
return;
}
}
foreach (string entry in splits.Where((s, i) => i % 2 == 1))
Console.WriteLine(entry);
}
The pattern check if based on the idea that it is always the first character in the patter that is increasing and it will always be something progressing by 1 in the ASCII table (e.g. a,b,c or A,B,C or 1,2,3)
Running this with the provided patterns:
pattern("1[,]Group A[,]2[,]Group B[,]3[,]Group C[,]4[,]Group D");
Console.WriteLine();
pattern("a[,]Group A[,]b[,]Group B[,]c[,]Group C[,]d[,]Group D");
Console.WriteLine();
pattern("a)[,]Group A[,]b)[,]Group B[,]c)[,]Group C[,]d)[,]Group D");
Console.WriteLine();
pattern("1 Group A[,]2 Group B[,]3 Group C[,]4 Group D");
Console.WriteLine();
yields
Group A
Group B
Group C
Group D
Group A
Group B
Group C
Group D
Group A
Group B
Group C
Group D
pattern fail

Assuming your group names must be longer than two characters, you can simply use:
var groups = input.Replace("[,]","\0").Split( '\0' ).Where( x => x.Length > 2 );
var output = string.Join( " ", groups );
Or
var groups = input.Split( "[,]" ).Where( x => x.Length > 2 );
var output = string.Join( " ", groups );
If your group names might be 2 or fewer characters, your requirements are not complete, as there is ambiguity. For example, with this input:
a)[,]a)[,]b)[,]b)
The output could be either
a) b)
Or
a) a) b) b)
...so you will need to come up with a rule to distinguish the text you wish to keep from the text you wish to discard.

Related

Find in the List of words with letters in certain positions

I'm doing a crossword puzzle maker. The user selects cells for words, and the program compiles a crossword puzzle from the dictionary (all words which can be used in the crossword) - List<string>.
I need to find a word (words) in a dictionary which matches given mask (pattern).
For example, I need to find all words which match
#a###g
pattern, i.e. all words of length 6 in the dictionary with "a" at index 1 and "g" at index 5
The number of letters and their position are unknown in advance
How do I realize this?
You can convert word description (mask)
#a###g
into corresponding regular expression pattern:
^\p{L}a\p{L}{3}g$
Pattern explained:
^ - anchor, word beginning
\p{L} - arbitrary letter
a - letter 'a'
\p{L}{3} - exactly 3 arbitrary letters
g - letter 'g'
$ - anchor, word ending
and then get all words from dictionary which match this pattern:
Code:
using System.Linq;
using System.Text.RegularExpressions;
...
private static string[] Variants(string mask, IEnumerable<string> availableWords) {
Regex regex = new Regex("^" + Regex.Replace(mask, "#*", m => #$"\p{{L}}{{{m.Length}}}") + "$");
return availableWords
.Where(word => regex.IsMatch(availableWords))
.OrderBy(word => word)
.ToArray();
}
Demo:
string[] allWords = new [] {
"quick",
"brown",
"fox",
"jump",
"rating",
"coding"
"lazy",
"paring",
"fang",
"dog",
};
string[] variants = Variants("#a###g", allWords);
Console.Write(string.Join(Environment.NewLine, variants));
Outcome:
paring
rating
I need to find a word in a list with "a" at index 1 and "g" at index 5, like the following
wordList.Where(word => word.Length == 6 && word[1] == 'a' && word[5] == 'g')
The length check first will be critical to preventing a crash, unless your words are arranged into different lists by length..
If you mean that you literally will pass "#a###g" as the parameter that conveys the search term:
var term = "#a###g";
var search = term.Select((c,i) => (Chr:c,Idx:i)).Where(t => t.Chr != '#').ToArray();
var words = wordList.Where(word => word.Length == term.Length && search.All(t => word[t.Idx] == t.Chr));
How it works:
Take "#a###g" and project it to a sequence of the index of the char and the char itself, so ('#', 0),('a', 1),('#', 2),('#', 3),('#', 4),('g', 5)
Discard the '#', leaving only ('a', 1),('g', 5)
This means "'a' at position 1 and 'g' at 5"
Search the wordlist demanding that the word length is same as "#a###g", and also that All the search terms match when we "get the char out of the word at Idx and check it matches the Chr in the search term

Regex replacement in a custom tag

I have a string that may contain one or more of the following tags:
<CHOICE [some words] [other words]>
I need to replace (C#) all occurrences of this tag as follows:
Example: I like <CHOICE [cars and bikes] [apple and oranges]>
Result: I like cars and bikes
Example: I like <CHOICE [cars and bikes] [apple and oranges]>, I also like <CHOICE [pizza] [pasta]>
Result: I like cars and bikes, I also like pizza
Basically, replace the entire tag with only the string appearing in the first set of brackets.
Looks like capture groups is the way to go but I wasn't able to understand how to make them work.
Any help is appreciated!
EDIT: Regex is not a requirement, I thought it would be the best approach, but I see some comments telling me that it's not needed so any other suggestion will be just as fine. Thanks!
Just for fun. Here is a school-yard foreach state-machine, with a linear O(n) time complexity.
var line = "I like <CHOICE [cars and bikes] [apple and oranges]>";
var result = new StringBuilder();
var state = 0;
foreach (char c in line)
{
if (state == 0 && c == '<') state = 1;
else if (state == 1 && c == '[') state = 2;
else if (state == 2 && c == ']') state = 3;
else if (state == 3 && c == '>') state = 0;
else if (state == 0 || state == 2) result.Append(c);
};
Output
I like cars and bikes
Demo here
Get groups of Matches First, then for each Matched Group replace a first string in [ and ]
MatchCollection matches = Regex.Matches(InputStr, #"<CHOICE(.*?)>");
foreach(Match Item in matches)
{
MatchCollection matches1 = Regex.Matches(Item.ToString(), #"\[(.+?)]");
string FirstOccurence = matches1[0].Groups[1].ToString();
InputStr = InputStr.Replace(Item.ToString(), FirstOccurence);
}
Find the demo
string pattern = #"\< *CHOICE *((\[(?<choice>[a-zA-Z0-9 ]+)\]) *)+ *>";
Regex regex = new Regex(pattern);
string source = "I like <CHOICE [cars and bikes] [apple and oranges]>";
var match = regex.Match(source);
if (match.Success)
{
for (int i = 0; i < match.Groups["choice"].Captures.Count; i++)
{
Debug.WriteLine(match.Groups["choice"].Captures[i]);
}
string replaced = regex.Replace(source, match.Groups["choice"].Captures[0].Value);
Debug.WriteLine(replaced);
}
The output is:
cars and bikes
apple and oranges
I like cars and bikes
\< *CHOICE *
matches "<" "zero or more spaces" "CHOICE" "zero or more spaces"
([a-zA-Z0-9 ]+)
matches words and spaces
?<choice>
gives above group a name:choice
\[(?<choice>[a-zA-Z0-9 ]+)\]
matches one choice in []
((\[(?<choice>[a-zA-Z0-9 ]+)\] *)
matches choices separated by zero or more spaces
+
means you should have at lease one choice
*>
you can have zero or more spaces at the end before ">"
I assume this is the best way to do that.
string text = "This is some dummy text with the choice < CHOICE [ white black green cyan ] [yellow green]>." +
" The second choice <CHOICE [pink brown red] [blue cyan]>.";
string pattern = #"<\s*?CHOICE\s*\[\s*?(.+?)\s*?\].*?>";
var result = Regex.Replace(text, pattern, r => String.Join(" and ", r.Groups[1].Value.Split(' ', StringSplitOptions.RemoveEmptyEntries)));
Console.WriteLine(result);
Output
This is some dummy text with the choice white and black and green and cyan. The second choice pink and brown and red.

The specificity of sorting

Code of the character '-' is 45, code of the character 'a' is 97. It's clear that '-' < 'a' is true.
Console.WriteLine((int)'-' + " " + (int)'a');
Console.WriteLine('-' < 'a');
45 97
True
Hence the result of the following sort is correct
var a1 = new string[] { "a", "-" };
Console.WriteLine(string.Join(" ", a1));
Array.Sort(a1);
Console.WriteLine(string.Join(" ", a1));
a -
- a
But why the result of the following sort is wrong?
var a2 = new string[] { "ab", "-b" };
Console.WriteLine(string.Join(" ", a2));
Array.Sort(a2);
Console.WriteLine(string.Join(" ", a2));
ab -b
ab -b
The - is ignored,
so - = "" < a
and -b = "b" > "ab"
this is because of Culture sort ( which is default )
https://msdn.microsoft.com/en-us/library/system.globalization.compareoptions(v=vs.110).aspx
The .NET Framework uses three distinct ways of sorting: word sort, string
sort, and ordinal sort. Word sort performs a culture-sensitive
comparison of strings. Certain nonalphanumeric characters might have
special weights assigned to them. For example, the hyphen ("-") might
have a very small weight assigned to it so that "coop" and "co-op"
appear next to each other in a sorted list. String sort is similar to
word sort, except that there are no special cases. Therefore, all
nonalphanumeric symbols come before all alphanumeric characters.
Ordinal sort compares strings based on the Unicode values of each
element of the string.

Using group by in C#

Could anyone explain what this sample of code is doing? I can't quite grasp how the words string is being grouped. Is it taking the first letter of each word and grouping them somehow?
// Create a data source.
string[] words = { "apples", "blueberries", "oranges", "bananas", "apricots" };
// Create the query.
var wordGroups1 =
from w in words
group w by w[0] into fruitGroup
where fruitGroup.Count() >= 2
select new { FirstLetter = fruitGroup.Key, Words = fruitGroup.Count() };
The LINQ query groups all the words by their first character. It then removes all groups which contain only one element (=keeps all groups with two or more elements). At the end the groups are filled into new anonymous objects containing the first letter and number of words found starting with that letter.
The LINQ Documentation and samples should get you started reading and writing code like that.
// Create a data source.
string[] words = { "apples", "blueberries", "oranges", "bananas", "apricots" };
// Create the query.
var wordGroups1 =
from w in words //w is every single string in words
group w by w[0] into fruitGroup //group based on first character of w
where fruitGroup.Count() >= 2 //select those groups which have 2 or more members
//having the result so far, it makes what is needed with select
select new { FirstLetter = fruitGroup.Key, Words = fruitGroup.Count() };
Another example. In the array show the frequency of string's length:
var wordGroups1 =
from w in words
group w by w.Length into myGroup
select new { StringLength = myGroup.Key, Freq = myGroup.Count() };
//result: 1 6-length string
// 1 11-length string
// 2 7-length string
// 1 8-length string

How to parse fixed width string with complex rules into component fields with regex

I need to parse fixed width records using c# and Regular Expressions.
Each record contains a number of fixed width fields, with each field potentially having non-trivial validation rules. The problem I'm having is with a match being applied across the fixed width field boundaries.
Without the rules it is easy to break apart a fixed width string of length 13 into 4 parts like this:
(?=^.{13}$).{1}.{5}.{6}.{1}
Here is a sample field rule:
Field can be all spaces OR start with [A-Z] and be right padded with spaces. Spaces cannot occur between letters
If the field was the only thing I have to validate I could use this:
(?=^[A-Z ]{5}$)([ ]{5}|[A-Z]+[ ]*)
When I add this validation as part of a longer list I have to remove the ^ and $ from the lookahead and I start to get matches that are not of length 5.
Here is the full regex along with some sample text that should match and not match the expression.
(?=^[A-Z ]{13}$)A(?=[A-Z ]{5})([ ]{5}|(?>[A-Z]{1,5})[ ]{0,4})(?=[A-Z ]{6})([ ]{6}|(?>[A-Z]{1,6})[ ]{0,5})Z
How do I implement the rules so that, for each field, the immediate next XX characters are used for the match and ensure that matches do not overlap?
Lines that should match:
ABCDEFGHIJKLZ
A Z
AB Z
A G Z
AB G Z
ABCDEF Z
ABCDEFG Z
A GHIJKLZ
AB GHIJKLZ
Lines that should not match:
AB D Z
AB D F Z
AB F Z
A G I Z
A G I LZ
A G LZ
AB FG LZ
AB D FG Z
AB FG I Z
AB D FG i Z
The following 3 should not match but do.
AB FG Z
AB FGH Z
AB EFGH Z
EDIT:
General solution (based on Ωmega's answer) with named captures for clarity:
(?<F1>F1Regex)(?<=^.{Len(F1)})
(?<F2>F2Regex)(?<=^.{Len(F1+F2)})
(?<F3>F3Regex)(?<=^.{Len(F1+F2+F3)})
...
(?<Fn>FnRegex)
Another example: Spaces between regex and zero-width positive lookback (?<= are for clarity.
(?<F1>\d{2}) (?<=^.{2})
(?<F2>[A-Z]{5}) (?<=^.{7})
(?<F3>\d{4}) (?<=^.{11})
(?<F4>[A-Z]{6}) (?<=^.{17})
(?<F5>\d{4})
If the input string is fixed in size, then you can match a specific position using look-aheads and look-behinds, like this:
(?<=^.{s})(?<fieldName>.*)(?=.{e}$)
where:
s = start position
e = string length - match length - s
If you concatenate multiple regexes, like this one, then you will get all the fields with specific positioning.
Example
Fixed length: 10
Field 1: start 0, length 3
Field 2: start 3, length 5
Field 3: start 8, length 2
Use this regex, ignoring white spaces:
var match = Regex.Match("0123456789", #"
(?<=^.{0})(?<name1>.*)(?=.{7}$)
(?<=^.{3})(?<name2>.*)(?=.{2}$)
(?<=^.{8})(?<name3>.*)(?=.{0}$)",
RegexOptions.IgnorePatternWhitespace)
var field1 = match.Groups["name1"].Value;
var field2 = match.Groups["name2"].Value;
var field3 = match.Groups["name3"].Value;
You can place whatever rule you want to match the fields.
I used .* for all of them, but you can place anything there.
Example 2
var match = Regex.Match(" 1a any-8888", #"
(?<=^.{0})(?<name1>\s*\d*[a-zA-Z])(?=.{9}$)
(?<=^.{3})(?<name2>.*)(?=.{4}$)
(?<=^.{8})(?<name3>(?<D>\d)\k<D>*)(?=.{0}$)
",
RegexOptions.IgnorePatternWhitespace)
var field1 = match.Groups["name1"].Value; // " 1a"
var field2 = match.Groups["name2"].Value; // " any-"
var field3 = match.Groups["name3"].Value; // "8888"
Here is your regex
I tested all of them, but the this sample is with the one you said should not pass, but passed... this time, it won't pass:
var match = Regex.Match("AB FG Z", #"
^A
(?<=^.{1}) (?<name1>([ ]{5}|(?>[A-Z]{1,5})[ ]{0,4})) (?=.{7}$)
(?<=^.{6}) (?<name2>([ ]{6}|(?>[A-Z]{1,6})[ ]{0,5})) (?=.{1}$)
Z$
",
RegexOptions.IgnorePatternWhitespace)
// no match with this input string
Match match = Regex.Match(
Regex.Replace(text, #"^(.)(.{5})(.{6})(.)$", "$1,$2,$3,$4"),
#"^[A-Z ],[A-Z]*[ ]*,[A-Z]*[ ]*,[A-Z ]$");
Check this code here.
I think it is possible to validate it by single regex pattern
^[A-Z ][A-Z]*[ ]*(?<=^.{6})[A-Z]*[ ]*(?<=^.{12})[A-Z ]$
If you need also capture all such groups, use
^([A-Z ])([A-Z]*[ ]*)(?<=^.{6})([A-Z]*[ ]*)(?<=^.{12})([A-Z ])$
I have already posted this before, but this answer is more specific to your question, and not generalized.
This solves all the cases you have presented in your question, the way you wanted.
Program to test all cases in your question
class Program
{
static void Main()
{
var strMatch = new string[]
{
// Lines that should match:
"ABCDEFGHIJKLZ",
"A Z",
"AB Z",
"A G Z",
"AB G Z",
"ABCDEF Z",
"ABCDEFG Z",
"A GHIJKLZ",
"AB GHIJKLZ",
};
var strNotMatch = new string[]
{
// Lines that should not match:
"AB D Z",
"AB D F Z",
"AB F Z",
"A G I Z",
"A G I LZ",
"A G LZ",
"AB FG LZ",
"AB D FG Z",
"AB FG I Z",
"AB D FG i Z",
// The following 3 should not match but do.
"AB FG Z",
"AB FGH Z",
"AB EFGH Z",
};
var pattern = #"
^A
(?<=^.{1}) (?<name1>([ ]{5}|(?>[A-Z]{1,5})[ ]{0,4})) (?=.{7}$)
(?<=^.{6}) (?<name2>([ ]{6}|(?>[A-Z]{1,6})[ ]{0,5})) (?=.{1}$)
Z$
";
foreach (var eachStrThatMustMatch in strMatch)
{
var match = Regex.Match(eachStrThatMustMatch,
pattern, RegexOptions.IgnorePatternWhitespace);
if (!match.Success)
throw new Exception("Should match.");
}
foreach (var eachStrThatMustNotMatch in strNotMatch)
{
var match = Regex.Match(eachStrThatMustNotMatch,
pattern, RegexOptions.IgnorePatternWhitespace);
if (match.Success)
throw new Exception("Should match.");
}
}
}

Categories

Resources