I have a string that may contain one or more of the following tags:
<CHOICE [some words] [other words]>
I need to replace (C#) all occurrences of this tag as follows:
Example: I like <CHOICE [cars and bikes] [apple and oranges]>
Result: I like cars and bikes
Example: I like <CHOICE [cars and bikes] [apple and oranges]>, I also like <CHOICE [pizza] [pasta]>
Result: I like cars and bikes, I also like pizza
Basically, replace the entire tag with only the string appearing in the first set of brackets.
Looks like capture groups is the way to go but I wasn't able to understand how to make them work.
Any help is appreciated!
EDIT: Regex is not a requirement, I thought it would be the best approach, but I see some comments telling me that it's not needed so any other suggestion will be just as fine. Thanks!
Just for fun. Here is a school-yard foreach state-machine, with a linear O(n) time complexity.
var line = "I like <CHOICE [cars and bikes] [apple and oranges]>";
var result = new StringBuilder();
var state = 0;
foreach (char c in line)
{
if (state == 0 && c == '<') state = 1;
else if (state == 1 && c == '[') state = 2;
else if (state == 2 && c == ']') state = 3;
else if (state == 3 && c == '>') state = 0;
else if (state == 0 || state == 2) result.Append(c);
};
Output
I like cars and bikes
Demo here
Get groups of Matches First, then for each Matched Group replace a first string in [ and ]
MatchCollection matches = Regex.Matches(InputStr, #"<CHOICE(.*?)>");
foreach(Match Item in matches)
{
MatchCollection matches1 = Regex.Matches(Item.ToString(), #"\[(.+?)]");
string FirstOccurence = matches1[0].Groups[1].ToString();
InputStr = InputStr.Replace(Item.ToString(), FirstOccurence);
}
Find the demo
string pattern = #"\< *CHOICE *((\[(?<choice>[a-zA-Z0-9 ]+)\]) *)+ *>";
Regex regex = new Regex(pattern);
string source = "I like <CHOICE [cars and bikes] [apple and oranges]>";
var match = regex.Match(source);
if (match.Success)
{
for (int i = 0; i < match.Groups["choice"].Captures.Count; i++)
{
Debug.WriteLine(match.Groups["choice"].Captures[i]);
}
string replaced = regex.Replace(source, match.Groups["choice"].Captures[0].Value);
Debug.WriteLine(replaced);
}
The output is:
cars and bikes
apple and oranges
I like cars and bikes
\< *CHOICE *
matches "<" "zero or more spaces" "CHOICE" "zero or more spaces"
([a-zA-Z0-9 ]+)
matches words and spaces
?<choice>
gives above group a name:choice
\[(?<choice>[a-zA-Z0-9 ]+)\]
matches one choice in []
((\[(?<choice>[a-zA-Z0-9 ]+)\] *)
matches choices separated by zero or more spaces
+
means you should have at lease one choice
*>
you can have zero or more spaces at the end before ">"
I assume this is the best way to do that.
string text = "This is some dummy text with the choice < CHOICE [ white black green cyan ] [yellow green]>." +
" The second choice <CHOICE [pink brown red] [blue cyan]>.";
string pattern = #"<\s*?CHOICE\s*\[\s*?(.+?)\s*?\].*?>";
var result = Regex.Replace(text, pattern, r => String.Join(" and ", r.Groups[1].Value.Split(' ', StringSplitOptions.RemoveEmptyEntries)));
Console.WriteLine(result);
Output
This is some dummy text with the choice white and black and green and cyan. The second choice pink and brown and red.
Related
Need to get three strings from the below mentioned string, need the possible solution in C# and ASP.NET:
"componentStatusId==2|3,screeningOwnerId>0"
I need to get '2','3' and '0' using a regular expression in C#
If all you want is the numbers from a string then you could use the regex in this code:
string re = "(?:\\b(\\d+)\\b[^\\d]*)+";
Regex regex = new Regex(re);
string input = "componentStatusId==2|3,screeningOwnerId>0";
MatchCollection matches = regex.Matches(input);
for (int ii = 0; ii < matches.Count; ii++)
{
Console.WriteLine("Match[{0}] // of 0..{1}:", ii, matches.Count - 1);
DisplayMatchResults(matches[ii]);
}
Function DisplayMatchResults is taken from this Stack Overflow answer.
The Console output from the above is:
Match[0] // of 0..0:
Match has 1 captures
Group 0 has 1 captures '2|3,screeningOwnerId>0'
Capture 0 '2|3,screeningOwnerId>0'
Group 1 has 3 captures '0'
Capture 0 '2'
Capture 1 '3'
Capture 2 '0'
match.Groups[0].Value == "2|3,screeningOwnerId>0"
match.Groups[1].Value == "0"
match.Groups[0].Captures[0].Value == "2|3,screeningOwnerId>0"
match.Groups[1].Captures[0].Value == "2"
match.Groups[1].Captures[1].Value == "3"
match.Groups[1].Captures[2].Value == "0"
Hence the numbers can be seen in match.Groups[1].Captures[...].
Another possibility is to use Regex.Split where the pattern is "non digits". The results from the code below will need post processing to remove empty strings. Note that Regex.Split does not have the StringSplitOptions.RemoveEmptyEntries of the string Split method.
string input = "componentStatusId==2|3,screeningOwnerId>0";
string[] numbers = Regex.Split(input, "[^\\d]+");
for (int ii = 0; ii < numbers.Length; ii++)
{
Console.WriteLine("{0}: '{1}'", ii, numbers[ii]);
}
The output from this is:
0: ''
1: '2'
2: '34'
3: '0'
Use following regex and capture your values from group 1, 2 and 3.
componentStatusId==(\d+)\|(\d+),screeningOwnerId>(\d+)
Demo
For generalizing componentStatusId and screeningOwnerId with any string, you can use \w+ in the regex and make it more general.
\w+==(\d+)\|(\d+),\w+>(\d+)
Updated Demo
I'm doing a crossword puzzle maker. The user selects cells for words, and the program compiles a crossword puzzle from the dictionary (all words which can be used in the crossword) - List<string>.
I need to find a word (words) in a dictionary which matches given mask (pattern).
For example, I need to find all words which match
#a###g
pattern, i.e. all words of length 6 in the dictionary with "a" at index 1 and "g" at index 5
The number of letters and their position are unknown in advance
How do I realize this?
You can convert word description (mask)
#a###g
into corresponding regular expression pattern:
^\p{L}a\p{L}{3}g$
Pattern explained:
^ - anchor, word beginning
\p{L} - arbitrary letter
a - letter 'a'
\p{L}{3} - exactly 3 arbitrary letters
g - letter 'g'
$ - anchor, word ending
and then get all words from dictionary which match this pattern:
Code:
using System.Linq;
using System.Text.RegularExpressions;
...
private static string[] Variants(string mask, IEnumerable<string> availableWords) {
Regex regex = new Regex("^" + Regex.Replace(mask, "#*", m => #$"\p{{L}}{{{m.Length}}}") + "$");
return availableWords
.Where(word => regex.IsMatch(availableWords))
.OrderBy(word => word)
.ToArray();
}
Demo:
string[] allWords = new [] {
"quick",
"brown",
"fox",
"jump",
"rating",
"coding"
"lazy",
"paring",
"fang",
"dog",
};
string[] variants = Variants("#a###g", allWords);
Console.Write(string.Join(Environment.NewLine, variants));
Outcome:
paring
rating
I need to find a word in a list with "a" at index 1 and "g" at index 5, like the following
wordList.Where(word => word.Length == 6 && word[1] == 'a' && word[5] == 'g')
The length check first will be critical to preventing a crash, unless your words are arranged into different lists by length..
If you mean that you literally will pass "#a###g" as the parameter that conveys the search term:
var term = "#a###g";
var search = term.Select((c,i) => (Chr:c,Idx:i)).Where(t => t.Chr != '#').ToArray();
var words = wordList.Where(word => word.Length == term.Length && search.All(t => word[t.Idx] == t.Chr));
How it works:
Take "#a###g" and project it to a sequence of the index of the char and the char itself, so ('#', 0),('a', 1),('#', 2),('#', 3),('#', 4),('g', 5)
Discard the '#', leaving only ('a', 1),('g', 5)
This means "'a' at position 1 and 'g' at 5"
Search the wordlist demanding that the word length is same as "#a###g", and also that All the search terms match when we "get the char out of the word at Idx and check it matches the Chr in the search term
I am having trouble splitting the below text. Is there any easier way to split this.
the input will be either like
"1[,]Group A[,]2[,]Group B[,]3[,]Group C[,]4[,]Group D"
OR
"a[,]Group A[,]b[,]Group B[,]c[,]Group C[,]d[,]Group D"
OR
"a)[,]Group A[,]b)[,]Group B[,]c)[,]Group C[,]d)[,]Group D"
Or Sometimes it will be like below text. How do I identify the absence of above pattern as well
"1 Group A[,]2 Group B[,]3 Group C[,]4 Group D"
Expected output
Group A
Group B
Group C
Group D
Instead of splitting your string, you can try just picking the parts you want out of the string:
var r = new Regex("Group [A-Z]");
var m = r.Matches(inputstring);
var result = m.Select(t => t.Value).ToList();
That will match any "Group" followed by a single uppercase letter.
I cooked up this method real quick with a pseudo pattern check:
static void pattern(string input)
{
string[] splits = input.Split(new[] { "[,]" }, StringSplitOptions.None);
if (splits.Length < 2)
return;
//pseudo pattern check
char[] patternStart = splits[0].ToCharArray();
for(int i = 2; i < splits.Length; i+=2)
{
patternStart[0]++;
if (!patternStart.SequenceEqual(splits[i]))
{
Console.WriteLine("pattern fail");
return;
}
}
foreach (string entry in splits.Where((s, i) => i % 2 == 1))
Console.WriteLine(entry);
}
The pattern check if based on the idea that it is always the first character in the patter that is increasing and it will always be something progressing by 1 in the ASCII table (e.g. a,b,c or A,B,C or 1,2,3)
Running this with the provided patterns:
pattern("1[,]Group A[,]2[,]Group B[,]3[,]Group C[,]4[,]Group D");
Console.WriteLine();
pattern("a[,]Group A[,]b[,]Group B[,]c[,]Group C[,]d[,]Group D");
Console.WriteLine();
pattern("a)[,]Group A[,]b)[,]Group B[,]c)[,]Group C[,]d)[,]Group D");
Console.WriteLine();
pattern("1 Group A[,]2 Group B[,]3 Group C[,]4 Group D");
Console.WriteLine();
yields
Group A
Group B
Group C
Group D
Group A
Group B
Group C
Group D
Group A
Group B
Group C
Group D
pattern fail
Assuming your group names must be longer than two characters, you can simply use:
var groups = input.Replace("[,]","\0").Split( '\0' ).Where( x => x.Length > 2 );
var output = string.Join( " ", groups );
Or
var groups = input.Split( "[,]" ).Where( x => x.Length > 2 );
var output = string.Join( " ", groups );
If your group names might be 2 or fewer characters, your requirements are not complete, as there is ambiguity. For example, with this input:
a)[,]a)[,]b)[,]b)
The output could be either
a) b)
Or
a) a) b) b)
...so you will need to come up with a rule to distinguish the text you wish to keep from the text you wish to discard.
I'm new to C# so expect some mistakes ahead. Any help / guidance would be greatly appreciated.
I want to limit the accepted inputs for a string to just:
a-z
A-Z
hyphen
Period
If the character is a letter, a hyphen, or period, it's to be accepted. Anything else will return an error.
The code I have so far is
string foo = "Hello!";
foreach (char c in foo)
{
/* Is there a similar way
To do this in C# as
I am basing the following
Off of my Python 3 knowledge
*/
if (c.IsLetter == true) // *Q: Can I cut out the == true part ?*
{
// Do what I want with letters
}
else if (c.IsDigit == true)
{
// Do what I want with numbers
}
else if (c.Isletter == "-") // Hyphen | If there's an 'or', include period as well
{
// Do what I want with symbols
}
}
I know that's a pretty poor set of code.
I had a thought whilst writing this:
Is it possible to create a list of the allowed characters and check the variable against that?
Something like:
foreach (char c in foo)
{
if (c != list)
{
// Unaccepted message here
}
else if (c == list)
{
// Accepted
}
}
Thanks in advance!
Easily accomplished with a Regex:
using System.Text.RegularExpressions;
var isOk = Regex.IsMatch(foo, #"^[A-Za-z0-9\-\.]+$");
Rundown:
match from the start
| set of possible matches
| |
|+-------------+
|| |any number of matches is ok
|| ||match until the end of the string
|| |||
vv vvv
^[A-Za-z0-9\-\.]+$
^ ^ ^ ^ ^
| | | | |
| | | | match dot
| | | match hyphen
| | match 0 to 9
| match a-z (lowercase)
match A-Z (uppercase)
You can do this in a single line with regular expressions:
Regex.IsMatch(myInput, #"^[a-zA-Z0-9\.\-]*$")
^ -> match start of input
[a-zA-Z0-9\.\-] -> match any of a-z , A-Z , 0-9, . or -
* -> 0 or more times (you may prefer + which is 1 or more times)
$ -> match the end of input
You can use Regex.IsMatch function and specify your regular expression.
Or define manually chars what you need. Something like this:
string foo = "Hello!";
char[] availableSymbols = {'-', ',', '!'};
char[] availableLetters = {'A', 'a', 'H'}; //etc.
char[] availableNumbers = {'1', '2', '3'}; //etc
foreach (char c in foo)
{
if (availableLetters.Contains(c))
{
// Do what I want with letters
}
else if (availableNumbers.Contains(c))
{
// Do what I want with numbers
}
else if (availableSymbols.Contains(c))
{
// Do what I want with symbols
}
}
Possible solution
You can use the CharUnicodeInfo.GetUnicodeCategory(char) method. It returns the UnicodeCategory of a character. The following unicode categories might be what you're look for:
UnicodeCategory.DecimalDigitNumber
UnicodeCategory.LowercaseLetter and UnicodeCategory.UppercaseLetter
An example:
string foo = "Hello!";
foreach (char c in foo)
{
UnicodeCategory cat = CharUnicodeInfo.GetUnicodeCategory(c);
if (cat == UnicodeCategory.LowercaseLetter || cat == UnicodeCategory.UppercaseLetter)
{
// Do what I want with letters
}
else if (cat == UnicodeCategory.DecimalDigitNumber)
{
// Do what I want with numbers
}
else if (c == '-' || c == '.')
{
// Do what I want with symbols
}
}
Answers to your other questions
Can I cut out the == true part?:
Yes, you can cut the == true part, it is not required in C#
If there's an 'or', include period as well.:
To create or expressions use the 'barbar' (||) operator as i've done in the above example.
Whenever you have some kind of collection of similar things, an array, a list, a string of characters, whatever, you'll see at the definition of the collection that it implements IEnumerable
public class String : ..., IEnumerable,
here T is a char. It means that you can ask the class: "give me your first T", "give me your next T", "give me your next T" and so on until there are no more elements.
This is the basis for all Linq. Ling has about 40 functions that act upon sequences. And if you need to do something with a sequence of the same kind of items, consider using LINQ.
The functions in LINQ can be found in class Enumerable. One of the function is Contains. You can use it to find out if a sequence contains a character.
char[] allowedChars = "abcdefgh....XYZ.-".ToCharArray();
Now you have a sequence of allowed characters. Suppose you have a character x and want to know if x is allowed:
char x = ...;
bool xIsAllowed = allowedChars.Contains(x);
Now Suppose you don't have one character x, but a complete string and you want only the characters in this string that are allowed:
string str = ...
var allowedInStr = str
.Where(characterInString => allowedChars.Contains(characterInString));
If you are going to do a lot with sequences of things, consider spending some time to familiarize yourself with LINQ:
Linq explained
You can use Regex.IsMatch with "^[a-zA-Z_.]*$" to check for valid characters.
string foo = "Hello!";
if (!Regex.IsMatch(foo, "^[a-zA-Z_\.]*$"))
{
throw new ArgumentException("Exception description here")
}
Other than that you can create a list of chars and use string.Contains method to check if it is ok.
string validChars = "abcABC./";
foreach (char c in foo)
{
if (!validChars.Contains(c))
{
// Throw exception
}
}
Also, you don't need to check for == true/false in if line. Both expressions are equal below
if (boolvariable) { /* do something */ }
if (boolvariable == true) { /* do something */ }
Format of file
POS ID PosScore NegScore SynsetTerms Gloss
a 00001740 0.125 0 able#1" able to swim"; "she was able to program her computer";
a 00002098 0 0.75 unable#1 "unable to get to town without a car";
a 00002312 0 0 dorsal#2 abaxial#1 "the abaxial surface of a leaf is the underside or side facing away from the stem"
a 00002843 0 0 basiscopic#1 facing or on the side toward the base
a 00002956 0 0.23 abducting#1 abducent#1 especially of muscles; drawing away from the midline of the body or from an adjacent part
a 00003131 0 0 adductive#1 adducting#1 adducent#1 especially of muscles;
In this file, I want to extract (ID,PosScore,NegScore and SynsetTerms) field. The (ID,PosScore,NegScore) field data extraction is easy and I use the following code for the data of these fields.
Regex expression = new Regex(#"(\t(\d+)|(\w+)\t)");
var results = expression.Matches(input);
foreach (Match match in results)
{
Console.WriteLine(match);
}
Console.ReadLine();
and it give the correct result but the Filed SynsetTerms create a problem because some lines have two or more words so how organize word and get against it PosScore And NegScore.
For example, in fifth line there are two words abducting#1 and abducent#1 but both have same score.
So what will be regex for such line that get Word and its score, like:
Word PosScore NegScore
abducting#1 0 0.23
abducent#1 0 0.23
The non-regex, string-splitting version might be easier:
var data =
lines.Split(new[] {Environment.NewLine}, StringSplitOptions.RemoveEmptyEntries)
.Skip(1)
.Select(line => line.Split('\t'))
.SelectMany(parts => parts[4].Split().Select(word => new
{
ID = parts[1],
Word = word,
PosScore = decimal.Parse(parts[2]),
NegScore = decimal.Parse(parts[3])
}));
You can use this regex
^(?<pos>\w+)\s+(?<id>\d+)\s+(?<pscore>\d+(?:\.\d+)?)\s+(?<nscore>\d+(?:\.\d+)?)\s+(?<terms>(?:.*?#[^\s]*)+)\s+(?<gloss>.*)$
You can create a list like this
var lst=Regex.Matches(input,regex)
.Cast<Match>()
.Select(x=>
new
{
pos=x.Groups["pos"].Value,
terms=Regex.Split(x.Groups["terms"].Value,#"\s+"),
gloss=x.Groups["gloss"].Value
}
);
and now you can iterate over it
foreach(var temp in lst)
{
temp.pos;
//you can now iterate over terms
foreach(var t in temp.terms)
{
}
}