Separate strings with semicolon and delimiter and containing parentheses - c#

I need to separate strings with semicolon (;) as delimiter. The semicolon inside parenthesis should be ignored.
Example:
string inputString = "(Apple;Mango);(Tiger;Horse);Ant;Frog;";
The output list of strings should be:
(Apple;Mango)
(Tiger;Horse)
Ant
Frog
The other valid input strings can be :
string inputString = "The fruits are (mango;apple), and they are good"
The above string should split to a single string
"The fruits are (mango;apple), and they are good"
string inputString = "The animals in (African (Lion;Elephant) and Asian(Panda; Tiger)) are endangered species; Some plants are endangered too."
The above string should split to two strings as shown below:
"The animals in (African (Lion;Elephant) and Asian(Panda; Tiger)) are endangered species"
"Some plants are endangered too."
I searched a lot but could not find the answer to the above scenario.
Does anybody know how to achieve this without reinventing the wheel?

Use a regular expression that matches what you want to keep, not the separators:
string inputString = "(Apple;Mango);(Tiger;Horse);Ant;Frog;";
MatchCollection m = Regex.Matches(inputString, #"\([^;)]*(;[^;)]*)*\)|[^;]+");
foreach (Match x in m){
Console.WriteLine(x.Value);
}
Output:
(Apple;Mango)
(Tiger;Horse)
Ant
Frog
Expression comments:
\( opening parenthesis
[^;)]* characters before semicolon
(;[^;)]*)* optional semicolon and characters after it
\) closing parenthesis
| or
[^;]+ text with no semicolon
Note: The expression above also accepts values in parentheses without a semicolon, e.g. (Lark) and mulitple semicolons, e.g. (Lark;Pine;Birch). It will also skips empty values, e.g. ";;Pine;;;;Birch;;;" will be two items, not ten.

Handle the paranthesized case separately from the "normal" case, to ensure that semicolons are omitted in the former.
A regex to achieve this (matching a single element in your input) may look like the following (not tested):
"\([A-Za-z;]+\)|[A-Za-z]+"

Related

C# Regex split() without removing the split condition character

I am splitting a string with regex using its Split() method.
var splitRegex = new Regex(#"[\s|{]");
string input = "/Tests/ShowMessage { 'Text': 'foo' }";
//second version of the input:
//string input = "/Tests/ShowMessage{ 'Text': 'foo' }";
string[] splittedText = splitRegex.Split(input, 2);
The string is just a sample pattern of the input. There are two different structures of input, once with a space before the { or without the space. I want to split the input on the { bracket in order to get the following result:
/Tests/ShowMessage
{ 'Text': 'foo' }
If there is a space, the string gets splitted there (space gets removed) and i get my desired result. But if there isnt a space i split the string on the {, so the { gets removed, what i dont want though. How can i use Regex.Split() without removing the split condition character?
The square brackets create a character set, so you want it to match exactly one of those inner characters. For your desire start off by removing them.
So to match it a random count of whitespaces you have to add *, the result is this one\s*.
\s is a whitespace
* means zero-or-more
That you don't remove the split condition character, you can use lookahead assertion (?=...).
(?=...) or (?!...) is a lookahead assertion
The combined Regex looks like this: \s*(?={)
This is a really good and detailed documentation of all the different Regex parts, you might have a look at it. Furthermore you can test your Regex easy and for free here.
In order to not include the curly brace in the match you can put it into a look ahead
\s*(?={)
That will match any number of white spaces up to the position before a open curly brace.
You can use regular string split, on "{" and trim the spaces off:
var bits = "/Tests/ShowMessage { 'Text': 'foo' }".Split("{", StringSplitOptions.RemoveEmptyEntries);
bits[0] = bits[0].TrimEnd();
bits[1] = "{" + bits[1];
If you want to use the RegEx route, you can add the { back if you change the regex a bit:
var splitRegex = new Regex(#"\s*{");
string input = "/Tests/ShowMessage { 'Text': 'foo' }";
//second version of the input:
//string input = "/Tests/ShowMessage{ 'Text': 'foo' }";
string[] splittedText = splitRegex.Split(input, 2);
splittedText[1] = "{" + splittedText[1];
It means "split at occurrence of (zero or more whitespace followed by {)" - so the split operation nukes your spaces (you want), and your { (you don't want) but you can put the { back with certainty that it will mean you get what you want
var splitedList = srt.Text.Replace(".", ".#").Replace("?", "?#").Replace("!", "!#").Split(new[] { "#"}, StringSplitOptions.RemoveEmptyEntries).ToList();
This will split text for .!? and will not remove condition chars. For better result just replace # with some uniq char. Like this one for example '®' That is all. Simple as it is. No regex.split which is slow and difficult due to many different task criterias, etc...
passing-> "Hello. I'am dev!"
result (split condition character exist )
"Hello."
"I'am dev!"

Split a string based on spaces (unless in quotes) and remove the quotes afterwards [duplicate]

I'm new to regular expressions and would appreciate your help. I'm trying to put together an expression that will split the example string using all spaces that are not surrounded by single or double quotes. My last attempt looks like this: (?!") and isn't quite working. It's splitting on the space before the quote.
Example input:
This is a string that "will be" highlighted when your 'regular expression' matches something.
Desired output:
This
is
a
string
that
will be
highlighted
when
your
regular expression
matches
something.
Note that "will be" and 'regular expression' retain the space between the words.
I don't understand why all the others are proposing such complex regular expressions or such long code. Essentially, you want to grab two kinds of things from your string: sequences of characters that aren't spaces or quotes, and sequences of characters that begin and end with a quote, with no quotes in between, for two kinds of quotes. You can easily match those things with this regular expression:
[^\s"']+|"([^"]*)"|'([^']*)'
I added the capturing groups because you don't want the quotes in the list.
This Java code builds the list, adding the capturing group if it matched to exclude the quotes, and adding the overall regex match if the capturing group didn't match (an unquoted word was matched).
List<String> matchList = new ArrayList<String>();
Pattern regex = Pattern.compile("[^\\s\"']+|\"([^\"]*)\"|'([^']*)'");
Matcher regexMatcher = regex.matcher(subjectString);
while (regexMatcher.find()) {
if (regexMatcher.group(1) != null) {
// Add double-quoted string without the quotes
matchList.add(regexMatcher.group(1));
} else if (regexMatcher.group(2) != null) {
// Add single-quoted string without the quotes
matchList.add(regexMatcher.group(2));
} else {
// Add unquoted word
matchList.add(regexMatcher.group());
}
}
If you don't mind having the quotes in the returned list, you can use much simpler code:
List<String> matchList = new ArrayList<String>();
Pattern regex = Pattern.compile("[^\\s\"']+|\"[^\"]*\"|'[^']*'");
Matcher regexMatcher = regex.matcher(subjectString);
while (regexMatcher.find()) {
matchList.add(regexMatcher.group());
}
There are several questions on StackOverflow that cover this same question in various contexts using regular expressions. For instance:
parsings strings: extracting words and phrases
Best way to parse Space Separated Text
UPDATE: Sample regex to handle single and double quoted strings. Ref: How can I split on a string except when inside quotes?
m/('.*?'|".*?"|\S+)/g
Tested this with a quick Perl snippet and the output was as reproduced below. Also works for empty strings or whitespace-only strings if they are between quotes (not sure if that's desired or not).
This
is
a
string
that
"will be"
highlighted
when
your
'regular expression'
matches
something.
Note that this does include the quote characters themselves in the matched values, though you can remove that with a string replace, or modify the regex to not include them. I'll leave that as an exercise for the reader or another poster for now, as 2am is way too late to be messing with regular expressions anymore ;)
If you want to allow escaped quotes inside the string, you can use something like this:
(?:(['"])(.*?)(?<!\\)(?>\\\\)*\1|([^\s]+))
Quoted strings will be group 2, single unquoted words will be group 3.
You can try it on various strings here: http://www.fileformat.info/tool/regex.htm or http://gskinner.com/RegExr/
The regex from Jan Goyvaerts is the best solution I found so far, but creates also empty (null) matches, which he excludes in his program. These empty matches also appear from regex testers (e.g. rubular.com).
If you turn the searches arround (first look for the quoted parts and than the space separed words) then you might do it in once with:
("[^"]*"|'[^']*'|[\S]+)+
(?<!\G".{0,99999})\s|(?<=\G".{0,99999}")\s
This will match the spaces not surrounded by double quotes.
I have to use min,max {0,99999} because Java doesn't support * and + in lookbehind.
It'll probably be easier to search the string, grabbing each part, vs. split it.
Reason being, you can have it split at the spaces before and after "will be". But, I can't think of any way to specify ignoring the space between inside a split.
(not actual Java)
string = "This is a string that \"will be\" highlighted when your 'regular expression' matches something.";
regex = "\"(\\\"|(?!\\\").)+\"|[^ ]+"; // search for a quoted or non-spaced group
final = new Array();
while (string.length > 0) {
string = string.trim();
if (Regex(regex).test(string)) {
final.push(Regex(regex).match(string)[0]);
string = string.replace(regex, ""); // progress to next "word"
}
}
Also, capturing single quotes could lead to issues:
"Foo's Bar 'n Grill"
//=>
"Foo"
"s Bar "
"n"
"Grill"
String.split() is not helpful here because there is no way to distinguish between spaces within quotes (don't split) and those outside (split). Matcher.lookingAt() is probably what you need:
String str = "This is a string that \"will be\" highlighted when your 'regular expression' matches something.";
str = str + " "; // add trailing space
int len = str.length();
Matcher m = Pattern.compile("((\"[^\"]+?\")|('[^']+?')|([^\\s]+?))\\s++").matcher(str);
for (int i = 0; i < len; i++)
{
m.region(i, len);
if (m.lookingAt())
{
String s = m.group(1);
if ((s.startsWith("\"") && s.endsWith("\"")) ||
(s.startsWith("'") && s.endsWith("'")))
{
s = s.substring(1, s.length() - 1);
}
System.out.println(i + ": \"" + s + "\"");
i += (m.group(0).length() - 1);
}
}
which produces the following output:
0: "This"
5: "is"
8: "a"
10: "string"
17: "that"
22: "will be"
32: "highlighted"
44: "when"
49: "your"
54: "regular expression"
75: "matches"
83: "something."
I liked Marcus's approach, however, I modified it so that I could allow text near the quotes, and support both " and ' quote characters. For example, I needed a="some value" to not split it into [a=, "some value"].
(?<!\\G\\S{0,99999}[\"'].{0,99999})\\s|(?<=\\G\\S{0,99999}\".{0,99999}\"\\S{0,99999})\\s|(?<=\\G\\S{0,99999}'.{0,99999}'\\S{0,99999})\\s"
Jan's approach is great but here's another one for the record.
If you actually wanted to split as mentioned in the title, keeping the quotes in "will be" and 'regular expression', then you could use this method which is straight out of Match (or replace) a pattern except in situations s1, s2, s3 etc
The regex:
'[^']*'|\"[^\"]*\"|( )
The two left alternations match complete 'quoted strings' and "double-quoted strings". We will ignore these matches. The right side matches and captures spaces to Group 1, and we know they are the right spaces because they were not matched by the expressions on the left. We replace those with SplitHere then split on SplitHere. Again, this is for a true split case where you want "will be", not will be.
Here is a full working implementation (see the results on the online demo).
import java.util.*;
import java.io.*;
import java.util.regex.*;
import java.util.List;
class Program {
public static void main (String[] args) throws java.lang.Exception {
String subject = "This is a string that \"will be\" highlighted when your 'regular expression' matches something.";
Pattern regex = Pattern.compile("\'[^']*'|\"[^\"]*\"|( )");
Matcher m = regex.matcher(subject);
StringBuffer b= new StringBuffer();
while (m.find()) {
if(m.group(1) != null) m.appendReplacement(b, "SplitHere");
else m.appendReplacement(b, m.group(0));
}
m.appendTail(b);
String replaced = b.toString();
String[] splits = replaced.split("SplitHere");
for (String split : splits) System.out.println(split);
} // end main
} // end Program
If you are using c#, you can use
string input= "This is a string that \"will be\" highlighted when your 'regular expression' matches <something random>";
List<string> list1 =
Regex.Matches(input, #"(?<match>\w+)|\""(?<match>[\w\s]*)""|'(?<match>[\w\s]*)'|<(?<match>[\w\s]*)>").Cast<Match>().Select(m => m.Groups["match"].Value).ToList();
foreach(var v in list1)
Console.WriteLine(v);
I have specifically added "|<(?[\w\s]*)>" to highlight that you can specify any char to group phrases. (In this case I am using < > to group.
Output is :
This
is
a
string
that
will be
highlighted
when
your
regular expression
matches
something random
1st one-liner using String.split()
String s = "This is a string that \"will be\" highlighted when your 'regular expression' matches something.";
String[] split = s.split( "(?<!(\"|').{0,255}) | (?!.*\\1.*)" );
[This, is, a, string, that, "will be", highlighted, when, your, 'regular expression', matches, something.]
don't split at the blank, if the blank is surrounded by single or double quotes
split at the blank when the 255 characters to the left and all characters to the right of the blank are neither single nor double quotes
adapted from original post (handles only double quotes)
I'm reasonably certain this is not possible using regular expressions alone. Checking whether something is contained inside some other tag is a parsing operation. This seems like the same problem as trying to parse XML with a regex -- it can't be done correctly. You may be able to get your desired outcome by repeatedly applying a non-greedy, non-global regex that matches the quoted strings, then once you can't find anything else, split it at the spaces... that has a number of problems, including keeping track of the original order of all the substrings. Your best bet is to just write a really simple function that iterates over the string and pulls out the tokens you want.
A couple hopefully helpful tweaks on Jan's accepted answer:
(['"])((?:\\\1|.)+?)\1|([^\s"']+)
Allows escaped quotes within quoted strings
Avoids repeating the pattern for the single and double quote; this also simplifies adding more quoting symbols if needed (at the expense of one more capturing group)
You can also try this:
String str = "This is a string that \"will be\" highlighted when your 'regular expression' matches something";
String ss[] = str.split("\"|\'");
for (int i = 0; i < ss.length; i++) {
if ((i % 2) == 0) {//even
String[] part1 = ss[i].split(" ");
for (String pp1 : part1) {
System.out.println("" + pp1);
}
} else {//odd
System.out.println("" + ss[i]);
}
}
The following returns an array of arguments. Arguments are the variable 'command' split on spaces, unless included in single or double quotes. The matches are then modified to remove the single and double quotes.
using System.Text.RegularExpressions;
var args = Regex.Matches(command, "[^\\s\"']+|\"([^\"]*)\"|'([^']*)'").Cast<Match>
().Select(iMatch => iMatch.Value.Replace("\"", "").Replace("'", "")).ToArray();
When you come across this pattern like this :
String str = "2022-11-10 08:35:00,470 RAV=REQ YIP=02.8.5.1 CMID=caonaustr CMN=\"Some Value Pyt Ltd\"";
//this helped
String[] str1= str.split("\\s(?=(([^\"]*\"){2})*[^\"]*$)\\s*");
System.out.println("Value of split string is "+ Arrays.toString(str1));
This results in :[2022-11-10, 08:35:00,470, PLV=REQ, YIP=02.8.5.1, CMID=caonaustr, CMN="Some Value Pyt Ltd"]
This regex matches spaces ONLY if it is followed by even number of double quotes.

Split the string based on regular expression

I've an array of strings like Name, Groups[0].Id, Types[11].Name.
I want to filter the string that has square brackets and split them into two parts. For ex., Groups[0].Id into Groups and Id.
How I can find the strings that has square brackets using regular expression?
You can try this
Regex.Split(input,#"\[.*?\][.]");
Just for splitting a single string like
string value = "Groups[0].Id";
use
string[] parts = Regex.Split(value, "\[\d+\]\.");
Explanation: you have to escape the square bracket and dot characters with a backslash (they have special meanings within a regular expression) and \d+ will accept only a string of number digits ('0'..'9') with at least one digit.
Links:
A nice .NET regex test page is http://regexhero.net/
MSDN documentation on Regex: http://msdn.microsoft.com/en-us/library/8yttk7sy.aspx
I'm not sure if you wanted to split the strings which is implied by your question title, or filter the list which seems to be what your asking at the end. You can split each element of the array with brackets and a periods this regex. This regex does not assume that the indices are digits alone -- for example it will allow an array keyed by strings.
Regex.Split(a, #"\[[^\]]+\]\.");
REY
You can use LINQ to Filter the array in one line.
string[] ary = new string[3] {"Name", "Groups[0].Id", "Types[11].Name" };
ary = ary.Where(a => Regex.Match(a, #"\[[^\]]+\]\.").Success).ToArray();
foreach (string str in ary)
{
Console.WriteLine(str);
}

How to get all words of a string in c#?

I have a paragraph in a single string and I'd like to get all the words in that paragraph.
My problem is that I don't want the suffixes words that end with punctuation marks such as (',','.',''','"',';',':','!','?') and /n /t etc.
I also don't want words with 's and 'm such as world's where it should only return world.
In the example
he said. "My dog's bone, toy, are missing!"
the list should be: he said my dog bone toy are missing
Expanding on Shan's answer, I would consider something like this as a starting point:
MatchCollection matches = Regex.Match(input, #"\b[\w']*\b");
Why include the ' character? Because this will prevent words like "we're" from being split into two words. After capturing it, you can manually strip out the suffix yourself (whereas otherwise, you couldn't recognize that re is not a word and ignore it).
So:
static string[] GetWords(string input)
{
MatchCollection matches = Regex.Matches(input, #"\b[\w']*\b");
var words = from m in matches.Cast<Match>()
where !string.IsNullOrEmpty(m.Value)
select TrimSuffix(m.Value);
return words.ToArray();
}
static string TrimSuffix(string word)
{
int apostropheLocation = word.IndexOf('\'');
if (apostropheLocation != -1)
{
word = word.Substring(0, apostropheLocation);
}
return word;
}
Example input:
he said. "My dog's bone, toy, are missing!" What're you doing tonight, by the way?
Example output:
[he, said, My, dog, bone, toy, are, missing, What, you, doing, tonight, by, the, way]
One limitation of this approach is that it will not handle acronyms well; e.g., "Y.M.C.A." would be treated as four words. I think that could also be handled by including . as a character to match in a word and then stripping it out if it's a full stop afterwards (i.e., by checking that it's the only period in the word as well as the last character).
Hope this is helpful for you:
string[] separators = new string[] {",", ".", "!", "\'", " ", "\'s"};
string text = "My dog's bone, toy, are missing!";
foreach (string word in text.Split(separators, StringSplitOptions.RemoveEmptyEntries))
Console.WriteLine(word);
See Regex word boundary expressions, What is the most efficient way to count all of the words in a richtextbox?. Moral of the story is that there are many ways to approach the problem, but regular expressions are probably the way to go for simplicity.
split on whitespace, trim anything that isn't a letter on the resulting strings.
Here's a looping replace method... not fast, but a way to solve it...
string result = "string to cut ' stuff. ! out of";
".',!#".ToCharArray().ToList().ForEach(a => result = result.Replace(a.ToString(),""));
This assumes you want to place it back in the original string, not a new string or a list.

C#: split a string into runs of characters, numbers and delimited strings and process it

OK my regex is a bit rusty and I've been struggling with this particular problem...
I need to split and process a string containing any number of the following, in any order:
Chars (lowercase letters only)
Quote delimited strings
Ints
The strings are pretty weird (I don't have control over them). When there's more than one number in a row in the string they're seperated by a comma. They need to be processed in the same order that they appeared in the original string.
For example, a string might look like:
abc20a"Hi""OK"100,20b
With this particular string the resulting call stack would look a bit like:
ProcessLetters( new[] { 'a', 'b', 'c' } );
ProcessInts( 20 );
ProcessLetters( 'a' );
ProcessStrings( new[] { "Hi", "OK" } );
ProcessInts( new[] { 100, 20 } );
ProcessLetters( 'b' );
What I could do is treat it a bit like CSV, where you build tokens by processing the characters one at a time, but I think it could be more easily done with a regex?
You can use the pattern contained in this string:
#"(""[^""]*""|[a-z]|\d+)"
to tokenize the input string you provided. This pattern captures three things: simple quoted strings (no embeded quotes), lower-case characters, and one or more digits.
If your quoted strings can have escaped quotes within them (e.g., "Hi\"There\"""OK""Pilgrim") then you can use this pattern to capture and tokenize them along with the rest of the input string:
#"((?:""[^""\\]*(?:\\.[^""\\]*)*"")|[a-z]|\d+)"
Here's an example:
MatchCollection matches = Regex.Matches(#"abc20a""Hi\""There\""""""OK""""Pilgrim""100,20b", #"((?:""[^""\\]*(?:\\.[^""\\]*)*"")|[a-z]|\d+)");
foreach (Match match in matches)
{
Console.WriteLine(match.Value);
}
Returns the string tokens:
a
b
c
20
a
"Hi\"There\""
"OK"
"Pilgrim"
100
20
b
One of the nice thing about this approach is you can just check the first character to see what stack you need to put your elements in. If the first character is alpha, then it goes into the ProcessLetters stack, if the character is numeric, then it goes into ProcessInts. If the first character is a quote, then it goes into ProcessStrings after trimming the leading and trailing quotes and calling Regex.Unescape() to unescape the embedded quotes.
You can make your regexp match each of the three separate options with the or operator |. This should catch valid tokens, skipping commas and other chars.
/[a-z]|[0-9]+|"[^"]"/
Can your strings contain escaped quotes?
static void Main(string[] args)
{
string test = #"abc20a""Hi""""OK""100,20b";
string[] results = Regex.Split(test, #"(""[a-zA-Z]+""|\d+|[a-zA-Z]+)");
foreach (string result in results)
{
if (!String.IsNullOrEmpty(result) && result != ",")
{
Console.WriteLine("result: " + result);
}
}
Console.ReadLine();
}

Categories

Resources