C# filter String with Regex - c#

I'm not familiar with the regex, However I think that REGEX could help me a lot to resolve my problem.
I have 2 kind of string in a big List<string> str (with or without description) :
str[0] = "[toto]";
str[1] = "[toto] descriptionToto";
str[2] = "[titi]";
str[3] = "[titi] descriptionTiti";
str[4] = "[tata]";
str[5] = "[tata] descriptionTata";
The list isn't really ordered.
I would parse all my list then format datas depending on what I will find inside.
If I find: "[toto]" I would like to get to set str[0]="toto"
and If I find "[toto] descriptionToto" I would like to get to set str[1]="descriptionToto"
Do you have any ideas of the better way to get this result please ?

There are two regex options if you ask me:
Make a regex pattern with two capturing groups, then use group 1 or group 2 depending on whether group 1 is empty. In this case you'd use named capturing groups to get a clear relationship between the pattern and the code
Make a regex that matches string type 1 or string type 2, in which case you would get your end result directly from regex
If you're going for speed, using str[0].IndexOf(']') would get most of the job done.

Rather than regex, I'd be inclined to just use string.split, something along the lines of:
string[] tokens = str[0].Split(new Char [] {'[', ']'});
if (tokens[2] == "") {
str = tokens[1];
} else {
str = tokens[2];
}

You can use single regex:
string s = Regex.Match(str[0], #"(?<=\[)[^\]]*(?=]$)|(?<=] ).*").Value;
Idea is simple: if the text is ended with ] and there is no other ], then take everything between [ ], otherwise take everything after first ].
Sample code:
List<string> strList = new List<string> {
"[toto]",
"[toto] descriptionToto",
"[titi]",
"[titi] descriptionTiti",
"[tata]",
"[tata] descriptionTata" };
foreach(string str in strList)
Console.WriteLine(Regex.Match(str, #"(?<=\[)[^\]]*(?=]$)|(?<=] ).*").Value);
Sample output:
toto
descriptionToto
titi
descriptionTiti
tata
descriptionTata

if you are planning to get just the description for those that contain description:
you can do a split at a space char - " " and store the second element of the array in str[1] which would be the description.
If there's no description, a space would not exist.
So do a loop and then in an array store : list.Split(' '). This will split the str with description into two elements.
so:
for (int i = 0; i < str.Length; i++)
{
string words[] = str[i].Split(' ')
if words.length > 1
{str[i] = word[1];
}
}

If those are code strings and not literal variable notation this should work.
The replacement just catenates capture group 1 and 2.
Find: ^\s*(?:\[([^\[\]]*)\]\s*|\[[^\[\]]*\]\s*((?:\s*\S)+\s*))$
Replace: "$1$2"
^
\s*
(?:
\[
( [^\[\]]* ) # (1)
\] \s*
|
\[ [^\[\]]* \]
\s*
( # (2 start)
(?: \s* \S )+
\s*
) # (2 end)
)
$
Dot-Net test case
string str1 = "[titi]";
Console.WriteLine( Regex.Replace(str1, #"^\s*(?:\[([^\[\]]*)\]\s*|\[[^\[\]]*\]\s*((?:\s*\S)+\s*))$", #"$1$2"));
string str2 = "[titi] descriptionTiti";
Console.WriteLine( Regex.Replace(str2, #"^\s*(?:\[([^\[\]]*)\]\s*|\[[^\[\]]*\]\s*((?:\s*\S)+\s*))$", #"$1$2"));
Output >>
titi
descriptionTiti

Related

How to split string by another string

I have this string (it's from EDI data):
ISA*ESA?ISA*ESA?
The * indicates it could be any character and can be of any length.
? indicates any single character.
Only the ISA and ESA are guaranteed not to change.
I need this split into two strings which could look like this: "ISA~this is date~ESA|" and
"ISA~this is more data~ESA|"
How do I do this in c#?
I can't use string.split, because it doesn't really have a delimeter.
You can use Regex.Split for accomplishing this
string splitStr = "|", inputStr = "ISA~this is date~ESA|ISA~this is more data~ESA|";
var regex = new Regex($#"(?<=ESA){Regex.Escape(splitStr)}(?=ISA)", RegexOptions.Compiled);
var items = regex.Split(inputStr);
foreach (var item in items) {
Console.WriteLine(item);
}
Output:
ISA~this is date~ESA
ISA~this is more data~ESA|
Note that if your string between the ISA and ESA have the same pattern that we are looking for, then you will have to find some smart way around it.
To explain the Regex a bit:
(?<=ESA) Look-behind assertion. This portion is not captured but still matched
(?=ISA) Look-ahead assertion. This portion is not captured but still matched
Using these look-around assertions you can find the correct | character for splitting
Simply use the
int x = whateverString.indexOf("?ISA"); // replace ? with the actual character here
and then just use the substring from 0 to that indexOf, indexOf to length.
Edit:
If ? is not known,
can we just use the regex Pattern and Matcher.
Matcher matcher = Patter.compile("ISA.*ESA").match(whateverString);
if(matcher.find()) {
matcher.find();
int x = matcher.start();
}
Here x would give that start index of that match.
Edit: I mistakenly saw it as java one, for C#
string pattern = #"ISA.*ESA";
Regex myRegex = new Regex(pattern, RegexOptions.IgnoreCase);
Match m = myRegex.Match(whateverString); // m is the first match
while (m.Success)
{
Console.writeLine(m.value);
m = m.NextMatch(); // more matches
}
RegEx will probably be the best for this. See this link
Mask would be
ISA(?<data1>.*?)ESA.ISA(?<data2>.*?)ESA.
This will give you 2 groups with data you need
Match match = Regex.Match(input, #"ISA(?<data1>.*?)ESA.ISA(?<data2>.*?)ESA.",RegexOptions.IgnoreCase);
if (match.Success)
{
var data1 = match.Groups["data1"].Value;
var data2 = match.Groups["data2"].Value;
}
Use Regex.Matches If you need multiple matches found, and specify different RegexOptions if needed.
It's kinda hacky but you could do...
string x = "ISA*ESA?ISA*ESA?";
x = x.Replace("*","~"); // OR SOME OTHER DELIMITER
string[] y = x.Split('~');
Not perfect in all situations, but it could solve your problem simply.
You could split by "ISA" and "ESA" and then put the parts back together.
string input = "ISA~this is date~ESA|ISA~this is more data~ESA|";
string start = "ISA",
end = "ESA";
var splitedInput = input.Split(new[] { start, end }, StringSplitOptions.None);
var firstPart = $"{start}{splitedInput[1]}{end}{splitedInput[2]}";
var secondPart = $"{start}{splitedInput[3]}{end}{splitedInput[4]}";
firstPart = "ISA~this is date~ESA|"
secondPart = "ISA~this is more data~ESA|";
Use a Regex like ISA(.+?)ESA and select the first group
string input = "ISA~mycontent+ESA";
Match match = Regex.Match(input, #"ISA(.+?)ESA",RegexOptions.IgnoreCase);
if (match.Success)
{
string key = match.Groups[1].Value;
}
Instead of "splitting" by a string, I would instead describe your question as "grouping" by a string. This can easily be done using a regular expression:
Regular expression: ^(ISA.*?(?=ESA)ESA.)(ISA.*?(?=ESA)ESA.)$
Explanation:
^ - asserts position at start of the string
( - start capturing group
ISA - match string ISA exactly
.*?(?=ESA) - match any character 0 or more times, positive lookahead on the
string ESA (basically match any character until the string ESA is found)
ESA - match string ESA exactly
. - match any character
) - end capturing group
repeat one more time...
$ - asserts position at end of the string
Try it on Regex101
Example:
string input = "ISA~this is date~ESA|ISA~this is more data~ESA|";
Regex regex = new Regex(#"^(ISA.*?(?=ESA)ESA.)(ISA.*?(?=ESA)ESA.)$",
RegexOptions.Compiled);
Match match = regex.Match(input);
if (match.Success)
{
string firstValue = match.Groups[1].Value; // "ISA~this is date~ESA|"
string secondValue = match.Groups[2].Value; // "ISA~this is more data~ESA|"
}
There are two answers to the question "How to split a string by another string".
var matches = input.Split(new [] { "ISA" }, StringSplitOptions.RemoveEmptyEntries);
and
var matches = Regex.Split(input, "ISA").ToList();
However, the first removes empty entries, while the second does not.

Replace Single WhiteSpace without Replacing Multiple WhiteSpace

I have a string in the format:
abc def ghi xyz
I would like to end with it in format:
abcdefghi xyz
What is the best way to do this? In this particular case, I could just strip off the last three characters, remove spaces, and then add them back at the end, but this won't work for cases in which the multiple spaces are in the middle of the string.
In Short, I want to remove all single whitespaces, and then replace all multiple whitespaces with a single. Each of those steps is easy enough by itself, but combining them seems a bit less straightforward.
I'm willing to use regular expressions, but I would prefer not to.
This approach uses regular expressions but hopefully in a way that's still fairly readable. First, split your input string on multiple spaces
var pattern = #" +"; // match two or more spaces
var groups = Regex.Split(input, pattern);
Next, remove the (individual) spaces from each token:
var tokens = groups.Select(group => group.Replace(" ", String.Empty));
Finally, join your tokens with single spaces
var result = String.Join(' ', tokens.ToArray());
This example uses a literal space character rather than 'whitespace' (which includes tabs, linefeeds, etc.) - substitute \s for ' ' if you need to split on multiple whitespace characters rather than actual spaces.
Well, Regular Expressions would probably be the fastest here, but you could implement some algorithm that uses a lookahead for single spaces and then replaces multiple spaces in a loop:
// Replace all single whitespaces
for (int i = 0; i < sourceString.Length; i++)
{
if (sourceString[i] = ' ')
{
if (i < sourceString.Length - 1 && sourceString[i+1] != ' ')
sourceString = sourceString.Delete(i);
}
}
// Replace multiple whitespaces
while (sourceString.Contains(" ")) // Two spaces here!
sourceString = sourceString.Replace(" ", " ");
But hey, that code is pretty ugly and slow compared to a proper regular expression...
For a Non-REGEX option you can use:
string str = "abc def ghi xyz";
var result = str.Split(); //This will remove single spaces from the result
StringBuilder sb = new StringBuilder();
bool ifMultipleSpacesFound = false;
for (int i = 0; i < result.Length;i++)
{
if (!String.IsNullOrWhiteSpace(result[i]))
{
sb.Append(result[i]);
ifMultipleSpacesFound = false;
}
else
{
if (!ifMultipleSpacesFound)
{
ifMultipleSpacesFound = true;
sb.Append(" ");
}
}
}
string output = sb.ToString();
The output would be:
output = "abcdefghi xyz"
Here's an approach which uses some fairly subtle logic:
public static string RemoveUnwantedSpaces(string text)
{
var sb = new StringBuilder();
char lhs = '\0';
char mid = '\0';
foreach (char rhs in text)
{
if (rhs != ' ' || (mid == ' ' && lhs != ' '))
sb.Append(rhs);
lhs = mid;
mid = rhs;
}
return sb.ToString().Trim();
}
How it works:
We will examine each possible three-character subsequence linearly across the string (in a kind of three-character sliding window). These three characters will be represented, in order, by the variables lhs, mid and rhs.
For each rhs character in the string:
If it's not a space we should output it.
If it is a space, and the previous character was also space but the one before that isn't, then this is the second in a sequence of at least two spaces, and therefore we should output one space.
Otherwise, don't output a space because this is either the first or the third (or later) space in a sequence of two or more spaces and in either case we don't want to output a space: If this happens to be the first in a sequence of two or more spaces, a space will be output when the second space comes along. If this is the third or later, we've already output a space for it.
The subtlety here is that I've avoided special casing the beginning of the sequence by initialising the lhs and mid variables with non-space characters. It doesn't matter what those values are, as long as they are not spaces, but I made them \0 to indicate that they are special values.
After second thought here is one line regex solution:
Regex.Replace("abc def ghi xyz", "( )( )*([^ ])", "$2$3")
the result of this is "abcdefghi xyz"
ORIGINAL ANSWER:
Two lines of code regex solution:
var tmp = Regex.Replace("abc def ghi xyz", "( )([^ ])", "$2")
tmp is "abcdefghi xyz"
then:
var result = Regex.Replace(tmp, "( )+", " ");
result is "abcdefghi xyz"
Explanation:
The first line of code removes single whitespaces and removes one whitespace for multiple whitespaces (so there are 3 spaces in tmp between letters i and x).
The second line just replace multiple whitespaces with one.
In-depth explanation of first line:
We match input string to regex that matches one space and non-space next to it. We also put this two characters in separate groups (we use ( ) for anonymous grouping).
So for "abc def ghi xyz" string we have this matches and groups:
match: " d" group1: " " group2: "d"
match: " g" group1: " " group2: "g"
match: " x" group1: " " group2: "x"
We are using substitution syntax for Regex.Replace method to replace match with the content of second group (which is non-whitespace character)

Extract multiple values from a string

I need to extract values from a string.
string sTemplate = "Hi [FirstName], how are you and [FriendName]?"
Values I need returned:
FirstName
FriendName
Any ideas on how to do this?
You can use the following regex globally:
\[(.*?)\]
Explanation:
\[ : [ is a meta char and needs to be escaped if you want to match it literally.
(.*?) : match everything in a non-greedy way and capture it.
\] : ] is a meta char and needs to be escaped if you want to match it literally.
Example:
string input = "Hi [FirstName], how are you and [FriendName]?";
string pattern = #"\[(.*?)\]";
Regex rgx = new Regex(pattern, RegexOptions.IgnoreCase);
MatchCollection matches = rgx.Matches(input);
if (matches.Count > 0)
{
Console.WriteLine("{0} ({1} matches):", input, matches.Count);
foreach (Match match in matches)
Console.WriteLine(" " + match.Value);
}
If the format/structure of the text won't be changing at all, and assuming the square brackets were used as markers for the variable, you could try something like this:
string sTemplate = "Hi FirstName, how are you and FriendName?"
// Split the string into two parts. Before and after the comma.
string[] clauses = sTemplate.Split(',');
// Grab the last word in each part.
string[] names = new string[]
{
clauses[0].Split(' ').Last(), // Using LINQ for .Last()
clauses[1].Split(' ').Last().TrimEnd('?')
};
return names;
You will need to tokenize the text and then extract the terms.
string[] tokenizedTerms = new string[7];
char delimiter = ' ';
tokenizedTerms = sTemplate.Split(delimiter);
firstName = tokenizedTerms[1];
friendName = tokenizedTerms[6];
char[] firstNameChars = firstName.ToCharArray();
firstName = new String(firstNameChars, 0, firstNameChars.length - 1);
char[] friendNameChars = lastName.ToCharArray();
friendName = new String(friendNameChars, 0, friendNameChars.length - 1);
Explanation:
You tokenize the terms, which separates the string into a string array with each element being the char sequence between each delimiter, in this case between spaces which is the words. From this word array we know that we want the 3rd word (element) and the 7th word (element). However each of these terms have punctuation at the end. So we convert the strings to a char array then back to a string minus that last character, which is the punctuation.
Note:
This method assumes that since it is a first name, there will only be one string, as well with the friend name. By this I mean if the name is just Will, it will work. But if one of the names is Will Fisher (first and last name), then this will not work.

Detect Two Consecutive Single Quotes Inside Single Quotes

I'm struggling to get this regex pattern exactly right, and am open to other options outside of regex if someone has a better alternative.
The situation:
I'm basically looking to parse a T-SQL "in" clause against a text column in C#. So, I need to take a string value like this:
"'don''t', 'do', 'anything', 'stupid'"
And interpret that as a list of values (I'll take care of the double single quotes later):
"don''t"
"do"
"anything"
"stupid"
I have a regex that works for most cases, but I'm struggling to generalize it to the point where it will accept any character OR a doubled-up single quote inside my group: (?:')([a-z0-9\s(?:'(?='))]+)(?:')[,\w]*
I'm fairly experienced with regexes, but have rarely, if ever, found a need for look-arounds (so downgrade my assessment of my regex experience accordingly).
So, to put this another way, I'm wanting to take a string of comma-delimited values, each enclosed in single quotes but can contain doubled single quotes, and output each such value.
EDIT
Here's a non-working example with my current regex (my problem is I need to handle all characters in my grouping and stop when I encounter a single quote not followed by a second single quote):
"'don''t', 'do?', 'anything!', '#stupid$'"
If you still think about a regex-based solution, you can use the following regex:
'(?:''|[^'])*'
Or an "un-rolled" version suggested by #sln:
'[^']*(?:''[^']*)*'
See demo
It is fairly simple, it captures double single quotation marks OR anything that is not a single quotation mark. No need using any look-behinds or look-aheads. It does not take care of any escaped entities, but I do not see this requirement in your question.
Moreover, this regex will return matches that are easy to access and deal with:
var text = "'don''t', 'do', 'anything', 'stupid'";
var re = new Regex(#"'[^']*(?:''[^']*)*'"); // Updated thanks to #sln, previous (#"'(?:''|[^'])*'");
var match_values = re.Matches(text).Cast<Match>().Select(p => p.Value).ToList();
Output:
If you want to use the Capture Collection feature, you can grab them all in a
single pass.
# #"""\s*(?:'([^']*(?:''[^']*)*)'\s*(?:,\s*|(?="")))+"""
"
\s*
(?:
'
( # (1 start)
[^']*
(?:
'' [^']*
)*
) # (1 end)
'
\s*
(?:
, \s*
| (?= " )
)
)+
"
C# code:
string strSrc = "\"'don''t', 'do', 'anything', 'stupid'\"";
Regex rx = new Regex(#"""\s*(?:'([^']*(?:''[^']*)*)'\s*(?:,\s*|(?="")))+""");
Match srcMatch = rx.Match(strSrc);
if (srcMatch.Success)
{
CaptureCollection cc = srcMatch.Groups[1].Captures;
for (int i = 0; i < cc.Count; i++)
Console.WriteLine("{0} = '{1}'", i, cc[i].Value);
}
Output:
0 = 'don''t'
1 = 'do'
2 = 'anything'
3 = 'stupid'
Press any key to continue . . .
Why don't you split on ', ':
Regex regex = new Regex(#"'\s*,\s*'");
string[] substrings = regex.Split(str);
And then take care of the extra single quotes by Trimming
Looks to me like you're over-thinking the problem. A quoted string with an escaped quote looks just like two strings without escaped quotes, one right after the other (not even spaces between them).
(?:'[^']*')+
Of course, you'll have to remove the enclosing quotes, but you probably had to do some post-processing anyway, to unescape the escaped quotes.
Also note that I'm not trying to validate the input or work around possible errors; for example, I don't bother matching the commas between the strings. If the input is well formed, this regex should be all you need.
In the interest of maintainability, I decided against a regex and followed the advice of using a state machine. Here's the crux of my implementation:
string currentTerm = string.Empty;
State currentState = State.BetweenTerms;
foreach (char c in valueToParse)
{
switch (currentState)
{
// if between terms, only need to do something if we encounter a single quote, signalling to start a new term
// encloser is client-specified char to look for (e.g. ')
case State.BetweenTerms:
if (c == encloser)
{
currentState = State.InTerm;
}
break;
case State.InTerm:
if (c == encloser)
{
if (valueToParse.Length > index + 1 && valueToParse[index + 1] == encloser && valueToParse.Length > index + 2)
{
// if next character is also encloser then add it and move on
currentTerm += c;
}
else if (currentTerm.Length > 0 && currentTerm[currentTerm.Length - 1] != encloser)
{
// on an encloser and didn't just add encloser, so we are done
// converterFunc is a client-specified Func<string,T> to return terms in the specified type (to allow for converting to int, for example)
yield return converterFunc(currentTerm);
currentTerm = string.Empty;
currentState = State.BetweenTerms;
}
}
else
{
currentTerm += c;
}
break;
}
index++;
}
if (currentTerm.Length > 0)
{
yield return converterFunc(currentTerm);
}

How to remove only certain substrings from a string?

Using C#, I have a string that is a SQL script containing multiple queries. I want to remove sections of the string that are enclosed in single quotes. I can do this using Regex.Replace, in this manner:
string test = "Only 'together' can we turn him to the 'dark side' of the Force";
test = Regex.Replace(test, "'[^']*'", string.Empty);
Results in: "Only can we turn him to the of the Force"
What I want to do is remove the substrings between quotes EXCEPT for substrings containing a specific substring. For example, using the string above, I want to remove the quoted substrings except for those that contain "dark," such that the resulting string is:
Results in: "Only can we turn him to the 'dark side' of the Force"
How can this be accomplished using Regex.Replace, or perhaps by some other technique? I'm currently trying a solution that involves using Substring(), IndexOf(), and Contains().
Note: I don't care if the single quotes around "dark side" are removed or not, so the result could also be: "Only can we turn him to the dark side of the Force." I say this because a solution using Split() would remove all the single quotes.
Edit: I don't have a solution yet using Substring(), IndexOf(), etc. By "working on," I mean I'm thinking in my head how this can be done. I have no code, which is why I haven't posted any yet. Thanks.
Edit: VKS's solution below works. I wasn't escaping the \b the first attempt which is why it failed. Also, it didn't work unless I included the single quotes around the whole string as well.
test = Regex.Replace(test, "'(?![^']*\\bdark\\b)[^']*'", string.Empty);
'(?![^']*\bdark\b)[^']*'
Try this.See demo.Replace by empty string.You can use lookahead here to check if '' contains a word dark.
https://www.regex101.com/r/rG7gX4/12
While vks's solution works, I'd like to demonstrate a different approach:
string test = "Only 'together' can we turn him to the 'dark side' of the Force";
test = Regex.Replace(test, #"'[^']*'", match => {
if (match.Value.Contains("dark"))
return match.Value;
// You can add more cases here
return string.Empty;
});
Or, if your condition is simple enough:
test = Regex.Replace(test, #"'[^']*'", match => match.Value.Contains("dark")
? match.Value
: string.Empty
);
That is, use a lambda to provide a callback for the replacement. This way, you can run arbitrary logic to replace the string.
some thing like this would work. you can add all strings you want to keep into the excludedStrings array
string test = "Only 'together' can we turn him to the 'dark side' of the Force";
var excludedString = new string[] { "dark side" };
int startIndex = 0;
while ((startIndex = test.IndexOf('\'', startIndex)) >= 0)
{
var endIndex = test.IndexOf('\'', startIndex + 1);
var subString = test.Substring(startIndex, (endIndex - startIndex) + 1);
if (!excludedString.Contains(subString.Replace("'", "")))
{
test = test.Remove(startIndex, (endIndex - startIndex) + 1);
}
else
{
startIndex = endIndex + 1;
}
}
Another method through regex alternation operator |.
#"('[^']*\bdark\b[^']*')|'[^']*'"
Then replace the matched character with $1
DEMO
string str = "Only 'together' can we turn him to the 'dark side' of the Force";
string result = Regex.Replace(str, #"('[^']*\bdark\b[^']*')|'[^']*'", "$1");
Console.WriteLine(result);
IDEONE
Explanation:
(...) called capturing group.
'[^']*\bdark\b[^']*' would match all the single quoted strings which contains the substring dark . [^']* matches any character but not of ', zero or more times.
('[^']*\bdark\b[^']*'), because the regex is within a capturing group, all the matched characters are stored inside the group index 1.
| Next comes the regex alternation operator.
'[^']*' Now this matches all the remaining (except the one contains dark) single quoted strings. Note that this won't match the single quoted string which contains the substring dark because we already matched those strings with the pattern exists before to the | alternation operator.
Finally replacing all the matched characters with the chars inside group index 1 will give you the desired output.
I made this attempt that I think you were thinking about (some solution using split, Contain, ... without regex)
string test = "Only 'together' can we turn him to the 'dark side' of the Force";
string[] separated = test.Split('\'');
string result = "";
for (int i = 0; i < separated.Length; i++)
{
string str = separated[i];
str = str.Trim(); //trim the tailing spaces
if (i % 2 == 0 || str.Contains("dark")) // you can expand your condition
{
result += str+" "; // add space after each added string
}
}
result = result.Trim(); //trim the tailing space again

Categories

Resources