Regular expressions: extract all words out of quotes - c#

By using Regular Expressions how can I extract all text in double quotes, and all words out of quotes in such string:
01AB "SET 001" IN SET "BACK" 09SS 76 "01 IN" SET
First regular expression should extract all text inside double quotes like
SET 001
BACK
01 IN
Second expression shoud extract all other words in string
01AB
IN
SET
09SS
76
SET
For the first case works fine ("(.*?)"). How can I extract all words out of quotes?

Try this expression:
(?:^|")([^"]*)(?:$|")
The groups matched by it will exclude the quotation marks, because they are enclosed in non-capturing parentheses (?: and ). Of course you need to escape the double-quotes for use in C# code.
If the target string starts and/or ends in a quoted value, this expression will match empty groups as well (for the initial and for the trailing quote).

Try this regex:
\"[^\"]*\"
Use Regex.Matches for texts in double quotes, and use Regex.Split for all other words:
var strInput = "01AB \"SET 001\" IN SET \"BACK\" 09SS 76 \"01 IN\" SET";
var otherWords = Regex.Split(strInput, "\"[^\"]*\"");

Maybe you can try replacing the words inside quotes with empty string like:
Regex r = new Regex("\".*?\"", RegexOptions.CultureInvariant | RegexOptions.Compiled | RegexOptions.Singleline);
string p = "01AB \"SET 001\" IN SET \"BACK\" 09SS 76 \"01 IN\" SET";
Console.Write(r.Replace(p, "").Replace(" "," "));

You need to negate the pattern in your first expression.
(?!pattern)
Check out this link.

If suggest you need all blocks of sentence - quoted and not ones - then there is more simple way to separate source string by using Regex.Split:
static Regex QuotedTextRegex = new Regex(#"("".*?"")", RegexOptions.IgnoreCase | RegexOptions.Compiled);
var result = QuotedTextRegex
.Split(sourceString)
.Select(v => new
{
value = v,
isQuoted = v.Length > 0 && v[0] == '\"'
});

Related

How do I remove specific character before and after single quote using regex

I have a text string with single quotes, I'd like to remove the parenthesis before and after that single quotes by using regular expression. Could anyone suggest me Thank you.
For example,
I have (name equal '('John')') the result that I expect is name equal '('John')'
// Using Regex
string input = "(name equal '('John')')";
Regex rx = new Regex(#"^\((.*?)\)$");
Console.WriteLine(rx.Match(input).Groups[1].Value);
// Using Substring method
String input= "(name equal '('John')')";
var result = input.Substring (1, input.Length-2);
Console.WriteLine(result);
Result:
name equal '('John')'
Try this:
var replaced = Regex.Replace("(name equal '('John')')", #"\((.+?'\)')\)", "${1}");
The Regex class is in the System.Text.RegularExpressions namespace.
Use negative look behind (?<! ) and negative look ahead (?! ) which will stop a match if it encounters the ', such as
(?<!')\(|\)(?!')
The example explains it as a comment:
string pattern =
#"
(?<!')\( # Match an open paren that does not have a tick behind it
| # or
\)(?!') # Match a closed paren tha does not have tick after it
";
var text = "(name equal '('John')')";
// Ignore Pattern whitespace allows us to comment the pattern ONLY, does not affect processing.
var final = Regex.Replace(text, pattern, string.Empty, RegexOptions.IgnorePatternWhitespace);
Result
name equal '('John')'

C# Regex - starts with pattern1 not contain pattern2

for the following input string contains all of these:
a1.aaa[SUBSCRIBED]
a1.bbb
a1.ccc
b1.ddd
d1.ddd[SUBSCRIBED]
I want to get the output:
bbb
ccc
which means: all the words that come after "a1." And not contain the substring "[SUBSCRIBED]"
all the words comes after "a1." And not contains the substring
"[SUBSCRIBED]"
Why regex? Following is crystal clear:
var result = strings
.Where(s => s.StartsWith("a1.") && !s.Contains("[SUBSCRIBED]"))
.Select(s => s.Substring(3));
Tim's answer makes sense. However if you insist on it I would venture that a Regex would look like this though.
^a1\.(.*)(?<!\[SUBSCRIBED\])$
with ^a1 meaning starts with a1
\.(.*) taking any number of character
and the negative lookbehind (?<!\[SUBSCRIBED\])$ would refuse text ending with [SUBSCRIBED]
You may use
^a1\.(?!.*\[SUBSCRIBED])(.*)
See the regex demo.
Details
^ - start of string
a1\. - a literal a1. substring
(?!.*\[SUBSCRIBED]) - a negative lookahead that fails the match if there is a [SUBSCRIBED] substring is present after any 0+ chars (other than newline if the RegexOptions.Singleline option is not used)
(.*) - Group 1: the rest of the line up to the end (if you use RegexOptions.Singleline option, . will match newlines as well).
C# code:
var result = string.Empty;
var m = Regex.Match(s, #"^a1\.(?!.*\[SUBSCRIBED])(.*)");
if (m.Success)
{
result = m.Groups[1].Value;
}

Regular Expression to split a sting

I have a string like
[123,234,321,....]
Can i use a regular expression to extract only the numbers ?
this is the code i am using now
public static string[] Split()
{
string s = "[123,234,345,456,567,678,789,890,100]";
var temp = s.Replace("[", "").Replace("]", "");
char[] separator = { ',' };
return temp.Split(separator);
}
You can use string.Split for this - no need for a regular expression, though your code can be simplified:
var nums = s.Split('[', ']', ',');
Thought you may want to exclude empty entries in the returned array:
var nums = s.Split(new[] { '[', ']', ',' },
StringSplitOptions.RemoveEmptyEntries);
There's an overload to Trim() that takes a character.
You could do this.
string s = "[123,234,345,456,567,678,789,890,100]";
var nums = s.Trim('[').Trim(']').Split(',');
If you want to use a regular expression, try:
string s = "[123,234,345,456,567,678,789,890,100]";
var matches = Regex.Matches(s, #"[0-9]+", RegexOptions.Compiled);
However, regular expressions tend to make your code less readable, so you might stick with your original approach.
Try with using string.Split method;
string s = "[123,234,345,456,567,678,789,890,100]";
var numbers = s.Split('[',']', ',');
foreach(var i in numbers )
Console.WriteLine(i);
Here is a DEMO.
EDIT: As Oded mentioned, you may want to use StringSplitOptions.RemoveEmptyEntries also.
string s = "[123,234,345,456,567,678,789,890,100]";
MatchCollection matches = Regex.Matches(s, #"(\d+)[,\]]");
string[] result = matches.OfType<Match>().Select(m => m.Groups[1].Value).ToArray();
Here the # is used to signify a verbatim string literal and allows the escape character '\' to be used directly in Regular expression notation without escaping itself "\".
\d is a digit, \d+ mean 1 or more digits. The parenthesis signify a group so (\d+) means I want a group of digits. (*See group used a little later)
[,\]] square brackets, in brief, mean choose any one of my element so it will choose either the comma , or a square bracket ] which I had to escape.
So the regular expression will find the expressions of sequential digits followed by a , or ]. The Matches will return the set of matches (which we use because there are multiple set) then we go through each match - with some LINQ - and grab the index 1 group which is the second group, "But we only made one group?" We only specified one group, the first group (index 0) is the entire regular expression match, which in our case, will include the , or ] which we don't want.
while you can and probably should use string.Split as other answers indicate, the question specifically asks if you can do it with regex, and yes, you can :-
var r = new Regex(#"\d+", RegexOptions.Compiled );
var matches = r.Matches("[123,234,345,456,567,678,789,890,100]");

How can I remove quoted string literals from a string in C#?

I have a string:
Hello "quoted string" and 'tricky"stuff' world
and want to get the string minus the quoted parts back. E.g.,
Hello and world
Any suggestions?
resultString = Regex.Replace(subjectString,
#"([""'])# Match a quote, remember which one
(?: # Then...
(?!\1) # (as long as the next character is not the same quote as before)
. # match any character
)* # any number of times
\1 # until the corresponding closing quote
\s* # plus optional whitespace
",
"", RegexOptions.IgnorePatternWhitespace);
will work on your example.
resultString = Regex.Replace(subjectString,
#"([""'])# Match a quote, remember which one
(?: # Then...
(?!\1) # (as long as the next character is not the same quote as before)
\\?. # match any escaped or unescaped character
)* # any number of times
\1 # until the corresponding closing quote
\s* # plus optional whitespace
",
"", RegexOptions.IgnorePatternWhitespace);
will also handle escaped quotes.
So it will correctly transform
Hello "quoted \"string\\" and 'tricky"stuff' world
into
Hello and world
Use a regular expression to match any quoted strings with the string and replace them with the empty string. Use the Regex.Replace() method to do the pattern matching and replacement.
In case, like me, you're afraid of regex, I've put together a functional way to do it, based on your example string. There's probably a way to make the code shorter, but I haven't found it yet.
private static string RemoveQuotes(IEnumerable<char> input)
{
string part = new string(input.TakeWhile(c => c != '"' && c != '\'').ToArray());
var rest = input.SkipWhile(c => c != '"' && c != '\'');
if(string.IsNullOrEmpty(new string(rest.ToArray())))
return part;
char delim = rest.First();
var afterIgnore = rest.Skip(1).SkipWhile(c => c != delim).Skip(1);
StringBuilder full = new StringBuilder(part);
return full.Append(RemoveQuotes(afterIgnore)).ToString();
}

C# Regex.Split - Subpattern returns empty strings

Hey, first time poster on this awesome community.
I have a regular expression in my C# application to parse an assignment of a variable:
NewVar = 40
which is entered in a Textbox. I want my regular expression to return (using Regex.Split) the name of the variable and the value, pretty straightforward. This is the Regex I have so far:
var r = new Regex(#"^(\w+)=(\d+)$", RegexOptions.IgnorePatternWhitespace);
var mc = r.Split(command);
My goal was to do the trimming of whitespace in the Regex and not use the Trim() method of the returned values. Currently, it works but it returns an empty string at the beginning of the MatchCollection and an empty string at the end.
Using the above input example, this is what's returned from Regex.Split:
mc[0] = ""
mc[1] = "NewVar"
mc[2] = "40"
mc[3] = ""
So my question is: why does it return an empty string at the beginning and the end?
Thanks.
The reson RegEx.Split is returning four values is that you have exactly one match, so RegEx.Split is returning:
All the text before your match, which is ""
All () groups within your match, which are "NewVar" and "40"
All the text after your match, which is ""
RegEx.Split's primary purpose is to extract any text between the matched regex, for example you could use RegEx.Split with a pattern of "[,;]" to split text on either commas or semicolons. In NET Framework 1.0 and 1.1, Regex.Split only returned the split values, in this case "" and "", but in NET Framework 2.0 it was modified to also include values matched by () within the Regex, which is why you are seeing "NewVar" and "40" at all.
What you were looking for is Regex.Match, not Regex.Split. It will do exactly what you want:
var r = new Regex(#"^(\w+)=(\d+)$");
var match = r.Match(command);
var varName = match.Groups[0].Value;
var valueText = match.Groups[1].Value;
Note that RegexOptions.IgnorePatternWhitespace means you can include extra spaces in your pattern - it has nothing to do with the matched text. Since you have no extra whitespace in your pattern it is unnecesssary.
From the docs, Regex.Split() uses the regular expression as the delimiter to split on. It does not split the captured groups out of the input string. Also, the IgnorePatternWhitespace ignore unescaped whitespace in your pattern, not the input.
Instead, try the following:
var r = new Regex(#"\s*=\s*");
var mc = r.Split(command);
Note that the whitespace is actually consumed as a part of the delimiter.

Categories

Resources