How can I remove quoted string literals from a string in C#? - c#

I have a string:
Hello "quoted string" and 'tricky"stuff' world
and want to get the string minus the quoted parts back. E.g.,
Hello and world
Any suggestions?

resultString = Regex.Replace(subjectString,
#"([""'])# Match a quote, remember which one
(?: # Then...
(?!\1) # (as long as the next character is not the same quote as before)
. # match any character
)* # any number of times
\1 # until the corresponding closing quote
\s* # plus optional whitespace
",
"", RegexOptions.IgnorePatternWhitespace);
will work on your example.
resultString = Regex.Replace(subjectString,
#"([""'])# Match a quote, remember which one
(?: # Then...
(?!\1) # (as long as the next character is not the same quote as before)
\\?. # match any escaped or unescaped character
)* # any number of times
\1 # until the corresponding closing quote
\s* # plus optional whitespace
",
"", RegexOptions.IgnorePatternWhitespace);
will also handle escaped quotes.
So it will correctly transform
Hello "quoted \"string\\" and 'tricky"stuff' world
into
Hello and world

Use a regular expression to match any quoted strings with the string and replace them with the empty string. Use the Regex.Replace() method to do the pattern matching and replacement.

In case, like me, you're afraid of regex, I've put together a functional way to do it, based on your example string. There's probably a way to make the code shorter, but I haven't found it yet.
private static string RemoveQuotes(IEnumerable<char> input)
{
string part = new string(input.TakeWhile(c => c != '"' && c != '\'').ToArray());
var rest = input.SkipWhile(c => c != '"' && c != '\'');
if(string.IsNullOrEmpty(new string(rest.ToArray())))
return part;
char delim = rest.First();
var afterIgnore = rest.Skip(1).SkipWhile(c => c != delim).Skip(1);
StringBuilder full = new StringBuilder(part);
return full.Append(RemoveQuotes(afterIgnore)).ToString();
}

Related

Find pipe in quotes ignore false positives [duplicate]

This question already has answers here:
Need C# Regex for replacing spaces inside of strings
(2 answers)
C# Regex Split - commas outside quotes
(7 answers)
Closed 3 years ago.
I'm trying to replace pipe delimited character inside quotes with a space. The issue is I get to many false positives because some strings are null. I only want to replace the pipe if there is text between the quotes. The regex pattern I'm using is from another stackoverflow post as my regex skills are lacking.
data sample:
"Hello"|"Green | Blue"|123.45|""|""|""|5|45
code i'm using:
internal class Program
{
public static void Main()
{
string pattern = #"(?: (?<= "")|\G(?!^))(\s*[^"" |\s]+(?:\s +[^
""|\s]+)*)\s*\|\s*(?=[^""] * "")";
string substitution = #"\1 \2";
string input = #"""20190430|""Test Text""|""""|""""|""Manual""|""""|""Machine""|""""|""""|10.00|""""|0.00|||0.00||5600.00||||""A+""|""""|40.00||""""|""Vision Service |Troubleshoot""|57|""Y""|838|""Yellow Maroon""|850||""FL""||||0.00|||||||||||""""||""""||""""|||""""||||||""""||""""|""""||""""|""""||||||""""|""""|""""||||||||1||""";
RegexOptions options = RegexOptions.Multiline;
Regex regex = new Regex(pattern, options);
string result = regex.Replace(input, substitution);
Console.WriteLine("Result:" + result);
Console.ReadKey();
}
}
It replaces the 'Blue Green' pipe just fine. But it also replaces the pipes between quotes later which breaks the file as column get removed.
Updated the code with an actual sample of my file I'm processing. The regex finds it but doesn't replace the pipe. Missing something.
If there should be text between the double quotes and the text should be on both sides of the pipe, you might use:
(?<=")(\s*[^"\s|]+)\s*\|\s*([^\s"|]+\s*)(?=")
In the replacement use $1 $2
Explanation
(?<=") Postive lookbehind, assert what is on the left is "
(\s*[^"\s|]+) Capture in group 1 matching 0+ times a whitespace char, 1+ times not ", | or a whitespace char
\s*\|\s* Match a | between 0+ times a whitespace char
([^\s"|]+\s*) Capture in group 2 matching 1+ times not ", | or a whitespace char and match 0+ times a whitespace char
(?=") Positive lookahead, assert what is on the right is "
.NET Regex demo
Edit
If you want to replace multiple pipes with a space between the double quotes you could make use of the \G anchor to assert the position at the end of previous match.
In the replacement use the first capturing group followed by a space $1
(?:(?<=")|\G(?!^))(\s*[^"|\s]+(?:\s+[^"|\s]+)*)\s*\|\s*(?=[^"]*")
Explanation
(?: Non capturing group
(?<=") Assert what is on the left is "
| Or
\G(?!^) Assert position at the end of the previous match
) Close non capturing group
( Capure group 1
\s*[^"|\s]+ Match 0+ times a whitespace char, followed by 1+ times not a | or whitespace char
(?:\s+[^"|\s]+)* Repeat 0+ times matching 1+ whitespace chars followed by 1+ times not a | or whitespace char
) Close capturing group 1
\s*\|\s* Match a | between 0+ times a whitespace char
(?=[^"]*") Assert what is on the right is a "
See another .NET regex demo
My guess is that, we might also want to keep only one space in our text, and this expression,
"([^"]+?)\s+\|\s+([^"]+?)"
with a replacement of $1 $2 might work.
Demo
Example
using System;
using System.Text.RegularExpressions;
public class Example
{
public static void Main()
{
string pattern = #"""([^""]+?)\s+\|\s+([^""]+?)""";
string substitution = #"\1 \2";
string input = #"""Hello""|""Green | Blue""|123.45|""""|""""|""""|5|45";
RegexOptions options = RegexOptions.Multiline;
Regex regex = new Regex(pattern, options);
string result = regex.Replace(input, substitution);
}
}

Getting the substring after a character in C# using regex

I have the following input string:
string val = "[01/02/70]\nhello world ";
I want to get the all words after the last ] character.
Example output for a sample string above:
\nhello world
In C#, use Substring() with IndexOf:
string val = val.Substring(val.IndexOf(']') + 1);
If you have multiple ] symbols, and you want to get all the string after the last one, use LastIndexOf:
string val = "[01/02/70]\nhello [01/02/80] world ";
string val = val.Substring(val.LastIndexOf(']') + 1); // => " world "
If you are a fan of Regex, you might want to use a Regex.Replace like
string val = "[01/02/70]\nhello [01/02/80] world ";
val = Regex.Replace(val, #"^.*\]", string.Empty, RegexOptions.Singleline); // => " world "
See demo
Notes on REGEX:
RegexOptions.Singleline makes . match a linebreak
^ - matches beginning of string
.* - matches 0 or more characters but as many as possible (greedy matching)
\] - matches literal ] (as it is a special regex metacharacter, it must be escaped).
You need to use lookbehind assertion. And not only that, you have to enable DOTALL modifier also, so that it would also match the newline character present inbetween.
"(?s)(?<=\\]).*"
(?s) - DOTALL modifier.
(?<=\\]) - lookbehind which asserts that the match must be preceeded by a close bracket
.* - Matches any chracater zero or more times.
or
"(?s)(?<=\\])[\\s\\S]*"
Try this if you don't want to match the following newline character.
#"(?<=\][\n\r]*).*"

I think my regular expression pattern in C# is incorrect

I'm checking to see if my regular expression matches my string.
I have a filename that looks like somename_somthing.txt and I want to match it to somename_*.txt, but my code is failing when I try to pass something that should match. Here is my code.
string pattern = "somename_*.txt";
Regex r = new Regex(pattern, RegexOptions.IgnoreCase);
using (ZipFile zipFile = ZipFile.Read(fullPath))
{
foreach (ZipEntry e in zipFile)
{
Match m = r.Match("somename_something.txt");
if (!m.Success)
{
throw new FileNotFoundException("A filename with format: " + pattern + " not found.");
}
}
}
The asterisk is matching the underscore and throwing it off.
Try:
somename_(\w+).txt
The (\w+) here will match the group at this location.
You can see it match here: https://regex101.com/r/qS8wA5/1
In General
Regex give in this code matches the _ with an * meaning zero or more underscores instead of what you intended. The * is used to denote zero or more of the previous item. Instead try
^somename_(.*)\.txt$
This matches exactly the first part "somename_".
Then anything (.*)
And finally the end ".txt". The backslash escapes the 'dot'.
More Specific
You can also say if you only want letters and not numbers or symbols in the middle part of the match with:
^somename_[a-z]*\.txt$
As written, your regular expression
somename_*.txt
matches (in a case-insensitive manner):
the literal text somename, followed by
zero or more underscore characters (_), followed
any character (other than newline), followed
the literal text txt
And it will match that anywhere in the source text. You probably want to write something like
Regex myPattern = new Regex( #"
^ # anchor the match to start-of-text, followed by
somename # the literal 'somename', followed by
_ # a literal underscore character, followed by
.* # zero or of any character (except newline), followed by
\. # a literal period/fullstop, followed by
txt # the literal text 'txt'
$ # with the match anchored at end-of-text
" , RegexOptions.IgnoreCase|RegexOptions.IgnorePatternWhitespace
) ;
Hi I think the pattern should be
string pattern = "somename_.*\\.txt";
Regards

How can I escape these quotes in regex expressions?

I have a string text which is like:
"ruf": "the text I want",
"puf":
I want extract the text inside the quotes.
tried this :
string cg="?<=\"ruf\":\")(.*?)(?=\",puf";
Regex g = new Regex(cg);
It didnt work.
Try with below regex:
(?<="ruf":\s\")[^"]*
Online demo
String literals for use in programs:
C#
#"(?<=""ruf"":\s\"")[^""]*"
output:
the text I want
Pattern description:
(?<= look behind to see if there is:
"ruf": '"ruf":'
\s whitespace (\n, \r, \t, \f, and " ")
\" '"'
) end of look-behind
[^"]* any character except: '"' (0 or more times
(matching the most amount possible))
Debuggex Demo
EDIT
Can you add puf. Because it is a long text which has multiple quotes in it
If you are looking till "puf" is found then try below regex:
(?<="ruf":\s\")[\s\S]*(?=",\s*"puf")
Online demo
String literals for use in programs:
C#
#"(?<=""ruf"":\s\"")[\s\S]*(?="",\s*""puf"")"
You could try the below regex with s modifier,
/(?<=\"ruf\": \")[^\"]*(?=\",.*?\"puf\":)/s
DEMO
With the s modifier, dot matches even newline character also.
Do it like this:
var myRegex = new Regex(#"(?s)(?<=""ruf"": "")[^""]*(?=\s*""puf"")");
string resultString = myRegex.Match(yourString).Value;
Console.WriteLine(resultString);

Regexp skip pattern

Problem
I need to replace all asterisk symbols('*') with percent symbol('%'). The asterisk symbols in square brackets should be ignored.
Example
[Test]
public void Replace_all_asterisks_outside_the_square_brackets()
{
var input = "Hel[*o], w*rld!";
var output = Regex.Replace(input, "What_pattern_should_be_there?", "%")
Assert.AreEqual("Hel[*o], w%rld!", output));
}
Try using a look ahead:
\*(?![^\[\]]*\])
Here's a bit stronger solution, which takes care of [] blocks better, and even escaped \[ characters:
string text = #"h*H\[el[*o], w*rl\]d!";
string pattern = #"
\\. # Match an escaped character. (to skip over it)
|
\[ # Match a character class
(?:\\.|[^\]])* # which may also contain escaped characters (to skip over it)
\]
|
(?<Asterisk>\*) # Match `*` and add it to a group.
";
text = Regex.Replace(text, pattern,
match => match.Groups["Asterisk"].Success ? "%" : match.Value,
RegexOptions.IgnorePatternWhitespace);
If you don't care about escaped characters you can simplify it to:
\[ # Skip a character class
[^\]]* # until the first ']'
\]
|
(?<Asterisk>\*)
Which can be written without comments as: #"\[[^\]]*\]|(?<Asterisk>\*)".
To understand why it works we need to understand how Regex.Replace works: for every position in the string it tries to match the regex. If it fails, it moves one character. If it succeeds, it moves over the whole match.
Here, we have dummy matches for the [...] blocks so we may skip over the asterisks we don't want to replace, and match only the lonely ones. That decision is made in a callback function that checks if Asterisk was matched or not.
I couldn't come up with a pure RegEx solution. Therefore I am providing you with a pragmatic solution. I tested it and it works:
[Test]
public void Replace_all_asterisks_outside_the_square_brackets()
{
var input = "H*]e*l[*o], w*rl[*d*o] [o*] [o*o].";
var actual = ReplaceAsterisksNotInSquareBrackets(input);
var expected = "H%]e%l[*o], w%rl[*d*o] [o*] [o*o].";
Assert.AreEqual(expected, actual);
}
private static string ReplaceAsterisksNotInSquareBrackets(string s)
{
Regex rx = new Regex(#"(?<=\[[^\[\]]*)(?<asterisk>\*)(?=[^\[\]]*\])");
var matches = rx.Matches(s);
s = s.Replace('*', '%');
foreach (Match match in matches)
{
s = s.Remove(match.Groups["asterisk"].Index, 1);
s = s.Insert(match.Groups["asterisk"].Index, "*");
}
return s;
}
EDITED
Okay here is my final attempt ;)
Using negative lookbehind (?<!) and negative lookahead (?!).
var output = Regex.Replace(input, #"(?<!\[)\*(?!\])", "%");
This also passes the test in the comment to another answer "Hel*o], w*rld!"

Categories

Resources