Regex to remove single-line SQL comments (--)

Regex to remove single-line SQL comments (--) - c#

Question:
Can anybody give me a working regex expression (C#/VB.NET) that can remove single line comments from a SQL statement ?
I mean these comments:
-- This is a comment
not those
/* this is a comment */
because I already can handle the star comments.
I have a made a little parser that removes those comments when they are at the start of the line, but they can also be somewhere after code or worse, in a SQL-string 'hello --Test -- World'
Those comments should also be removed (except those in a SQL string of course - if possible).
Surprisingly I didn't got the regex working. I would have assumed the star comments to be more difficult, but actually, they aren't.
As per request, here my code to remove /**/-style comments
(In order to have it ignore SQL-Style strings, you have to subsitute strings with a uniqueidentifier (i used 4 concated), then apply the comment-removal, then apply string-backsubstitution.
static string RemoveCstyleComments(string strInput)
{
string strPattern = #"/[*][\w\d\s]+[*]/";
//strPattern = #"/\*.*?\*/"; // Doesn't work
//strPattern = "/\\*.*?\\*/"; // Doesn't work
//strPattern = #"/\*([^*]|[\r\n]|(\*+([^*/]|[\r\n])))*\*+/ "; // Doesn't work
//strPattern = #"/\*([^*]|[\r\n]|(\*+([^*/]|[\r\n])))*\*+/ "; // Doesn't work
// http://stackoverflow.com/questions/462843/improving-fixing-a-regex-for-c-style-block-comments
strPattern = #"/\*(?>(?:(?>[^*]+)|\*(?!/))*)\*/"; // Works !
string strOutput = System.Text.RegularExpressions.Regex.Replace(strInput, strPattern, string.Empty, System.Text.RegularExpressions.RegexOptions.Multiline);
Console.WriteLine(strOutput);
return strOutput;
} // End Function RemoveCstyleComments

I will disappoint all of you. This can't be done with regular expressions. Sure, it's easy to find comments not in a string (that even the OP could do), the real deal is comments in a string. There is a little hope of the look arounds, but that's still not enough. By telling that you have a preceding quote in a line won't guarantee anything. The only thing what guarantees you something is the oddity of quotes. Something you can't find with regular expression. So just simply go with non-regular-expression approach.
EDIT:
Here's the c# code:
String sql = "--this is a test\r\nselect stuff where substaff like '--this comment should stay' --this should be removed\r\n";
char[] quotes = { '\'', '"'};
int newCommentLiteral, lastCommentLiteral = 0;
while ((newCommentLiteral = sql.IndexOf("--", lastCommentLiteral)) != -1)
{
int countQuotes = sql.Substring(lastCommentLiteral, newCommentLiteral - lastCommentLiteral).Split(quotes).Length - 1;
if (countQuotes % 2 == 0) //this is a comment, since there's an even number of quotes preceding
{
int eol = sql.IndexOf("\r\n") + 2;
if (eol == -1)
eol = sql.Length; //no more newline, meaning end of the string
sql = sql.Remove(newCommentLiteral, eol - newCommentLiteral);
lastCommentLiteral = newCommentLiteral;
}
else //this is within a string, find string ending and moving to it
{
int singleQuote = sql.IndexOf("'", newCommentLiteral);
if (singleQuote == -1)
singleQuote = sql.Length;
int doubleQuote = sql.IndexOf('"', newCommentLiteral);
if (doubleQuote == -1)
doubleQuote = sql.Length;
lastCommentLiteral = Math.Min(singleQuote, doubleQuote) + 1;
//instead of finding the end of the string you could simply do += 2 but the program will become slightly slower
}
}
Console.WriteLine(sql);
What this does: find every comment literal. For each, check if it's within a comment or not, by counting the number of quotes between the current match and the last one. If this number is even, then it's a comment, thus remove it (find first end of line and remove whats between). If it's odd, this is within a string, find the end of the string and move to it. Rgis snippet is based on a wierd SQL trick: 'this" is a valid string. Even tho the 2 quotes differ. If it's not true for your SQL language, you should try a completely different approach. I'll write a program to that too if that's the case, but this one's faster and more straightforward.

You want something like this for the simple case
-{2,}.*
The -{2,} looks for a dash that happens 2 or more times
The .* gets the rest of the lines up to the newline
*But, for the edge cases, it appears that SinistraD is correct in that you cannot catch everything, however here is an article about how this can be done in C# with a combination of code and regex.

This seems to work well for me so far; it even ignores comments within strings, such as SELECT '--not a comment--' FROM ATable
private static string removeComments(string sql)
{
string pattern = #"(?<=^ ([^'""] |['][^']*['] |[""][^""]*[""])*) (--.*$|/\*(.|\n)*?\*/)";
return Regex.Replace(sql, pattern, "", RegexOptions.IgnorePatternWhitespace | RegexOptions.Multiline);
}
Note: it is designed to eliminate both /**/-style comments as well as -- style. Remove |/\*(.|\n)*?\*/ to get rid of the /**/ checking. Also be sure you are using the RegexOptions.IgnorePatternWhitespace Regex option!!
I wanted to be able to handle double-quotes too, but since T-SQL doesn't support them, you could get rid of |[""][^""]*[""] too.
Adapted from here.
Note (Mar 2015): In the end, I wound up using Antlr, a parser generator, for this project. There may have been some edge cases where the regex didn't work. In the end I was much more confident with the results having used Antlr, and it's worked well.

Using System.Text.RegularExpressions;
public static string RemoveSQLCommentCallback(Match SQLLineMatch)
{
System.Text.StringBuilder sb = new System.Text.StringBuilder();
bool open = false; //opening of SQL String found
char prev_ch = ' ';
foreach (char ch in SQLLineMatch.ToString())
{
if (ch == '\'')
{
open = !open;
}
else if ((!open && prev_ch == '-' && ch == '-'))
{
break;
}
sb.Append(ch);
prev_ch = ch;
}
return sb.ToString().Trim('-');
}
The code
public static void Main()
{
string sqlText = "WHERE DEPT_NAME LIKE '--Test--' AND START_DATE < SYSDATE -- Don't go over today";
//for every matching line call callback func
string result = Regex.Replace(sqlText, ".*--.*", RemoveSQLCommentCallback);
}
Let's replace, find all the lines that match dash dash comment and call your parsing function for every match.

As a late solution, the simplest way is to do it using ScriptDom-TSqlParser:
// https://michaeljswart.com/2014/04/removing-comments-from-sql/
// http://web.archive.org/web/*/https://michaeljswart.com/2014/04/removing-comments-from-sql/
public static string StripCommentsFromSQL(string SQL)
{
Microsoft.SqlServer.TransactSql.ScriptDom.TSql150Parser parser =
new Microsoft.SqlServer.TransactSql.ScriptDom.TSql150Parser(true);
System.Collections.Generic.IList<Microsoft.SqlServer.TransactSql.ScriptDom.ParseError> errors;
Microsoft.SqlServer.TransactSql.ScriptDom.TSqlFragment fragments =
parser.Parse(new System.IO.StringReader(SQL), out errors);
// clear comments
string result = string.Join(
string.Empty,
fragments.ScriptTokenStream
.Where(x => x.TokenType != Microsoft.SqlServer.TransactSql.ScriptDom.TSqlTokenType.MultilineComment)
.Where(x => x.TokenType != Microsoft.SqlServer.TransactSql.ScriptDom.TSqlTokenType.SingleLineComment)
.Select(x => x.Text));
return result;
}
or instead of using the Microsoft-Parser, you can use ANTL4 TSqlLexer
or without any parser at all:
private static System.Text.RegularExpressions.Regex everythingExceptNewLines =
new System.Text.RegularExpressions.Regex("[^\r\n]");
// http://drizin.io/Removing-comments-from-SQL-scripts/
// http://web.archive.org/web/*/http://drizin.io/Removing-comments-from-SQL-scripts/
public static string RemoveComments(string input, bool preservePositions, bool removeLiterals = false)
{
//based on http://stackoverflow.com/questions/3524317/regex-to-strip-line-comments-from-c-sharp/3524689#3524689
var lineComments = #"--(.*?)\r?\n";
var lineCommentsOnLastLine = #"--(.*?)$"; // because it's possible that there's no \r\n after the last line comment
// literals ('literals'), bracketedIdentifiers ([object]) and quotedIdentifiers ("object"), they follow the same structure:
// there's the start character, any consecutive pairs of closing characters are considered part of the literal/identifier, and then comes the closing character
var literals = #"('(('')|[^'])*')"; // 'John', 'O''malley''s', etc
var bracketedIdentifiers = #"\[((\]\])|[^\]])* \]"; // [object], [ % object]] ], etc
var quotedIdentifiers = #"(\""((\""\"")|[^""])*\"")"; // "object", "object[]", etc - when QUOTED_IDENTIFIER is set to ON, they are identifiers, else they are literals
//var blockComments = #"/\*(.*?)\*/"; //the original code was for C#, but Microsoft SQL allows a nested block comments // //https://msdn.microsoft.com/en-us/library/ms178623.aspx
//so we should use balancing groups // http://weblogs.asp.net/whaggard/377025
var nestedBlockComments = #"/\*
(?>
/\* (?<LEVEL>) # On opening push level
|
\*/ (?<-LEVEL>) # On closing pop level
|
(?! /\* | \*/ ) . # Match any char unless the opening and closing strings
)+ # /* or */ in the lookahead string
(?(LEVEL)(?!)) # If level exists then fail
\*/";
string noComments = System.Text.RegularExpressions.Regex.Replace(input,
nestedBlockComments + "|" + lineComments + "|" + lineCommentsOnLastLine + "|" + literals + "|" + bracketedIdentifiers + "|" + quotedIdentifiers,
me => {
if (me.Value.StartsWith("/*") && preservePositions)
return everythingExceptNewLines.Replace(me.Value, " "); // preserve positions and keep line-breaks // return new string(' ', me.Value.Length);
else if (me.Value.StartsWith("/*") && !preservePositions)
return "";
else if (me.Value.StartsWith("--") && preservePositions)
return everythingExceptNewLines.Replace(me.Value, " "); // preserve positions and keep line-breaks
else if (me.Value.StartsWith("--") && !preservePositions)
return everythingExceptNewLines.Replace(me.Value, ""); // preserve only line-breaks // Environment.NewLine;
else if (me.Value.StartsWith("[") || me.Value.StartsWith("\""))
return me.Value; // do not remove object identifiers ever
else if (!removeLiterals) // Keep the literal strings
return me.Value;
else if (removeLiterals && preservePositions) // remove literals, but preserving positions and line-breaks
{
var literalWithLineBreaks = everythingExceptNewLines.Replace(me.Value, " ");
return "'" + literalWithLineBreaks.Substring(1, literalWithLineBreaks.Length - 2) + "'";
}
else if (removeLiterals && !preservePositions) // wrap completely all literals
return "''";
else
throw new System.NotImplementedException();
},
System.Text.RegularExpressions.RegexOptions.Singleline | System.Text.RegularExpressions.RegexOptions.IgnorePatternWhitespace);
return noComments;
}

I don't know if C#/VB.net regex is special in some way but traditionally s/--.*// should work.

In PHP, i'm using this code to uncomment SQL (only single line):
$sqlComments = '#(([\'"`]).*?[^\\\]\2)|((?:\#|--).*?$)\s*|(?<=;)\s+#ms';
/* Commented version
$sqlComments = '#
(([\'"`]).*?[^\\\]\2) # $1 : Skip single & double quoted + backticked expressions
|((?:\#|--).*?$) # $3 : Match single line comments
\s* # Trim after comments
|(?<=;)\s+ # Trim after semi-colon
#msx';
*/
$uncommentedSQL = trim( preg_replace( $sqlComments, '$1', $sql ) );
preg_match_all( $sqlComments, $sql, $comments );
$extractedComments = array_filter( $comments[ 3 ] );
var_dump( $uncommentedSQL, $extractedComments );
To remove all comments see Regex to match MySQL comments

Related

Custom Uppercase on String

hi i was trying to make a program that modified a word in a string to a uppercase word.
the uppercase word is in a tag like this :
the <upcase>weather</upcase> is very <upcase>hot</upcase>
the result :
the WEATHER is very HOT
my code is like this :
string upKey = "<upcase>";
string lowKey = "</upcase>";
string quote = "the lazy <upcase>fox jump over</upcase> the dog <upcase> something here </upcase>";
int index = quote.IndexOf(upKey);
int indexEnd = quote.IndexOf(lowKey);
while(index!=-1)
{
for (int a = 0; a < index; a++)
{
Console.Write(quote[a]);
}
string upperQuote = "";
for (int b = index + 8; b < indexEnd; b++)
{
upperQuote += quote[b];
}
upperQuote = upperQuote.ToUpper().ToString();
Console.Write(upperQuote);
for (int c = indexEnd+9;c<quote.Length;c++)
{
if (quote[c]=='<')
{
break;
}
Console.Write(quote[c]);
}
index = quote.IndexOf(upKey, index + 1);
indexEnd = quote.IndexOf(lowKey, index + 1);
}
Console.WriteLine();
}
i have been trying using this code,and a while(while (indexEnd != -1)) :
index = quote.IndexOf(upKey, index + 1);
indexEnd = quote.IndexOf(lowKey, index + 1);
but that not work, the program run into unlimited loop, btw i'm a noob so please give a answer that i can understand :)

You can use a regular expression for this:
string input = "the <upcase>weather</upcase> is very <upcase>hot</upcase>";
var regex = new Regex("<upcase>(?<theMatch>.*?)</upcase>");
var result = regex.Replace(input, match => match.Groups["theMatch"].Value.ToUpper());
// result will be: "the WEATHER is very HOT"
Here's an explanation taken from here for the regular expression used above:
<upcase> matches the characters <upcase> literally (case sensitive)
(?<theMatch>.\*?) Named capturing group theMatch
.*? matches any character (except newline)
Quantifier: *? Between zero and unlimited times, as few times as possible, expanding as needed [lazy]
< matches the characters < literally
/ matches the character / literally
upcase> matches the characters upcase> literally (case sensitive)

The following will work as long as there are only matching tags and none of them are nested.
public static string Upper(string str)
{
const string start = "<upcase>";
const string end = "</upcase>";
var builder = new StringBuilder();
// Find the first start tag
int startIndex = str.IndexOf(start);
// If no start tag found then return the original
if (startIndex == -1)
return str;
// Append the part before the first tag as is
builder.Append(str.Substring(0, startIndex));
// Continue as long as we find another start tag.
while (startIndex != -1)
{
// Find the end tag for the current start tag
var endIndex = str.IndexOf(end, startIndex);
// Append the text between the start and end as upper case.
builder.Append(
str.Substring(
startIndex + start.Length,
endIndex - startIndex - start.Length).ToUpper());
// Find the next start tag.
startIndex = str.IndexOf(start, endIndex);
// Append the part after the end tag, but before the next start as is
builder.Append(
str.Substring(
endIndex + end.Length,
(startIndex == -1 ? str.Length : startIndex) - endIndex - end.Length));
}
return builder.ToString();
}

I'm not rewriting your code. Just answering your (main) question:
You need to keep a variable of the index you're at, and check for IndexOf from there only (See MSDN). Something like this:
int index = 0;
while (quote.IndexOf(upKey, index) != -1)
{
//Your code, including updating the value of index.
}
(I didn't check this on Visual Studio. This is just to point you in the direction that I think you're looking for.)
The reason for the infinite loop is that you're always testing IndexOf of the same index. Perhaps you mean to have quote.IndexOf(upKey, index += 1); which would change the value of index?

The way to go here is to probably use Regex but these easy parsing excercises are always fun to do manually. This can be easily solved using a very simple state machine.
What states can we have when dealing with strings of this nature? I can think of 4:
We are either parsing normal text
Or we are parsing an opening format tag '<...>'
Or we are parsing a closing format tag '</...>'
Or we are parsing text to be formatted between tags
I can't think of any other states. Now we need to think about the normal flow / transition between states. What should happen when we a parse string with the correct format?
Parser starts up expecting normal text. That is easy to understand.
If expecting normal text we encounter a '<' then the parser should switch to parsing opening format tag state. There is no other valid state transition.
If in parsing opening format tag state we encounter a '>' then the parser should switch to parsing text to be formatted. There is no other valid state transition.
If in parsing text to be formatted we encounter a '<' then the parser should switch to parsing closing tag. Again, there is no other valid state transition.
If in parsing closing tag we encounter a '>' then the parser should switch to normal text. Once more, there is no other valid transition. Note that we are disallowing nested tags.
Ok, so that seems pretty easy to understand. What do we need to implement this?
First we'll need something to represent the parsing states. A good old enum will do:
private enum ParsingState
{
UnformattedText,
OpenTag,
CloseTag,
FormattedText,
}
Now we need some string buffers to keep track of the final formatted string, the current format tag we are parsing and finally the substring we need to format. We will use several StringBuilder's for these as we don't know how long these buffers are and how many concatenations will be performed:
var formattedStringBuffer = new StringBuilder();
var formatBuffer = new StringBuilder();
var tagBuffer = new StringBuilder();
We will also need to keep track of the parser's state and the current active tag if any (so we can make sure that the parsed closing tag matches the current active tag):
var state = ParsingState.UnformattedText;
var activeFormatTag = string.Empty;
And now we are good to go, but before we do, can we generalize this so it works with any format tag?
Yes we can, we just need to tell the parser what to do for each supported tag. We can do this easily just passing a along a Dictionary that ties each tag with the action it should perform. We do this the following way:
var formatter = new Dictionary<string, Func<string, string>>();
formatter.Add("upcase", s => s.ToUpperInvariant());
formatter.Add("lcase", s => s.ToLowerInvariant());
Great! Now our implementation could be the following:
public static string Parse(this string str, Dictionary<string, Func<string,string>> formatter)
{
var formattedStringBuffer = new StringBuilder();
var formatBuffer = new StringBuilder();
var tagBuffer = new StringBuilder();
var state = ParsingState.UnformattedText;
var activeFormatTag = string.Empty;
foreach (var c in str)
{
switch (state)
{
case ParsingState.UnformattedText:
{
if (c != '<')
{
formattedStringBuffer.Append(c);
}
else
{
state = ParsingState.OpenTag;
}
break;
}
case ParsingState.OpenTag:
{
if (c != '>')
{
tagBuffer.Append(c);
}
else
{
state = ParsingState.FormattedText;
activeFormatTag = tagBuffer.ToString();
tagBuffer.Clear();
}
break;
}
case ParsingState.FormattedText:
{
if (c != '<')
{
formatBuffer.Append(c);
}
else
{
state = ParsingState.CloseTag;
}
break;
}
case ParsingState.CloseTag:
{
if (c!='>')
{
tagBuffer.Append(c);
}
else
{
var expectedTag = $"/{activeFormatTag}";
var tag = tagBuffer.ToString();
if (tag != expectedTag)
throw new FormatException($"Expected closing tag not found: <{expectedTag}>.");
if (formatter.ContainsKey(activeFormatTag))
{
var formatted = formatter[activeFormatTag](formatBuffer.ToString());
formattedStringBuffer.Append(formatted);
tagBuffer.Clear();
formatBuffer.Clear();
state = ParsingState.UnformattedText;
}
else
throw new FormatException($"Format tag <{activeFormatTag}> not recognized.");
}
break;
}
}
}
if (state != ParsingState.UnformattedText)
throw new FormatException($"Bad format in specified string '{str}'");
return formattedStringBuffer.ToString();
}
Is it the most elegant solution? No, Regex will do a much better job, but being a beginner I would not recommend you start solving these kind of problems that way, you'll learn a whole lot more solving them manualy. You'll have plenty of time to learn Regex later on.

Replacing characters in a string with another string

So what I am trying to do is as follows :
example of a string is A4PC
I am trying to replace for example any occurance of "A" with "[A4]" so I would get and similar any occurance of "4" with "[A4]"
"[A4][A4]PC"
I tried doing a normal Replace on the string but found out I got
"[A[A4]]PC"
string badWordAllVariants =
restriction.Value.Replace("A", "[A4]").Replace("4", "[A4]")
since I have two A's in a row causing an issue.
So I was thinking it would be better rather than the replace on the string I need to do it on a character per character basis and then build up a string again.
Is there anyway in Linq or so to do something like this ?

You don't need any LINQ here - String.Replace works just fine:
string input = "AAPC";
string result = input.Replace("A", "[A4]"); // "[A4][A4]PC"
UPDATE: For your updated requirements I suggest to use regular expression replace
string input = "A4PC";
var result = Regex.Replace(input, "A|4", "[A4]"); // "[A4][A4]PC"

This works well for me:
string x = "AAPC";
string replace = x.Replace("A", "[A4]");
EDIT:
Based on the updated question, the issue is the second replacement. In order to replace multiple strings you will want to do this sequentially:
var original = "AAPC";
// add arbitrary room to allow for more new characters
StringBuilder resultString = new StringBuilder(original.Length + 10);
foreach (char currentChar in original.ToCharArray())
{
if (currentChar == 'A') resultString.Append("[A4]");
else if (currentChar == '4') resultString.Append("[A4]");
else resultString.Append(currentChar);
}
string result = resultString.ToString();
You can run this routine with any replacements you want to make (in this case the letters 'A' and '4' and it should work. If you would want to replace strings the code would be similar in structure but you would need to "look ahead" and probably use a for loop. Hopefully this helps!
By the way - you want to use a string builder here and not strings because strings are static which means space gets allocated every time you loop. (Not good!)

I think this should do the trick
string str = "AA4PC";
string result = Regex.Replace(str, #"(?<Before>[^A4]?)(?<Value>A|4)(?<After>[^A4]?)", (m) =>
{
string before = m.Groups["Before"].Value;
string after = m.Groups["After"].Value;
string value = m.Groups["Value"].Value;
if (before != "[" || after != "]")
{
return "[A4]";
}
return m.ToString();
});
It is going to replace A and 4 that hasn't been replaced yet for [A4].

How can I convert PascalCase to split words?

I have variables containing text such as:
ShowSummary
ShowDetails
AccountDetails
Is there a simple way function / method in C# that I can apply to these variables to yield:
"Show Summary"
"Show Details"
"Account Details"
I was wondering about an extension method but I've never coded one and I am not sure where to start.

See this post by Jon Galloway and one by Phil

In the application I am currently working on, we have a delegate based split extension method. It looks like so:
public static string Split(this string target, Func<char, char, bool> shouldSplit, string splitFiller = " ")
{
if (target == null)
throw new ArgumentNullException("target");
if (shouldSplit == null)
throw new ArgumentNullException("shouldSplit");
if (String.IsNullOrEmpty(splitFiller))
throw new ArgumentNullException("splitFiller");
int targetLength = target.Length;
// We know the resulting string is going to be atleast the length of target
StringBuilder result = new StringBuilder(targetLength);
result.Append(target[0]);
// Loop from the second character to the last character.
for (int i = 1; i < targetLength; ++i)
{
char firstChar = target[i - 1];
char secondChar = target[i];
if (shouldSplit(firstChar, secondChar))
{
// If a split should be performed add in the filler
result.Append(splitFiller);
}
result.Append(secondChar);
}
return result.ToString();
}
Then it is could be used as follows:
string showSummary = "ShowSummary";
string spacedString = showSummary.Split((c1, c2) => Char.IsLower(c1) && Char.IsUpper(c2));
This allows you to split on any conditions between two chars, and insert a filler of your choice (default of a space).

The best would be to iterate through each character within the string. Check if the character is upper case. If so, insert a space character before it. Otherwise, move onto the next character.
Also, ideally start from the second character so that a space would not be inserted before the first character.

try something like this
var word = "AccountDetails";
word = string.Join(string.Empty,word
.Select(c => new string(c, 1)).Select(c => c[0] < 'Z' ? " " + c : c)).Trim();

Parsing CSV File enclosed with quotes in C#

I've seen lots of samples in parsing CSV File. but this one is kind of annoying file...
so how do you parse this kind of CSV
"1",1/2/2010,"The sample ("adasdad") asdada","I was pooping in the door "Stinky", so I'll be damn","AK"

The best answer in most cases is probably #Jim Mischel's. TextFieldParser seems to be exactly what you want for most conventional cases -- though it strangely lives in the Microsoft.VisualBasic namespace! But this case isn't conventional.
The last time I ran into a variation on this issue where I needed something unconventional, I embarrassingly gave up on regexp'ing and bullheaded a char by char check. Sometimes, that's not-wrong enough to do. Splitting a string isn't as difficult a problem if you byte push.
So I rewrote for this case as a string extension. I think this is close.
Do note that, "I was pooping in the door "Stinky", so I'll be damn", is an especially nasty case. Without the *** STINKY CONDITION *** code, below, you'd get I was pooping in the door "Stinky as one value and so I'll be damn" as the other.
The only way to do better than that for any anonymous weird splitter/escape case would be to have some sort of algorithm to determine the "usual" number of columns in each row, and then check for, in this case, fixed length fields like your AK state entry or some other possible landmark as a sort of normalizing backstop for nonconformist columns. But that's serious crazy logic that likely isn't called for, as much fun as it'd be to code. As #Vash points out, you're better off following some standard and coding a little more OFfensively.
But the problem here is probably easier than that. The only lexically meaningful case is the one in your example -- ", -- double quote, comma, and then a space. So that's what the *** STINKY CONDITION *** code checks. Even so, this code is getting nastier than I'd like, which means you have ever stranger edge cases, like "This is also stinky," a f a b","Now what?" Heck, even "A,"B","C" doesn't work in this code right now, iirc, since I treat the begin and end chars as having been escape pre- and post-fixed. So we're largely back to #Vash's comment!
Apologies for all the brackets for one-line if statements, but I'm stuck in a StyleCop world right now. I'm not necessarily suggesting you use this -- that strictEscapeToSplitEvaluation plus the STINKY CONDITION makes this a little complex. But it's worth keeping in mind that a normal csv parser that's intelligent about quotes is significantly more straightforward to the point of being tedious, but otherwise trivial.
namespace YourFavoriteNamespace
{
using System;
using System.Collections.Generic;
using System.Text;
public static class Extensions
{
public static Queue<string> SplitSeeingQuotes(this string valToSplit, char splittingChar = ',', char escapeChar = '"',
bool strictEscapeToSplitEvaluation = true, bool captureEndingNull = false)
{
Queue<string> qReturn = new Queue<string>();
StringBuilder stringBuilder = new StringBuilder();
bool bInEscapeVal = false;
for (int i = 0; i < valToSplit.Length; i++)
{
if (!bInEscapeVal)
{
// Escape values must come immediately after a split.
// abc,"b,ca",cab has an escaped comma.
// abc,b"ca,c"ab does not.
if (escapeChar == valToSplit[i] && (!strictEscapeToSplitEvaluation || (i == 0 || (i != 0 && splittingChar == valToSplit[i - 1]))))
{
bInEscapeVal = true; // not capturing escapeChar as part of value; easy enough to change if need be.
}
else if (splittingChar == valToSplit[i])
{
qReturn.Enqueue(stringBuilder.ToString());
stringBuilder = new StringBuilder();
}
else
{
stringBuilder.Append(valToSplit[i]);
}
}
else
{
// Can't use switch b/c we're comparing to a variable, I believe.
if (escapeChar == valToSplit[i])
{
// Repeated escape always reduces to one escape char in this logic.
// So if you wanted "I'm ""double quote"" crazy!" to come out with
// the double double quotes, you're toast.
if (i + 1 < valToSplit.Length && escapeChar == valToSplit[i + 1])
{
i++;
stringBuilder.Append(escapeChar);
}
else if (!strictEscapeToSplitEvaluation)
{
bInEscapeVal = false;
}
// *** STINKY CONDITION ***
// Kinda defense, since only `", ` really makes sense.
else if ('"' == escapeChar && i + 2 < valToSplit.Length &&
valToSplit[i + 1] == ',' && valToSplit[i + 2] == ' ')
{
i = i+2;
stringBuilder.Append("\", ");
}
// *** EO STINKY CONDITION ***
else if (i+1 == valToSplit.Length || (i + 1 < valToSplit.Length && valToSplit[i + 1] == splittingChar))
{
bInEscapeVal = false;
}
else
{
stringBuilder.Append(escapeChar);
}
}
else
{
stringBuilder.Append(valToSplit[i]);
}
}
}
// NOTE: The `captureEndingNull` flag is not tested.
// Catch null final entry? "abc,cab,bca," could be four entries, with the last an empty string.
if ((captureEndingNull && splittingChar == valToSplit[valToSplit.Length-1]) || (stringBuilder.Length > 0))
{
qReturn.Enqueue(stringBuilder.ToString());
}
return qReturn;
}
}
}
Probably worth mentioning that the "answer" you gave yourself doesn't have the "Stinky" problem in its sample string. ;^)
[Understanding that we're three years after you asked,] I will say that your example isn't as insane as folks here make out. I can see wanting to treat escape characters (in this case, ") as escape characters only when they're the first value after the splitting character or, after finding an opening escape, stopping only if you find the escape character before a splitter; in this case, the splitter is obviously ,.
If the row of your csv is abc,bc"a,ca"b, I would expect that to mean we've got three values: abc, bc"a, and ca"b.
Same deal in your "The sample ("adasdad") asdada" column -- quotes that don't begin and end a cell value aren't escape characters and don't necessarily need doubling to maintain meaning. So I added a strictEscapeToSplitEvaluation flag here.
Enjoy. ;^)

I very strongly recommend using TextFieldParser. Hand-coded parsers that use String.Split or regular expressions almost invariably mishandle things like quoted fields that have embedded quotes or embedded separators.
I would be surprised, though, if it handled your particular example. As others have said, that line is, at best, ambiguous.

Split based on
",
I would use MyString.IndexOf("\","
And then substring the parts. Other then that im sure someone written a csv parser out there that can handle this :)

I found a way to parse this malformed CSV. I looked for a pattern and found it.... I first replace (",") with a character... like "¤" and then split it...
from this:
"Annoying","CSV File","poop#mypants.com",1999,01-20-2001,"oh,boy",01-20-2001,"yeah baby","yeah!"
to this:
"Annoying¤CSV File¤poop#mypants.com",1999,01-20-2001,"oh,boy",01-20-2001,"yeah baby¤yeah!"
then split it:
ArrayA[0]: "Annoying //this value will be trimmed by replace("\"","") same as the array[4]
ArrayA[1]: CSV File
ArrayA[2]: poop#mypants.com",1999,01-20-2001,"oh,boy",01-20-2001,"yeah baby
ArrayA[3]: yeah!"
after splitting it, I will replace strings from ArrayA[2] ", and ," with ¤ and then split it again
from this
ArrayA[2]: poop#mypants.com",1999,01-20-2001,"oh,boy",01-20-2001,"yeah baby
to this
ArrayA[2]: poop#mypants.com¤1999,01-20-2001¤oh,boy¤01-20-2001¤yeah baby
then split it again and would turn to this
ArrayB[0]: poop#mypants.com
ArrayB[1]: 1999,01-20-2001
ArrayB[2]: oh,boy
ArrayB[3]: 01-20-2001
ArrayB[4]: yeah baby
and lastly... I'll split the Year only and the date from ArrayB[1] with , to ArrayC
It's tedious but there's no other way to do it...

There is one another open source library, Cinchoo ETL, handle quoted string fine. Here is sample code.
string csv = #"""1"",1/2/2010,""The sample(""adasdad"") asdada"",""I was pooping in the door ""Stinky"", so I'll be damn"",""AK""";
using (var r = ChoCSVReader.LoadText(csv)
.QuoteAllFields()
)
{
foreach (var rec in r)
Console.WriteLine(rec.Dump());
}
Output:
[Count: 5]
Key: Column1 [Type: Int64]
Value: 1
Key: Column2 [Type: DateTime]
Value: 1/2/2010 12:00:00 AM
Key: Column3 [Type: String]
Value: The sample(adasdad) asdada
Key: Column4 [Type: String]
Value: I was pooping in the door Stinky, so I'll be damn
Key: Column5 [Type: String]
Value: AK

You could split the string by ",". It is recomended that the csv file could each cell value should be enclosed in quotes like "1","2","3".....

I don't see how you could if each line is different. This line is a malformed for CSV. Quotes contained within a value must be doubled as shown below. I can't even tell for sure where the values should be terminated.
"1",1/2/2010,"The sample (""adasdad"") asdada","I was pooping in the door ""Stinky"", so I'll be damn","AK"
Here's my code to parse a CSV file but I don't see how any code would know how to handle your line because it's malformed.

You might want to give CsvReader a try. It will handle quoted string fine, so you just will have to remove leading and trailing quotes.
It will fail if your strings contains a coma. To avoid this, the quotes needs to be doubled as said in other answers.

As no (decent) .csv parser can parse non-csv-data correctly, the task isn't to parse the data, but to fix the file(s) (and then to parse the correct data).
To fix the data you need a list of bad rows (to be sent to the person responsible for the garbage for manual editing). To get such a list, you can
use Access with a correct import specification to import the file. You'll get a list of import failures.
write a script/program that opens the file via the OLEDB text driver.
Sample file:
"Id","Remark","DateDue"
1,"This is good",20110413
2,"This is ""good""",20110414
3,"This is ""good"","bad",and "ugly",,20110415
4,"This is ""good""" again,20110415
Sample SQL/Result:
SELECT * FROM [badcsv01.csv]
Id Remark DateDue
1 This is good 4/13/2011
2 This is "good" 4/14/2011
3 This is "good", NULL
4 This is "good" again 4/15/2011
SELECT * FROM [badcsv01.csv] WHERE DateDue Is Null
Id Remark DateDue
3 This is "good", NULL

First you will do it for the columns names:
DataTable pbResults = new DataTable();
OracleDataAdapter oda = new OracleDataAdapter(cmd);
oda.Fill(pbResults);
StringBuilder sb1 = new StringBuilder();
StringBuilder sb2 = new StringBuilder();
IEnumerable<string> columnNames = pbResults.Columns.Cast<DataColumn>().Select(column => column.ColumnName);
sb1.Append(string.Join("\"" + "," + "\"", columnNames));
sb2.Append("\"");
sb2.Append(sb1);
sb2.AppendLine("\"");
Second you will do it for each row:
foreach (DataRow row in pbResults.Rows)
{
IEnumerable<string> fields = row.ItemArray.Select(field => field.ToString());
sb2.Append("\"");
sb2.Append(string.Join("\"" + "," + "\"", fields));
sb2.AppendLine("\"");
}

Escape command line arguments in c#

Short version:
Is it enough to wrap the argument in quotes and escape \ and " ?
Code version
I want to pass the command line arguments string[] args to another process using ProcessInfo.Arguments.
ProcessStartInfo info = new ProcessStartInfo();
info.FileName = Application.ExecutablePath;
info.UseShellExecute = true;
info.Verb = "runas"; // Provides Run as Administrator
info.Arguments = EscapeCommandLineArguments(args);
Process.Start(info);
The problem is that I get the arguments as an array and must merge them into a single string. An arguments could be crafted to trick my program.
my.exe "C:\Documents and Settings\MyPath \" --kill-all-humans \" except fry"
According to this answer I have created the following function to escape a single argument, but I might have missed something.
private static string EscapeCommandLineArguments(string[] args)
{
string arguments = "";
foreach (string arg in args)
{
arguments += " \"" +
arg.Replace ("\\", "\\\\").Replace("\"", "\\\"") +
"\"";
}
return arguments;
}
Is this good enough or is there any framework function for this?

It's more complicated than that though!
I was having related problem (writing front-end .exe that will call the back-end with all parameters passed + some extra ones) and so i looked how people do that, ran into your question. Initially all seemed good doing it as you suggest arg.Replace (#"\", #"\\").Replace(quote, #"\"+quote).
However when i call with arguments c:\temp a\\b, this gets passed as c:\temp and a\\b, which leads to the back-end being called with "c:\\temp" "a\\\\b" - which is incorrect, because there that will be two arguments c:\\temp and a\\\\b - not what we wanted! We have been overzealous in escapes (windows is not unix!).
And so i read in detail http://msdn.microsoft.com/en-us/library/system.environment.getcommandlineargs.aspx and it actually describes there how those cases are handled: backslashes are treated as escape only in front of double quote.
There is a twist to it in how multiple \ are handled there, the explanation can leave one dizzy for a while. I'll try to re-phrase said unescape rule here: say we have a substring of N \, followed by ". When unescaping, we replace that substring with int(N/2) \ and iff N was odd, we add " at the end.
The encoding for such decoding would go like that: for an argument, find each substring of 0-or-more \ followed by " and replace it by twice-as-many \, followed by \". Which we can do like so:
s = Regex.Replace(arg, #"(\\*)" + "\"", #"$1$1\" + "\"");
That's all...
PS. ... not. Wait, wait - there is more! :)
We did the encoding correctly but there is a twist because you are enclosing all parameters in double-quotes (in case there are spaces in some of them). There is a boundary issue - in case a parameter ends on \, adding " after it will break the meaning of closing quote. Example c:\one\ two parsed to c:\one\ and two then will be re-assembled to "c:\one\" "two" that will me (mis)understood as one argument c:\one" two (I tried that, i am not making it up). So what we need in addition is to check if argument ends on \ and if so, double the number of backslashes at the end, like so:
s = "\"" + Regex.Replace(s, #"(\\+)$", #"$1$1") + "\"";

My answer was similar to Nas Banov's answer but I wanted double quotes only if necessary.
Cutting out extra unnecessary double quotes
My code saves unnecessarily putting double quotes around it all the time which is important *when you are getting up close to the character limit for parameters.
/// <summary>
/// Encodes an argument for passing into a program
/// </summary>
/// <param name="original">The value that should be received by the program</param>
/// <returns>The value which needs to be passed to the program for the original value
/// to come through</returns>
public static string EncodeParameterArgument(string original)
{
if( string.IsNullOrEmpty(original))
return original;
string value = Regex.Replace(original, #"(\\*)" + "\"", #"$1\$0");
value = Regex.Replace(value, #"^(.*\s.*?)(\\*)$", "\"$1$2$2\"");
return value;
}
// This is an EDIT
// Note that this version does the same but handles new lines in the arugments
public static string EncodeParameterArgumentMultiLine(string original)
{
if (string.IsNullOrEmpty(original))
return original;
string value = Regex.Replace(original, #"(\\*)" + "\"", #"$1\$0");
value = Regex.Replace(value, #"^(.*\s.*?)(\\*)$", "\"$1$2$2\"", RegexOptions.Singleline);
return value;
}
explanation
To escape the backslashes and double quotes correctly you can just replace any instances of multiple backslashes followed by a single double quote with:
string value = Regex.Replace(original, #"(\\*)" + "\"", #"\$1$0");
An extra twice the original backslashes + 1 and the original double quote. i.e., '\' + originalbackslashes + originalbackslashes + '"'. I used $1$0 since $0 has the original backslashes and the original double quote so it makes the replacement a nicer one to read.
value = Regex.Replace(value, #"^(.*\s.*?)(\\*)$", "\"$1$2$2\"");
This can only ever match an entire line that contains a whitespace.
If it matches then it adds double quotes to the beginning and end.
If there was originally backslashes on the end of the argument they will not have been quoted, now that there is a double quote on the end they need to be. So they are duplicated, which quotes them all, and prevents unintentionally quoting the final double quote
It does a minimal matching for the first section so that the last .*? doesn't eat into matching the final backslashes
Output
So these inputs produce the following outputs
hello
hello
\hello\12\3\
\hello\12\3\
hello world
"hello world"
\"hello\"
\\"hello\\\"
\"hello\ world
"\\"hello\ world"
\"hello\\\ world\
"\\"hello\\\ world\\"
hello world\\
"hello world\\\\"

I have ported a C++ function from the Everyone quotes command line arguments the wrong way article.
It works fine, but you should note that cmd.exe interprets command line differently. If (and only if, like the original author of article noted) your command line will be interpreted by cmd.exe you should also escape shell metacharacters.
/// <summary>
/// This routine appends the given argument to a command line such that
/// CommandLineToArgvW will return the argument string unchanged. Arguments
/// in a command line should be separated by spaces; this function does
/// not add these spaces.
/// </summary>
/// <param name="argument">Supplies the argument to encode.</param>
/// <param name="force">
/// Supplies an indication of whether we should quote the argument even if it
/// does not contain any characters that would ordinarily require quoting.
/// </param>
private static string EncodeParameterArgument(string argument, bool force = false)
{
if (argument == null) throw new ArgumentNullException(nameof(argument));
// Unless we're told otherwise, don't quote unless we actually
// need to do so --- hopefully avoid problems if programs won't
// parse quotes properly
if (force == false
&& argument.Length > 0
&& argument.IndexOfAny(" \t\n\v\"".ToCharArray()) == -1)
{
return argument;
}
var quoted = new StringBuilder();
quoted.Append('"');
var numberBackslashes = 0;
foreach (var chr in argument)
{
switch (chr)
{
case '\\':
numberBackslashes++;
continue;
case '"':
// Escape all backslashes and the following
// double quotation mark.
quoted.Append('\\', numberBackslashes*2 + 1);
quoted.Append(chr);
break;
default:
// Backslashes aren't special here.
quoted.Append('\\', numberBackslashes);
quoted.Append(chr);
break;
}
numberBackslashes = 0;
}
// Escape all backslashes, but let the terminating
// double quotation mark we add below be interpreted
// as a metacharacter.
quoted.Append('\\', numberBackslashes*2);
quoted.Append('"');
return quoted.ToString();
}

I was running into issues with this, too. Instead of unparsing args, I went with taking the full original commandline and trimming off the executable. This had the additional benefit of keeping whitespace in the call, even if it isn't needed/used. It still has to chase escapes in the executable, but that seemed easier than the args.
var commandLine = Environment.CommandLine;
var argumentsString = "";
if(args.Length > 0)
{
// Re-escaping args to be the exact same as they were passed is hard and misses whitespace.
// Use the original command line and trim off the executable to get the args.
var argIndex = -1;
if(commandLine[0] == '"')
{
//Double-quotes mean we need to dig to find the closing double-quote.
var backslashPending = false;
var secondDoublequoteIndex = -1;
for(var i = 1; i < commandLine.Length; i++)
{
if(backslashPending)
{
backslashPending = false;
continue;
}
if(commandLine[i] == '\\')
{
backslashPending = true;
continue;
}
if(commandLine[i] == '"')
{
secondDoublequoteIndex = i + 1;
break;
}
}
argIndex = secondDoublequoteIndex;
}
else
{
// No double-quotes, so args begin after first whitespace.
argIndex = commandLine.IndexOf(" ", System.StringComparison.Ordinal);
}
if(argIndex != -1)
{
argumentsString = commandLine.Substring(argIndex + 1);
}
}
Console.WriteLine("argumentsString: " + argumentsString);

I published small project on GitHub that handles most issues with command line encoding/escaping:
https://github.com/ericpopivker/Command-Line-Encoder
There is a CommandLineEncoder.Utils.cs class, as well as Unit Tests that verify the Encoding/Decoding functionality.

I wrote you a small sample to show you how to use escape chars in command line.
public static string BuildCommandLineArgs(List<string> argsList)
{
System.Text.StringBuilder sb = new System.Text.StringBuilder();
foreach (string arg in argsList)
{
sb.Append("\"\"" + arg.Replace("\"", #"\" + "\"") + "\"\" ");
}
if (sb.Length > 0)
{
sb = sb.Remove(sb.Length - 1, 1);
}
return sb.ToString();
}
And here is a test method:
List<string> myArgs = new List<string>();
myArgs.Add("test\"123"); // test"123
myArgs.Add("test\"\"123\"\"234"); // test""123""234
myArgs.Add("test123\"\"\"234"); // test123"""234
string cmargs = BuildCommandLineArgs(myArgs);
// result: ""test\"123"" ""test\"\"123\"\"234"" ""test123\"\"\"234""
// when you pass this result to your app, you will get this args list:
// test"123
// test""123""234
// test123"""234
The point is to to wrap each arg with double-double quotes ( ""arg"" ) and to replace all quotes inside arg value with escaped quote ( test\"123 ).

static string BuildCommandLineFromArgs(params string[] args)
{
if (args == null)
return null;
string result = "";
if (Environment.OSVersion.Platform == PlatformID.Unix
||
Environment.OSVersion.Platform == PlatformID.MacOSX)
{
foreach (string arg in args)
{
result += (result.Length > 0 ? " " : "")
+ arg
.Replace(#" ", #"\ ")
.Replace("\t", "\\\t")
.Replace(#"\", #"\\")
.Replace(#"""", #"\""")
.Replace(#"<", #"\<")
.Replace(#">", #"\>")
.Replace(#"|", #"\|")
.Replace(#"#", #"\#")
.Replace(#"&", #"\&");
}
}
else //Windows family
{
bool enclosedInApo, wasApo;
string subResult;
foreach (string arg in args)
{
enclosedInApo = arg.LastIndexOfAny(
new char[] { ' ', '\t', '|', '#', '^', '<', '>', '&'}) >= 0;
wasApo = enclosedInApo;
subResult = "";
for (int i = arg.Length - 1; i >= 0; i--)
{
switch (arg[i])
{
case '"':
subResult = #"\""" + subResult;
wasApo = true;
break;
case '\\':
subResult = (wasApo ? #"\\" : #"\") + subResult;
break;
default:
subResult = arg[i] + subResult;
wasApo = false;
break;
}
}
result += (result.Length > 0 ? " " : "")
+ (enclosedInApo ? "\"" + subResult + "\"" : subResult);
}
}
return result;
}

An Alternative Approach
If you're passing a complex object such as nested JSON and you have control over the system that's receiving the command line arguments, it's far easier to just encode the command line arg/s as base64 and then decode them from the receiving system.
See here: Encode/Decode String to/from Base64
Use Case: I needed to pass a JSON object that contained an XML string in one of the properties which was overly complicated to escape. This solved it.

Does a nice job of adding arguments, but doesn't escape. Added comment in method where escape sequence should go.
public static string ApplicationArguments()
{
List<string> args = Environment.GetCommandLineArgs().ToList();
args.RemoveAt(0); // remove executable
StringBuilder sb = new StringBuilder();
foreach (string s in args)
{
// todo: add escape double quotes here
sb.Append(string.Format("\"{0}\" ", s)); // wrap all args in quotes
}
return sb.ToString().Trim();
}

Copy sample code function from this url:
http://csharptest.net/529/how-to-correctly-escape-command-line-arguments-in-c/index.html
You can get command line to execute for example like this:
String cmdLine = EscapeArguments(Environment.GetCommandLineArgs().Skip(1).ToArray());
Skip(1) skips executable name.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Regex to remove single-line SQL comments (--) - c#

I don't know if C#/VB.net regex is special in some way but traditionally s/--.*// should work.

Related

Custom Uppercase on String

Replacing characters in a string with another string

How can I convert PascalCase to split words?

Parsing CSV File enclosed with quotes in C#

Escape command line arguments in c#

Categories

Resources