This question already has answers here:
How to split csv whose columns may contain comma
(9 answers)
Closed 4 years ago.
I have the following comma-separated string that I need to split. The problem is that some of the content is within quotes and contains commas that shouldn't be used in the split.
String:
111,222,"33,44,55",666,"77,88","99"
I want the output:
111
222
33,44,55
666
77,88
99
I have tried this:
(?:,?)((?<=")[^"]+(?=")|[^",]+)
But it reads the comma between "77,88","99" as a hit and I get the following output:
111
222
33,44,55
666
77,88
,
99
Depending on your needs you may not be able to use a csv parser, and may in fact want to re-invent the wheel!!
You can do so with some simple regex
(?:^|,)(\"(?:[^\"]+|\"\")*\"|[^,]*)
This will do the following:
(?:^|,) = Match expression "Beginning of line or string ,"
(\"(?:[^\"]+|\"\")*\"|[^,]*) = A numbered capture group, this will select between 2 alternatives:
stuff in quotes
stuff between commas
This should give you the output you are looking for.
Example code in C#
static Regex csvSplit = new Regex("(?:^|,)(\"(?:[^\"]+|\"\")*\"|[^,]*)", RegexOptions.Compiled);
public static string[] SplitCSV(string input)
{
List<string> list = new List<string>();
string curr = null;
foreach (Match match in csvSplit.Matches(input))
{
curr = match.Value;
if (0 == curr.Length)
{
list.Add("");
}
list.Add(curr.TrimStart(','));
}
return list.ToArray();
}
private void button1_Click(object sender, RoutedEventArgs e)
{
Console.WriteLine(SplitCSV("111,222,\"33,44,55\",666,\"77,88\",\"99\""));
}
Warning As per #MrE's comment - if a rogue new line character appears in a badly formed csv file and you end up with an uneven ("string) you'll get catastrophic backtracking (https://www.regular-expressions.info/catastrophic.html) in your regex and your system will likely crash (like our production system did). Can easily be replicated in Visual Studio and as I've discovered will crash it. A simple try/catch will not trap this issue either.
You should use:
(?:^|,)(\"(?:[^\"])*\"|[^,]*)
instead
Fast and easy:
public static string[] SplitCsv(string line)
{
List<string> result = new List<string>();
StringBuilder currentStr = new StringBuilder("");
bool inQuotes = false;
for (int i = 0; i < line.Length; i++) // For each character
{
if (line[i] == '\"') // Quotes are closing or opening
inQuotes = !inQuotes;
else if (line[i] == ',') // Comma
{
if (!inQuotes) // If not in quotes, end of current string, add it to result
{
result.Add(currentStr.ToString());
currentStr.Clear();
}
else
currentStr.Append(line[i]); // If in quotes, just add it
}
else // Add any other character to current string
currentStr.Append(line[i]);
}
result.Add(currentStr.ToString());
return result.ToArray(); // Return array of all strings
}
With this string as input :
111,222,"33,44,55",666,"77,88","99"
It will return :
111
222
33,44,55
666
77,88
99
i really like jimplode's answer, but I think a version with yield return is a little bit more useful, so here it is:
public IEnumerable<string> SplitCSV(string input)
{
Regex csvSplit = new Regex("(?:^|,)(\"(?:[^\"]+|\"\")*\"|[^,]*)", RegexOptions.Compiled);
foreach (Match match in csvSplit.Matches(input))
{
yield return match.Value.TrimStart(',');
}
}
Maybe it's even more useful to have it like an extension method:
public static class StringHelper
{
public static IEnumerable<string> SplitCSV(this string input)
{
Regex csvSplit = new Regex("(?:^|,)(\"(?:[^\"]+|\"\")*\"|[^,]*)", RegexOptions.Compiled);
foreach (Match match in csvSplit.Matches(input))
{
yield return match.Value.TrimStart(',');
}
}
}
This regular expression works without the need to loop through values and TrimStart(','), like in the accepted answer:
((?<=\")[^\"]*(?=\"(,|$)+)|(?<=,|^)[^,\"]*(?=,|$))
Here is the implementation in C#:
string values = "111,222,\"33,44,55\",666,\"77,88\",\"99\"";
MatchCollection matches = new Regex("((?<=\")[^\"]*(?=\"(,|$)+)|(?<=,|^)[^,\"]*(?=,|$))").Matches(values);
foreach (var match in matches)
{
Console.WriteLine(match);
}
Outputs
111
222
33,44,55
666
77,88
99
None of these answers work when the string has a comma inside quotes, as in "value, 1", or escaped double-quotes, as in "value ""1""", which are valid CSV that should be parsed as value, 1 and value "1", respectively.
This will also work with the tab-delimited format if you pass in a tab instead of a comma as your delimiter.
public static IEnumerable<string> SplitRow(string row, char delimiter = ',')
{
var currentString = new StringBuilder();
var inQuotes = false;
var quoteIsEscaped = false; //Store when a quote has been escaped.
row = string.Format("{0}{1}", row, delimiter); //We add new cells at the delimiter, so append one for the parser.
foreach (var character in row.Select((val, index) => new {val, index}))
{
if (character.val == delimiter) //We hit a delimiter character...
{
if (!inQuotes) //Are we inside quotes? If not, we've hit the end of a cell value.
{
Console.WriteLine(currentString);
yield return currentString.ToString();
currentString.Clear();
}
else
{
currentString.Append(character.val);
}
} else {
if (character.val != ' ')
{
if(character.val == '"') //If we've hit a quote character...
{
if(character.val == '\"' && inQuotes) //Does it appear to be a closing quote?
{
if (row[character.index + 1] == character.val) //If the character afterwards is also a quote, this is to escape that (not a closing quote).
{
quoteIsEscaped = true; //Flag that we are escaped for the next character. Don't add the escaping quote.
}
else if (quoteIsEscaped)
{
quoteIsEscaped = false; //This is an escaped quote. Add it and revert quoteIsEscaped to false.
currentString.Append(character.val);
}
else
{
inQuotes = false;
}
}
else
{
if (!inQuotes)
{
inQuotes = true;
}
else
{
currentString.Append(character.val); //...It's a quote inside a quote.
}
}
}
else
{
currentString.Append(character.val);
}
}
else
{
if (!string.IsNullOrWhiteSpace(currentString.ToString())) //Append only if not new cell
{
currentString.Append(character.val);
}
}
}
}
}
With minor updates to the function provided by "Chad Hedgcock".
Updates are on:
Line 26: character.val == '\"' - This can never be true due to the check made on Line 24. i.e. character.val == '"'
Line 28: if (row[character.index + 1] == character.val) added !quoteIsEscaped to escape 3 consecutive quotes.
public static IEnumerable<string> SplitRow(string row, char delimiter = ',')
{
var currentString = new StringBuilder();
var inQuotes = false;
var quoteIsEscaped = false; //Store when a quote has been escaped.
row = string.Format("{0}{1}", row, delimiter); //We add new cells at the delimiter, so append one for the parser.
foreach (var character in row.Select((val, index) => new {val, index}))
{
if (character.val == delimiter) //We hit a delimiter character...
{
if (!inQuotes) //Are we inside quotes? If not, we've hit the end of a cell value.
{
//Console.WriteLine(currentString);
yield return currentString.ToString();
currentString.Clear();
}
else
{
currentString.Append(character.val);
}
} else {
if (character.val != ' ')
{
if(character.val == '"') //If we've hit a quote character...
{
if(character.val == '"' && inQuotes) //Does it appear to be a closing quote?
{
if (row[character.index + 1] == character.val && !quoteIsEscaped) //If the character afterwards is also a quote, this is to escape that (not a closing quote).
{
quoteIsEscaped = true; //Flag that we are escaped for the next character. Don't add the escaping quote.
}
else if (quoteIsEscaped)
{
quoteIsEscaped = false; //This is an escaped quote. Add it and revert quoteIsEscaped to false.
currentString.Append(character.val);
}
else
{
inQuotes = false;
}
}
else
{
if (!inQuotes)
{
inQuotes = true;
}
else
{
currentString.Append(character.val); //...It's a quote inside a quote.
}
}
}
else
{
currentString.Append(character.val);
}
}
else
{
if (!string.IsNullOrWhiteSpace(currentString.ToString())) //Append only if not new cell
{
currentString.Append(character.val);
}
}
}
}
}
For Jay's answer, if you use a 2nd boolean then you can have nested double-quotes inside single-quotes and vice-versa.
private string[] splitString(string stringToSplit)
{
char[] characters = stringToSplit.ToCharArray();
List<string> returnValueList = new List<string>();
string tempString = "";
bool blockUntilEndQuote = false;
bool blockUntilEndQuote2 = false;
int characterCount = 0;
foreach (char character in characters)
{
characterCount = characterCount + 1;
if (character == '"' && !blockUntilEndQuote2)
{
if (blockUntilEndQuote == false)
{
blockUntilEndQuote = true;
}
else if (blockUntilEndQuote == true)
{
blockUntilEndQuote = false;
}
}
if (character == '\'' && !blockUntilEndQuote)
{
if (blockUntilEndQuote2 == false)
{
blockUntilEndQuote2 = true;
}
else if (blockUntilEndQuote2 == true)
{
blockUntilEndQuote2 = false;
}
}
if (character != ',')
{
tempString = tempString + character;
}
else if (character == ',' && (blockUntilEndQuote == true || blockUntilEndQuote2 == true))
{
tempString = tempString + character;
}
else
{
returnValueList.Add(tempString);
tempString = "";
}
if (characterCount == characters.Length)
{
returnValueList.Add(tempString);
tempString = "";
}
}
string[] returnValue = returnValueList.ToArray();
return returnValue;
}
The original version
Currently I use the following regex:
public static Regex regexCSVSplit = new Regex(#"(?x:(
(?<FULL>
(^|[,;\t\r\n])\s*
( (?<QUODAT> (?<QUO>[""'])(?<DAT>([^,;\t\r\n]|(?<!\k<QUO>\s*)[,;\t\r\n])*)\k<QUO>) |
(?<QUODAT> (?<DAT> [^""',;\s\r\n]* )) )
(?=\s*([,;\t\r\n]|$))
) |
(?<FULL>
(^|[\s\t\r\n])
( (?<QUODAT> (?<QUO>[""'])(?<DAT> [^""',;\s\t\r\n]* )\k<QUO>) |
(?<QUODAT> (?<DAT> [^""',;\s\t\r\n]* )) )
(?=[,;\s\t\r\n]|$)
)
))", RegexOptions.Compiled);
This solution can handle pretty chaotic cases too like below:
This is how to feed the result into an array:
var data = regexCSVSplit.Matches(line_to_process).Cast<Match>().
Select(x => x.Groups["DAT"].Value).ToArray();
See this example in action HERE
Note: The regular expression contains two set of <FULL> block and each of them contains two <QUODAT> block separated by "or" (|). Depending on your task you may only need one of them.
Note: That this regular expression gives us one string array, and works on single line with or without <carrier return> and/or <line feed>.
Simplified version
The following regular expression will already cover many complex cases:
public static Regex regexCSVSplit = new Regex(#"(?x:(
(?<FULL>
(^|[,;\t\r\n])\s*
(?<QUODAT> (?<QUO>[""'])(?<DAT>([^,;\t\r\n]|(?<!\k<QUO>\s*)[,;\t\r\n])*)\k<QUO>)
(?=\s*([,;\t\r\n]|$))
)
))", RegexOptions.Compiled);
See this example in action: HERE
It can process complex, easy and empty items too:
This is how to feed the result into an array:
var data = regexCSVSplit.Matches(line_to_process).Cast<Match>().
Select(x => x.Groups["DAT"].Value).ToArray();
The main rule here is that every item may contain anything but the <quotation mark><separators><comma> sequence AND each item shall being and end with the same <quotation mark>.
<quotation mark>: <">, <'>
<comma>: <,>, <;>, <tab>, <carrier return>, <line feed>
Edit notes: I added some more explanation to make it easier to understand and replaces the text "CO" with "QUO".
Try this:
string s = #"111,222,""33,44,55"",666,""77,88"",""99""";
List<string> result = new List<string>();
var splitted = s.Split('"').ToList<string>();
splitted.RemoveAll(x => x == ",");
foreach (var it in splitted)
{
if (it.StartsWith(",") || it.EndsWith(","))
{
var tmp = it.TrimEnd(',').TrimStart(',');
result.AddRange(tmp.Split(','));
}
else
{
if(!string.IsNullOrEmpty(it)) result.Add(it);
}
}
//Results:
foreach (var it in result)
{
Console.WriteLine(it);
}
I know I'm a bit late to this, but for searches, here is how I did what you are asking about in C sharp
private string[] splitString(string stringToSplit)
{
char[] characters = stringToSplit.ToCharArray();
List<string> returnValueList = new List<string>();
string tempString = "";
bool blockUntilEndQuote = false;
int characterCount = 0;
foreach (char character in characters)
{
characterCount = characterCount + 1;
if (character == '"')
{
if (blockUntilEndQuote == false)
{
blockUntilEndQuote = true;
}
else if (blockUntilEndQuote == true)
{
blockUntilEndQuote = false;
}
}
if (character != ',')
{
tempString = tempString + character;
}
else if (character == ',' && blockUntilEndQuote == true)
{
tempString = tempString + character;
}
else
{
returnValueList.Add(tempString);
tempString = "";
}
if (characterCount == characters.Length)
{
returnValueList.Add(tempString);
tempString = "";
}
}
string[] returnValue = returnValueList.ToArray();
return returnValue;
}
Don't reinvent a CSV parser, try FileHelpers.
I needed something a little more robust, so I took from here and created this... This solution is a little less elegant and a little more verbose, but in my testing (with a 1,000,000 row sample), I found this to be 2 to 3 times faster. Plus it handles non-escaped, embedded quotes. I used string delimiter and qualifiers instead of chars because of the requirements of my solution. I found it more difficult than I expected to find a good, generic CSV parser so I hope this parsing algorithm can help someone.
public static string[] SplitRow(string record, string delimiter, string qualifier, bool trimData)
{
// In-Line for example, but I implemented as string extender in production code
Func <string, int, int> IndexOfNextNonWhiteSpaceChar = delegate (string source, int startIndex)
{
if (startIndex >= 0)
{
if (source != null)
{
for (int i = startIndex; i < source.Length; i++)
{
if (!char.IsWhiteSpace(source[i]))
{
return i;
}
}
}
}
return -1;
};
var results = new List<string>();
var result = new StringBuilder();
var inQualifier = false;
var inField = false;
// We add new columns at the delimiter, so append one for the parser.
var row = $"{record}{delimiter}";
for (var idx = 0; idx < row.Length; idx++)
{
// A delimiter character...
if (row[idx]== delimiter[0])
{
// Are we inside qualifier? If not, we've hit the end of a column value.
if (!inQualifier)
{
results.Add(trimData ? result.ToString().Trim() : result.ToString());
result.Clear();
inField = false;
}
else
{
result.Append(row[idx]);
}
}
// NOT a delimiter character...
else
{
// ...Not a space character
if (row[idx] != ' ')
{
// A qualifier character...
if (row[idx] == qualifier[0])
{
// Qualifier is closing qualifier...
if (inQualifier && row[IndexOfNextNonWhiteSpaceChar(row, idx + 1)] == delimiter[0])
{
inQualifier = false;
continue;
}
else
{
// ...Qualifier is opening qualifier
if (!inQualifier)
{
inQualifier = true;
}
// ...It's a qualifier inside a qualifier.
else
{
inField = true;
result.Append(row[idx]);
}
}
}
// Not a qualifier character...
else
{
result.Append(row[idx]);
inField = true;
}
}
// ...A space character
else
{
if (inQualifier || inField)
{
result.Append(row[idx]);
}
}
}
}
return results.ToArray<string>();
}
Some test code:
//var input = "111,222,\"33,44,55\",666,\"77,88\",\"99\"";
var input =
"111, 222, \"99\",\"33,44,55\" , \"666 \"mark of a man\"\", \" spaces \"77,88\" \"";
Console.WriteLine("Split with trim");
Console.WriteLine("---------------");
var result = SplitRow(input, ",", "\"", true);
foreach (var r in result)
{
Console.WriteLine(r);
}
Console.WriteLine("");
// Split 2
Console.WriteLine("Split with no trim");
Console.WriteLine("------------------");
var result2 = SplitRow(input, ",", "\"", false);
foreach (var r in result2)
{
Console.WriteLine(r);
}
Console.WriteLine("");
// Time Trial 1
Console.WriteLine("Experimental Process (1,000,000) iterations");
Console.WriteLine("-------------------------------------------");
watch = Stopwatch.StartNew();
for (var i = 0; i < 1000000; i++)
{
var x1 = SplitRow(input, ",", "\"", false);
}
watch.Stop();
elapsedMs = watch.ElapsedMilliseconds;
Console.WriteLine($"Total Process Time: {string.Format("{0:0.###}", elapsedMs / 1000.0)} Seconds");
Console.WriteLine("");
Results
Split with trim
---------------
111
222
99
33,44,55
666 "mark of a man"
spaces "77,88"
Split with no trim
------------------
111
222
99
33,44,55
666 "mark of a man"
spaces "77,88"
Original Process (1,000,000) iterations
-------------------------------
Total Process Time: 7.538 Seconds
Experimental Process (1,000,000) iterations
--------------------------------------------
Total Process Time: 3.363 Seconds
I once had to do something similar and in the end I got stuck with Regular Expressions. The inability for Regex to have state makes it pretty tricky - I just ended up writing a simple little parser.
If you're doing CSV parsing you should just stick to using a CSV parser - don't reinvent the wheel.
Here is my fastest implementation based upon string raw pointer manipulation:
string[] FastSplit(string sText, char? cSeparator = null, char? cQuotes = null)
{
string[] oTokens;
if (null == cSeparator)
{
cSeparator = DEFAULT_PARSEFIELDS_SEPARATOR;
}
if (null == cQuotes)
{
cQuotes = DEFAULT_PARSEFIELDS_QUOTE;
}
unsafe
{
fixed (char* lpText = sText)
{
#region Fast array estimatation
char* lpCurrent = lpText;
int nEstimatedSize = 0;
while (0 != *lpCurrent)
{
if (cSeparator == *lpCurrent)
{
nEstimatedSize++;
}
lpCurrent++;
}
nEstimatedSize++; // Add EOL char(s)
string[] oEstimatedTokens = new string[nEstimatedSize];
#endregion
#region Parsing
char[] oBuffer = new char[sText.Length];
int nIndex = 0;
int nTokens = 0;
lpCurrent = lpText;
while (0 != *lpCurrent)
{
if (cQuotes == *lpCurrent)
{
// Quotes parsing
lpCurrent++; // Skip quote
nIndex = 0; // Reset buffer
while (
(0 != *lpCurrent)
&& (cQuotes != *lpCurrent)
)
{
oBuffer[nIndex] = *lpCurrent; // Store char
lpCurrent++; // Move source cursor
nIndex++; // Move target cursor
}
}
else if (cSeparator == *lpCurrent)
{
// Separator char parsing
oEstimatedTokens[nTokens++] = new string(oBuffer, 0, nIndex); // Store token
nIndex = 0; // Skip separator and Reset buffer
}
else
{
// Content parsing
oBuffer[nIndex] = *lpCurrent; // Store char
nIndex++; // Move target cursor
}
lpCurrent++; // Move source cursor
}
// Recover pending buffer
if (nIndex > 0)
{
// Store token
oEstimatedTokens[nTokens++] = new string(oBuffer, 0, nIndex);
}
// Build final tokens list
if (nTokens == nEstimatedSize)
{
oTokens = oEstimatedTokens;
}
else
{
oTokens = new string[nTokens];
Array.Copy(oEstimatedTokens, 0, oTokens, 0, nTokens);
}
#endregion
}
}
// Epilogue
return oTokens;
}
Try this
private string[] GetCommaSeperatedWords(string sep, string line)
{
List<string> list = new List<string>();
StringBuilder word = new StringBuilder();
int doubleQuoteCount = 0;
for (int i = 0; i < line.Length; i++)
{
string chr = line[i].ToString();
if (chr == "\"")
{
if (doubleQuoteCount == 0)
doubleQuoteCount++;
else
doubleQuoteCount--;
continue;
}
if (chr == sep && doubleQuoteCount == 0)
{
list.Add(word.ToString());
word = new StringBuilder();
continue;
}
word.Append(chr);
}
list.Add(word.ToString());
return list.ToArray();
}
This is Chad's answer rewritten with state based logic. His answered failed for me when it came across """BRAD""" as a field. That should return "BRAD" but it just ate up all the remaining fields. When I tried to debug it I just ended up rewriting it as state based logic:
enum SplitState { s_begin, s_infield, s_inquotefield, s_foundquoteinfield };
public static IEnumerable<string> SplitRow(string row, char delimiter = ',')
{
var currentString = new StringBuilder();
SplitState state = SplitState.s_begin;
row = string.Format("{0}{1}", row, delimiter); //We add new cells at the delimiter, so append one for the parser.
foreach (var character in row.Select((val, index) => new { val, index }))
{
//Console.WriteLine("character = " + character.val + " state = " + state);
switch (state)
{
case SplitState.s_begin:
if (character.val == delimiter)
{
/* empty field */
yield return currentString.ToString();
currentString.Clear();
} else if (character.val == '"')
{
state = SplitState.s_inquotefield;
} else
{
currentString.Append(character.val);
state = SplitState.s_infield;
}
break;
case SplitState.s_infield:
if (character.val == delimiter)
{
/* field with data */
yield return currentString.ToString();
state = SplitState.s_begin;
currentString.Clear();
} else
{
currentString.Append(character.val);
}
break;
case SplitState.s_inquotefield:
if (character.val == '"')
{
// could be end of field, or escaped quote.
state = SplitState.s_foundquoteinfield;
} else
{
currentString.Append(character.val);
}
break;
case SplitState.s_foundquoteinfield:
if (character.val == '"')
{
// found escaped quote.
currentString.Append(character.val);
state = SplitState.s_inquotefield;
}
else if (character.val == delimiter)
{
// must have been last quote so we must find delimiter
yield return currentString.ToString();
state = SplitState.s_begin;
currentString.Clear();
}
else
{
throw new Exception("Quoted field not terminated.");
}
break;
default:
throw new Exception("unknown state:" + state);
}
}
//Console.WriteLine("currentstring = " + currentString.ToString());
}
This is a lot more lines of code than the other solutions, but it is easy to modify to add edge cases.
I am using C# WinForm, and I have a RichTextBox that I am trying to make look like a C# script.
Means when using specific words, I want them to be colored. When they edit the word by changing it, I want it to go back to be black.
My approach works, but it really messy and cause bugs when the a scroll option is created and needed to be used to see the code below. (When typing, pretty much the richtextbox jumps up and down without stop)
private void ScriptRichTextBox_TextChanged(object sender, EventArgs e)
{
ScriptTextChange = ScriptRichTextBox.Text;
ScriptColorChange();
}
private void ScriptColorChange()
{
int index = ScriptRichTextBox.SelectionStart;
ScriptRichTextBox.Text = ScriptTextChange; //Only way I found to make the all current text black again, SelectAll() didn't work well.
ScriptRichTextBox.SelectionStart = index;
String[] coloredNames = {"Main", "ClickMouseDown", "ClickMouseUp", "PressKey", "StopMoving", "Delay", "GoRight", "GoLeft", "GoUp", "GoDown", "MousePosition", "LastHorizontalDirection", "LastVerticalDirections", "CurrentDirection", "Directions" };
String[] coloredNames2 = { "cm.", "if", "else", "while", "switch", "case", "break", "return", "new" };
String[] coloredNames3 = { "MyPosition", "MyHp", "MyMp", "OtherPeopleInMap", ".RIGHT", ".LEFT", ".UP", ".DOWN", ".STOP_MOVING" };
foreach (String s in coloredNames)
this.CheckKeyword(s, Color.LightSkyBlue, 0);
foreach (String s in coloredNames2)
this.CheckKeyword(s, Color.Blue, 0);
foreach (String s in coloredNames3)
this.CheckKeyword(s, Color.DarkGreen, 0);
}
private void CheckKeyword(string word, Color color, int startIndex)
{
if (this.ScriptRichTextBox.Text.Contains(word))
{
int index = 0;
int selectStart = this.ScriptRichTextBox.SelectionStart;
while ((index = this.ScriptRichTextBox.Text.IndexOf(word, (index + 1))) != -1)
{
this.ScriptRichTextBox.Select((index + startIndex), word.Length);
this.ScriptRichTextBox.SelectionColor = color;
this.ScriptRichTextBox.Select(selectStart, 0);
this.ScriptRichTextBox.SelectionColor = Color.Black;
}
}
}
I refactored your code a little to hopefully demonstrate a better approach to colouring the text. It is also not optimal to instantiate your string arrays every time you fire the TextChanged event.
Updated:The idea is to build up a word buffer that will be matched with your set of words when typing.
The buffer records each key and if it .IsLetterOrDigit it adds it to the StringBuilder buffer. The buffer has some additional bugs, with recording key press values and not removing recorded chars if you hit backspace etc..
Instead of the word buffer, use RegEx to match any of the words in your reserve word list. Build up the reserve word RegEx so you end up with something like \b(word|word2|word3....)\b This is done in the code in the BuildRegExPattern(..) method.
Once you hit any key other than a letter or number the buffer is checked for content and if the content matches a word then only the text right before the cursor in the ScriptRichTextBox.Text is checked and changed.
Remove the .(dots) from the reserve words as this just complicates the matching criteria. The RegEx in the built up patters will match the words exactly, so if you type something like FARRIGHT or cms the words will not partially change colour.
As an extra I also covered the paste process pressing Ctrl+V because it is a bit of a pain in WinForms and will probably happen quite often.
There are older questions eg. this one that cover the scrolling behaviour, where it shows how to interop by adding the [System.Runtime.InteropServices.DllImport("user32.dll")] attribute, but it can be done without it.
To prevent all the scroll jumping you can make use of the .DefWndProc(msg) method on the form. this question pointed me towards the WM_SETREDRAW property.
There is also this list of other properties that can be set.
The full implementation is this:
public partial class Form1 : Form
{
private readonly string[] _skyBlueStrings;
private readonly string[] _blueStrings;
private readonly string[] _greenStrings;
//for pasting
bool _IsCtrl;
bool _IsV;
//value to fix the colour not setting first character after return key pressed
int _returnIdxFix = 0;
//regex patterns to use
string _LightBlueRegX = "";
string _BlueRegX = "";
string _GreenRegX = "";
//match only words
Regex _rgxAnyWords = new Regex(#"(\w+)");
//colour setup
Color _LightBlueColour = Color.LightSkyBlue;
Color _BlueColour = Color.Blue;
Color _GreenColour = Color.DarkGreen;
Color _DefaultColour = Color.Black;
public Form1()
{
InitializeComponent();
_skyBlueStrings = new string[] { "Main", "ClickMouseDown", "ClickMouseUp", "PressKey", "StopMoving", "Delay", "GoRight", "GoLeft", "GoUp", "GoDown", "MousePosition", "LastHorizontalDirection", "LastVerticalDirections", "CurrentDirection", "Directions" };
_blueStrings = new string[] { "cm", "if", "else", "while", "switch", "case", "break", "return", "new" };
_greenStrings = new string[] { "MyPosition", "MyHp", "MyMp", "OtherPeopleInMap", "RIGHT", "LEFT", "UP", "DOWN", "STOP_MOVING" };
_LightBlueRegX = BuildRegExPattern(_skyBlueStrings);
_BlueRegX = BuildRegExPattern(_blueStrings);
_GreenRegX = BuildRegExPattern(_greenStrings);
}
string BuildRegExPattern(string[] keyworkArray)
{
StringBuilder _regExPatern = new StringBuilder();
_regExPatern.Append(#"\b(");//beginning of word
_regExPatern.Append(string.Join("|", keyworkArray));//all reserve words
_regExPatern.Append(#")\b");//end of word
return _regExPatern.ToString();
}
private void ProcessAllText()
{
BeginRtbUpdate();
FormatKeywords(_LightBlueRegX, _LightBlueColour);
FormatKeywords(_BlueRegX, _BlueColour);
FormatKeywords(_GreenRegX, _GreenColour);
//internal function to process words and set their colours
void FormatKeywords(string regExPattern, Color wordColour)
{
var matchStrings = Regex.Matches(ScriptRichTextBox.Text, regExPattern);
foreach (Match match in matchStrings)
{
FormatKeyword(keyword: match.Value, wordIndex: match.Index, wordColour: wordColour);
}
}
EndRtbUpdate();
ScriptRichTextBox.Select(ScriptRichTextBox.Text.Length, 0);
ScriptRichTextBox.Invalidate();
}
void ProcessWordAtIndex(string fullText, int cursorIdx)
{
MatchCollection anyWordMatches = _rgxAnyWords.Matches(fullText);
if (anyWordMatches.Count == 0)
{ return; } // no words found
var allWords = anyWordMatches.OfType<Match>().ToList();
//get the word just before cursor
var wordAtCursor = allWords.FirstOrDefault(w => (cursorIdx - _returnIdxFix) == (w.Index + w.Length));
if (wordAtCursor is null || string.IsNullOrWhiteSpace(wordAtCursor.Value))
{ return; }//no word at cursor or the match was blank
Color wordColour = CalculateWordColour(wordAtCursor.Value);
FormatKeyword(wordAtCursor.Value, wordAtCursor.Index, wordColour);
}
private Color CalculateWordColour(string word)
{
if (_skyBlueStrings.Contains(word))
{ return _LightBlueColour; }
if (_blueStrings.Contains(word))
{ return _BlueColour; }
if (_greenStrings.Contains(word))
{ return _GreenColour; }
return _DefaultColour;
}
private void FormatKeyword(string keyword, int wordIndex, Color wordColour)
{
ScriptRichTextBox.Select((wordIndex - _returnIdxFix), keyword.Length);
ScriptRichTextBox.SelectionColor = wordColour;
ScriptRichTextBox.Select(wordIndex + keyword.Length, 0);
ScriptRichTextBox.SelectionColor = _DefaultColour;
}
#region RichTextBox BeginUpdate and EndUpdate Methods
protected override void WndProc(ref Message m)
{
base.WndProc(ref m);
//wait until the rtb is visible, otherwise you get some weird behaviour.
if (ScriptRichTextBox.Visible && ScriptRichTextBox.IsHandleCreated)
{
if (m.LParam == ScriptRichTextBox.Handle)
{
rtBox_lParam = m.LParam;
rtBox_wParam = m.WParam;
}
}
}
IntPtr rtBox_wParam = IntPtr.Zero;
IntPtr rtBox_lParam = IntPtr.Zero;
const int WM_SETREDRAW = 0x0b;
const int EM_HIDESELECTION = 0x43f;
void BeginRtbUpdate()
{
Message msg_WM_SETREDRAW = Message.Create(ScriptRichTextBox.Handle, WM_SETREDRAW, (IntPtr)0, rtBox_lParam);
this.DefWndProc(ref msg_WM_SETREDRAW);
}
public void EndRtbUpdate()
{
Message msg_WM_SETREDRAW = Message.Create(ScriptRichTextBox.Handle, WM_SETREDRAW, rtBox_wParam, rtBox_lParam);
this.DefWndProc(ref msg_WM_SETREDRAW);
//redraw the RichTextBox
ScriptRichTextBox.Invalidate();
}
#endregion
private void ScriptRichTextBox_TextChanged(object sender, EventArgs e)
{
//only run all text if it was pasted NOT ON EVERY TEXT CHANGE!
if (_IsCtrl && _IsV)
{
_IsCtrl = false;
ProcessAllText();
}
}
protected void ScriptRichTextBox_KeyPress(object sender, KeyPressEventArgs e)
{
if (!char.IsLetterOrDigit(e.KeyChar))
{
//if the key was enter the cursor position is 1 position off
_returnIdxFix = (e.KeyChar == '\r') ? 1 : 0;
ProcessWordAtIndex(ScriptRichTextBox.Text, ScriptRichTextBox.SelectionStart);
}
}
private void ScriptRichTextBox_KeyDown(object sender, System.Windows.Forms.KeyEventArgs e)
{
if (e.KeyCode == Keys.ControlKey)
{
_IsCtrl = true;
}
if (e.KeyCode == Keys.V)
{
_IsV = true;
}
}
private void ScriptRichTextBox_KeyUp(object sender, System.Windows.Forms.KeyEventArgs e)
{
if (e.KeyCode == Keys.ControlKey)
{
_IsCtrl = false;
}
if (e.KeyCode == Keys.V)
{
_IsV = false;
}
}
}
It looks like this when you paste some "code" with keywords:
and typing looks like this:
Ok after 2 days of not finding something that actually works good or has annoying bugs. I managed to find a solution myself after a big struggle of trying to make it work. The big idea is people try to edit all the RichTextBox words at once, which cause bugs. Why to edit all of the rich text box when you can do your checks on the current word only to get the same result. Which is what I did, I checked if any of my array strings is in the current word, and colored all of them.
private void ScriptRichTextBox_TextChanged(object sender, EventArgs e)
{
FindStringsInCurrentWord();
}
private void FindStringsInCurrentWord()
{
RichTextBox script = ScriptRichTextBox;
String finalWord, forwards, backwards;
int saveLastSelectionStart = script.SelectionStart;
int index = script.SelectionStart;
String[] coloredNames = { "Main", "ClickMouseDown", "ClickMouseUp", "PressKey", "StopMoving", "Delay", "GoRight", "GoLeft", "GoUp", "GoDown", "MousePosition", "LastHorizontalDirection", "LastVerticalDirections", "CurrentDirection", "Directions" };
String[] coloredNames2 = { "cm.", "if", "else", "while", "switch", "case", "break", "return", "new" };
String[] coloredNames3 = { "MyPosition", "MyHp", "MyMp", "OtherPeopleInMap", ".RIGHT", ".LEFT", ".UP", ".DOWN", ".STOP_MOVING" };
String[] arr2 = coloredNames.Union(coloredNames2).ToArray();
Array arrAll = arr2.Union(coloredNames3).ToArray(); //Gets all arrays together
Array[] wordsArray = { coloredNames, coloredNames2, coloredNames3 }; //All found strings in the word
List<String> wordsFoundList = new List<String>();
int foundChangedColor = 0;
int wordsFound = 0;
char current = (char)script.GetCharFromPosition(script.GetPositionFromCharIndex(index)); //Where the editor thingy is
//Check forward text where he uses space and save text
while (!System.Char.IsWhiteSpace(current) && index < script.Text.Length)
{
index++;
current = (char)script.GetCharFromPosition(script.GetPositionFromCharIndex(index));
}
int lengthForward = index - saveLastSelectionStart;
script.Select(script.SelectionStart, lengthForward);
forwards = script.SelectedText;
//Debug.WriteLine("Forwards: " + forwards);
script.SelectionStart = saveLastSelectionStart;
this.ScriptRichTextBox.Select(script.SelectionStart, 0);
index = script.SelectionStart;
current = (char)script.GetCharFromPosition(script.GetPositionFromCharIndex(index));
int length = 0;
//Check backwords where he uses space and save text
while ((!System.Char.IsWhiteSpace(current) || length == 0) && index > 0 && index <= script.Text.Length)
{
index--;
length++;
current = (char)script.GetCharFromPosition(script.GetPositionFromCharIndex(index));
}
script.SelectionStart -= length;
script.Select(script.SelectionStart + 1, length - 1);
backwards = script.SelectedText;
//Debug.WriteLine("Backwards: " + backwards);
script.SelectionStart = saveLastSelectionStart;
this.ScriptRichTextBox.Select(saveLastSelectionStart, 0);
this.ScriptRichTextBox.SelectionColor = Color.Black;
finalWord = backwards + forwards; //Our all word!
//Debug.WriteLine("WORD:" + finalWord);
//Setting all of the word black, after it coloring the right places
script.Select(index + 1, length + lengthForward);
script.SelectionColor = Color.Black;
foreach (string word in arrAll)
{
if (finalWord.IndexOf(word) != -1)
{
wordsFound++;
wordsFoundList.Add(word);
script.Select(index + 1 + finalWord.IndexOf(word), word.Length);
if (coloredNames.Any(word.Contains))
{
script.SelectionColor = Color.LightSkyBlue;
foundChangedColor++;
}
else if (coloredNames2.Any(word.Contains))
{
script.SelectionColor = Color.Blue;
foundChangedColor++;
}
else if (coloredNames3.Any(word.Contains))
{
script.SelectionColor = Color.DarkGreen;
foundChangedColor++;
}
//Debug.WriteLine("Word to edit: " + script.SelectedText);
this.ScriptRichTextBox.Select(saveLastSelectionStart, 0);
this.ScriptRichTextBox.SelectionColor = Color.Black;
}
}
//No strings found, color it black
if (wordsFound == 0)
{
script.Select(index + 1, length + lengthForward);
script.SelectionColor = Color.Black;
//Debug.WriteLine("WORD??: " + script.SelectedText);
this.ScriptRichTextBox.Select(saveLastSelectionStart, 0);
this.ScriptRichTextBox.SelectionColor = Color.Black;
}
}