Censoring words in string[] by replacing - c#

I am making a censor program for a game .dll I cannot figure out how to do this. I have a string[] of words and sentences. I have found out how to filter the words and block the messages. Right now I am trying to replace words with * the same length as a word. For example if someone said "fuck that stupid ass" it would come out as **** that stupid ***. Below is the code I am using
public void Actionfor(ServerChatEventArgs args)
{
var player = TShock.Players[args.Who];
if (!args.Text.ToLower().StartsWith("/") || args.Text.ToLower().StartsWith("/w") || args.Text.ToLower().StartsWith("/r") || args.Text.ToLower().StartsWith("/me") || args.Text.ToLower().StartsWith("/c") || args.Text.ToLower().StartsWith("/party"))
{
foreach (string Word in config.BanWords)
{
if (player.Group.HasPermission("caw.staff"))
{
args.Handled = false;
}
else if (args.Text.ToLower().Contains(Word))
{
switch (config.Action)
{
case "kick":
args.Handled = true;
TShock.Utils.Kick(player, config.KickMessage, true, false);
break;
case "ignore":
args.Handled = true;
player.SendErrorMessage("Your message has been ignored for saying: {0}", Word);
break;
case "censor":
args.Handled = false;
var wordlength = Word.Length;
break;
case "donothing":
args.Handled = false;
break;
}
}
}
}
else
{
args.Handled = false;
}
}
public string[] BanWords = { "fuck", "ass", "can i be staff", "can i be admin" };
Some places have code something like this under my case "censor"
Word = Word.Replace(Word, new string("*", Word.Length));
However I always get an error cannot convert string to char and cannot figure out else to do.

The compiler is telling you the problem; the overload of String you want takes a char and int, not a string and int.
It's trying to convert the * from a string to a char. Replace the double quotes " with a single quote '.

For chars, use single quotes ' instead of double quotes " like this:
new string('*', Word.Length)
And in your code, you don't need to replace. Simply do:
Word = new string('*', Word.Length);

Related

How to solve this challenge using a FIFO stack?

I'm using a portion of C# code from Sanjit Prasad to solve the challenge of processing backspaces in a given string of words. The new challenge is to process the left-arrow-key and right-arrow-key in combination with backspaces, reflecting a "corrector" for typos in writing.
The following string represents the problem and solution for the first challenge using a FIFO stack (credits to Sanjit Prasad):
string: thiss# is a txt##ext with some typos
expected result: this is a text with some typos
This is the code to generate the expected result:
static String finalAnswer(String S)
{
Stack<Char> q = new Stack<Char>();
for (int i = 0; i < S.Length; ++i)
{
if (S[i] != '#') q.Push(S[i]);
else if (q.Count!=0) q.Pop();
}
String ans = "";
while (q.Count!=0)
{
ans += q.Pop();
}
String answer = "";
for(int j = ans.Length - 1; j >= 0; j--)
{
answer += ans[j];
}
return answer;
}
That code works great, now, the challenge is to process the following string:
string: ths#is is an te\\\#///xt wit some\\\\\h///// tpos###ypos
expected result: this is a text with some typos
In the above string, the character "\" represents a left arrow key pressed, and "/" a right arrow key pressed.
Thank you so much for all your comments, this is my very first question in Stackoverflow, I would like to know an approach to solve the this challenge.
Here is a version that gets the job done, but will need some checks for edge cases and some more error checks.
static string GenerateUpdatedString(string strInput)
{
var stringStack = new Stack<char>();//holds the string as it is read from the input
var workingStack = new Stack<char>();//hold chars when going back to fix typos
char poppedChar;
foreach (var ch in strInput)
{
switch (ch)
{
case '\\':
{
PushAndPopCharacters(workingStack, stringStack);
break;
}
case '/':
{
PushAndPopCharacters(stringStack, workingStack);
break;
}
case '#':
{
stringStack.TryPop(out poppedChar);
break;
}
default:
stringStack.Push(ch);
break;
}
}
return new string(stringStack.Reverse().ToArray());
}
static void PushAndPopCharacters(Stack<char> stackToPush, Stack<char> stackToPop)
{
char poppedChar;
if (stackToPop.TryPop(out poppedChar))
{
stackToPush.Push(poppedChar);
}
}
Usage
var result = GenerateUpdatedString
(#"ths#is is an te\\\#///xt wit some\\\\\h///// tpos###ypos");

Interpreting and formatting user input

I'm designing a command line interpreter for my software and need to be able to format user input. Currently I have a system which basically splits everything by spaces, the problem is that I need to not split anything inside double quotes.
As you can probably tell, my current implementation won't handle quoted paths very well.
This is my current interpreting and formatting logic (contained in a non static method which gets called when the user presses enter, in case anyone was wondering):
var command = ConsoleInput.Text;
ConsoleInput.Text = String.Empty;
string command_main = command.Split(new char[] { ' ' }).First();
string[] synatx = command.Split(new char[] { ' ' }).Skip(1).ToArray();
if (lCommands.ContainsKey(command_main))
{
Action<string[]> commandfunction;
lCommands.TryGetValue(command_main, out commandfunction);
commandfunction(synatx);
}
else
ConsoleOut($"Invalid Command - {command_main} {string.Join(" ", synatx)}");
I need quoted paths to be taken in as a single argument, instead of being split by spacing.
for example, (disclaimer: this is just an example and not actual code)
this is what I don't want: with an input of: "this is a test" and some more text it turns out to be something like this: syntax[0] = "this syntax[1] = is, and so on.
The expected outcome would be (what I want to happen): syntax[0] = "this is a test" syntax[1] = and syntax[2] = some, and so on.
I'm stuck here, anyone have a solution? Thank you.
Here's a solution. It's a hacked together state machine that handles quoted strings that may contain spaces. It throws away extraneous whitespace between arguments, and considers a doubled up double-quote as if it were a single double-quote (but without any special meaning; as if it were any other character).
public IEnumerable<string> ParseLine(string toParse)
{
var result = new List<string>();
bool inQuotedString = false;
bool parsingDoubleQuote = false;
bool inWhiteSpace = false;
int length = toParse.Length;
var argBuffer = new StringBuilder();
for (var index = 0; index < length; ++index)
{
//if looking ahead for a double quote succeeded, just add the quote to the current arguemnt
if (parsingDoubleQuote)
{
parsingDoubleQuote = false;
argBuffer.Append('"');
//and we are done with this character, so...
continue; //done with this character, time to just loop again
}
if (toParse[index] == '"')
{
inWhiteSpace = false;
//look ahead one character to see if there's a double quote
if (index < length - 1 && toParse[index + 1] == '"')
{
parsingDoubleQuote = true;
continue; //done with this character, time to just loop again
}
if (!inQuotedString)
{
inQuotedString = true;
continue; //done with this character, time to just loop again
}
else
{
//it's not a double quote, and we are in quotes string, so
inQuotedString = false;
//we don't add the buffer to the output args until a space or the end, so
continue; //done with this character, time to just loop again
}
}
//if we are here, there's no quote, so...
if (toParse[index] == ' ' || toParse[index] == '\t')
{
if (inQuotedString)
{
argBuffer.Append(toParse[index]);
continue; //done with this character, time to just loop again
}
if (inWhiteSpace)
{
//nothing to do
continue; //out of the for loop
}
else
{
inWhiteSpace = true;
if (argBuffer.Length > 0)
{
result.Add(argBuffer.ToString());
argBuffer.Clear();
continue; //done with this character, time to just loop again
}
}
}
else
{
inWhiteSpace = false;
//no quote, no space, so...
argBuffer.Append(toParse[index]);
continue; //done with this character, time to just loop again
}
} //end of for loop
if (argBuffer.Length > 0)
{
result.Add(argBuffer.ToString());
}
return result;
}
I've given it cursory testing - you'll want to test it harder

C# split comma separated values

How can I split comma separated strings with quoted strings that can also contain commas?
Example input:
John, Doe, "Sid, Nency", Smith
Expected output:
John
Doe
Sid, Nency
Smith
Split by commas was ok, but I've got requirement that strings like "Sid, Nency" are allowed. I tried to use regexes to split such values. Regex ",(?=([^\"]*\"[^\"]*\")*[^\"]*$)" is from Java question and it is not working good for my .NET code. It doubles some strings, finds extra results etc.
So what is the best way to split such strings?
It's because of the capture group. Just turn it into a non-capture group:
",(?=(?:[^""]*""[^""]*"")*[^""]*$)"
^^
The capture group is including the captured part in your results.
ideone demo
var regexObj = new Regex(#",(?=(?:[^""]*""[^""]*"")*[^""]*$)");
regexObj.Split(input).Select(s => s.Trim('\"', ' ')).ForEach(Console.WriteLine);
And just trim the results.
Just go through your string. As you go through your string keep track
if you're in a "block" or not. If you're - don't treat the comma as
a comma (as a separator). Otherwise do treat it as such. It's a simple
algorithm, I would write it myself. When you encounter first " you enter
a block. When you encounter next ", you end that block you were, and so on.
So you can do it with one pass through your string.
import java.util.ArrayList;
public class Test003 {
public static void main(String[] args) {
String s = " John, , , , \" Barry, John \" , , , , , Doe, \"Sid , Nency\", Smith ";
StringBuilder term = new StringBuilder();
boolean inQuote = false;
boolean inTerm = false;
ArrayList<String> terms = new ArrayList<String>();
for (int i=0; i<s.length(); i++){
char ch = s.charAt(i);
if (ch == ' '){
if (inQuote){
if (!inTerm) {
inTerm = true;
}
term.append(ch);
}
else {
if (inTerm){
terms.add(term.toString());
term.setLength(0);
inTerm = false;
}
}
}else if (ch== '"'){
term.append(ch); // comment this out if you don't need it
if (!inTerm){
inTerm = true;
}
inQuote = !inQuote;
}else if (ch == ','){
if (inQuote){
if (!inTerm){
inTerm = true;
}
term.append(ch);
}else{
if (inTerm){
terms.add(term.toString());
term.setLength(0);
inTerm = false;
}
}
}else{
if (!inTerm){
inTerm = true;
}
term.append(ch);
}
}
if (inTerm){
terms.add(term.toString());
}
for (String t : terms){
System.out.println("|" + t + "|");
}
}
}
I use the following code within my Csv Parser class to achieve this:
private string[] ParseLine(string line)
{
List<string> results = new List<string>();
bool inQuotes = false;
int index = 0;
StringBuilder currentValue = new StringBuilder(line.Length);
while (index < line.Length)
{
char c = line[index];
switch (c)
{
case '\"':
{
inQuotes = !inQuotes;
break;
}
default:
{
if (c == ',' && !inQuotes)
{
results.Add(currentValue.ToString());
currentValue.Clear();
}
else
currentValue.Append(c);
break;
}
}
++index;
}
results.Add(currentValue.ToString());
return results.ToArray();
} // eo ParseLine
If you find the regular expression too complex you can do it like this:
string initialString = "John, Doe, \"Sid, Nency\", Smith";
IEnumerable<string> splitted = initialString.Split('"');
splitted = splitted.SelectMany((str, index) => index % 2 == 0 ? str.Split(',') : new[] { str });
splitted = splitted.Where(str => !string.IsNullOrWhiteSpace(str)).Select(str => str.Trim());

C#, regular expressions : how to parse comma-separated values, where some values might be quoted strings themselves containing commas

In C#, using the Regex class, how does one parse comma-separated values, where some values might be quoted strings themselves containing commas?
using System ;
using System.Text.RegularExpressions ;
class Example
{
public static void Main ( )
{
string myString = "cat,dog,\"0 = OFF, 1 = ON\",lion,tiger,'R = red, G = green, B = blue',bear" ;
Console.WriteLine ( "\nmyString is ...\n\t" + myString + "\n" ) ;
Regex regex = new Regex ( "(?<=,(\"|\')).*?(?=(\"|\'),)|(^.*?(?=,))|((?<=,).*?(?=,))|((?<=,).*?$)" ) ;
Match match = regex.Match ( myString ) ;
int j = 0 ;
while ( match.Success )
{
Console.WriteLine ( j++ + " \t" + match ) ;
match = match.NextMatch() ;
}
}
}
Output (in part) appears as follows:
0 cat
1 dog
2 "0 = OFF
3 1 = ON"
4 lion
5 tiger
6 'R = red
7 G = green
8 B = blue'
9 bear
However, desired output is:
0 cat
1 dog
2 0 = OFF, 1 = ON
3 lion
4 tiger
5 R = red, G = green, B = blue
6 bear
Try with this Regex:
"[^"\r\n]*"|'[^'\r\n]*'|[^,\r\n]*
Regex regexObj = new Regex(#"""[^""\r\n]*""|'[^'\r\n]*'|[^,\r\n]*");
Match matchResults = regexObj.Match(input);
while (matchResults.Success)
{
Console.WriteLine(matchResults.Value);
matchResults = matchResults.NextMatch();
}
Ouputs:
cat
dog
"0 = OFF, 1 = ON"
lion
tiger
'R = red, G = green, B = blue'
bear
Note: This regex solution will work for your case, however I recommend you to use a specialized library like FileHelpers.
Why not heed the advice from the experts and Don't roll your own CSV parser.
Your first thought is, "I need to handle commas inside of quotes."
Your next thought will be, "Oh, crap, I need to handle quotes inside of quotes. Escaped quotes. Double quotes. Single quotes..."
It's a road to madness. Don't write your own. Find a library with an extensive unit test coverage that hits all the hard parts and has gone through hell for you. For .NET, use the free and open source FileHelpers library.
it's not a regex, but I've used Microsoft.VisualBasic.FileIO.TextFieldParser to accomplish this for csv files. yes, it might feel a little strange adding a reference to Microsoft.VisualBasic in a C# app, maybe even a little dirty, but hey it works.
Ah, RegEx. Now you have two problems. ;)
I'd use a tokenizer/parser, since it is quite straightforward, and more importantly, much easier to read for later maintenance.
This works, for example:
using System;
using System.Collections;
using System.Collections.Generic;
using System.IO;
using System.Text;
class Program
{
static void Main(string[] args)
{
string myString = "cat,dog,\"0 = OFF, 1 = ON\",lion,tiger,'R = red, G = green, B = blue',bear";
Console.WriteLine("\nmyString is ...\n\t" + myString + "\n");
CsvParser parser = new CsvParser(myString);
Int32 lineNumber = 0;
foreach (string s in parser)
{
Console.WriteLine(lineNumber + ": " + s);
}
Console.ReadKey();
}
}
internal enum TokenType
{
Comma,
Quote,
Value
}
internal class Token
{
public Token(TokenType type, string value)
{
Value = value;
Type = type;
}
public String Value { get; private set; }
public TokenType Type { get; private set; }
}
internal class StreamTokenizer : IEnumerable<Token>
{
private TextReader _reader;
public StreamTokenizer(TextReader reader)
{
_reader = reader;
}
public IEnumerator<Token> GetEnumerator()
{
String line;
StringBuilder value = new StringBuilder();
while ((line = _reader.ReadLine()) != null)
{
foreach (Char c in line)
{
switch (c)
{
case '\'':
case '"':
if (value.Length > 0)
{
yield return new Token(TokenType.Value, value.ToString());
value.Length = 0;
}
yield return new Token(TokenType.Quote, c.ToString());
break;
case ',':
if (value.Length > 0)
{
yield return new Token(TokenType.Value, value.ToString());
value.Length = 0;
}
yield return new Token(TokenType.Comma, c.ToString());
break;
default:
value.Append(c);
break;
}
}
// Thanks, dpan
if (value.Length > 0)
{
yield return new Token(TokenType.Value, value.ToString());
}
}
}
IEnumerator IEnumerable.GetEnumerator()
{
return GetEnumerator();
}
}
internal class CsvParser : IEnumerable<String>
{
private StreamTokenizer _tokenizer;
public CsvParser(Stream data)
{
_tokenizer = new StreamTokenizer(new StreamReader(data));
}
public CsvParser(String data)
{
_tokenizer = new StreamTokenizer(new StringReader(data));
}
public IEnumerator<string> GetEnumerator()
{
Boolean inQuote = false;
StringBuilder result = new StringBuilder();
foreach (Token token in _tokenizer)
{
switch (token.Type)
{
case TokenType.Comma:
if (inQuote)
{
result.Append(token.Value);
}
else
{
yield return result.ToString();
result.Length = 0;
}
break;
case TokenType.Quote:
// Toggle quote state
inQuote = !inQuote;
break;
case TokenType.Value:
result.Append(token.Value);
break;
default:
throw new InvalidOperationException("Unknown token type: " + token.Type);
}
}
if (result.Length > 0)
{
yield return result.ToString();
}
}
IEnumerator IEnumerable.GetEnumerator()
{
return GetEnumerator();
}
}
Just adding the solution I worked on this morning.
var regex = new Regex("(?<=^|,)(\"(?:[^\"]|\"\")*\"|[^,]*)");
foreach (Match m in regex.Matches("<-- input line -->"))
{
var s = m.Value;
}
As you can see, you need to call regex.Matches() per line. It will then return a MatchCollection with the same number of items you have as columns. The Value property of each match is, obviously, the parsed value.
This is still a work in progress, but it happily parses CSV strings like:
2,3.03,"Hello, my name is ""Joshua""",A,B,C,,,D
CSV is not regular. Unless your regex language has sufficient power to handle the stateful nature of csv parsing (unlikely, the MS one does not) then any pure regex solution is a list of bugs waiting to happen as you hit a new input source that isn't quite handled by the last regex.
CSV reading is not that complex to write as a state machine since the grammar is simple but even so you must consider: quoted quotes, commas within quotes, new lines within quotes, empty fields.
As such you should probably just use someone else's CSV parser. I recommend CSVReader for .Net
Function:
private List<string> ParseDelimitedString (string arguments, char delim = ',')
{
bool inQuotes = false;
bool inNonQuotes = false; //used to trim leading WhiteSpace
List<string> strings = new List<string>();
StringBuilder sb = new StringBuilder();
foreach (char c in arguments)
{
if (c == '\'' || c == '"')
{
if (!inQuotes)
inQuotes = true;
else
inQuotes = false;
}else if (c == delim)
{
if (!inQuotes)
{
strings.Add(sb.Replace("'", string.Empty).Replace("\"", string.Empty).ToString());
sb.Remove(0, sb.Length);
inNonQuotes = false;
}
else
{
sb.Append(c);
}
}
else if ( !char.IsWhiteSpace(c) && !inQuotes && !inNonQuotes)
{
if (!inNonQuotes) inNonQuotes = true;
sb.Append(c);
}
}
strings.Add(sb.Replace("'", string.Empty).Replace("\"", string.Empty).ToString());
return strings;
}
Usage
string myString = "cat,dog,\"0 = OFF, 1 = ON\",lion,tiger,'R = red, G = green, B = blue',bear, text";
List<string> strings = ParseDelimitedString(myString);
foreach( string s in strings )
Console.WriteLine( s );
Output:
cat
dog
0 = OFF, 1 = ON
lion
tiger
R = red, G = green, B = blue
bear
text
I found a few bugs in that version, for example, a non-quoted string that has a single quote in the value.
And I agree use the FileHelper library when you can, however that library requires you know what your data will look like... I need a generic parser.
So I've updated the code to the following and thought I'd share...
static public List<string> ParseDelimitedString(string value, char delimiter)
{
bool inQuotes = false;
bool inNonQuotes = false;
bool secondQuote = false;
char curQuote = '\0';
List<string> results = new List<string>();
StringBuilder sb = new StringBuilder();
foreach (char c in value)
{
if (inNonQuotes)
{
// then quotes are just characters
if (c == delimiter)
{
results.Add(sb.ToString());
sb.Remove(0, sb.Length);
inNonQuotes = false;
}
else
{
sb.Append(c);
}
}
else if (inQuotes)
{
// then quotes need to be double escaped
if ((c == '\'' && c == curQuote) || (c == '"' && c == curQuote))
{
if (secondQuote)
{
secondQuote = false;
sb.Append(c);
}
else
secondQuote = true;
}
else if (secondQuote && c == delimiter)
{
results.Add(sb.ToString());
sb.Remove(0, sb.Length);
inQuotes = false;
}
else if (!secondQuote)
{
sb.Append(c);
}
else
{
// bad,as,"user entered something like"this,poorly escaped,value
// just ignore until second delimiter found
}
}
else
{
// not yet parsing a field
if (c == '\'' || c == '"')
{
curQuote = c;
inQuotes = true;
inNonQuotes = false;
secondQuote = false;
}
else if (c == delimiter)
{
// blank field
inQuotes = false;
inNonQuotes = false;
results.Add(string.Empty);
}
else
{
inQuotes = false;
inNonQuotes = true;
sb.Append(c);
}
}
}
if (inQuotes || inNonQuotes)
results.Add(sb.ToString());
return results;
}
since this question: Regex to to parse csv with nested quotes
reports here and is much more generic, and since a RegEx is not really the proper way to solve this problem (i.e. I have had many issues with catastrophic backtracking (http://www.regular-expressions.info/catastrophic.html)
here is a simple parser implementation in Python as well
def csv_to_array(string):
stack = []
match = []
matches = []
for c in string:
# do we have a quote or double quote?
if c == "\"":
# is it a closing match?
if len(stack) > 0 and stack[-1] == c:
stack.pop()
else:
stack.append(c)
elif (c == "," and len(stack) == 0) or (c == "\n"):
matches.append("".join(match))
match = []
else:
match.append(c)
return matches

What is the best algorithm for arbitrary delimiter/escape character processing?

I'm a little surprised that there isn't some information on this on the web, and I keep finding that the problem is a little stickier than I thought.
Here's the rules:
You are starting with delimited/escaped data to split into an array.
The delimiter is one arbitrary character
The escape character is one arbitrary character
Both the delimiter and the escape character could occur in data
Regex is fine, but a good-performance solution is best
Edit: Empty elements (including leading or ending delimiters) can be ignored
The code signature (in C# would be, basically)
public static string[] smartSplit(
string delimitedData,
char delimiter,
char escape) {}
The stickiest part of the problem is the escaped consecutive escape character case, of course, since (calling / the escape character and , the delimiter): ////////, = ////,
Am I missing somewhere this is handled on the web or in another SO question? If not, put your big brains to work... I think this problem is something that would be nice to have on SO for the public good. I'm working on it myself, but don't have a good solution yet.
A simple state machine is usually the easiest and fastest way. Example in Python:
def extract(input, delim, escape):
# states
parsing = 0
escaped = 1
state = parsing
found = []
parsed = ""
for c in input:
if state == parsing:
if c == delim:
found.append(parsed)
parsed = ""
elif c == escape:
state = escaped
else:
parsed += c
else: # state == escaped
parsed += c
state = parsing
if parsed:
found.append(parsed)
return found
void smartSplit(string const& text, char delim, char esc, vector<string>& tokens)
{
enum State { NORMAL, IN_ESC };
State state = NORMAL;
string frag;
for (size_t i = 0; i<text.length(); ++i)
{
char c = text[i];
switch (state)
{
case NORMAL:
if (c == delim)
{
if (!frag.empty())
tokens.push_back(frag);
frag.clear();
}
else if (c == esc)
state = IN_ESC;
else
frag.append(1, c);
break;
case IN_ESC:
frag.append(1, c);
state = NORMAL;
break;
}
}
if (!frag.empty())
tokens.push_back(frag);
}
private static string[] Split(string input, char delimiter, char escapeChar, bool removeEmpty)
{
if (input == null)
{
return new string[0];
}
char[] specialChars = new char[]{delimiter, escapeChar};
var tokens = new List<string>();
var token = new StringBuilder();
for (int i = 0; i < input.Length; i++)
{
var c = input[i];
if (c.Equals(escapeChar))
{
if (i >= input.Length - 1)
{
throw new ArgumentException("Uncompleted escape sequence has been encountered at the end of the input");
}
var nextChar = input[i + 1];
if (nextChar != escapeChar && nextChar != delimiter)
{
throw new ArgumentException("Unknown escape sequence has been encountered: " + c + nextChar);
}
token.Append(nextChar);
i++;
}
else if (c.Equals(delimiter))
{
if (!removeEmpty || token.Length > 0)
{
tokens.Add(token.ToString());
token.Length = 0;
}
}
else
{
var index = input.IndexOfAny(specialChars, i);
if (index < 0)
{
token.Append(c);
}
else
{
token.Append(input.Substring(i, index - i));
i = index - 1;
}
}
}
if (!removeEmpty || token.Length > 0)
{
tokens.Add(token.ToString());
}
return tokens.ToArray();
}
The implementation of this kind of tokenizer in terms of a FSM is fairly straight forward.
You do have a few decisions to make (like, what do I do with leading delimiters? strip or emit NULL tokens).
Here is an abstract version which ignores leading and multiple delimiters, and doesn't allow escaping the newline:
state(input) action
========================
BEGIN(*): token.clear(); state=START;
END(*): return;
*(\n\0): token.emit(); state=END;
START(DELIMITER): ; // NB: the input is *not* added to the token!
START(ESCAPE): state=ESC; // NB: the input is *not* added to the token!
START(*): token.append(input); state=NORM;
NORM(DELIMITER): token.emit(); token.clear(); state=START;
NORM(ESCAPE): state=ESC; // NB: the input is *not* added to the token!
NORM(*): token.append(input);
ESC(*): token.append(input); state=NORM;
This kind of implementation has the advantage of dealing with consecutive excapes naturally, and can be easily extended to give special meaning to more escape sequences (i.e. add a rule like ESC(t) token.appeand(TAB)).
Here's my ported function in C#
public static void smartSplit(string text, char delim, char esc, ref List<string> listToBuild)
{
bool currentlyEscaped = false;
StringBuilder fragment = new StringBuilder();
for (int i = 0; i < text.Length; i++)
{
char c = text[i];
if (currentlyEscaped)
{
fragment.Append(c);
currentlyEscaped = false;
}
else
{
if (c == delim)
{
if (fragment.Length > 0)
{
listToBuild.Add(fragment.ToString());
fragment.Remove(0, fragment.Length);
}
}
else if (c == esc)
currentlyEscaped = true;
else
fragment.Append(c);
}
}
if (fragment.Length > 0)
{
listToBuild.Add(fragment.ToString());
}
}
Hope this helps someone in the future. Thanks to KenE for pointing me in the right direction.
Here's a more idiomatic and readable way to do it:
public IEnumerable<string> SplitAndUnescape(
string encodedString,
char separator,
char escape)
{
var inEscapeSequence = false;
var currentToken = new StringBuilder();
foreach (var currentCharacter in encodedString)
if (inEscapeSequence)
{
currentToken.Append(currentCharacter);
inEscapeSequence = false;
}
else
if (currentCharacter == escape)
inEscapeSequence = true;
else
if (currentCharacter == separator)
{
yield return currentToken.ToString();
currentToken.Clear();
}
else
currentToken.Append(currentCharacter);
yield return currentToken.ToString();
}
Note that this doesn't remove empty elements. I don't think that should be the responsibility of the parser. If you want to remove them, just call Where(item => item.Any()) on the result.
I think this is too much logic for a single method; it gets hard to follow. If someone has time, I think it would be better to break it up into multiple methods and maybe its own class.
You'ew looking for something like a "string tokenizer". There's a version I found quickly that's similar. Or look at getopt.

Categories

Resources