How to replace multiple substrings in a string in C#? - c#

I have to replace multiple substrings from a string (max length 32 of input string). I have a big dictionary which can have millions of items as a key-value pair. I need to check for each word if this word is present in the dictionary and replace with the respective value if present in the dictionary. The input string can have multiple trailing spaces.
This method is being called millions of time, due to this, it's affecting the performance badly.
Is there any scope of optimization in the code or some other better way to do this.
public static string RandomValueCompositeField(object objInput, Dictionary<string, string> g_rawValueRandomValueMapping) {
if (objInput == null)
return null;
string input = objInput.ToString();
if (input == "")
return input;
//List<string> ls = new List<string>();
int count = WhiteSpaceAtEnd(input);
foreach (string data in input.Substring(0, input.Length - count).Split(' ')) {
try {
string value;
gs_dictRawValueRandomValueMapping.TryGetValue(data, out value);
if (value != null) {
//ls.Add(value.TrimEnd());
input = input.Replace(data, value);
}
else {
//ls.Add(data);
}
}
catch(Exception ex) {
}
}
//if (count > 0)
// input = input + new string(' ', count);
//ls.Add(new string(' ', count));
return input;
}
EDIT:
I missed one important thing in the question. substring can occur only once inthe input string. Dictionay key and value have same number of characters.

Here's a method that will take an input string and will build a new string by finding "words" (any consecutive non-whitespace) and then checking if that word is in a dictionary and replacing it with the corresponding value if found. This will fix the issues of Replace doing replacements on "sub-words" (if you have "hello hell" and you want to replace "hell" with "heaven" and you don't want it to give you "heaveno heaven"). It also fixes the issue of swapping. For example if you want to replace "yes" with "no" and "no" with "yes" in "yes no" you don't want it to first turn that into "no no" and then into "yes yes".
public string ReplaceWords(string input, Dictionary<string, string> replacements)
{
var builder = new StringBuilder();
int wordStart = -1;
int wordLength = 0;
for(int i = 0; i < input.Length; i++)
{
// If the current character is white space check if we have a word to replace
if(char.IsWhiteSpace(input[i]))
{
// If wordStart is not -1 then we have hit the end of a word
if(wordStart >= 0)
{
// get the word and look it up in the dictionary
// if found use the replacement, if not keep the word.
var word = input.Substring(wordStart, wordLength);
if(replacements.TryGetValue(word, out var replace))
{
builder.Append(replace);
}
else
{
builder.Append(word);
}
}
// Make sure to reset the start and length
wordStart = -1;
wordLength = 0;
// append whatever whitespace was found.
builder.Append(input[i]);
}
// If this isn't whitespace we set wordStart if it isn't already set
// and just increment the length.
else
{
if(wordStart == -1) wordStart = i;
wordLength++;
}
}
// If wordStart is not -1 then we have a trailing word we need to check.
if(wordStart >= 0)
{
var word = input.Substring(wordStart, wordLength);
if(replacements.TryGetValue(word, out var replace))
{
builder.Append(replace);
}
else
{
builder.Append(word);
}
}
return builder.ToString();
}

Related

List[i].Replace in for-loop won't return string

Trying to make hangman (i'm still a newbie) and the program chooses a random word out of a textfile ==> word turned into arrays. And i have to put it in a label while having the textlabel modified to what's in the letterlist. Thing is: it doesn't show anything in the label and i can't seem to figure out why.
So the for-loop is the modifier and when it has modified every string in the list it should return the word with the right letter or "_".
At first i tried is by doing: letterlist[i] = Letter or letterlist[i] = "_", but would happen is if i typed in a right letter it would show only that letter.
For example: word = "pen". If i typed in "p", it resulted in "ppp".
letterlist = new List<string>();
char[] wordarray = woord.GetWordcharArray(); //word in charArrays
string newwordstring = new string(wordarray);
for (int i = 0; i < wordarray.Length; i++)
{
letterlist.Add(" "); //adds empty strings in list with the length of the word
}
/*
* For-loop for every string in List to check and modify if it's correct or not
*/
for (int i = 0; i < letterlist.Count; i++)
{
if (letterlist[i].Contains(Letter) && newwordstring.Contains(Letter)) //right answer: letter[i] = Letter
{
letterlist[i].Replace(Letter, Letter);
}
else if (letterlist[i].Contains(" ") && newwordstring.Contains(Letter)) //right answer: letter[i] = ""
{
letterlist[i].Replace(" ", Letter);
}
else if (letterlist[i].Contains("_") && newwordstring.Contains(Letter)) //right answer: letter[i] = "_"
{
letterlist[i].Replace("_", Letter);
}
else if (letterlist[i].Contains(" ") && !newwordstring.Contains(Letter)) //wrong answer: letter[i] = ""
{
letterlist[i].Replace(" ", "_");
}
else if (letterlist[i].Contains("_") && !newwordstring.Contains(Letter)) //wrong answer: letter[i] = "_"
{
letterlist[i].Replace(" ", "_");
}
}
/*
* empty += every modified letterlist[i]-string
*/
string empty = "";
foreach (string letter in letterlist)
{
empty += letter;
}
return empty;
New code but it only shows "___" ("_" as many times as the amount of letters as word has):
char[] wordarray = woord.GetWordcharArray(); //word in charArrays
string newwordstring = new string(wordarray); //actual word
string GuessedWord = new string('_', newwordstring.Length);//word that shows in form
bool GuessLetter(char letterguess)
{
bool guessedright = false;
StringBuilder builder = new StringBuilder(GuessedWord);
for(int i = 0; i < GuessedWord.Length; i++)
{
if(char.ToLower(wordarray[i]) == Convert.ToChar(Letter))
{
builder[i] = wordarray[i];
guessedright = true;
}
}
GuessedWord = builder.ToString();
return guessedright;
}
return GuessedWord;
First of all, note that C# string are immutable, which means letterlist[i].Replace(" ", "_") does not replace spaces with underscores. It returns a new string in which spaces have been replaced with underscores.
Therefore, you should reassign this result:
letterlist[i] = letterlist[i].Replace(" ", "_");
Second, Replace(Letter, Letter) won't do much.
Third, in your first for loop, you set every item in letterlist to " ".
I don't understand then why you expect (in your second for loop) letterlist[i].Contains("_") to ever be true.
Finally, I'll leave here something you might find interesting (especially the use of StringBuilder):
class Hangman
{
static void Main()
{
Hangman item = new Hangman();
item.Init();
Console.WriteLine(item.Guessed); // ____
item.GuessLetter('t'); // true
Console.WriteLine(item.Guessed); // T__t
item.GuessLetter('a'); // false
Console.WriteLine(item.Guessed); // T__t
item.GuessLetter('e'); // true
Console.WriteLine(item.Guessed); // Te_t
}
string Word {get;set;}
string Guessed {get;set;}
void Init()
{
Word = "Test";
Guessed = new string('_',Word.Length);
}
bool GuessLetter(char letter)
{
bool guessed = false;
// use a stringbuilder so you can change any character
var sb = new StringBuilder(Guessed);
// for each character of Word, we check if it is the one we claimed
for(int i=0; i<Word.Length; i++)
{
// Let's put both characters to lower case so we can compare them right
if(Char.ToLower(Word[i]) == Char.ToLower(letter)) // have we found it?
{
// Yeah! So we put it in the stringbuilder at the same place
sb[i] = Word[i];
guessed = true;
}
}
// reassign the stringbuilder's representation to Guessed
Guessed = sb.ToString();
// tell if you guessed right
return guessed;
}
}

Regular expression for pipe delimited and double quoted string

I have a string something like this:
"2014-01-23 09:13:45|\"10002112|TR0859657|25-DEC-2013>0000000000000001\"|10002112"
I would like to split by pipe apart from anything wrapped in double quotes so I have something like (similar to how csv is done):
[0] => 2014-01-23 09:13:45
[1] => 10002112|TR0859657|25-DEC-2013>0000000000000001
[2] => 10002112
I would like to know if there is a regular expression that can do this?
I think you may need to write your own parser.
Yo will need:
custom collection to keep results
boolean flag to decide whether pipe is inside quotation or outside quotation marks
string (or StringBuilder) to keep current word
The idea is that you read string char by char. Each char is appended to the word. If there is a pipe outside quotation marks you add the word to your result collection. If there is a quote you switch a flag so you don't treat the pipe as a divider anymore but you append it as a part of the word. Then if there is another quotation you switch the flag back again. So next pipe will result in adding the whole word (with pipes within quotation marks) to the collection. I tested the code below on your example and it worked.
private static List<string> ParseLine(string yourString)
{
bool ignorePipe = false;
string word = string.Empty;
List<string> divided = new List<string>();
foreach (char c in yourString)
{
if (c == '|' &&
!ignorePipe)
{
divided.Add(word);
word = string.Empty;
}
else if (c == '"')
{
ignorePipe = !ignorePipe;
}
else
{
word += c;
}
}
divided.Add(word);
return divided;
}
How about this Regular Expression:
/((["|]).*\2)/g
Online Demo
It looks like it could be used as valid split expression.
I'm going to blatantly ignore the fact that you want a RegEx, because I think that making your own IEnumerable will be easier. Plus, you get instant access to Linq.
var line = "2014-01-23 09:13:45|\"10002112|TR0859657|25-DEC-2013>0000000000000001\"|10002112";
var data = GetPartsFromLine(line).ToList();
private static IEnumerable<string> GetPartsFromLine(string line)
{
int position = -1;
while (position < line.Length)
{
position++;
if (line[position] == '"')
{
//go find the next "
int endQuote = line.IndexOf('"', position + 1);
yield return line.Substring(position + 1, endQuote - position - 1);
position = endQuote;
if (position < line.Length && line[position + 1] == '|')
{
position++;
}
}
else
{
//go find the next |
int pipe = line.IndexOf('|', position + 1);
if (pipe == -1)
{
//hit the end of the line
yield return line.Substring(position);
position = line.Length;
}
else
{
yield return line.Substring(position, pipe - position);
position = pipe;
}
}
}
}
This hasn't been fully tested, but it works with your example.

How to split a space-delimited list of paths where paths can include spaces in .NET 2?

For instance:
c:\dir1 c:\dir2 "c:\my files" c:\code "old photos" "new photos"
Should be read as a list:
c:\dir1
c:\dir2
c:\my files
c:\code
old photos
new photos
I can write a function which parses the string linearly but wondered if the .NET 2.0 toolbox has any cool tricks one could use?
Since you have to hit every character I think a brute force is going to give you the best performance.
That way you hit every character exactly once.
And it limits the number of comparisons performed.
static void Main(string[] args)
{
string input = #"c:\dir1 c:\dir2 ""c:\my files"" c:\code ""old photos"" ""new photos""";
List<string> splitInput = MySplit(input);
foreach (string s in splitInput)
{
System.Diagnostics.Debug.WriteLine(s);
}
System.Diagnostics.Debug.WriteLine(input);
}
public static List<string> MySplit(string input)
{
List<string> split = new List<string>();
StringBuilder sb = new StringBuilder();
bool splitOnQuote = false;
char quote = '"';
char space = ' ';
foreach (char c in input.ToCharArray())
{
if (splitOnQuote)
{
if (c == quote)
{
if (sb.Length > 0)
{
split.Add(sb.ToString());
sb.Clear();
}
splitOnQuote = false;
}
else { sb.Append(c); }
}
else
{
if (c == space)
{
if (sb.Length > 0)
{
split.Add(sb.ToString());
sb.Clear();
}
}
else if (c == quote)
{
if (sb.Length > 0)
{
split.Add(sb.ToString());
sb.Clear();
}
splitOnQuote = true;
}
else { sb.Append(c); }
}
}
if (sb.Length > 0) split.Add(sb.ToString());
return split;
}
Usually for this type of problem one could develop a regular expression to parse out the fields. ( "(.*?)" ) would give you all the string values in quotes. You could strip all those values from your string, and then do a simple split on space after all the quoted items are out.
static void Main(string[] args)
{
string myString = "\"test\" test1 \"test2 test3\" test4 test6 \"test5\"";
string myRegularExpression = #"""(.*?)""";
List<string> listOfMatches = new List<string>();
myString = Regex.Replace(myString, myRegularExpression, delegate(Match match)
{
string v = match.ToString();
listOfMatches.Add(v);
return "";
});
var array = myString.Split(' ');
foreach (string s in array)
{
if(s.Trim().Length > 0)
listOfMatches.Add(s);
}
foreach (string match in listOfMatches)
{
Console.WriteLine(match);
}
Console.Read();
}
Unfortunately, I don't think there is any sort of C# kungfu that makes it much simpler. I should add that obviously, this algorithm gives you the items out of order... so if that matters... this isn't a good solution.
Here's a regex-only solution which captures both space-delimited and quoted paths. Quoted paths are stripped of the quotes, multiple spaces don't cause empty list entries. Edge case of mixing a quoted path with a non-quoted path without intervening space is interpreted as multiple entries.
It can be optimized by disabling captures for unused groups but I opted for more readability instead.
static Regex re = new Regex(#"^([ ]*((?<r>[^ ""]+)|[""](?<r>[^""]*)[""]))*[ ]*$");
public static IEnumerable<string> RegexSplit(string input)
{
var m = re.Match(input ?? "");
if(!m.Success)
throw new ArgumentException("Malformed input.");
return from Capture capture in m.Groups["r"].Captures select capture.Value;
}
Assuming that a space acts as a delimiter between except when enclosed in quotes (to allow paths to contain spaces), I'd recommend the following algorithm:
ignore_space = false;
i = 0;
list_of_breaks=[];
while(i < input_length)
{
if(charat(i) is a space and ignore_space is false)
{
add i to list_of_breaks;
}
else if(charat(i) is a quote)
{
ignore_space = ! ignore_space
}
}
split the input at the indices listed in list_of_breaks

Determine if string has all unique characters

I'm working through an algorithm problem set which poses the following question:
"Determine if a string has all unique characters. Assume you can only use arrays".
I have a working solution, but I would like to see if there is anything better optimized in terms of time complexity. I do not want to use LINQ. Appreciate any help you can provide!
static void Main(string[] args)
{
FindDupes("crocodile");
}
static string FindDupes(string text)
{
if (text.Length == 0 || text.Length > 256)
{
Console.WriteLine("String is either empty or too long");
}
char[] str = new char[text.Length];
char[] output = new char[text.Length];
int strLength = 0;
int outputLength = 0;
foreach (char value in text)
{
bool dupe = false;
for (int i = 0; i < strLength; i++)
{
if (value == str[i])
{
dupe = true;
break;
}
}
if (!dupe)
{
str[strLength] = value;
strLength++;
output[outputLength] = value;
outputLength++;
}
}
return new string(output, 0, outputLength);
}
If time complexity is all you care about you could map the characters to int values, then have an array of bool values which remember if you've seen a particular character value previously.
Something like ... [not tested]
bool[] array = new bool[256]; // or larger for Unicode
foreach (char value in text)
if (array[(int)value])
return false;
else
array[(int)value] = true;
return true;
try this,
string RemoveDuplicateChars(string key)
{
string table = string.Empty;
string result = string.Empty;
foreach (char value in key)
{
if (table.IndexOf(value) == -1)
{
table += value;
result += value;
}
}
return result;
}
usage
Console.WriteLine(RemoveDuplicateChars("hello"));
Console.WriteLine(RemoveDuplicateChars("helo"));
Console.WriteLine(RemoveDuplicateChars("Crocodile"));
output
helo
helo
Crocdile
public boolean ifUnique(String toCheck){
String str="";
for(int i=0;i<toCheck.length();i++)
{
if(str.contains(""+toCheck.charAt(i)))
return false;
str+=toCheck.charAt(i);
}
return true;
}
EDIT:
You may also consider to omit the boundary case where toCheck is an empty string.
The following code works:
static void Main(string[] args)
{
isUniqueChart("text");
Console.ReadKey();
}
static Boolean isUniqueChart(string text)
{
if (text.Length == 0 || text.Length > 256)
{
Console.WriteLine(" The text is empty or too larg");
return false;
}
Boolean[] char_set = new Boolean[256];
for (int i = 0; i < text.Length; i++)
{
int val = text[i];//already found this char in the string
if (char_set[val])
{
Console.WriteLine(" The text is not unique");
return false;
}
char_set[val] = true;
}
Console.WriteLine(" The text is unique");
return true;
}
If the string has only lower case letters (a-z) or only upper case letters (A-Z) you can use a very optimized O(1) solution.Also O(1) space.
c++ code :
bool checkUnique(string s){
if(s.size() >26)
return false;
int unique=0;
for (int i = 0; i < s.size(); ++i) {
int j= s[i]-'a';
if(unique & (1<<j)>0)
return false;
unique=unique|(1<<j);
}
return true;
}
Remove Duplicates in entire Unicode Range
Not all characters can be represented by a single C# char. If you need to take into account combining characters and extended unicode characters, you need to:
parse the characters using StringInfo
normalize the characters
find duplicates amongst the normalized strings
Code to remove duplicate characters:
We keep track of the entropy, storing the normalized characters (each character is a string, because many characters require more than 1 C# char). In case a character (normalized) is not yet stored in the entropy, we append the character (in specified form) to the output.
public static class StringExtension
{
public static string RemoveDuplicateChars(this string text)
{
var output = new StringBuilder();
var entropy = new HashSet<string>();
var iterator = StringInfo.GetTextElementEnumerator(text);
while (iterator.MoveNext())
{
var character = iterator.GetTextElement();
if (entropy.Add(character.Normalize()))
{
output.Append(character);
}
}
return output.ToString();
}
}
Unit Test:
Let's test a string that contains variations on the letter A, including the Angstrom sign Å. The Angstrom sign has unicode codepoint u212B, but can also be constructed as the letter A with the diacritic u030A. Both represent the same character.
// ÅÅAaA
var input = "\u212BA\u030AAaA";
// ÅAa
var output = input.RemoveDuplicateChars();
Further extensions could allow for a selector function that determines how to normalize characters. For instance the selector (x) => x.ToUpperInvariant().Normalize() would allow for case-insensitive duplicate removal.
public static bool CheckUnique(string str)
{
int accumulator = 0;
foreach (int asciiCode in str)
{
int shiftedBit = 1 << (asciiCode - ' ');
if ((accumulator & shiftedBit) > 0)
return false;
accumulator |= shiftedBit;
}
return true;
}

What is the best algorithm for arbitrary delimiter/escape character processing?

I'm a little surprised that there isn't some information on this on the web, and I keep finding that the problem is a little stickier than I thought.
Here's the rules:
You are starting with delimited/escaped data to split into an array.
The delimiter is one arbitrary character
The escape character is one arbitrary character
Both the delimiter and the escape character could occur in data
Regex is fine, but a good-performance solution is best
Edit: Empty elements (including leading or ending delimiters) can be ignored
The code signature (in C# would be, basically)
public static string[] smartSplit(
string delimitedData,
char delimiter,
char escape) {}
The stickiest part of the problem is the escaped consecutive escape character case, of course, since (calling / the escape character and , the delimiter): ////////, = ////,
Am I missing somewhere this is handled on the web or in another SO question? If not, put your big brains to work... I think this problem is something that would be nice to have on SO for the public good. I'm working on it myself, but don't have a good solution yet.
A simple state machine is usually the easiest and fastest way. Example in Python:
def extract(input, delim, escape):
# states
parsing = 0
escaped = 1
state = parsing
found = []
parsed = ""
for c in input:
if state == parsing:
if c == delim:
found.append(parsed)
parsed = ""
elif c == escape:
state = escaped
else:
parsed += c
else: # state == escaped
parsed += c
state = parsing
if parsed:
found.append(parsed)
return found
void smartSplit(string const& text, char delim, char esc, vector<string>& tokens)
{
enum State { NORMAL, IN_ESC };
State state = NORMAL;
string frag;
for (size_t i = 0; i<text.length(); ++i)
{
char c = text[i];
switch (state)
{
case NORMAL:
if (c == delim)
{
if (!frag.empty())
tokens.push_back(frag);
frag.clear();
}
else if (c == esc)
state = IN_ESC;
else
frag.append(1, c);
break;
case IN_ESC:
frag.append(1, c);
state = NORMAL;
break;
}
}
if (!frag.empty())
tokens.push_back(frag);
}
private static string[] Split(string input, char delimiter, char escapeChar, bool removeEmpty)
{
if (input == null)
{
return new string[0];
}
char[] specialChars = new char[]{delimiter, escapeChar};
var tokens = new List<string>();
var token = new StringBuilder();
for (int i = 0; i < input.Length; i++)
{
var c = input[i];
if (c.Equals(escapeChar))
{
if (i >= input.Length - 1)
{
throw new ArgumentException("Uncompleted escape sequence has been encountered at the end of the input");
}
var nextChar = input[i + 1];
if (nextChar != escapeChar && nextChar != delimiter)
{
throw new ArgumentException("Unknown escape sequence has been encountered: " + c + nextChar);
}
token.Append(nextChar);
i++;
}
else if (c.Equals(delimiter))
{
if (!removeEmpty || token.Length > 0)
{
tokens.Add(token.ToString());
token.Length = 0;
}
}
else
{
var index = input.IndexOfAny(specialChars, i);
if (index < 0)
{
token.Append(c);
}
else
{
token.Append(input.Substring(i, index - i));
i = index - 1;
}
}
}
if (!removeEmpty || token.Length > 0)
{
tokens.Add(token.ToString());
}
return tokens.ToArray();
}
The implementation of this kind of tokenizer in terms of a FSM is fairly straight forward.
You do have a few decisions to make (like, what do I do with leading delimiters? strip or emit NULL tokens).
Here is an abstract version which ignores leading and multiple delimiters, and doesn't allow escaping the newline:
state(input) action
========================
BEGIN(*): token.clear(); state=START;
END(*): return;
*(\n\0): token.emit(); state=END;
START(DELIMITER): ; // NB: the input is *not* added to the token!
START(ESCAPE): state=ESC; // NB: the input is *not* added to the token!
START(*): token.append(input); state=NORM;
NORM(DELIMITER): token.emit(); token.clear(); state=START;
NORM(ESCAPE): state=ESC; // NB: the input is *not* added to the token!
NORM(*): token.append(input);
ESC(*): token.append(input); state=NORM;
This kind of implementation has the advantage of dealing with consecutive excapes naturally, and can be easily extended to give special meaning to more escape sequences (i.e. add a rule like ESC(t) token.appeand(TAB)).
Here's my ported function in C#
public static void smartSplit(string text, char delim, char esc, ref List<string> listToBuild)
{
bool currentlyEscaped = false;
StringBuilder fragment = new StringBuilder();
for (int i = 0; i < text.Length; i++)
{
char c = text[i];
if (currentlyEscaped)
{
fragment.Append(c);
currentlyEscaped = false;
}
else
{
if (c == delim)
{
if (fragment.Length > 0)
{
listToBuild.Add(fragment.ToString());
fragment.Remove(0, fragment.Length);
}
}
else if (c == esc)
currentlyEscaped = true;
else
fragment.Append(c);
}
}
if (fragment.Length > 0)
{
listToBuild.Add(fragment.ToString());
}
}
Hope this helps someone in the future. Thanks to KenE for pointing me in the right direction.
Here's a more idiomatic and readable way to do it:
public IEnumerable<string> SplitAndUnescape(
string encodedString,
char separator,
char escape)
{
var inEscapeSequence = false;
var currentToken = new StringBuilder();
foreach (var currentCharacter in encodedString)
if (inEscapeSequence)
{
currentToken.Append(currentCharacter);
inEscapeSequence = false;
}
else
if (currentCharacter == escape)
inEscapeSequence = true;
else
if (currentCharacter == separator)
{
yield return currentToken.ToString();
currentToken.Clear();
}
else
currentToken.Append(currentCharacter);
yield return currentToken.ToString();
}
Note that this doesn't remove empty elements. I don't think that should be the responsibility of the parser. If you want to remove them, just call Where(item => item.Any()) on the result.
I think this is too much logic for a single method; it gets hard to follow. If someone has time, I think it would be better to break it up into multiple methods and maybe its own class.
You'ew looking for something like a "string tokenizer". There's a version I found quickly that's similar. Or look at getopt.

Categories

Resources