Replace char(0x10) with a String (The Optimized way) - c#

This is a common question but I hope this does not get tagged as a duplicate since the nature of the question is different (please read the whole not only the title)
Unaware of the existence of String.Replace I wrote the following:
int theIndex = 0;
while ((theIndex = message.IndexOf(separationChar, theIndex)) != -1) //we found the character
{
theIndex++;
if (theIndex < message.Length)//not in the last position
{
message = message.Insert(theIndex, theTime);
}
else
{
// I dont' think this is really neccessary
break;
}
} //while finding characters
As you can see I am replacing occurrences of separationChar in the message String with a String called "theTime".
Now, this works ok for small strings but I have been given a really huge String (in the order of several hundred Kbytes- by the way is there a limit for String or StringBuilder??) and it takes a lot of time...
So my questions are:
1) Is it more efficient if I just do
oldString=separationChar.ToString();
newString=oldString.Insert(theTime);
message= message.Replace(oldString,newString);
2) Is there any other way I can process very long Strings to insert a String (theTime) when finding some char in a very fast and efficient way??
Thanks a lot

As Danny already mentioned, string.Insert() actually creates a new instance each time you use it, and these also have to be garbage collected at some point.
You could instead start with an empty StringBuilder to construct the result string:
public static string Replace(this string str, char find, string replacement)
{
StringBuilder result = new StringBuilder(str.Length); // initial capacity
int pointer = 0;
int index;
while ((index = str.IndexOf(find, pointer)) >= 0)
{
// Append the unprocessed data up to the character
result.Append(str, pointer, index - pointer);
// Append the replacement string
result.Append(replacement);
// Next unprocessed data starts after the character
pointer = index + 1;
}
// Append the remainder of the unprocessed data
result.Append(str, pointer, str.Length - pointer);
return result.ToString();
}
This will not cause a new string to be created (and garbage collected) for each occurrence of the character. Instead, when the internal buffer of the StringBuilder is full, it will create a new buffer chunk "of sufficient capacity". Quote from reference source, when its buffer is full:
Compute the length of the new block we need
We make the new chunk at least big enough for the current need (minBlockCharCount), but also as big as the current length (thus doubling capacity), up to a maximum
(so we stay in the small object heap, and never allocate really big chunks even if
the string gets really big).

Thank you for answering my question.
I am writing an answer because I have to report that I tried the solution in my question 1) and it is indeed more efficient according to the results of my program. String.Replace can replace a string(from a char) with another string very fast.
oldString=separationChar.ToString();
newString=oldString.Insert(theTime);
message= message.Replace(oldString,newString);

Related

Removing punctuation from an extremely long string

I'm working on a book encryption program for one of my courses and I've run into a problem. Our professor gave us the example of using say Pride and Prejudice as the book used to encrypt, so I chose that one to test my program. The current function I'm using to remove the punctuation from the string is taking so long that the program is being forced into break mode. This function works for smaller strings even pages long, but when I fed it Pride and Prejudice it takes way to long.
public void removePunctuation(ref string s) {
string result = "";
for (int i = 0; i < s.Length; i++) {
if (Char.IsWhiteSpace(s[i])) {
result += ' ';
} else if (!Char.IsLetter(s[i]) && !Char.IsNumber(s[i])) {
// do nothing
} else {
result += s[i];
}
}
s = result;
}
So I think I need a faster way to remove punctuation from this string if anyone has any suggestions? I know looping through every character is horrible, but I'm stumped and I was never taught Regex in depth.
Edit: I was asked how I was storing the string in the dictionary class! This is the constructor for another class that actually uses the formatted string.
public CodeBook(string book)
{
BookMap = new Dictionary<string, List<int>>();
Key = book.Split(null).ToList(); // split string into words
foreach(string s in Key)
{
if (!BookMap.Keys.Contains(s))
{
BookMap.Add(s, Enumerable.Range(0, Key.Count).Where(i => Key[i] == s).ToList());
// add word and add list of occurrances of word
}
}
}
This is slow because you construct string by concatenations in a loop. You have several approaches that are more performant:
Use StringBuilder - unlike string concatenation which constructs a new object each time you add a character, this approach expands the string under construction by larger chunks, preventing excessive garbage creation.
Use LINQ's filtering with Where - this approach constructs an array of chars in a single shot, then constructs a single string from it.
Use regular expression's Replace - this method is optimized to deal with strings of virtually unlimited sizes.
Roll your own algorithm - create an array of chars that corresponds to the length of the original string. Walk through the string, and add the characters that you wish to keep to the array. Use string's constructor that takes the array, the initial index, and the length to construct the string at once.
Looping through every character once is not that bad. You're doing it all in one pass, that's not trivial to avoid.
The problem lies in the fact that the framework will need to allocate a new copy of the (partial) string whenever you do something like
result += s[i];
You can avoid that by introducing a StringBuilder documented here to append non-punctuation characters as you go.
public string removePunctuation(string s)
{
var result = new StringBuilder();
for (int i = 0; i < s.Length; i++) {
if (Char.IsWhiteSpace(s[i])) {
result.Append(" ");
} else if (!Char.IsLetter(s[i]) && !Char.IsNumber(s[i])) {
// do nothing
} else {
result.Append(s[i]);
}
}
return result.ToString();
}
You could further reduce the number of necessary Append calls with a refined algorithm, for example look ahead to the next punctuation and append larger portions at once, or use an existing string manipulation library like RegEx. But the introduction of StringBuilder above should give you a noticable performance gain already.
I was never taught Regex in depth
Use the search provider of your choice, you may end up with a tested solution which you can just study and use: https://stackoverflow.com/a/5871826/1132334
You can use Regex to remove punctuations as below.
public string removePunctuation(string s)
{
string result = Regex.Replace(s, #"[^\w\s]", "");
return result;
}
^ Means: not these characters (letters, numbers).
\w Means: word characters.
\s Means: space characters.

Get UTF8.GetBytes from very large stringbuilder

I have a StringBuilder of length 1,539,121,968. When calling StringBuilder .ToString() on it fails with OutOfMemoryException. I tried creating a char array but was not allowed to create such big array.
I need to store it a byte array in UTF8 format. Is it possible?
I'd suggest looking at the documentation for streams. As this might help.
Another way to approach it would be to split it up. As for your last comment stating that you wish to store it as a ByteArray with UTF8 you'd need a char[] as else you'd lose your encoding. I'd reccomend splitting it into many smaller strings (or char[]s) stored in separate objects that can easily be reconstructed. Something like this might suffice, create many StringSlices:
public class StringSlice()
{
public Str {get;}
public Index {get;}
public StringSlice(string str, int index)
{
this.Str = str;
this.Index = index;
}
public static List<string> ReconstructString(IEnumerable<StringSlice> parts)
{
//Sort input by index return list with new strings in order. Probably have to use a buffer on the disc so as not to breach 2GB obj limit.
}
}
In essence what you would be doing here is similar to the way internet packets are split and reconstructed. I'm not entirely sure if I've answered your question but hopefully this comes some way to helping.

.NET Regular expressions on bytes instead of chars

I'm trying to do some parsing that will be easier using regular expressions.
The input is an array (or enumeration) of bytes.
I don't want to convert the bytes to chars for the following reasons:
Computation efficiency
Memory consumption efficiency
Some non-printable bytes might be complex to convert to chars. Not all the bytes are printable.
So I can't use Regex.
The only solution I know, is using Boost.Regex (which works on bytes - C chars), but this is a C++ library that wrapping using C++/CLI will take considerable work.
How can I use regular expressions on bytes in .NET directly, without working with .NET strings and chars?
Thank you.
There is a bit of impedance mismatch going on here. You want to work with Regular expressions in .Net which use strings (multi-byte characters), but you want to work with single byte characters. You can't have both at the same time using .Net as per usual.
However, to break this mismatch down, you could deal with a string in a byte oriented fashion and mutate it. The mutated string can then act as a re-usable buffer. In this way you will not have to convert bytes to chars, or convert your input buffer to a string (as per your question).
An example:
//BLING
byte[] inputBuffer = { 66, 76, 73, 78, 71 };
string stringBuffer = new string('\0', 1000);
Regex regex = new Regex("ING", RegexOptions.Compiled);
unsafe
{
fixed (char* charArray = stringBuffer)
{
byte* buffer = (byte*)(charArray);
//Hard-coded example of string mutation, in practice you would
//loop over your input buffers and regex\match so that the string
//buffer is re-used.
buffer[0] = inputBuffer[0];
buffer[2] = inputBuffer[1];
buffer[4] = inputBuffer[2];
buffer[6] = inputBuffer[3];
buffer[8] = inputBuffer[4];
Console.WriteLine("Mutated string:'{0}'.",
stringBuffer.Substring(0, inputBuffer.Length));
Match match = regex.Match(stringBuffer, 0, inputBuffer.Length);
Console.WriteLine("Position:{0} Length:{1}.", match.Index, match.Length);
}
}
Using this technique you can allocate a string "buffer" which can be re-used as the input to Regex, but you can mutate it with your bytes each time. This avoids the overhead of converting\encoding your byte array into a new .Net string each time you want to do a match. This could prove to be very significant as I have seen many an algorithm in .Net try to go at a million miles an hour only to be brought to its knees by string generation and the subsequent heap spamming and time spent in GC.
Obviously this is unsafe code, but it is .Net.
The results of the Regex will generate strings though, so you have an issue here. I'm not sure if there is a way of using Regex that will not generate new strings. You can certainly get at the match index and length information but the string generation violates your requirements for memory efficiency.
Update
Actually after disassembling Regex\Match\Group\Capture, it looks like it only generates the captured string when you access the Value property, so you may at least not be generating strings if you only access index and length properties. However, you will be generating all the supporting Regex objects.
Well, if I faced this problem, I would DO the C++/CLI wrapper, except I'd create specialized code for what I want to achieve. Eventually develop the wrapper with time to do general things, but this just an option.
The first step is to wrap the Boost::Regex input and output only. Create specialized functions in C++ that do all the stuff you want and use CLI just to pass the input data to the C++ code and then fetch the result back with the CLI. This doesn't look to me like too much work to do.
Update:
Let me try to clarify my point. Even though I may be wrong, I believe you wont be able to find any .NET Binary Regex implementation that you could use. That is why - whether you like it or not - you will be forced to choose between CLI wrapper and bytes-to-chars conversion to use .NET's Regex. In my opinion the wrapper is better choice, because it will be working faster. I did not do any benchmarking, this is just an assumption based on:
Using wrapper you just have to cast
the pointer type (bytes <-> chars).
Using .NET's Regex you have to
convert each byte of the input.
As an alternative to using unsafe, just consider writing a simple, recursive comparer like:
static bool Evaluate(byte[] data, byte[] sequence, int dataIndex=0, int sequenceIndex=0)
{
if (sequence[sequenceIndex] == data[dataIndex])
{
if (sequenceIndex == sequence.Length - 1)
return true;
else if (dataIndex == data.Length - 1)
return false;
else
return Evaluate(data, sequence, dataIndex + 1, sequenceIndex + 1);
}
else
{
if (dataIndex < data.Length - 1)
return Evaluate(data, sequence, dataIndex+1, 0);
else
return false;
}
}
You could improve efficiency in a number of ways (i.e. seeking the first byte match instead of iterating, etc.) but this could get you started... hope it helps.
I personally went a different approach and wrote a small state machine that can be extended. I believe if parsing protocol data this is much more readable than regex.
bool ParseUDSResponse(PassThruMsg rxMsg, UDScmd.Mode txMode, byte txSubFunction, out UDScmd.Response functionResponse, out byte[] payload)
{
payload = new byte[0];
functionResponse = UDScmd.Response.UNKNOWN;
bool positiveReponse = false;
var rxMsgBytes = rxMsg.GetBytes();
//Iterate the reply bytes to find the echod ECU index, response code, function response and payload data if there is any
//If we could use some kind of HEX regex this would be a bit neater
//Iterate until we get past any and all null padding
int stateMachine = 0;
for (int i = 0; i < rxMsgBytes.Length; i++)
{
switch (stateMachine)
{
case 0:
if (rxMsgBytes[i] == 0x07) stateMachine = 1;
break;
case 1:
if (rxMsgBytes[i] == 0xE8) stateMachine = 2;
else return false;
case 2:
if (rxMsgBytes[i] == (byte)txMode + (byte)OBDcmd.Reponse.SUCCESS)
{
//Positive response to the requested mode
positiveReponse = true;
}
else if(rxMsgBytes[i] != (byte)OBDcmd.Reponse.NEGATIVE_RESPONSE)
{
//This is an invalid response, give up now
return false;
}
stateMachine = 3;
break;
case 3:
functionResponse = (UDScmd.Response)rxMsgBytes[i];
if (positiveReponse && rxMsgBytes[i] == txSubFunction)
{
//We have a positive response and a positive subfunction code (subfunction is reflected)
int payloadLength = rxMsgBytes.Length - i;
if(payloadLength > 0)
{
payload = new byte[payloadLength];
Array.Copy(rxMsgBytes, i, payload, 0, payloadLength);
}
return true;
} else
{
//We had a positive response but a negative subfunction error
//we return the function error code so it can be relayed
return false;
}
default:
return false;
}
}
return false;
}

Recursive woes - reducing an input string

I'm working on a portion of code that is essentially trying to reduce a list of strings down to a single string recursively.
I have an internal database built up of matching string arrays of varying length (say array lengths of 2-4).
An example input string array would be:
{"The", "dog", "ran", "away"}
And for further example, my database could be made up of string arrays in this manner:
(length 2) {{"The", "dog"},{"dog", "ran"}, {"ran", "away"}}
(length 3) {{"The", "dog", "ran"}.... and so on
So, what I am attempting to do is recursively reduce my input string array down to a single token. So ideally it would parse something like this:
1) {"The", "dog", "ran", "away"}
Say that (seq1) = {"The", "dog"} and (seq2) = {"ran", "away"}
2) { (seq1), "ran", "away"}
3) { (seq1), (seq2)}
In my sequence database I know that, for instance, seq3 = {(seq1), (seq2)}
4) { (seq3) }
So, when it is down to a single token, I'm happy and the function would end.
Here is an outline of my current program logic:
public void Tokenize(Arraylist<T> string_array, int current_size)
{
// retrieve all known sequences of length [current_size] (from global list array)
loc_sequences_by_length = sequences_by_length[current_size-min_size]; // sequences of length 2 are stored in position 0 and so on
// escape cases
if (string_array.Count == 1)
{
// finished successfully
return;
}
else if (string_array.Count < current_size)
{
// checking sequences of greater length than input string, bail
return;
}
else
{
// split input string into chunks of size [current_size] and compare to local database
// of known sequences
// (splitting code works fine)
foreach (comparison)
{
if (match_found)
{
// update input string and recall function to find other matches
string_array[found_array_position] = new_sequence;
string_array.Removerange[found_array_position+1, new_sequence.Length-1];
Tokenize(string_array, current_size)
}
}
}
// ran through unsuccessfully, increment length and try again for new sequence group
current_size++;
if (current_size > MAX_SIZE)
return;
else
Tokenize(string_array, current_size);
}
I thought it was straightforward enough, but have been getting some strange results.
Generally it appears to work, but upon further review of my output data I'm seeing some issues. Mainly, it appears to work up to a certain point...and at that point my 'curr_size' counter resets to the minimum value.
So it is called with a size of 2, then 3, then 4, then resets to 2.
My assumption was that it would run up to my predetermined max size, and then bail completely.
I tried to simplify my code as much as possible, so there are probably some simple syntax errors in transcribing. If there is any other detail that may help an eagle-eyed SO user, please let me know and I'll edit.
Thanks in advance
One bug is:
string_array[found_array_position] = new_sequence;
I don't know where this is defined, and as far as I can tell if it was defined, it is never changed.
In your if statement, when if match_found ever set to true?
Also, it appears you have an extra close brace here, but you may want the last block of code to be outside of the function:
}
}
}
It would help if you cleaned up the code, to make it easier to read. Once we get past the syntactic errors it will be easier to see what is going on, I think.
Not sure what all the issues are, but the first thing I'd do is have your "catch-all" exit block right at the beginning of your method.
public void Tokenize(Arraylist<T> string_array, int current_size)
{
if (current_size > MAX_SIZE)
return;
// Guts go here
Tokenize(string_array, ++current_size);
}
A couple things:
Your tokens are not clearly separated from your input string values. This makes it more difficult to handle, and to see what's going on.
It looks like you're writing pseudo-code:
loc_sequences_by_length is not used
found_array_position is not defined
Arraylist should be ArrayList.
etc.
Overall I agree with James' statement:
It would help if you cleaned up the
code, to make it easier to read.
-Doug

How to optimize memory usage in this algorithm?

I'm developing a log parser, and I'm reading files of strings of more than 150MB.- This is my approach, Is there any way to optimize what is in the While statement? The problem is that is consuming a lot of memory.- I also tried with a stringbuilder facing the same memory comsuption.-
private void ReadLogInThread()
{
string lineOfLog = string.Empty;
try
{
StreamReader logFile = new StreamReader(myLog.logFileLocation);
InformationUnit infoUnit = new InformationUnit();
infoUnit.LogCompleteSize = myLog.logFileSize;
while ((lineOfLog = logFile.ReadLine()) != null)
{
myLog.transformedLog.Add(lineOfLog); //list<string>
myLog.logNumberLines++;
infoUnit.CurrentNumberOfLine = myLog.logNumberLines;
infoUnit.CurrentLine = lineOfLog;
infoUnit.CurrentSizeRead += lineOfLog.Length;
if (onLineRead != null)
onLineRead(infoUnit);
}
}
catch { throw; }
}
Thanks in advance!
EXTRA:
Im saving each line because after reading the log I will need to check for some information on every stored line.- The language is C#
Memory economy can be achieved if your log lines are actually can be parsed to a data row representation.
Here is a typical log line i can think of:
Event at: 2019/01/05:0:24:32.435, Reason: Operation, Kind: DataStoreOperation, Operation Status: Success
This line takes 200 bytes in memory.
At the same time, following representation just takes belo 16 bytes:
Enum LogReason { Operation, Error, Warning };
Enum EventKind short { DataStoreOperation, DataReadOperation };
Enum OperationStatus short { Success, Failed };
LogRow
{
DateTime EventTime;
LogReason Reason;
EventKind Kind;
OperationStatus Status;
}
Another optimization possibility is just parsing a line to array of string tokens,
this way you could make use of string interning.
For example, if a word "DataStoreOperation" takes 36 bytes, and if it has 1000000 entiries in the file, the economy is (18*2 - 4) * 1000000 = 32 000 000 bytes.
Try to make your algorithm sequential.
Using an IEnumerable instead of a List helps playing nice with memory, while keeping same semantic as working with a list, if you don't need random access to lines by index in the list.
IEnumerable<string> ReadLines()
{
// ...
while ((lineOfLog = logFile.ReadLine()) != null)
{
yield return lineOfLog;
}
}
//...
foreach( var line in ReadLines() )
{
ProcessLine(line);
}
I am not sure if it will fit your project but you can store the result in StringBuilder instead of strings list.
For example, this process on my machine takes 250MB memory after loading (file is 50MB):
static void Main(string[] args)
{
using (StreamReader streamReader = File.OpenText("file.txt"))
{
var list = new List<string>();
string line;
while (( line=streamReader.ReadLine())!=null)
{
list.Add(line);
}
}
}
On the other hand, this code process will take only 100MB:
static void Main(string[] args)
{
var stringBuilder = new StringBuilder();
using (StreamReader streamReader = File.OpenText("file.txt"))
{
string line;
while (( line=streamReader.ReadLine())!=null)
{
stringBuilder.AppendLine(line);
}
}
}
Memory usage keeps going up because you're simply adding them to a List<string>, constantly growing. If you want to use less memory one thing you can do is to write the data to disk, rather than keeping it in scope. Of course, this will greatly cause speed to degrade.
Another option is to compress the string data as you're storing it to your list, and decompress it coming out but I don't think this is a good method.
Side Note:
You need to add a using block around your streamreader.
using (StreamReader logFile = new StreamReader(myLog.logFileLocation))
Consider this implementation: (I'm speaking c/c++, substitute c# as needed)
Use fseek/ftell to find the size of the file.
Use malloc to allocate a chunk of memory the size of the file + 1;
Set that last byte to '\0' to terminate the string.
Use fread to read the entire file into the memory buffer.
You now have char * which holds the contents of the file as a
string.
Create a vector of const char * to hold pointers to the positions
in memory where each line can be found. Initialize the first element
of the vector to the first byte of the memory buffer.
Find the carriage control characters (probably \r\n) Replace the
\r by \0 to make the line a string. Increment past the \n.
This new pointer location is pushed back onto the vector.
Repeat the above until all of the lines in the file have been NUL
terminated, and are pointed to by elements in the vector.
Iterate though the vector as needed to investigate the contents of
each line, in your business specific way.
When you are done, close the file, free the memory, and continue
happily along your way.
1) Compress the strings before you store them (ie see System.IO.Compression and GZipStream). This would probably kill the performance of your program though since you'd have to uncompress to read each line.
2) Remove any extra white space characters or common words you can do without. ie if you can understand what the log is saying with the words "the, a, of...", remove them. Also, shorten any common words (ie change "error" to "err" and "warning" to "wrn"). This would slow down this step in the process but shouldn't affect performance of the rest.
What encoding is your original file? If it is ascii then just the strings alone are going to take over 2x the size of the file just to load up into your array. A C# character is 2 bytes and a C# string adds an extra 20 bytes per string in addition to the characters.
In your case, since it is a log file, you can probably exploit the fact that there is a lot of repetition in the the messages. You most likely can parse the incoming line into a data structure which reduces the memory overhead. For example, if you have a timestamp in the log file you can convert that to a DateTime value which is 8 bytes. Even a short timestamp of 1/1/10 would add 12 bytes to the size of a string, and a timestamp with time information would be even longer. Other tokens in your log stream might be able to be turned into a code or an enum in a similar manner.
Even if you have the leave the value as a string, if you can break it down into pieces that are used a lot, or remove boilerplate that is not needed at all you can probably cut down on your memory usage. If there are a lot of common strings you can Intern them and only pay for 1 string no matter how many you have.
If you must store the raw data, and assuming that your logs are mostly ASCII, then you can save some memory by storing UTF8 bytes internally. Strings are UTF16 internally, so you're storing an extra byte for each character. So by switching to UTF8 you're cutting memory use by half (not counting class overhead, which is still significant). Then you can convert back to normal strings as needed.
static void Main(string[] args)
{
List<Byte[]> strings = new List<byte[]>();
using (TextReader tr = new StreamReader(#"C:\test.log"))
{
string s = tr.ReadLine();
while (s != null)
{
strings.Add(Encoding.Convert(Encoding.Unicode, Encoding.UTF8, Encoding.Unicode.GetBytes(s)));
s = tr.ReadLine();
}
}
// Get strings back
foreach( var str in strings)
{
Console.WriteLine(Encoding.UTF8.GetString(str));
}
}

Categories

Resources