C# perform string operation on UTF-16 byte array - c#

I'm reading a file into byte[] buffer. The file contains a lot of UTF-16 strings (millions) in the following format:
The first byte contain and string length in chars (range 0 .. 255)
The following bytes contains the string characters in UTF-16 encoding (each char represented by 2 bytes, means byteCount = charCount * 2).
I need to perform standard string operations for all strings in the file, for example: IndexOf, EndsWith and StartsWith, with StringComparison.OrdinalIgnoreCase and StringComparison.Ordinal.
For now my code first converting each string from byte array to System.String type. I found the following code to be the most efficient to do so:
// position/length validation removed to minimize the code
string result;
byte charLength = _buffer[_bufferI++];
int byteLength = charLength * 2;
fixed (byte* pBuffer = &_buffer[_bufferI])
{
result = new string((char*)pBuffer, 0, charLength);
}
_bufferI += byteLength;
return result;
Still, new string(char*, int, int) it's very slow because it performing unnecessary copying for each string.
Profiler says its System.String.wstrcpy(char*,char*,int32) performing slow.
I need a way to perform string operations without copying bytes for each string.
Is there a way to perform string operations on byte array directly?
Is there a way to create new string without copying its bytes?

No, you can't create a string without copying the character data.
The String object stores the meta data for the string (Length, et.c.) in the same memory area as the character data, so you can't keep the character data in the byte array and pretend that it's a String object.
You could try other ways of constructing the string from the byte data, and see if any of them has less overhead, like Encoding.UTF16.GetString.
If you are using a pointer, you could try to get multiple strings at a time, so that you don't have to fix the buffer for each string.

You could read the File using a StreamReader using Encoding.UTF16 so you do not have the "byte overhead" in between:
using (StreamReader sr = new StreamReader(filename, Encoding.UTF16))
{
string line;
while ((line = sr.ReadLine()) != null)
{
//Your Code
}
}

You could create extension methods on byte arrays to handle most of those string operations directly on the byte array and avoid the cost of converting. Not sure what all string operations you perform, so not sure if all of them could be accomplished this way.

Related

C# .Net framework strings encoding from utf-8 bytes

I have application wrote in C#, and this application receives data through network from server using sockets (udp libenet).
In my application I have function to process raw bytes sent in packet.
One of functions is reading string, delimited by \0.
My problem is that I'm sending UTF-8 encoded string by server to C# application, but when I use these strings to display them in controls, I get gibberish instead of polish letters.
Function that reads strings from buffer:
public override string ReadString()
{
StringBuilder sb = new StringBuilder();
while (true)
{
byte b;
if (Remaining > 0)
b = ReadByte();
else
b = 0;
if (b == 0) break;
// Probably here is the problem. Checked other encodings etc., but still same
sb.Append(Encoding.UTF8.GetString(new byte[] { b }, 0, 1));
}
return sb.ToString();
}
Function overrides, the one from:
public class BitReader : BinaryReader
In my application I get:
You can't read UTF-8 byte wise as a single character might take more than one byte.
See How to convert byte[] to string? (first read everything into one byte array / List).

String to byte array only converts first 16 bytes according to Intellisense

I'm trying to convert a string to a byte[] using the ASCIIEncoder object in the .NET library. The string will never contain non-ASCII characters, but it will usually have a length greater than 16. My code looks like the following:
public static byte[] Encode(string packet)
{
ASCIIEncoder enc = new ASCIIEncoder();
byte[] byteArray = enc.GetBytes(packet);
return byteArray;
}
By the end of the method, the byte array should be full of packet.Length number of bytes, but Intellisense tells me that all bytes after byteArray[15] are literally questions marks that cannot be observed. I used Wireshark to view byteArray after I sent it and it was received on the other side fine, but the end device did not follow the instructions encoded in byteArray. I'm wondering if this has anything to do with Intellisense not being able to display all elements in byteArray, or if my packet is completely wrong.
If your packet string basically contains characters in the range 0-255, then ASCIIEncoding is not what you should be using. ASCII only defines character codes 0-127; anything in the range 128-255 will get turned into question marks (as you have observed) because there characters are not defined in ASCII.
Consider using a method like this to convert the string to a byte array. (This assumes that the ordinal value of each character is in the range 0-255 and that the ordinal value is what you want.)
public static byte[] ToOrdinalByteArray(this string str)
{
if (str == null) { throw new ArgumentNullException("str"); }
var bytes = new byte[str.Length];
for (int i = 0; i < str.Length; ++i) {
// Wrapping the cast in checked() will trigger an OverflowException
// if the character being converted is out of range for a byte.
bytes[i] = checked((byte)str[i]);
}
return bytes;
}
The Encoding class hierarchy is specifically designed for handling text. What you have here doesn't seem to be text, so you should avoid using these classes.
The standard encoders use the replacement character fallback strategy. If a character doesn't exist in the target character set, they encode a replacement character ('?' by default).
To me, that's worse than a silent failure; It's data corruption. I prefer that libraries tell me when my assumptions are wrong.
You can derive an encoder that throws an exception:
Encoding.GetEncoding(
"us-ascii",
new EncoderExceptionFallback(),
new DecoderExceptionFallback());
If you are truly using only characters in Unicode's ASCII range then you'll never see an exception.

How can I convert a byte array to a string array?

Just to clarify something first. I am not trying to convert a byte array to a single string. I am trying to convert a byte-array to a string-array.
I am fetching some data from the clipboard using the GetClipboardData API, and then I'm copying the data from the memory as a byte array. When you're copying multiple files (hence a CF_HDROP clipboard format), I want to convert this byte array into a string array of the files copied.
Here's my code so far.
//Get pointer to clipboard data in the selected format
var clipboardDataPointer = GetClipboardData(format);
//Do a bunch of crap necessary to copy the data from the memory
//the above pointer points at to a place we can access it.
var length = GlobalSize(clipboardDataPointer);
var #lock = GlobalLock(clipboardDataPointer);
//Init a buffer which will contain the clipboard data
var buffer = new byte[(int)length];
//Copy clipboard data to buffer
Marshal.Copy(#lock, buffer, 0, (int)length);
GlobalUnlock(clipboardDataPointer);
snapshot.InsertData(format, buffer);
Now, here's my code for reading the buffer data afterwards.
var formatter = new BinaryFormatter();
using (var serializedData = new MemoryStream(buffer))
{
paths = (string[]) formatter.Deserialize(serializedData);
}
This won't work, and it'll crash with an exception saying that the stream doesn't contain a binary header. I suppose this is because it doesn't know which type to deserialize into.
I've tried looking the Marshal class through. Nothing seems of any relevance.
If the data came through the Win32 API then a string array will just be a sequence of null-terminated strings with a double-null-terminator at the end. (Note that the strings will be UTF-16, so two bytes per character). You'll basically need to pull the strings out one at a time into an array.
The method you're looking for here is Marshal.PtrToStringUni, which you should use instead of Marshal.Copy since it works on an IntPtr. It will extract a string, up to the first null character, from your IntPtr and copy it to a string.
The idea would be to continually extract a single string, then advance the IntPtr past the null byte to the start of the next string, until you run out of buffer. I have not tested this, and it could probably be improved (in particular I think there's a smarter way to detect the end of the buffer) but the basic idea would be:
var myptr = GetClipboardData(format);
var length = GlobalSize(myptr);
var result = new List<string>();
var pos = 0;
while ( pos < length )
{
var str = Marshal.PtrToStringUni(myptr);
var count = Encoding.Unicode.GetByteCount(str);
myptr = IntPtr.Add(myptr, count + 1);
pos += count + 1;
result.Add(str);
}
return result.ToArray();
(By the way: the reason your deserialization doesn't work is because serializing a string[] doesn't just write out the characters as bytes; it writes out the structure of a string array, including additional internal bits that .NET uses like the lengths, and a binary header with type information. What you're getting back from the clipboard has none of that present, so it cannot be deserialized.)
How about this:
var strings = Encoding.Unicode
.GetString(buffer)
.Split(new[] { '\0' }, StringSplitOptions.RemoveEmptyEntries);

C# binary to string

I am converting integer number to binary and put it in a header of a data message.
For instance the first meesage that arrived, I would convert the counter to binary that take 4 bytes and had the data message, which is a reguler message containning a, b, c etc'.
Here is how I convert the counter :
//This is preparing the counter as binaryint
nCtrIn = ...;
int nCtrInNetwork = System.Net.IPAddress.HostToNetworkOrder(nCtrIn);
byte[] ArraybyteFormat = BitConverter.GetBytes(nCtrInNetwork);
Now the problem is that now in to take a nother string copy the byteFormat to the beginning of a string and in addition to add the string data.
I do that because I want only in the end to write to the file using binary writer
m_brWriter.Write(ArraybyteFormat);
m_brWriter.Flush();
You can simplify by letting the BinaryWriter directly write the int - no need to conver to byte[] first.
The other problem is writing the message, it can be done like:
m_brWriter.Write(nCounterIn);
string msg = ....; // get it as 1 big string
byte[] textData = Encoding.UTF8.GetBytes(msg);
m_brWriter.Write(textData);
Or, even easier and also easier to read back:
m_brWriter.Write(nCounterIn);
m_brWriter.Write(msg);
But note that the BinaryWriter will now put it's own lentgh-prefix in front of the string.
If it's important to write to the stream in single call you can concatenate the arrays:
var intArray = new byte[4]; // In real code assign
var stringArray = new byte[12]; // actual arrays
var concatenated = new byte[16];
intArray.CopyTo(concatenated, 0);
stringArray.CopyTo(concatenated, 4);
m_brWriter.Write(concatenated);
m_brWriter.Flush();
Did you consider writing the arrays in two calls to Write?

How to optimize memory usage in this algorithm?

I'm developing a log parser, and I'm reading files of strings of more than 150MB.- This is my approach, Is there any way to optimize what is in the While statement? The problem is that is consuming a lot of memory.- I also tried with a stringbuilder facing the same memory comsuption.-
private void ReadLogInThread()
{
string lineOfLog = string.Empty;
try
{
StreamReader logFile = new StreamReader(myLog.logFileLocation);
InformationUnit infoUnit = new InformationUnit();
infoUnit.LogCompleteSize = myLog.logFileSize;
while ((lineOfLog = logFile.ReadLine()) != null)
{
myLog.transformedLog.Add(lineOfLog); //list<string>
myLog.logNumberLines++;
infoUnit.CurrentNumberOfLine = myLog.logNumberLines;
infoUnit.CurrentLine = lineOfLog;
infoUnit.CurrentSizeRead += lineOfLog.Length;
if (onLineRead != null)
onLineRead(infoUnit);
}
}
catch { throw; }
}
Thanks in advance!
EXTRA:
Im saving each line because after reading the log I will need to check for some information on every stored line.- The language is C#
Memory economy can be achieved if your log lines are actually can be parsed to a data row representation.
Here is a typical log line i can think of:
Event at: 2019/01/05:0:24:32.435, Reason: Operation, Kind: DataStoreOperation, Operation Status: Success
This line takes 200 bytes in memory.
At the same time, following representation just takes belo 16 bytes:
Enum LogReason { Operation, Error, Warning };
Enum EventKind short { DataStoreOperation, DataReadOperation };
Enum OperationStatus short { Success, Failed };
LogRow
{
DateTime EventTime;
LogReason Reason;
EventKind Kind;
OperationStatus Status;
}
Another optimization possibility is just parsing a line to array of string tokens,
this way you could make use of string interning.
For example, if a word "DataStoreOperation" takes 36 bytes, and if it has 1000000 entiries in the file, the economy is (18*2 - 4) * 1000000 = 32 000 000 bytes.
Try to make your algorithm sequential.
Using an IEnumerable instead of a List helps playing nice with memory, while keeping same semantic as working with a list, if you don't need random access to lines by index in the list.
IEnumerable<string> ReadLines()
{
// ...
while ((lineOfLog = logFile.ReadLine()) != null)
{
yield return lineOfLog;
}
}
//...
foreach( var line in ReadLines() )
{
ProcessLine(line);
}
I am not sure if it will fit your project but you can store the result in StringBuilder instead of strings list.
For example, this process on my machine takes 250MB memory after loading (file is 50MB):
static void Main(string[] args)
{
using (StreamReader streamReader = File.OpenText("file.txt"))
{
var list = new List<string>();
string line;
while (( line=streamReader.ReadLine())!=null)
{
list.Add(line);
}
}
}
On the other hand, this code process will take only 100MB:
static void Main(string[] args)
{
var stringBuilder = new StringBuilder();
using (StreamReader streamReader = File.OpenText("file.txt"))
{
string line;
while (( line=streamReader.ReadLine())!=null)
{
stringBuilder.AppendLine(line);
}
}
}
Memory usage keeps going up because you're simply adding them to a List<string>, constantly growing. If you want to use less memory one thing you can do is to write the data to disk, rather than keeping it in scope. Of course, this will greatly cause speed to degrade.
Another option is to compress the string data as you're storing it to your list, and decompress it coming out but I don't think this is a good method.
Side Note:
You need to add a using block around your streamreader.
using (StreamReader logFile = new StreamReader(myLog.logFileLocation))
Consider this implementation: (I'm speaking c/c++, substitute c# as needed)
Use fseek/ftell to find the size of the file.
Use malloc to allocate a chunk of memory the size of the file + 1;
Set that last byte to '\0' to terminate the string.
Use fread to read the entire file into the memory buffer.
You now have char * which holds the contents of the file as a
string.
Create a vector of const char * to hold pointers to the positions
in memory where each line can be found. Initialize the first element
of the vector to the first byte of the memory buffer.
Find the carriage control characters (probably \r\n) Replace the
\r by \0 to make the line a string. Increment past the \n.
This new pointer location is pushed back onto the vector.
Repeat the above until all of the lines in the file have been NUL
terminated, and are pointed to by elements in the vector.
Iterate though the vector as needed to investigate the contents of
each line, in your business specific way.
When you are done, close the file, free the memory, and continue
happily along your way.
1) Compress the strings before you store them (ie see System.IO.Compression and GZipStream). This would probably kill the performance of your program though since you'd have to uncompress to read each line.
2) Remove any extra white space characters or common words you can do without. ie if you can understand what the log is saying with the words "the, a, of...", remove them. Also, shorten any common words (ie change "error" to "err" and "warning" to "wrn"). This would slow down this step in the process but shouldn't affect performance of the rest.
What encoding is your original file? If it is ascii then just the strings alone are going to take over 2x the size of the file just to load up into your array. A C# character is 2 bytes and a C# string adds an extra 20 bytes per string in addition to the characters.
In your case, since it is a log file, you can probably exploit the fact that there is a lot of repetition in the the messages. You most likely can parse the incoming line into a data structure which reduces the memory overhead. For example, if you have a timestamp in the log file you can convert that to a DateTime value which is 8 bytes. Even a short timestamp of 1/1/10 would add 12 bytes to the size of a string, and a timestamp with time information would be even longer. Other tokens in your log stream might be able to be turned into a code or an enum in a similar manner.
Even if you have the leave the value as a string, if you can break it down into pieces that are used a lot, or remove boilerplate that is not needed at all you can probably cut down on your memory usage. If there are a lot of common strings you can Intern them and only pay for 1 string no matter how many you have.
If you must store the raw data, and assuming that your logs are mostly ASCII, then you can save some memory by storing UTF8 bytes internally. Strings are UTF16 internally, so you're storing an extra byte for each character. So by switching to UTF8 you're cutting memory use by half (not counting class overhead, which is still significant). Then you can convert back to normal strings as needed.
static void Main(string[] args)
{
List<Byte[]> strings = new List<byte[]>();
using (TextReader tr = new StreamReader(#"C:\test.log"))
{
string s = tr.ReadLine();
while (s != null)
{
strings.Add(Encoding.Convert(Encoding.Unicode, Encoding.UTF8, Encoding.Unicode.GetBytes(s)));
s = tr.ReadLine();
}
}
// Get strings back
foreach( var str in strings)
{
Console.WriteLine(Encoding.UTF8.GetString(str));
}
}

Categories

Resources