Just to clarify something first. I am not trying to convert a byte array to a single string. I am trying to convert a byte-array to a string-array.
I am fetching some data from the clipboard using the GetClipboardData API, and then I'm copying the data from the memory as a byte array. When you're copying multiple files (hence a CF_HDROP clipboard format), I want to convert this byte array into a string array of the files copied.
Here's my code so far.
//Get pointer to clipboard data in the selected format
var clipboardDataPointer = GetClipboardData(format);
//Do a bunch of crap necessary to copy the data from the memory
//the above pointer points at to a place we can access it.
var length = GlobalSize(clipboardDataPointer);
var #lock = GlobalLock(clipboardDataPointer);
//Init a buffer which will contain the clipboard data
var buffer = new byte[(int)length];
//Copy clipboard data to buffer
Marshal.Copy(#lock, buffer, 0, (int)length);
GlobalUnlock(clipboardDataPointer);
snapshot.InsertData(format, buffer);
Now, here's my code for reading the buffer data afterwards.
var formatter = new BinaryFormatter();
using (var serializedData = new MemoryStream(buffer))
{
paths = (string[]) formatter.Deserialize(serializedData);
}
This won't work, and it'll crash with an exception saying that the stream doesn't contain a binary header. I suppose this is because it doesn't know which type to deserialize into.
I've tried looking the Marshal class through. Nothing seems of any relevance.
If the data came through the Win32 API then a string array will just be a sequence of null-terminated strings with a double-null-terminator at the end. (Note that the strings will be UTF-16, so two bytes per character). You'll basically need to pull the strings out one at a time into an array.
The method you're looking for here is Marshal.PtrToStringUni, which you should use instead of Marshal.Copy since it works on an IntPtr. It will extract a string, up to the first null character, from your IntPtr and copy it to a string.
The idea would be to continually extract a single string, then advance the IntPtr past the null byte to the start of the next string, until you run out of buffer. I have not tested this, and it could probably be improved (in particular I think there's a smarter way to detect the end of the buffer) but the basic idea would be:
var myptr = GetClipboardData(format);
var length = GlobalSize(myptr);
var result = new List<string>();
var pos = 0;
while ( pos < length )
{
var str = Marshal.PtrToStringUni(myptr);
var count = Encoding.Unicode.GetByteCount(str);
myptr = IntPtr.Add(myptr, count + 1);
pos += count + 1;
result.Add(str);
}
return result.ToArray();
(By the way: the reason your deserialization doesn't work is because serializing a string[] doesn't just write out the characters as bytes; it writes out the structure of a string array, including additional internal bits that .NET uses like the lengths, and a binary header with type information. What you're getting back from the clipboard has none of that present, so it cannot be deserialized.)
How about this:
var strings = Encoding.Unicode
.GetString(buffer)
.Split(new[] { '\0' }, StringSplitOptions.RemoveEmptyEntries);
Related
how can I create a byte data type from a string? For example: The device I am sending data to, expects the data to be in the hexadecimal format. More specifically, it needs to be in the format: 0x{hexa_decimal_value}
Hard coded, it already worked sending data this way.
I would create a byte array like this:
byte[] items_to_send_ = new byte[] {0x46, 0x30, 0x00};
Now I want to code it dynamically.
The code I am now trying to write looks like this:
var ListByte = new List<byte>();
foreach (char val in messageToConvert)
{
var hexa_decimal_val = Convert.ToInt32(val).ToString("X");
hexa_decimal_val = $"0x:{hexa_decimal_val}";
byte val_ = CreateByteFromStringFunction(hexa_decimal_val); // How?
ListByte.Add(val_);
}
The step in question is when creating the variable val_, where I want to build the byte value from hexa_decimal_val, but I just don't know how. Casting does not work, and I did not find any other function that would do it for me.
It feels like there should be a really easy solution to this, but I just don't seem to find it.
What makes looking for the correct answer tricky here is that I already know how to convert from string to hexadecimal value, but the conversion afterwards is nowhere to be found.
You don't need to do create bytes from characters one by one and append to a list. You can usethis;
var encodedByteList = Encoding.UTF8.GetBytes(messageToConvert);
If you still want to do that, you can do something like this;
var encodedByteList = new List<byte>();
foreach (var character in messageToConvert)
{
var correspondingByte = (byte)character;
encodedByteList.Add(correspondingByte);
}
Or with LINQ, you can use this one liner;
var encodedByteList = messageToConvert.Select(c => (byte)c);
I'm calling a json-rpc api that returns a UCHAR array that represents a PDF file (so the result property on return contains a string representation of a UCHAR array). I need to convert this result string into a Byte array so I can handle it as a PDF file, i.e., save it and/or forward it as a file in a POST to another api.
I have tried the following (the result variable is the returned UCHAR string):
char[] pdfChar = result.ToCharArray();
byte[] pdfByte = new byte[pdfChar.Length];
for (int i = 0; i < pdfChar.Length; i++)
{
pdfByte[i] = Convert.ToByte(pdfChar[i]);
}
File.WriteAllBytes(basePath + "test.pdf", pdfByte);
I have also tried:
byte[] pdfByte = Encoding.ASCII.GetBytes(pdfObj.result);
File.WriteAllBytes(basePath + "test.pdf", pdfByte);
With both of these, when I try to open the resulting test.pdf file, it will not open, presumably because it was not converted properly.
Turns out that, although the output of the API function is UCHAR, when it comes in as part of the JSON string, it is a base64 string, so this works for me:
byte[] pdfBytes = Convert.FromBase64String(pdfObj.result);
I'm pretty sure the API is making that conversion "under the hood", i.e., while the function being called returns UCHAR, the api is using a framework to create the JSON-RPC responses, and it is likely performing the conversion before sending it out. If it is .NET that makes this conversion from UCHAR to base64, then please feel free to chime in and confirm this.
Do you know the file encoding format? Try to use this
return System.Text.Encoding.UTF8.GetString(pdfObj.result)
EDIT:
The solution you found is also reported here
var base64EncodedBytes = System.Convert.FromBase64String(pdfObj.result);
return System.Text.Encoding.UTF8.GetString(base64EncodedBytes)
I'm reading a file into byte[] buffer. The file contains a lot of UTF-16 strings (millions) in the following format:
The first byte contain and string length in chars (range 0 .. 255)
The following bytes contains the string characters in UTF-16 encoding (each char represented by 2 bytes, means byteCount = charCount * 2).
I need to perform standard string operations for all strings in the file, for example: IndexOf, EndsWith and StartsWith, with StringComparison.OrdinalIgnoreCase and StringComparison.Ordinal.
For now my code first converting each string from byte array to System.String type. I found the following code to be the most efficient to do so:
// position/length validation removed to minimize the code
string result;
byte charLength = _buffer[_bufferI++];
int byteLength = charLength * 2;
fixed (byte* pBuffer = &_buffer[_bufferI])
{
result = new string((char*)pBuffer, 0, charLength);
}
_bufferI += byteLength;
return result;
Still, new string(char*, int, int) it's very slow because it performing unnecessary copying for each string.
Profiler says its System.String.wstrcpy(char*,char*,int32) performing slow.
I need a way to perform string operations without copying bytes for each string.
Is there a way to perform string operations on byte array directly?
Is there a way to create new string without copying its bytes?
No, you can't create a string without copying the character data.
The String object stores the meta data for the string (Length, et.c.) in the same memory area as the character data, so you can't keep the character data in the byte array and pretend that it's a String object.
You could try other ways of constructing the string from the byte data, and see if any of them has less overhead, like Encoding.UTF16.GetString.
If you are using a pointer, you could try to get multiple strings at a time, so that you don't have to fix the buffer for each string.
You could read the File using a StreamReader using Encoding.UTF16 so you do not have the "byte overhead" in between:
using (StreamReader sr = new StreamReader(filename, Encoding.UTF16))
{
string line;
while ((line = sr.ReadLine()) != null)
{
//Your Code
}
}
You could create extension methods on byte arrays to handle most of those string operations directly on the byte array and avoid the cost of converting. Not sure what all string operations you perform, so not sure if all of them could be accomplished this way.
I am converting integer number to binary and put it in a header of a data message.
For instance the first meesage that arrived, I would convert the counter to binary that take 4 bytes and had the data message, which is a reguler message containning a, b, c etc'.
Here is how I convert the counter :
//This is preparing the counter as binaryint
nCtrIn = ...;
int nCtrInNetwork = System.Net.IPAddress.HostToNetworkOrder(nCtrIn);
byte[] ArraybyteFormat = BitConverter.GetBytes(nCtrInNetwork);
Now the problem is that now in to take a nother string copy the byteFormat to the beginning of a string and in addition to add the string data.
I do that because I want only in the end to write to the file using binary writer
m_brWriter.Write(ArraybyteFormat);
m_brWriter.Flush();
You can simplify by letting the BinaryWriter directly write the int - no need to conver to byte[] first.
The other problem is writing the message, it can be done like:
m_brWriter.Write(nCounterIn);
string msg = ....; // get it as 1 big string
byte[] textData = Encoding.UTF8.GetBytes(msg);
m_brWriter.Write(textData);
Or, even easier and also easier to read back:
m_brWriter.Write(nCounterIn);
m_brWriter.Write(msg);
But note that the BinaryWriter will now put it's own lentgh-prefix in front of the string.
If it's important to write to the stream in single call you can concatenate the arrays:
var intArray = new byte[4]; // In real code assign
var stringArray = new byte[12]; // actual arrays
var concatenated = new byte[16];
intArray.CopyTo(concatenated, 0);
stringArray.CopyTo(concatenated, 4);
m_brWriter.Write(concatenated);
m_brWriter.Flush();
Did you consider writing the arrays in two calls to Write?
I'm developing a log parser, and I'm reading files of strings of more than 150MB.- This is my approach, Is there any way to optimize what is in the While statement? The problem is that is consuming a lot of memory.- I also tried with a stringbuilder facing the same memory comsuption.-
private void ReadLogInThread()
{
string lineOfLog = string.Empty;
try
{
StreamReader logFile = new StreamReader(myLog.logFileLocation);
InformationUnit infoUnit = new InformationUnit();
infoUnit.LogCompleteSize = myLog.logFileSize;
while ((lineOfLog = logFile.ReadLine()) != null)
{
myLog.transformedLog.Add(lineOfLog); //list<string>
myLog.logNumberLines++;
infoUnit.CurrentNumberOfLine = myLog.logNumberLines;
infoUnit.CurrentLine = lineOfLog;
infoUnit.CurrentSizeRead += lineOfLog.Length;
if (onLineRead != null)
onLineRead(infoUnit);
}
}
catch { throw; }
}
Thanks in advance!
EXTRA:
Im saving each line because after reading the log I will need to check for some information on every stored line.- The language is C#
Memory economy can be achieved if your log lines are actually can be parsed to a data row representation.
Here is a typical log line i can think of:
Event at: 2019/01/05:0:24:32.435, Reason: Operation, Kind: DataStoreOperation, Operation Status: Success
This line takes 200 bytes in memory.
At the same time, following representation just takes belo 16 bytes:
Enum LogReason { Operation, Error, Warning };
Enum EventKind short { DataStoreOperation, DataReadOperation };
Enum OperationStatus short { Success, Failed };
LogRow
{
DateTime EventTime;
LogReason Reason;
EventKind Kind;
OperationStatus Status;
}
Another optimization possibility is just parsing a line to array of string tokens,
this way you could make use of string interning.
For example, if a word "DataStoreOperation" takes 36 bytes, and if it has 1000000 entiries in the file, the economy is (18*2 - 4) * 1000000 = 32 000 000 bytes.
Try to make your algorithm sequential.
Using an IEnumerable instead of a List helps playing nice with memory, while keeping same semantic as working with a list, if you don't need random access to lines by index in the list.
IEnumerable<string> ReadLines()
{
// ...
while ((lineOfLog = logFile.ReadLine()) != null)
{
yield return lineOfLog;
}
}
//...
foreach( var line in ReadLines() )
{
ProcessLine(line);
}
I am not sure if it will fit your project but you can store the result in StringBuilder instead of strings list.
For example, this process on my machine takes 250MB memory after loading (file is 50MB):
static void Main(string[] args)
{
using (StreamReader streamReader = File.OpenText("file.txt"))
{
var list = new List<string>();
string line;
while (( line=streamReader.ReadLine())!=null)
{
list.Add(line);
}
}
}
On the other hand, this code process will take only 100MB:
static void Main(string[] args)
{
var stringBuilder = new StringBuilder();
using (StreamReader streamReader = File.OpenText("file.txt"))
{
string line;
while (( line=streamReader.ReadLine())!=null)
{
stringBuilder.AppendLine(line);
}
}
}
Memory usage keeps going up because you're simply adding them to a List<string>, constantly growing. If you want to use less memory one thing you can do is to write the data to disk, rather than keeping it in scope. Of course, this will greatly cause speed to degrade.
Another option is to compress the string data as you're storing it to your list, and decompress it coming out but I don't think this is a good method.
Side Note:
You need to add a using block around your streamreader.
using (StreamReader logFile = new StreamReader(myLog.logFileLocation))
Consider this implementation: (I'm speaking c/c++, substitute c# as needed)
Use fseek/ftell to find the size of the file.
Use malloc to allocate a chunk of memory the size of the file + 1;
Set that last byte to '\0' to terminate the string.
Use fread to read the entire file into the memory buffer.
You now have char * which holds the contents of the file as a
string.
Create a vector of const char * to hold pointers to the positions
in memory where each line can be found. Initialize the first element
of the vector to the first byte of the memory buffer.
Find the carriage control characters (probably \r\n) Replace the
\r by \0 to make the line a string. Increment past the \n.
This new pointer location is pushed back onto the vector.
Repeat the above until all of the lines in the file have been NUL
terminated, and are pointed to by elements in the vector.
Iterate though the vector as needed to investigate the contents of
each line, in your business specific way.
When you are done, close the file, free the memory, and continue
happily along your way.
1) Compress the strings before you store them (ie see System.IO.Compression and GZipStream). This would probably kill the performance of your program though since you'd have to uncompress to read each line.
2) Remove any extra white space characters or common words you can do without. ie if you can understand what the log is saying with the words "the, a, of...", remove them. Also, shorten any common words (ie change "error" to "err" and "warning" to "wrn"). This would slow down this step in the process but shouldn't affect performance of the rest.
What encoding is your original file? If it is ascii then just the strings alone are going to take over 2x the size of the file just to load up into your array. A C# character is 2 bytes and a C# string adds an extra 20 bytes per string in addition to the characters.
In your case, since it is a log file, you can probably exploit the fact that there is a lot of repetition in the the messages. You most likely can parse the incoming line into a data structure which reduces the memory overhead. For example, if you have a timestamp in the log file you can convert that to a DateTime value which is 8 bytes. Even a short timestamp of 1/1/10 would add 12 bytes to the size of a string, and a timestamp with time information would be even longer. Other tokens in your log stream might be able to be turned into a code or an enum in a similar manner.
Even if you have the leave the value as a string, if you can break it down into pieces that are used a lot, or remove boilerplate that is not needed at all you can probably cut down on your memory usage. If there are a lot of common strings you can Intern them and only pay for 1 string no matter how many you have.
If you must store the raw data, and assuming that your logs are mostly ASCII, then you can save some memory by storing UTF8 bytes internally. Strings are UTF16 internally, so you're storing an extra byte for each character. So by switching to UTF8 you're cutting memory use by half (not counting class overhead, which is still significant). Then you can convert back to normal strings as needed.
static void Main(string[] args)
{
List<Byte[]> strings = new List<byte[]>();
using (TextReader tr = new StreamReader(#"C:\test.log"))
{
string s = tr.ReadLine();
while (s != null)
{
strings.Add(Encoding.Convert(Encoding.Unicode, Encoding.UTF8, Encoding.Unicode.GetBytes(s)));
s = tr.ReadLine();
}
}
// Get strings back
foreach( var str in strings)
{
Console.WriteLine(Encoding.UTF8.GetString(str));
}
}