Text and binary data in the same file

Text and binary data in the same file - c#

CString strFile = "c:\\test.txt";
CStdioFile aFile;
UINT nOpenFlags = CFile::modeWrite | CFile::modeCreate | CFile::typeText;
CFileException anError;
if (!aFile.Open(strFile, nOpenFlags, &anError))
{
return false
}
int nSize = 4*sizeof(double);
double* pData = new double[2];
CString strLine, str;
// Write begin of header
strLine = _T(">>> Begin of header <<<\n");
aFile.WriteString(strLine);
// Retrieve current position of file pointer
int lFilePos = (long) aFile.GetPosition();
// Close file
aFile.Close();
nOpenFlags = CFile::modeWrite | CFile::typeBinary;
if (!aFile.Open(strFile, nOpenFlags, &anError))
{
return false;
}
for(int i = 0 ; i < 2 ; i++)
{
pData[i] = i;
}
// Set position of file pointer behind header
aFile.Seek(lFilePos, CFile::begin);
// Write complex vector
aFile.Write(pData, nSize);
// Write complex vector
aFile.Write(pData, nSize);
// Close file
aFile.Close();
Intention to create a file which contains both text data and binary data. This code is written in MFC. I wanted to similarly created a file in C# which contains both text data a and binary data. Please let me know which stream class is used to create this

Text can be written as binary data => simply use binary mode for the whole file and be done.
The only thing the text mode does is that it converts "\n" to "\r\n" on write and back on read. Since the file is partly binary and therefore not editable in regular text editor anyway, you don't need that conversion. If the file is just for your application, you just don't care and if it's for another application, just use whatever newline sequence it requires manually.

As to C#, possibly this S.O. article can give you the answer you are looking for.
The C# solution could also guide you in writing something similar for c, but I suspect you are on your own, i.e., you can use generic read/write to file. In C++, you have the possibility of doing formatted input/output from/to streams by using operator>> and operator<<.

Related

Kotlin Equivalent to C# BitArray

I'm working with a .NET application that stores a long list of true/false values into BitArray which gets stored in a SQL Server as a binary(32) value. The data comes back from the database as a byte[].
I'm trying to migrate the project over to Kotlin using Spring. After a lot of testing and messing around, I was finally able to get the same array in Kotlin that I do from the BitArray in C#. However, this solution seems like quite a kludge just to get an array of true/false values from a ByteArray.
I'm sure there's a cleaner way of going about this, but I'm still learning Kotlin and this is what I've come up with:
fun convertByteArrayToBitArray(array: ByteArray): List<Boolean>{
val stringBuilder = StringBuilder()
array.forEach {
// Convert to Integer to convert to binaryString
var int = it.toInt()
// Because the array has signed values, convert to unsigned value. Either this or add
// '.takeLast(8)' to the end of the String.format call
if(int < 0) int += 256
// Convert to binary string padding with leading 0s
val binary = String.format("%8s", Integer.toBinaryString(int)).replace(" ", "0")
// Through testing, this binary value needs to be reversed to give the right values
// at the right index
stringBuilder.append(binary.reversed())
}
// Convert stringBuilder to CharArray then to List of true/false values
return stringBuilder.toString().toCharArray().map {
Integer.parseInt(it.toString()) == 1
}
}

If you don't need multiplatform code, then you can use BitSet from the Java stdlib:
val bytes = byteArrayOf(0x11, 0x11)
val bits = BitSet.valueOf(bytes)
println(bits[0]) // true
println(bits[1]) // false
println(bits[4]) // true
println(bits[5]) // false
println(bits[8]) // true
println(bits[9]) // false
val bytes2 = bits.toByteArray()
It stores the data in little endian. If I read your code correctly, it also uses LE.
If you need List<Boolean> specifically, for example because some of your code already depends on it, then you can convert between BitSet and List<Boolean> like this:
fun BitSet.toBooleanList() = List(length()) { this[it] }
fun List<Boolean>.toBitSet() = BitSet(size).also { bitSet ->
forEachIndexed { i, item ->
bitSet.set(i, item)
}
}
Just note List<Boolean> is very memory-inefficient.

C# serialization without struct metadata itself

Right now I'm working on a game engine. To be more efficient and keep data from the end user, I'm trying to use serialization on a modified form of the Wavefront's *.OBJ format. I have multiple structs set up to represent data, and the serialization of the objects works fine except it takes up a significant amount of file space (at least x5 that of the original OBJ file).
To be specific, here's a quick example of what the final object would be (in a JSON-esque format):
{
[{float 5.0, float 2.0, float 1.0}, {float 7.0, float 2.0, float 1.0}, ...]
// ^^^ vertex positions
// other similar structures for colors, normals, texture coordinates
// ...
[[{int 1, int 1, int 1}, {int 2, int 2, int 1}, {int 3, int 3, int 2}], ...]
//represents one face; represents the following
//face[vertex{position index, text coords index, normal index}, vertex{}...]
}
Basically, my main issue with the method of serializing data (binary format) is it saves the names of the structs, not the values. I'd love to keep the data in the format I have already, just without saving the struct itself in my data. I want to save something similar to the above, yet it'll still let me recompile with a different struct name later.
Here's the main object I'm serializing and saving to a file:
[Serializable()] //the included structs have this applied
public struct InstantGameworksObjectData
{
public Position[] Positions;
public TextureCoordinates[] TextureCoordinates;
public Position[] Normals;
public Face[] Faces;
}
Here's the method in which I serialize and save the data:
IFormatter formatter = new BinaryFormatter();
long Beginning = DateTime.Now.Ticks / 10000000;
foreach (string file in fileNames)
{
Console.WriteLine("Begin " + Path.GetFileName(file));
var output = InstantGameworksObject.ConvertOBJToIGWO(File.ReadAllLines(file));
Console.WriteLine("Writing file");
Stream fileOutputStream = new FileStream(outputPath + #"\" + Path.GetFileNameWithoutExtension(file) + ".igwo", FileMode.Create, FileAccess.Write, FileShare.None);
formatter.Serialize(fileOutputStream, output);
Console.WriteLine(outputPath + #"\" + Path.GetFileNameWithoutExtension(file) + ".igwo");
}
The output, of course, is in binary/hex (based on what program you use to view the file), and that's great:
But putting it into a hex-to-text converter online yields specific name data:
In the long run, this could mean gigabytes worth of useless data. How can I save my C# object with the data in the correct format, just without the extra meta-clutter?

As you correctly note, the standard framework binary formatters include a host of metadata about the structure of the data. This is to try to keep the serialised data self-describing. If they were to separate the data from all that metadata, then the smallest change to the structure of classes would render the previously serialised data useless. By that token, I doubt you'd find any standard framework method of serialising binary data that didn't include all the metadata.
Even ProtoBuf includes the semantics of the data in the file data, albeit with less overhead.
Given that the structure of your data follows the reasonably common and well established form of 3D object data, you could roll your own format for your assets which strips the semantics and only stores the raw data. You can implement read and write methods easily using the BinaryReader/BinaryWriter classes (which would be my preference). If you're looking to obfuscate data from the end user, there are a variety of different ways that you could achieve that with this approach.
For example:
public static InstantGameworksObjectData ReadIgoObjct(BinaryReader pReader)
{
var lOutput = new InstantGameworksObjectData();
int lVersion = pReader.ReadInt32(); // Useful in case you ever want to change the format
int lPositionCount = pReader.ReadInt32(); // Store the length of the Position array before the data so you can pre-allocate the array.
lOutput.Positions = new Position[lPositionCount];
for ( int lPositionIndex = 0 ; lPositionIndex < lPositionCount ; ++ lPositionIndex )
{
lOutput.Positions[lPositionIndex] = new Position();
lOutput.Positions[lPositionIndex].X = pReader.ReadSingle();
lOutput.Positions[lPositionIndex].Y = pReader.ReadSingle();
lOutput.Positions[lPositionIndex].Z = pReader.ReadSingle();
// or if you prefer... lOutput.Positions[lPositionIndex] = Position.ReadPosition(pReader);
}
int lTextureCoordinateCount = pReader.ReadInt32();
lOutput.TextureCoordinates = new TextureCoordinate[lPositionCount];
for ( int lTextureCoordinateIndex = 0 ; lTextureCoordinateIndex < lTextureCoordinateCount ; ++ lTextureCoordinateIndex )
{
lOutput.TextureCoordinates[lTextureCoordinateIndex] = new TextureCoordinate();
lOutput.TextureCoordinates[lTextureCoordinateIndex].X = pReader.ReadSingle();
lOutput.TextureCoordinates[lTextureCoordinateIndex].Y = pReader.ReadSingle();
lOutput.TextureCoordinates[lTextureCoordinateIndex].Z = pReader.ReadSingle();
// or if you prefer... lOutput.TextureCoordinates[lTextureCoordinateIndex] = TextureCoordinate.ReadTextureCoordinate(pReader);
}
// ...
}
As far as space efficiency and speed goes, this approach is hard to beat. However, this works well for the 3D objects as they're fairly well-defined and the format is not likely to change, but this approach may not extend well to the other assets that you want to store.
If you find you are needing to change class structures frequently, you may find you have to write lots of if-blocks based on version to correctly read a file, and have to regularly debug issues where the data in the file is not quite in the format you expect. A happy medium might be to use something such as ProtoBuf for the bulk of your development until you're happy with the structure of your data object classes, and then writing raw binary Read/Write methods for each of them before you release.
I'd also recommend some Unit Tests to ensure that your Read and Write methods are correctly persisting the object to avoid pulling your hair out later.
Hope this helps

How to write numbers to a file and make them readable between Java and C#

I'm into a "compatibility" issue between two versions of the same program, the first one written in Java, the second it's a port in C#.
My goal is to write some data to a file (for example, in Java), like a sequence of numbers, then to have the ability to read it in C#. Obviously, the operation should work in the reversed order.
For example, I want to write 3 numbers in sequence, represented with the following schema:
first number as one 'byte' (4 bit)
second number as one 'integer' (32 bit)
third number as one 'integer' (32 bit)
So, I can put on a new file the following sequence: 2 (as byte), 120 (as int32), 180 (as int32)
In Java, the writing procedure is more or less this one:
FileOutputStream outputStream;
byte[] byteToWrite;
// ... initialization....
// first byte
outputStream.write(first_byte);
// integers
byteToWrite = ByteBuffer.allocate(4).putInt(first_integer).array();
outputStream.write(byteToWrite);
byteToWrite = ByteBuffer.allocate(4).putInt(second_integer).array();
outputStream.write(byteToWrite);
outputStream.close();
While the reading part it's the following:
FileInputStream inputStream;
ByteBuffer byteToRead;
// ... initialization....
// first byte
first_byte = inputStream.read();
// integers
byteToRead = ByteBuffer.allocate(4);
inputStream.read(byteToRead.array());
first_integer = byteToRead.getInt();
byteToRead = ByteBuffer.allocate(4);
inputStream.read(byteToRead.array());
second_integer = byteToRead.getInt();
inputStream.close();
C# code is the following. Writing:
FileStream fs;
byte[] byteToWrite;
// ... initialization....
// first byte
byteToWrite = new byte[1];
byteToWrite[0] = first_byte;
fs.Write(byteToWrite, 0, byteToWrite.Length);
// integers
byteToWrite = BitConverter.GetBytes(first_integer);
fs.Write(byteToWrite, 0, byteToWrite.Length);
byteToWrite = BitConverter.GetBytes(second_integer);
fs.Write(byteToWrite, 0, byteToWrite.Length);
Reading:
FileStream fs;
byte[] byteToWrite;
// ... initialization....
// first byte
byte[] firstByteBuff = new byte[1];
fs.Read(firstByteBuff, 0, firstByteBuff.Length);
first_byte = firstByteBuff[0];
// integers
byteToRead = new byte[4 * 2];
fs.Read(byteToRead, 0, byteToRead.Length);
first_integer = BitConverter.ToInt32(byteToRead, 0);
second_integer = BitConverter.ToInt32(byteToRead, 4);
Please note that both the procedures works when the same Java/C# version of the program writes and reads the file. The problem is when I try to read a file written by the Java program from the C# version and viceversa. Readed integers are always "strange" numbers (like -1451020...).
There's surely a compatibility issue regarding the way Java stores and reads 32bit integer values (always signed, right?), in contrast to C#. How to handle this?

It's just an endian-ness issue. You can use my MiscUtil library to read big-endian data from .NET.
However, I would strongly advise a simpler approach to both your Java and your .NET:
In Java, use DataInputStream and DataOutputStream. There's no need to get complicated with ByteBuffer etc.
In .NET, use EndianBinaryReader from MiscUtil, which extends BinaryReader (and likewise EndianBinaryWriter for BinaryWriter)
Alternatively, consider just using text instead.

I'd consider using a standard format like XML or JSON to store your data. Then you can use standard serializers in both Java and C# to read/write the file. This sort of approach lets you easily name the data fields, read it from many languages, be easily understandable if someone opens the file in a text editor, and more easily add data to be serialized.
E.g. you can read/write JSON with Gson in Java and Json.NET in C#. The class might look like this in C#:
public class MyData
{
public byte FirstValue { get; set; }
public int SecondValue { get; set; }
public int ThirdValue { get; set; }
}
// serialize to string example
var myData = new MyData { FirstValue = 2, SecondValue = 5, ThirdValue = -1 };
string serialized = JsonConvert.SerializeObject(myData);
It would serialize to
{"FirstValue":2,"SecondValue":5,"ThirdValue":-1}
The Java would, similarly, be quite simple. You can find examples of how to read/write files in each library.
Or if an array would be a better model for your data:
string serialized = JsonConvert.SerializeObject(new[] { 2, 5, -1 }); // [2,5,-1]

How can I convert a byte array to a string array?

Just to clarify something first. I am not trying to convert a byte array to a single string. I am trying to convert a byte-array to a string-array.
I am fetching some data from the clipboard using the GetClipboardData API, and then I'm copying the data from the memory as a byte array. When you're copying multiple files (hence a CF_HDROP clipboard format), I want to convert this byte array into a string array of the files copied.
Here's my code so far.
//Get pointer to clipboard data in the selected format
var clipboardDataPointer = GetClipboardData(format);
//Do a bunch of crap necessary to copy the data from the memory
//the above pointer points at to a place we can access it.
var length = GlobalSize(clipboardDataPointer);
var #lock = GlobalLock(clipboardDataPointer);
//Init a buffer which will contain the clipboard data
var buffer = new byte[(int)length];
//Copy clipboard data to buffer
Marshal.Copy(#lock, buffer, 0, (int)length);
GlobalUnlock(clipboardDataPointer);
snapshot.InsertData(format, buffer);
Now, here's my code for reading the buffer data afterwards.
var formatter = new BinaryFormatter();
using (var serializedData = new MemoryStream(buffer))
{
paths = (string[]) formatter.Deserialize(serializedData);
}
This won't work, and it'll crash with an exception saying that the stream doesn't contain a binary header. I suppose this is because it doesn't know which type to deserialize into.
I've tried looking the Marshal class through. Nothing seems of any relevance.

If the data came through the Win32 API then a string array will just be a sequence of null-terminated strings with a double-null-terminator at the end. (Note that the strings will be UTF-16, so two bytes per character). You'll basically need to pull the strings out one at a time into an array.
The method you're looking for here is Marshal.PtrToStringUni, which you should use instead of Marshal.Copy since it works on an IntPtr. It will extract a string, up to the first null character, from your IntPtr and copy it to a string.
The idea would be to continually extract a single string, then advance the IntPtr past the null byte to the start of the next string, until you run out of buffer. I have not tested this, and it could probably be improved (in particular I think there's a smarter way to detect the end of the buffer) but the basic idea would be:
var myptr = GetClipboardData(format);
var length = GlobalSize(myptr);
var result = new List<string>();
var pos = 0;
while ( pos < length )
{
var str = Marshal.PtrToStringUni(myptr);
var count = Encoding.Unicode.GetByteCount(str);
myptr = IntPtr.Add(myptr, count + 1);
pos += count + 1;
result.Add(str);
}
return result.ToArray();
(By the way: the reason your deserialization doesn't work is because serializing a string[] doesn't just write out the characters as bytes; it writes out the structure of a string array, including additional internal bits that .NET uses like the lengths, and a binary header with type information. What you're getting back from the clipboard has none of that present, so it cannot be deserialized.)

How about this:
var strings = Encoding.Unicode
.GetString(buffer)
.Split(new[] { '\0' }, StringSplitOptions.RemoveEmptyEntries);

How to optimize memory usage in this algorithm?

I'm developing a log parser, and I'm reading files of strings of more than 150MB.- This is my approach, Is there any way to optimize what is in the While statement? The problem is that is consuming a lot of memory.- I also tried with a stringbuilder facing the same memory comsuption.-
private void ReadLogInThread()
{
string lineOfLog = string.Empty;
try
{
StreamReader logFile = new StreamReader(myLog.logFileLocation);
InformationUnit infoUnit = new InformationUnit();
infoUnit.LogCompleteSize = myLog.logFileSize;
while ((lineOfLog = logFile.ReadLine()) != null)
{
myLog.transformedLog.Add(lineOfLog); //list<string>
myLog.logNumberLines++;
infoUnit.CurrentNumberOfLine = myLog.logNumberLines;
infoUnit.CurrentLine = lineOfLog;
infoUnit.CurrentSizeRead += lineOfLog.Length;
if (onLineRead != null)
onLineRead(infoUnit);
}
}
catch { throw; }
}
Thanks in advance!
EXTRA:
Im saving each line because after reading the log I will need to check for some information on every stored line.- The language is C#

Memory economy can be achieved if your log lines are actually can be parsed to a data row representation.
Here is a typical log line i can think of:
Event at: 2019/01/05:0:24:32.435, Reason: Operation, Kind: DataStoreOperation, Operation Status: Success
This line takes 200 bytes in memory.
At the same time, following representation just takes belo 16 bytes:
Enum LogReason { Operation, Error, Warning };
Enum EventKind short { DataStoreOperation, DataReadOperation };
Enum OperationStatus short { Success, Failed };
LogRow
{
DateTime EventTime;
LogReason Reason;
EventKind Kind;
OperationStatus Status;
}
Another optimization possibility is just parsing a line to array of string tokens,
this way you could make use of string interning.
For example, if a word "DataStoreOperation" takes 36 bytes, and if it has 1000000 entiries in the file, the economy is (18*2 - 4) * 1000000 = 32 000 000 bytes.

Try to make your algorithm sequential.
Using an IEnumerable instead of a List helps playing nice with memory, while keeping same semantic as working with a list, if you don't need random access to lines by index in the list.
IEnumerable<string> ReadLines()
{
// ...
while ((lineOfLog = logFile.ReadLine()) != null)
{
yield return lineOfLog;
}
}
//...
foreach( var line in ReadLines() )
{
ProcessLine(line);
}

I am not sure if it will fit your project but you can store the result in StringBuilder instead of strings list.
For example, this process on my machine takes 250MB memory after loading (file is 50MB):
static void Main(string[] args)
{
using (StreamReader streamReader = File.OpenText("file.txt"))
{
var list = new List<string>();
string line;
while (( line=streamReader.ReadLine())!=null)
{
list.Add(line);
}
}
}
On the other hand, this code process will take only 100MB:
static void Main(string[] args)
{
var stringBuilder = new StringBuilder();
using (StreamReader streamReader = File.OpenText("file.txt"))
{
string line;
while (( line=streamReader.ReadLine())!=null)
{
stringBuilder.AppendLine(line);
}
}
}

Memory usage keeps going up because you're simply adding them to a List<string>, constantly growing. If you want to use less memory one thing you can do is to write the data to disk, rather than keeping it in scope. Of course, this will greatly cause speed to degrade.
Another option is to compress the string data as you're storing it to your list, and decompress it coming out but I don't think this is a good method.
Side Note:
You need to add a using block around your streamreader.
using (StreamReader logFile = new StreamReader(myLog.logFileLocation))

Consider this implementation: (I'm speaking c/c++, substitute c# as needed)
Use fseek/ftell to find the size of the file.
Use malloc to allocate a chunk of memory the size of the file + 1;
Set that last byte to '\0' to terminate the string.
Use fread to read the entire file into the memory buffer.
You now have char * which holds the contents of the file as a
string.
Create a vector of const char * to hold pointers to the positions
in memory where each line can be found. Initialize the first element
of the vector to the first byte of the memory buffer.
Find the carriage control characters (probably \r\n) Replace the
\r by \0 to make the line a string. Increment past the \n.
This new pointer location is pushed back onto the vector.
Repeat the above until all of the lines in the file have been NUL
terminated, and are pointed to by elements in the vector.
Iterate though the vector as needed to investigate the contents of
each line, in your business specific way.
When you are done, close the file, free the memory, and continue
happily along your way.

1) Compress the strings before you store them (ie see System.IO.Compression and GZipStream). This would probably kill the performance of your program though since you'd have to uncompress to read each line.
2) Remove any extra white space characters or common words you can do without. ie if you can understand what the log is saying with the words "the, a, of...", remove them. Also, shorten any common words (ie change "error" to "err" and "warning" to "wrn"). This would slow down this step in the process but shouldn't affect performance of the rest.

What encoding is your original file? If it is ascii then just the strings alone are going to take over 2x the size of the file just to load up into your array. A C# character is 2 bytes and a C# string adds an extra 20 bytes per string in addition to the characters.
In your case, since it is a log file, you can probably exploit the fact that there is a lot of repetition in the the messages. You most likely can parse the incoming line into a data structure which reduces the memory overhead. For example, if you have a timestamp in the log file you can convert that to a DateTime value which is 8 bytes. Even a short timestamp of 1/1/10 would add 12 bytes to the size of a string, and a timestamp with time information would be even longer. Other tokens in your log stream might be able to be turned into a code or an enum in a similar manner.
Even if you have the leave the value as a string, if you can break it down into pieces that are used a lot, or remove boilerplate that is not needed at all you can probably cut down on your memory usage. If there are a lot of common strings you can Intern them and only pay for 1 string no matter how many you have.

If you must store the raw data, and assuming that your logs are mostly ASCII, then you can save some memory by storing UTF8 bytes internally. Strings are UTF16 internally, so you're storing an extra byte for each character. So by switching to UTF8 you're cutting memory use by half (not counting class overhead, which is still significant). Then you can convert back to normal strings as needed.
static void Main(string[] args)
{
List<Byte[]> strings = new List<byte[]>();
using (TextReader tr = new StreamReader(#"C:\test.log"))
{
string s = tr.ReadLine();
while (s != null)
{
strings.Add(Encoding.Convert(Encoding.Unicode, Encoding.UTF8, Encoding.Unicode.GetBytes(s)));
s = tr.ReadLine();
}
}
// Get strings back
foreach( var str in strings)
{
Console.WriteLine(Encoding.UTF8.GetString(str));
}
}

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Text and binary data in the same file - c#

Related

Kotlin Equivalent to C# BitArray

C# serialization without struct metadata itself

How to write numbers to a file and make them readable between Java and C#

How can I convert a byte array to a string array?

How to optimize memory usage in this algorithm?

Categories

Resources