I am writing a C# application that needs to read about 130,000 (String, Int32) pairs at startup to a Dictionary. The pairs are stored in a .txt file, and are thus easily modifiable by anyone, which is something dangerous in the context. I would like to ask if there is a way to save this dictionary so that the information can be reasonably safely stored, without losing performance at startup. I have tried using BinaryFormatter, but the problem is that while the original program takes between 125ms and 250ms at startup to read the information from the txt and build the dictionary, deserializing the resulting binary files takes up to 2s, which is not too much by itself but when compared to the original performance is a 8-16x decrease in speed.
Note: Encryption is important, but the most important should be a way to save and read the dictionary from the disk - possibly from a binary file - without having to use Convert.ToInt32 on each line, thus improving performance.
interesting question. I did some quick tests and you are right - BinaryFormatter is surprisingly slow:
Serialize 130,000 dictionary entries: 547ms
Deserialize 130,000 dictionary entries: 1046ms
When I coded it with a StreamReader/StreamWriter with comma separated values I got:
Serialize 130,000 dictionary entries: 121ms
Deserialize 130,000 dictionary entries: 111ms
But then I tried just using a BinaryWriter/BinaryReader:
Serialize 130,000 dictionary entries: 22ms
Deserialize 130,000 dictionary entries: 36ms
The code for that looks like this:
public void Serialize(Dictionary<string, int> dictionary, Stream stream)
{
BinaryWriter writer = new BinaryWriter(stream);
writer.Write(dictionary.Count);
foreach (var kvp in dictionary)
{
writer.Write(kvp.Key);
writer.Write(kvp.Value);
}
writer.Flush();
}
public Dictionary<string, int> Deserialize(Stream stream)
{
BinaryReader reader = new BinaryReader(stream);
int count = reader.ReadInt32();
var dictionary = new Dictionary<string,int>(count);
for (int n = 0; n < count; n++)
{
var key = reader.ReadString();
var value = reader.ReadInt32();
dictionary.Add(key, value);
}
return dictionary;
}
As others have said though, if you are concerned about users tampering with the file, encryption, rather than binary formatting is the way forward.
If you want to have the data relatively safely stored, you can encrypt the contents. If you just encrypt it as a string and decrypt it before your current parsing logic, you should be safe. And, this should not impact performance that much.
See Encrypt and decrypt a string for more information.
Encryption comes at the cost of key management. And, of course, even the fastest encryption/decryption algorithms are slower than no encryption at all. Same with compression, which will only help if you are I/O-bound.
If performance is your main concern, start looking at where the bottleneck actually is. If the culprit really is the Convert.ToInt32() call, I imagine you can store the Int32 bits directly and get away with a simple cast, which should be faster than parsing a string value. To obfuscate the strings, you can xor each byte with some fixed value, which is fast but provides nothing more than a roadbump for a determined attacker.
Perhaps something like:
static void Serialize(string path, IDictionary<string, int> data)
{
using (var file = File.Create(path))
using (var writer = new BinaryWriter(file))
{
writer.Write(data.Count);
foreach(var pair in data)
{
writer.Write(pair.Key);
writer.Write(pair.Value);
}
}
}
static IDictionary<string,int> Deserialize(string path)
{
using (var file = File.OpenRead(path))
using (var reader = new BinaryReader(file))
{
int count = reader.ReadInt32();
var data = new Dictionary<string, int>(count);
while(count-->0) {
data.Add(reader.ReadString(), reader.ReadInt32());
}
return data;
}
}
Note this doesn't do anything re encryption; that is a separate concern. You might also find that adding deflate into the mix reduces file IO and increases performance:
static void Serialize(string path, IDictionary<string, int> data)
{
using (var file = File.Create(path))
using (var deflate = new DeflateStream(file, CompressionMode.Compress))
using (var writer = new BinaryWriter(deflate))
{
writer.Write(data.Count);
foreach(var pair in data)
{
writer.Write(pair.Key);
writer.Write(pair.Value);
}
}
}
static IDictionary<string,int> Deserialize(string path)
{
using (var file = File.OpenRead(path))
using (var deflate = new DeflateStream(file, CompressionMode.Decompress))
using (var reader = new BinaryReader(deflate))
{
int count = reader.ReadInt32();
var data = new Dictionary<string, int>(count);
while(count-->0) {
data.Add(reader.ReadString(), reader.ReadInt32());
}
return data;
}
}
Is it safe enough to use BinaryFormatter instead of storing the contents directly in the text file? Obviously not. Because others can easily "destroy" the file by opening it by notepad and add something, even though he can see strange characters only. It's better if you store it in a database. But if you insist your solution, you can easily improve the performance a lot, by using Parallel Programming in C#4.0 (you can easily get a lot of useful examples by googling it). Something looks like this:
//just an example
Dictionary<string, int> source = GetTheDict();
var grouped = source.GroupBy(x =>
{
if (x.Key.First() >= 'a' && x.Key.First() <= 'z') return "File1";
else if (x.Key.First() >= 'A' && x.Key.First() <= 'Z') return "File2";
return "File3";
});
Parallel.ForEach(grouped, g =>
{
ThreeStreamsToWriteToThreeFilesParallelly(g);
});
Another alternative solution of Parallel is creating several threads, reading from/writing to different files will be faster.
Well, using a BinaryFormatter isn't really a safe way to store the pairs, as you can write a very simple program to deserialize it (after, say, running reflector on your code to get the type)
How about encrypting the txt?
With something like this for example ? (for maximum performance, try without compression)
Related
Having problems with formatting CSV created from C# code. In the notepad file the output scrolls vertically down one row (the values seen in the structs below are output in one row. There is a row of numbers as well that appears directly below the struct values but the numbers should be in a new row beside the structs). When I open in excel it's a similar story only the output from the structs is where it should be however the row of numbers appears directly below the struct values but one row to the right if that makes sense, and the numbers should appear directly beside their corresponding struct values. The code I'm using is below.
Here are the structs for the dictionaries im working with.
public enum Genders
{
Male,
Female,
Other,
UnknownorDeclined,
}
public enum Ages
{
Upto15Years,
Between16to17Years,
Between18to24Years,
Between25to34Years,
Between35to44Years,
Between45to54Years,
Between55to64Years,
Between65to74Years,
Between75to84Years,
EightyFiveandOver,
UnavailableorDeclined,
}
the csv file that does the outputting using a streamwriter and stringbuilder.
public void CSVProfileCreate<T>(Dictionary<T, string> columns, Dictionary<T, int> data)
{
StreamWriter write = new StreamWriter("c:/temp/testoutputprofile.csv");
StringBuilder output = new StringBuilder();
foreach (var pair in columns)
{
//output.Append(pair.Key);
//output.Append(",");
output.Append(pair.Value);
output.Append(",");
output.Append(Environment.NewLine);
}
foreach (var d in data)
{
//output.Append(pair.Key);
output.Append(",");
output.Append(d.Value);
output.Append(Environment.NewLine);
}
write.Write(output);
write.Dispose();
}
And finally the method to feed the dictionaries into the csv creator.
public void RunReport()
{
CSVProfileCreate(genderKeys, genderValues);
CSVProfileCreate(ageKeys, ageValues);
}
Any ideas?
UPDATE
I fixed it by doing this:
public void CSVProfileCreate<T>(Dictionary<T, string> columns, Dictionary<T, int> data)
{
StreamWriter write = new StreamWriter("c:/temp/testoutputprofile.csv");
StringBuilder output = new StringBuilder();
IEnumerable<string> col = columns.Values.AsEnumerable();
IEnumerable<int> dat = data.Values.AsEnumerable();
for (int i = 0; i < col.Count(); i++)
{
output.Append(col.ElementAt(i));
output.Append(",");
output.Append(dat.ElementAt(i));
output.Append(",");
output.Append(Environment.NewLine);
}
write.Write(output);
write.Dispose();
}
}
You write Environment.NewLine after every single value that you output.
Rather than having two loops, you should have just one loop that outputs
A "pair"
A value
Environment.NewLine
for each iteration.
Assuming columns and data have the same keys, that could look something like
foreach (T key in columns.Keys)
{
pair = columns[key];
d = data[key];
output.Append(pair.Value);
output.Append(",");
output.Append(d.Value);
output.Append(Environment.NewLine);
}
Note two complications:
If pair.Value or d.Value contains a comma, you need to surround the output of that cell with double quotes.
If If pair.Value or d.Value contains a comma and also contains a double-quote, you have to double up the double-quote to escape it.
Examples:
Smith, Jr
would have to be output
"Smith, Jr"
and
"Smitty" Smith, Jr
would have to be output
"""Smitty"" Smith, Jr"
UPDATE
Based on your comment about the keys...
For purposes of enumeration, each item in the dictionary is treated as a KeyValuePair structure representing a value and its key. The order in which the items are returned is undefined.
http://msdn.microsoft.com/en-us/library/xfhwa508.aspx
If you cannot use the key to associate the right pair with the right data, how do you make that association?
If you are iterating the dictionary and they happen to be in the order you hope, that is truly undefined behavior that could change with the next .NET service pack.
You need something reliable to relate the pair with the correct data.
About the var keyword
var is not a type, but rather a shortcut that frees you from writing out the entire type. You can use var if you wish, but the actual type is KeyValuePair<T, string> and KeyValuePair<T, int> respectively. You can see that if you write var and hover over that keyword with your mouse in Visual Studio.
About disposing resources
Your line
write.Dispose();
is risky. If any of your code throws an Exception prior to reaching that line, it will never run and write will not be disposed. It is strongly preferable to make use of the using keyword like this:
using (StreamWriter write = new StreamWriter("c:/temp/testoutputprofile.csv"))
{
// Your code here
}
When the scope of using ends (after the associated }), write.Dispose() will be automatically called whether or not an Exception was thrown. This is the same as, but shorter than,
try
{
StreamWriter write = new StreamWriter("c:/temp/testoutputprofile.csv");
// Your code here
}
finally
{
write.Dispose();
}
For example, i have:
struct SomeStruct
{
//some fields
//each instance will store info read from file, maybe be 3kb, maybe more.
}
List<SomeStruct> lst = new List<SomeStruct>();
I will add to that list crazy amount of objects, so it will end up to 10Gbs or more in size.
Can i serialize lst without any errors like out of memory and etc? Can i deserialize it later?
If you can hold the list of items in memory at one time, you should have a decent chance of serializing/deserializing them. You may want to handle them individually, in a stream, rather than serializing/deserializing the entire list all at once, perhaps. That would take care of any edge cases you might have.
Pseudo-code:
private void SerializeObjects(List<foo> foos, Stream stream)
{
foreach (var f in foos)
{
stream.Write(f);
}
}
private void DeserializeObjects(List<foo> foos, Stream stream)
{
foo f = stream.ReadFoo();
while (f != null)
{
foos.Add(f);
f = stream.ReadFoo();
}
}
I have a weird situation happening that I'm not quite understanding.
I have a 'dataset' class that holds various metadata about a monitoring buoy including a list of 'sensors'.
Each current 'sensorstate'.
Each 'sensorstate' has a bit of metadata about it (timestamp, reason for change etc) but most importantly it has a Dictionary<DateTime,float> of values.
These sensors generally have upwards of 50k data points (years worth of 15min data readings) and so I wanted to find something that was a bit faster at serialising than the default .NET BinaryFormatter and so set up Protobuf-net which will serialize fantastically fast.
Unfortunately my problem occurs on deserialization when my dictionary of values throws a exception for there already being an item with the same key added and the only way I can get it to deserialise is to enable 'OverwriteList' but I'm a little unsure why when there aren't any duplicate keys (it's a dictionary) when serializing, so why are there duplicate keys when I deserialize? Which also brings up data integrity issues.
Any help in explaining this would be highly appreciated.
(On a side note, when giving ProtoMember attribute ids, do they need to be unique to the class or the whole project? and I'm looking for lossless compression recommendations to use in conjunction with protobuf-net as the files are getting pretty large)
Edit:
I've just put my source up on GitHub and here is the class in question
SensorState (Note: it currently has OverwriteList = true in order to have it working for other development)
Here is an example raw data file
I had already tried using the SkipContructor flag but even with it set to true it gets an exception unless OverwriteList is also true for the values dictionary.
If OverwriteList fixes it, then it suggests to me that the dictionary has some data in it by default, perhaps via a constructor or similar. If it is indeed coming from the constructor, you can disable that with [ProtoContract(SkipConstructor=true)].
If I have misunderstood the above, it may help to illustrate with a reproducible example, if possible.
With regard to the ids, they only need to be unique inside each type, and it is recommended to keep them small (due to "varint" encoding of tags, small keys are "cheaper" than large keys).
If you want to really minimise size, I would actually suggest looking at the content of the data, too. For example, you say that this is 15 minute readings... well, I'm guessing there are occasional gaps, but could you do, for example:
Block (class)
Start Time (DateTime)
Values (float[])
and have a Block for every contiguous bunch of 15-minute values (the assumption here is that every value is 15 after the last, else a new block is started). So you are storing multiple Block instances in place of a single dictionary. This has the advantages:
much less DateTime values to store
you can use "packed" encoding on the floats, which means it doesn't need to add all the intermediate tags; you do this by marking an array/list as ([ProtoMember({key}, IsPacked = true)]) - noting that it only works on a few basic data-types (not sub-objects)
combined, these two tweaks could yield significant savings
If the data has a lot of strings, you could try GZIP/DEFLATE. You can of course try these either way, but without large amounts of string data I would be cautious of expecting too much extra from compression.
As an update based on the supplied (CSV) data file, there is no inherent problem here handling the dictionary - as shown:
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using ProtoBuf;
class Program
{
static void Main()
{
var data = new Data
{
Points =
{
{new DateTime(2009,09,1,0,0,0), 11.04F},
{new DateTime(2009,09,1,0,15,0), 11.04F},
{new DateTime(2009,09,1,0,30,0), 11.01F},
{new DateTime(2009,09,1,0,45,0), 11.01F},
{new DateTime(2009,09,1,1,0,0), 11F},
{new DateTime(2009,09,1,1,15,0), 10.98F},
{new DateTime(2009,09,1,1,30,0), 10.98F},
{new DateTime(2009,09,1,1,45,0), 10.92F},
{new DateTime(2009,09,1,2,00,0), 10.09F},
}
};
var ms = new MemoryStream();
Serializer.Serialize(ms, data);
ms.Position = 0;
var clone =Serializer.Deserialize<Data>(ms);
Console.WriteLine("{0} points:", clone.Points.Count);
foreach(var pair in clone.Points.OrderBy(x => x.Key))
{
float orig;
data.Points.TryGetValue(pair.Key, out orig);
Console.WriteLine("{0}: {1}", pair.Key, pair.Value == orig ? "correct" : "FAIL");
}
}
}
[ProtoContract]
class Data
{
private readonly Dictionary<DateTime, float> points = new Dictionary<DateTime, float>();
[ProtoMember(1)]
public Dictionary<DateTime, float> Points { get { return points; } }
}
This is where I apologize for ever suggesting it had anything to do with code that wasn't my own doing. And while I'm here mad props to the team behind protobuf and Marc Gravell for protobuf-net it's seriously fast.
What was happening was in the Sensor class I had some logic to never let a couple of Properties never be null.
[ProtoMember(12)]
public SensorState CurrentState
{
get { return (_currentState == null) ? RawData : _currentState; }
set { _currentState = value; }
}
Link
[ProtoMember(16)]
public SensorState RawData
{
get { return _rawData ?? (_rawData = new SensorState(this, DateTime.Now, new Dictionary<DateTime, float>(), "", true, null)); }
private set { _rawData = value; }
}
Link
While this works fantastically for when I'm using the properties it messes up serialization processes.
The simple fix was to instead mark the underlying objects for serialization instead.
[ProtoMember(16)]
private SensorState _rawData;
[ProtoMember(12)]
private SensorState _currentState;
Link
I have a collection of objects that I need to write to a binary file.
I need the bytes in the file to be compact, so I can't use BinaryFormatter. BinaryFormatter throws in all sorts of info for deserialization needs.
If I try
byte[] myBytes = (byte[]) myObject
I get a runtime exception.
I need this to be fast so I'd rather not be copying arrays of bytes around. I'd just like the cast byte[] myBytes = (byte[]) myObject to work!
OK just to be clear, I cannot have any metadata in the output file. Just the object bytes. Packed object-to-object. Based on answers received, it looks like I'll be writing low-level Buffer.BlockCopy code. Perhaps using unsafe code.
To convert an object to a byte array:
// Convert an object to a byte array
public static byte[] ObjectToByteArray(Object obj)
{
BinaryFormatter bf = new BinaryFormatter();
using (var ms = new MemoryStream())
{
bf.Serialize(ms, obj);
return ms.ToArray();
}
}
You just need copy this function to your code and send to it the object that you need to convert to a byte array. If you need convert the byte array to an object again you can use the function below:
// Convert a byte array to an Object
public static Object ByteArrayToObject(byte[] arrBytes)
{
using (var memStream = new MemoryStream())
{
var binForm = new BinaryFormatter();
memStream.Write(arrBytes, 0, arrBytes.Length);
memStream.Seek(0, SeekOrigin.Begin);
var obj = binForm.Deserialize(memStream);
return obj;
}
}
You can use these functions with custom classes. You just need add the [Serializable] attribute in your class to enable serialization
If you want the serialized data to be really compact, you can write serialization methods yourself. That way you will have a minimum of overhead.
Example:
public class MyClass {
public int Id { get; set; }
public string Name { get; set; }
public byte[] Serialize() {
using (MemoryStream m = new MemoryStream()) {
using (BinaryWriter writer = new BinaryWriter(m)) {
writer.Write(Id);
writer.Write(Name);
}
return m.ToArray();
}
}
public static MyClass Desserialize(byte[] data) {
MyClass result = new MyClass();
using (MemoryStream m = new MemoryStream(data)) {
using (BinaryReader reader = new BinaryReader(m)) {
result.Id = reader.ReadInt32();
result.Name = reader.ReadString();
}
}
return result;
}
}
Well a cast from myObject to byte[] is never going to work unless you've got an explicit conversion or if myObject is a byte[]. You need a serialization framework of some kind. There are plenty out there, including Protocol Buffers which is near and dear to me. It's pretty "lean and mean" in terms of both space and time.
You'll find that almost all serialization frameworks have significant restrictions on what you can serialize, however - Protocol Buffers more than some, due to being cross-platform.
If you can give more requirements, we can help you out more - but it's never going to be as simple as casting...
EDIT: Just to respond to this:
I need my binary file to contain the
object's bytes. Only the bytes, no
metadata whatsoever. Packed
object-to-object. So I'll be
implementing custom serialization.
Please bear in mind that the bytes in your objects are quite often references... so you'll need to work out what to do with them.
I suspect you'll find that designing and implementing your own custom serialization framework is harder than you imagine.
I would personally recommend that if you only need to do this for a few specific types, you don't bother trying to come up with a general serialization framework. Just implement an instance method and a static method in all the types you need:
public void WriteTo(Stream stream)
public static WhateverType ReadFrom(Stream stream)
One thing to bear in mind: everything becomes more tricky if you've got inheritance involved. Without inheritance, if you know what type you're starting with, you don't need to include any type information. Of course, there's also the matter of versioning - do you need to worry about backward and forward compatibility with different versions of your types?
I took Crystalonics' answer and turned them into extension methods. I hope someone else will find them useful:
public static byte[] SerializeToByteArray(this object obj)
{
if (obj == null)
{
return null;
}
var bf = new BinaryFormatter();
using (var ms = new MemoryStream())
{
bf.Serialize(ms, obj);
return ms.ToArray();
}
}
public static T Deserialize<T>(this byte[] byteArray) where T : class
{
if (byteArray == null)
{
return null;
}
using (var memStream = new MemoryStream())
{
var binForm = new BinaryFormatter();
memStream.Write(byteArray, 0, byteArray.Length);
memStream.Seek(0, SeekOrigin.Begin);
var obj = (T)binForm.Deserialize(memStream);
return obj;
}
}
Use of binary formatter is now considered unsafe. see --> Docs Microsoft
Just use System.Text.Json:
To serialize to bytes:
JsonSerializer.SerializeToUtf8Bytes(obj);
To deserialize to your type:
JsonSerializer.Deserialize(byteArray);
You are really talking about serialization, which can take many forms. Since you want small and binary, protocol buffers may be a viable option - giving version tolerance and portability as well. Unlike BinaryFormatter, the protocol buffers wire format doesn't include all the type metadata; just very terse markers to identify data.
In .NET there are a few implementations; in particular
protobuf-net
dotnet-protobufs
I'd humbly argue that protobuf-net (which I wrote) allows more .NET-idiomatic usage with typical C# classes ("regular" protocol-buffers tends to demand code-generation); for example:
[ProtoContract]
public class Person {
[ProtoMember(1)]
public int Id {get;set;}
[ProtoMember(2)]
public string Name {get;set;}
}
....
Person person = new Person { Id = 123, Name = "abc" };
Serializer.Serialize(destStream, person);
...
Person anotherPerson = Serializer.Deserialize<Person>(sourceStream);
This worked for me:
byte[] bfoo = (byte[])foo;
foo is an Object that I'm 100% certain that is a byte array.
I found Best Way this method worked correcly for me
Use Newtonsoft.Json
public TData ByteToObj<TData>(byte[] arr){
return JsonConvert.DeserializeObject<TData>(Encoding.UTF8.GetString(arr));
}
public byte[] ObjToByte<TData>(TData data){
var json = JsonConvert.SerializeObject(data);
return Encoding.UTF8.GetBytes(json);
}
Take a look at Serialization, a technique to "convert" an entire object to a byte stream. You may send it to the network or write it into a file and then restore it back to an object later.
To access the memory of an object directly (to do a "core dump") you'll need to head into unsafe code.
If you want something more compact than BinaryWriter or a raw memory dump will give you, then you need to write some custom serialisation code that extracts the critical information from the object and packs it in an optimal way.
edit
P.S. It's very easy to wrap the BinaryWriter approach into a DeflateStream to compress the data, which will usually roughly halve the size of the data.
I believe what you're trying to do is impossible.
The junk that BinaryFormatter creates is necessary to recover the object from the file after your program stopped.
However it is possible to get the object data, you just need to know the exact size of it (more difficult than it sounds) :
public static unsafe byte[] Binarize(object obj, int size)
{
var r = new byte[size];
var rf = __makeref(obj);
var a = **(IntPtr**)(&rf);
Marshal.Copy(a, r, 0, size);
return res;
}
this can be recovered via:
public unsafe static dynamic ToObject(byte[] bytes)
{
var rf = __makeref(bytes);
**(int**)(&rf) += 8;
return GCHandle.Alloc(bytes).Target;
}
The reason why the above methods don't work for serialization is that the first four bytes in the returned data correspond to a RuntimeTypeHandle. The RuntimeTypeHandle describes the layout/type of the object but the value of it changes every time the program is ran.
EDIT: that is stupid don't do that -->
If you already know the type of the object to be deserialized for certain you can switch those bytes for BitConvertes.GetBytes((int)typeof(yourtype).TypeHandle.Value) at the time of deserialization.
I found another way to convert an object to a byte[], here is my solution:
IEnumerable en = (IEnumerable) myObject;
byte[] myBytes = en.OfType<byte>().ToArray();
Regards
This method returns an array of bytes from an object.
private byte[] ConvertBody(object model)
{
return Encoding.UTF8.GetBytes(JsonConvert.SerializeObject(model));
}
Spans are very useful for something like this. To put it simply, they are very fast ref structs that have a pointer to the first element and a length. They guarantee a contiguous region of memory and the JIT compiler is able to optimize based on these guarantees. They work just like pointer arrays you can see all the time in the C and C++ languages.
Ever since spans have been added, you are able to use two MemoryMarshal functions that can get all bytes of an object without the overhead of streams. Under the hood, it is just a little bit of casting. Just like you asked, there are no extra allocations going down to the bytes unless you copy them to an array or another span. Here is an example of the two functions in use to get the bytes of one:
public static Span<byte> GetBytes<T>(ref T o)
where T : struct
{
if (RuntimeHelpers.IsReferenceOrContainsReferences<T>())
throw new Exception($"Type {nameof(T)} is or contains a reference");
var singletonSpan = MemoryMarshal.CreateSpan(ref o, 1);
var bytes = MemoryMarshal.AsBytes(singletonSpan);
return bytes;
}
The first function, MemoryMarshal.CreateSpan, takes a reference to an object with a length for how many adjacent objects of the same type come immediately after it. They must be adjacent because spans guarantee contiguous regions of memory. In this case, the length is 1 because we are only working with the single object. Under the hood, it is done by creating a span beginning at the first element.
The second function, MemoryMarshal.AsBytes, takes a span and turns it into a span of bytes. This span still covers the argument object so any changes to the bytes will be reflected within the object. Fortunately, spans have a method called ToArray which copies all of the contents from the span into a new array. Under the hood, it creates a span over bytes instead of T and adjusts the length accordingly. If there's a span you want to copy into instead, there's the CopyTo method.
The if statement is there to ensure that you are not copying the bytes of a type that is or contains a reference for safety reasons. If it is not there, you may be copying a reference to an object that doesn't exist.
The type T must be a struct because MemoryMarshal.AsBytes requires a non-nullable type.
You can use below method to convert list of objects into byte array using System.Text.Json serialization.
private static byte[] CovertToByteArray(List<object> mergedResponse)
{
var options = new JsonSerializerOptions
{
PropertyNameCaseInsensitive = true,
};
if (mergedResponse != null && mergedResponse.Any())
{
return JsonSerializer.SerializeToUtf8Bytes(mergedResponse, options);
}
return new byte[] { };
}
I'm developing a log parser, and I'm reading files of strings of more than 150MB.- This is my approach, Is there any way to optimize what is in the While statement? The problem is that is consuming a lot of memory.- I also tried with a stringbuilder facing the same memory comsuption.-
private void ReadLogInThread()
{
string lineOfLog = string.Empty;
try
{
StreamReader logFile = new StreamReader(myLog.logFileLocation);
InformationUnit infoUnit = new InformationUnit();
infoUnit.LogCompleteSize = myLog.logFileSize;
while ((lineOfLog = logFile.ReadLine()) != null)
{
myLog.transformedLog.Add(lineOfLog); //list<string>
myLog.logNumberLines++;
infoUnit.CurrentNumberOfLine = myLog.logNumberLines;
infoUnit.CurrentLine = lineOfLog;
infoUnit.CurrentSizeRead += lineOfLog.Length;
if (onLineRead != null)
onLineRead(infoUnit);
}
}
catch { throw; }
}
Thanks in advance!
EXTRA:
Im saving each line because after reading the log I will need to check for some information on every stored line.- The language is C#
Memory economy can be achieved if your log lines are actually can be parsed to a data row representation.
Here is a typical log line i can think of:
Event at: 2019/01/05:0:24:32.435, Reason: Operation, Kind: DataStoreOperation, Operation Status: Success
This line takes 200 bytes in memory.
At the same time, following representation just takes belo 16 bytes:
Enum LogReason { Operation, Error, Warning };
Enum EventKind short { DataStoreOperation, DataReadOperation };
Enum OperationStatus short { Success, Failed };
LogRow
{
DateTime EventTime;
LogReason Reason;
EventKind Kind;
OperationStatus Status;
}
Another optimization possibility is just parsing a line to array of string tokens,
this way you could make use of string interning.
For example, if a word "DataStoreOperation" takes 36 bytes, and if it has 1000000 entiries in the file, the economy is (18*2 - 4) * 1000000 = 32 000 000 bytes.
Try to make your algorithm sequential.
Using an IEnumerable instead of a List helps playing nice with memory, while keeping same semantic as working with a list, if you don't need random access to lines by index in the list.
IEnumerable<string> ReadLines()
{
// ...
while ((lineOfLog = logFile.ReadLine()) != null)
{
yield return lineOfLog;
}
}
//...
foreach( var line in ReadLines() )
{
ProcessLine(line);
}
I am not sure if it will fit your project but you can store the result in StringBuilder instead of strings list.
For example, this process on my machine takes 250MB memory after loading (file is 50MB):
static void Main(string[] args)
{
using (StreamReader streamReader = File.OpenText("file.txt"))
{
var list = new List<string>();
string line;
while (( line=streamReader.ReadLine())!=null)
{
list.Add(line);
}
}
}
On the other hand, this code process will take only 100MB:
static void Main(string[] args)
{
var stringBuilder = new StringBuilder();
using (StreamReader streamReader = File.OpenText("file.txt"))
{
string line;
while (( line=streamReader.ReadLine())!=null)
{
stringBuilder.AppendLine(line);
}
}
}
Memory usage keeps going up because you're simply adding them to a List<string>, constantly growing. If you want to use less memory one thing you can do is to write the data to disk, rather than keeping it in scope. Of course, this will greatly cause speed to degrade.
Another option is to compress the string data as you're storing it to your list, and decompress it coming out but I don't think this is a good method.
Side Note:
You need to add a using block around your streamreader.
using (StreamReader logFile = new StreamReader(myLog.logFileLocation))
Consider this implementation: (I'm speaking c/c++, substitute c# as needed)
Use fseek/ftell to find the size of the file.
Use malloc to allocate a chunk of memory the size of the file + 1;
Set that last byte to '\0' to terminate the string.
Use fread to read the entire file into the memory buffer.
You now have char * which holds the contents of the file as a
string.
Create a vector of const char * to hold pointers to the positions
in memory where each line can be found. Initialize the first element
of the vector to the first byte of the memory buffer.
Find the carriage control characters (probably \r\n) Replace the
\r by \0 to make the line a string. Increment past the \n.
This new pointer location is pushed back onto the vector.
Repeat the above until all of the lines in the file have been NUL
terminated, and are pointed to by elements in the vector.
Iterate though the vector as needed to investigate the contents of
each line, in your business specific way.
When you are done, close the file, free the memory, and continue
happily along your way.
1) Compress the strings before you store them (ie see System.IO.Compression and GZipStream). This would probably kill the performance of your program though since you'd have to uncompress to read each line.
2) Remove any extra white space characters or common words you can do without. ie if you can understand what the log is saying with the words "the, a, of...", remove them. Also, shorten any common words (ie change "error" to "err" and "warning" to "wrn"). This would slow down this step in the process but shouldn't affect performance of the rest.
What encoding is your original file? If it is ascii then just the strings alone are going to take over 2x the size of the file just to load up into your array. A C# character is 2 bytes and a C# string adds an extra 20 bytes per string in addition to the characters.
In your case, since it is a log file, you can probably exploit the fact that there is a lot of repetition in the the messages. You most likely can parse the incoming line into a data structure which reduces the memory overhead. For example, if you have a timestamp in the log file you can convert that to a DateTime value which is 8 bytes. Even a short timestamp of 1/1/10 would add 12 bytes to the size of a string, and a timestamp with time information would be even longer. Other tokens in your log stream might be able to be turned into a code or an enum in a similar manner.
Even if you have the leave the value as a string, if you can break it down into pieces that are used a lot, or remove boilerplate that is not needed at all you can probably cut down on your memory usage. If there are a lot of common strings you can Intern them and only pay for 1 string no matter how many you have.
If you must store the raw data, and assuming that your logs are mostly ASCII, then you can save some memory by storing UTF8 bytes internally. Strings are UTF16 internally, so you're storing an extra byte for each character. So by switching to UTF8 you're cutting memory use by half (not counting class overhead, which is still significant). Then you can convert back to normal strings as needed.
static void Main(string[] args)
{
List<Byte[]> strings = new List<byte[]>();
using (TextReader tr = new StreamReader(#"C:\test.log"))
{
string s = tr.ReadLine();
while (s != null)
{
strings.Add(Encoding.Convert(Encoding.Unicode, Encoding.UTF8, Encoding.Unicode.GetBytes(s)));
s = tr.ReadLine();
}
}
// Get strings back
foreach( var str in strings)
{
Console.WriteLine(Encoding.UTF8.GetString(str));
}
}