I have a task, which I know how to code (in C#),
but I know a simple implementation will not meet ALL my needs.
So, I am looking for tricks which might meet ALL my needs.
I am writing a simulation involving N number of entities interacting over time.
N will start at around 30 and move in to many thousands.
a. The number of entities will change during the course of the simulation.
b. I expect this will require each entity to have its own trace file.
Each Entity has a minimum of 20 parameters, up to millions; I want to track over time.
a. This will most likely required that we can’t keep all values in memory at all times. Some subset should be fine.
b. The number of parameters per entity will initially be fixed, but I can think of some test which would have the number of parameters slowing changing over time.
Simulation will last for millions of time steps and I need to keep every value for every parameter.
What I will be using these traces for:
a. Plotting a subset (configurable) of the parameters for a fixed amount of time from the current time step to the past.
i. Normally on the order of 300 time steps.
ii. These plots are in real time while the simulation is running.
b. I will be using these traces to re-play the simulation, so I need to quickly access all the parameters at a give time step so I can quickly move to different times in the simulation.
i. This requires the values be stored in a file(s) which can be inspected/loaded after restarting the software.
ii. Using a database is NOT an option.
c. I will be using the parameters for follow up analysis which I can’t define up front so a more flexible system is desirable.
My initial thought:
One class per entity which holds all the parameters.
Backed by a memory mapped file.
Only a fixed, but moving, amount of the file is mapped to main memory
A second memory mapped file which holds time indexes to main file for quicker access during re-playing of simulation. This may be very important because each entity file will represent a different time slice of the full simulation.
I would start with SQLite. SQLite is like a binary format library that you can query conveniently and quickly. It is not really like a database, in that you can really run it on any machine, with no installation whatsoever.
I strongly recommend against XML, given the requirement of millions of steps, potentially with millions parameters.
EDIT: Given the sheer amount of data involved, SQLite may well end up being too slow for you. Don't get me wrong, SQLite is very fast, but it won't beat seeks & reads, and it looks like your use case is such that basic binary IO is rather appropriate.
If you go with the binary IO method you should expect some moderately involved coding, and the absence of such niceties as your file staying in a consistent state if the application dies halfway through (unless you code this specifically that is).
KISS -- just write a logfile for each entity and at each time slice write out every parameter in a specified order (so you don't double the size of the logfile by adding parameter names). You can have a header in each logfile if you want to specify the parameter names of each column and the identify of the entity.
If there are many parameter values that will remain fixed or slowly changing during the course of the simulation, you can write these out to another file that encodes only changes to parameter values rather than every value at each time slice.
You should probably synchronize the logging so that each log entry is written out with the same time value. Rather than coordinate through a central file, just make the first value in each line of the file the time value.
Forget about database - too slow and too much overhead for simulation replay purposes. For replaying of a simulation, you will simply need sequential access to each time slice, which is most efficiently and fastest implemented by simply reading in the lines of the files one by one.
For the same reason - speed and space efficiency - forget XML.
Just for the memory part...
1.You can save the data as xElemet (sorry for not knowing much about linq) but it holds an XML logic.
2.hold a record counter.
after n records save the xelement to an xmlFile (data1.xml,...dataN.xml)
It can be a perfect log to any parameter you have with any logic you like:
<run>
<step id="1">
<param1 />
<param2 />
<param3 />
</step>
.
.
.
<step id="N">
<param1 />
<param2 />
<param3 />
</step>
</run>
This way your memory is free and the data is relatively free.
You don't have to think too much about DB issues and it's pretty amazing what LINQ can do for you... just open the currect XML log file...
here is what i am doing now
int bw = 0;
private void timer1_Tick(object sender, EventArgs e)
{
bw = Convert.ToInt32(lblBytesReceived.Text) - bw;
SqlCommand comnd = new SqlCommand("insert into tablee (bandwidthh,timee) values (" + bw.ToString() + ",#timee)", conn);
conn.Open();
comnd.Parameters.Add("#timee",System.Data.SqlDbType.Time).Value = DateTime.Now.TimeOfDay;
comnd.ExecuteNonQuery();
conn.Close();
}
Related
I'm currently using SSIS to do an improvement on a project. need to insert single documents in a MongoDB collection of type Time Series. At some point I want to retrieve rows of data after going through a C# transformation script. I did this:
foreach (BsonDocument bson in listBson)
{
OutputBuffer.AddRow();
OutputBuffer.DatalineX = (string) bson.GetValue("data");
}
But this piece of code that works great with small file does not work with a 6 million line file. That is, there are no lines in the output. The other following tasks validate but react as if they had received nothing as input.
Where could the problem come from?
Your OuputBuffer has DatalineX defined as a string, either DT_STR or DT_WSTR and a specific length. When you exceed that value, things go bad. In normal strings, you'd have a maximum length of 8k or 4k respectively.
Neither of which are useful for your use case of at least 6M characters. To handle that, you'll need to change your data type to DT_TEXT/DT_NTEXT Those data types do not require a length as they are "max" types. There are lots of things to be aware of when using the LOB types.
Performance can suck depending on whether SSIS can keep the data in memory (good) or has to write intermediate values to disk (bad)
You can't readily manipulate them in a data flow
You'll use a different syntax in a Script Component to work with them
e.g.
// TODO: convert to bytes
Output0Buffer.DatalineX.AddBlobData(bytes);
Longer example of questionable accuracy with regard to encoding the bytes that you get to solve at https://stackoverflow.com/a/74902194/181965
I want to imlpement a c# class (.NET 4.6.1) which contains time series data as follows :
the timeseries is a collection keyed on DateTime, each with an associated value (eg an array of doubles)
the points will be added strictly in time order
there will be a rolling time period - e.g. 1 hour - when adding a new point, any points at the start older than this period will be removed
the key issues for performance will be quickly adding new points, and finding the data point for a particular time (a binary search or better is essential). There will be 100k's of points in the collection sometimes.
it has to be thread safe
it has to be relatively memory efficient - e.g. it can't just keep all the points from the beginning of time - as time moves on the memory footprint has to be fairly stable.
So what would be a good approach for that in terms of underlying collection classes? Using Lists will be slow - as the rolling period means we will need to remove from the start of the List a lot. A Queue or LinkedList would solve that - but I don't think they provide a fast way to access the nth item in the collection (ElementAt just iterates so is very slow).
The only solution I can think of will involve storing the data twice - once in a collection thats easy to remove from the start of - and again in one thats easy to search in (with some awful background process to prune the stale points from the search collection somehow).
Is there a better way to approach this?
Thanks.
When I first saw the question I immediately thought of a queue, but most built-in queues do not efficiently allow indexed access, as you've found.
Best suggestion I can come up with is to use a ConcurrentDictionary. Thread-safe, near-constant access time by key, you can key directly to DateTimes, etc. it's everything you need functionally, except the behavior to manage size/timeline. To do that, you use a little Linq to scan the Keys property for keys more than one hour older than the newest one being added, then TryRemove() each one (you can derive from ConcurrentDictionary and override TryAdd() to do this automatically when adding anything new).
Only other potential issue is it's not terribly memory-efficient; the HashSet-based implementation of .NET Dictionaries requires a two-dimensional array for storage based on the hash of the key, and that array will be sparsely populated even with 100k items in the collection.
I am using ObservableCollection to store information on my CPU Usage and transferring this information to a line chart. the information is updated every second. It is working fine but I realize that this is going to jam up my memory overtime cos it just keeps adding information to the list.
What is the norm in this situation? Do you reset the list after every minute? I feel that would mess up how the chart looks every time it resets. Please advice how I could manage this memory issue swiftly. Thanks.
ObservableCollection<KeyValuePair<double, double>> chart1 = new ObservableCollection<KeyValuePair<double, double>>();
chart1.Add(new KeyValuePair<double, double>(DateTime.now, getCurrentCpuUsage()));
What exactly you should do depends on your requirements.
If you only need to retain a certain amount of data (e.g. 10 minutes or whatever), a bounded queue may be more appropriate than an ObservableCollection. That way, events that are "too old" automatically fall out of the data structure, allowing you to cap memory usage.
If you still want to be able to access older data in the future, you could write the data coming out of the end of the queue to a file or database instead of just dropping it.
For one implementation of a bounded queue see
Limit size of Queue<T> in .NET?
Since you may need an observable bounded queue, here are some notes on how to implement one (fairly straightforward)
Observable Stack and Queue
Bounded Queue Explained
A regular queue is like the line at a checkout stand. People line up (or as the British would say queue up) at the end of the line, and the cashier takes the person at the front of the line. First in, first out. FIFO.
A bounded queue sets a maximum length for the line. For a real-life line, new people would be prevented from joining the line if it is too long. Some bounded queues in software work that way too. The other option to keep the queue length from exceeding a limit is to remove from the front of the queue when the line is too long. In real life that's probably not very fair, but for software algorithms that's sometimes just what you need.
Store as much information as you need and drop (or save to file) the rest.
If you need to store data for very long time spans (more than i.e. 1M of data points) you may try compress data. But figure out your goals and measure performance time/memory usage first.
This question is related to the discussion here but can also be treated stand alone.
Also I think it would be nice to have the relevant results in one separate thread. I couldn't find anything on the web that was dealing with the topic comprehensively.
Let's say we need to work with a very very large List<double> of Data (e.g. 2 Billion entries).
Having a large list loaded into the memory leads to either a "System.OutOfMemoryException"
or if one uses <gcAllowVeryLargeObjects enabled="true|false" /> eventually just eats up the entire memory.
First: How to minimize the size of the container:
would an Array of 2 Billion doubles be less expensive ?
use decimals instead of doubles ?
is there a data-tye even lesse expensive than decimal in C# (-100
to 100 in 0.00001 steps would do the job for me)
Second: Saving the data to disc and reading it
What would be the fastet way to save the List to the disk and read it again?
Here I think the size of the file could be a problem. A txt-file containing 2 Billion entries will be huge - opening it could turn out to take years. (Perhaps some type of stream between the programm and the txt - file would do the job ?) - some sample code would be much appreciated :)
Third: Iteration
if the List is in memory I think using yield would make sense.
if the List is saved to the disk the speed of iteration is mostly limited by the speed at
which we can read it from the disc.
I need to process an XML file with the following structure:
<FolderSizes>
<Version></Version>
<DateTime Un=""></DateTime>
<Summary>
<TotalSize Bytes=""></TotalSize>
<TotalAllocated Bytes=""></TotalAllocated>
<TotalAvgFileSize Bytes=""></TotalAvgFileSize>
<TotalFolders Un=""></TotalFolders>
<TotalFiles Un=""></TotalFiles>
</Summary>
<DiskSpaceInfo>
<Drive Type="" Total="" TotalBytes="" Free="" FreeBytes="" Used=""
UsedBytes=""><![CDATA[ ]]></Drive>
</DiskSpaceInfo>
<Folder ScanState="">
<FullPath Name=""><![CDATA[ ]]></FullPath>
<Attribs Int=""></Attribs>
<Size Bytes=""></Size>
<Allocated Bytes=""></Allocated>
<AvgFileSz Bytes=""></AvgFileSz>
<Folders Un=""></Folders>
<Files Un=""></Files>
<Depth Un=""></Depth>
<Created Un=""></Created>
<Accessed Un=""></Accessed>
<LastMod Un=""></LastMod>
<CreatedCalc Un=""></CreatedCalc>
<AccessedCalc Un=""></AccessedCalc>
<LastModCalc Un=""></LastModCalc>
<Perc><![CDATA[ ]]></Perc>
<Owner><![CDATA[ ]]></Owner>
<!-- Special element; see paragraph below -->
<Folder></Folder>
</Folder>
</FolderSizes>
The <Folder> element is special in that it repeats within the <FolderSizes> element but can also appear within itself; I reckon up to about 5 levels.
The problem is that the file is really big at a whopping 11GB so I'm having difficulty processing it - I have experience with XML documents, but nothing on this scale.
What I would like to do is to import the information into a SQL database because then I will be able to process the information in any way necessary without having to concern myself with this immense, impractical file.
Here are the things I have tried:
Simply load the file and attempt to process it with a simple C# program using an XmlDocument or XDocument object
Before I even started I knew this would not work, as I'm sure everyone would agree, but I tried it anyway, and ran the application on a VM (since my notebook only has 4GB RAM) with 30GB memory. The application ended up using 24GB memory, and taking very, very long, so I just cancelled it.
Attempt to process the file using an XmlReader object
This approach worked better in that it didn't use as much memory, but I still had a few problems:
It was taking really long because I was reading the file one line at a time.
Processing the file one line at a time makes it difficult to really work with the data contained in the XML because now you have to detect the start of a tag, and then the end of that tag (hopefully), and then create a document from that information, read the info, attempt to determine which parent tag it belongs to because we have multiple levels... Sound prone to problems and errors
Did I mention it takes really long reading the file one line at a time; and that still without actually processing that line - literally just reading it.
Import the information using SQL Server
I created a stored procedure using XQuery and running it recursively within itself processing the <Folder> elements. This went quite well - I think better than the other two approaches - until one of the <Folder> elements ended up being rather big, producing a An XML operation resulted an XML data type exceeding 2GB in size. Operation aborted. error. I read up about it and I don't think it's an adjustable limit.
Here are more things I think I should try:
Re-write my C# application to use unmanaged code
I don't have much experience with unmanaged code, so I'm not sure how well it will work and how to make it as unmanaged as possible.
I once wrote a little application that works with my webcam, receiving the image, inverting the colours, and painting it to a panel. Using normal managed code didn't work - the result was about 2 frames per second. Re-writing the colour inversion method to use unmanaged code solved the problem. That's why I thought that unmanaged might be a solution.
Rather go for C++ in stead of C#
Not sure if this is really a solution. Would it necessarily be better that C#? Better than unmanaged C#?
The problem here is that I haven't actually worked with C++ before, so I'll need to get to know a few things about C++ before I can really start working with it, and then probably not very efficiently yet.
I thought I'd ask for some advice before I go any further, possibly wasting my time.
Thanks in advance for you time and assistance.
EDIT
So before I start processing the file I run through it and check the size in a attempt to provide the user with feedback as to how long the processing might take; I made a screenshot of the calculation:
That's about 1500 lines per second; if the average line length is about 50 characters, that's 50 bytes per line, that's 75 kilobytes per second, for an 11GB file should take about 40 hours, if my maths is correct. But this is only stepping each line. It's not actually processing the line or doing anything with it, so when that starts, the processing rate drops significantly.
This is the method that runs during the size calculation:
private int _totalLines = 0;
private bool _cancel = false; // set to true when the cancel button is clicked
private void CalculateFileSize()
{
xmlStream = new StreamReader(_filePath);
xmlReader = new XmlTextReader(xmlStream);
while (xmlReader.Read())
{
if (_cancel)
return;
if (xmlReader.LineNumber > _totalLines)
_totalLines = xmlReader.LineNumber;
InterThreadHelper.ChangeText(
lblLinesRemaining,
string.Format("{0} lines", _totalLines));
string elapsed = string.Format(
"{0}:{1}:{2}:{3}",
timer.Elapsed.Days.ToString().PadLeft(2, '0'),
timer.Elapsed.Hours.ToString().PadLeft(2, '0'),
timer.Elapsed.Minutes.ToString().PadLeft(2, '0'),
timer.Elapsed.Seconds.ToString().PadLeft(2, '0'));
InterThreadHelper.ChangeText(lblElapsed, elapsed);
if (_cancel)
return;
}
xmlStream.Dispose();
}
Still runnig, 27 minutes in :(
you can read an XML as a logical stream of elements instead of trying to read it line-by-line and piece it back together yourself. see the code sample at the end of this article
also, your question has already been asked here