Writing to a file asynchronously, but in order

Writing to a file asynchronously, but in order - c#

I've got some code which saves data from an object to XML. This locked the UI for a few seconds so I made it so it wouldn't.
foreach (Path path in m_canvasCompact.Children)
{
Task.Run(() => WritePathDataToXML(false, path));
}
Private void WritePAthDataToXML(bool is32x32, Path path)
{
//stuff going on...
xmlDoc.Root.Descendants.......Add(iconToAdd);
xmlDoc.Save(..);
}
The problem is (as expected) the order in which the data is written to the XML is in a random order depending upon the speed in which the tasks finish (I assume)
I could probably write some bodged code which looks at the XML and rearranges it once everything has been completed, but that's not ideal. Is there anyway to do this on a separate thread, but perhaps only one at a time, so they get executed and saved in the correct order.
Thanks.

It sounds like you want a producer/consumer queue. You can rig that up fairly easily using BlockingCollection<T>.
Create the blocking collection
Start a task which will read from the collection until it's "finished" (simplest with GetConsumingEnumerable), writing to the file
Add all the relevant items to the collection - making sure you do everything that touches UI elements within the UI thread.
Tell the collection it's "finished" (CompleteAdding)
Alternatively, as suggested in comments:
In the UI thread, create a collection with all the information you need from UI elements - basically you don't want to touch the UI elements within a non-UI thread.
Start a task to write that collection to disk; optionally await that task (which won't block the UI)
That's simpler, but it does mean building up the whole collection in memory before you start writing. With the first approach, you can add to the collection as you write - although it's entirely possible that if building the collection is much faster than writing to disk, you'll end up with the whole thing in memory anyway. If this is infeasible, you'll need some way of adding "gently" from the UI thread, without blocking it. It would be nice if BlockingCollection had an AddAsync method, but I can't see one.
We don't know enough about what you're doing with the Path elements to give you sample code for this, but hopefully that's enough of a starting point.

Run the whole loop in a Task:
Task.Run(()=>{
foreach (Path path in m_canvasCompact.Children)
{
WritePathDataToXML(false, path);
}
});
This will still take the same time, but should not block the UI.

Related

C# Safely using LINQ across threads

I have a program that is constantly reading and parsing a large stream of data from a WebSocket. All of the parsing happens on one thread within the client, and the data is organized into a SortedSet<T> tree for fast operation.
All of the data is added, updated, and removed without a hitch.
The problem comes when I try to access the data from another thread. It will run fine, but somewhere along the lines is a race condition that will be hit within a minute or two.
Consider this code (running on its own thread) to update the UI in near real-time:
private async Task RenderOrderBook()
{
var book = _client.OrderBook;
while (true)
{
try
{
var asks = book.Asks.OrderBy(i => i.Price).Take(5).OrderByDescending(i => i.Price);
var bids = book.Bids.OrderByDescending(i => i.Price).Take(5);
orderBookView.BeginInvoke(new MethodInvoker(() =>
{
...omitted due to irrelevance
}));
await Task.Delay(500);
}
catch (Exception ex)
{
ex.ToString();
}
}
}
The race condition lies within the LINQ operations on book. The common error is that i.Price (a decimal variable), or perhaps just the object i is referring to, is null. Additionally, my shoddy attempt to just swallow the exception does not actually work.
Regardless, my guess is that the data is being parsed and manipulated so fast that eventually, when using the LINQ OrderBy operation, it will hit a case where a node has been removed by the client, attempt to read from it, and throw an exception.
The book.Asks and book.Bids properties were initially of type SortedSet<T> and pointed directly to the data member itself. In an attempt to mitigate this race condition scenario, I attempted to change them to an array of the node, and use a _asks.ToArray() call to essentially make a copy to read from. This helped make the problem occur a bit less frequently, but nonetheless it still does happen.
How can I make this thread-safe?
Additional Code Snippets
public PriceNode[] Asks
{
get { return _asks.ToArray(); }
}
public PriceNode[] Bids
{
get { return _bids.ToArray(); }
}

My first rule of UI development is that you never perform I/O on the UI thread. Sounds like you've got that one covered.
My second rule is that once something is visible to the UI thread, you can't touch it from any other thread. There is exactly one exception to this rule, and that is for immutable data: if an object will not change, then any thread can touch it. Mutable data? No touch. Keep in mind that "mutable data" includes most collections.
Your life will be so much easier if you can follow these two rules. Following one without breaking the other can be tricky, but there are ways to do it, and once you have a decent grip of them, you'll be in a better place. The path to enlightenment begins here:
Your read thread (the thread reading off the socket) is allowed to create all the new objects it wants, but it can't update existing objects. It also can't modify any collections that the UI thread is using. If you're only adding new objects, this isn't so bad: your read thread can pull data off the socket and use it to cook up new objects. When those objects are ready, it has to hand them over to the UI thread, and the UI thread can add them to the relevant collections. The bulk of the work (and all of the I/O) happens on the read thread, which is what we want, per Strobel's Rule #1. The act of "committing" the already-populated objects should be trivial by comparison. Per Rule #2, once any mutable objects get handed off to the UI thread, your read thread can't touch them again. Ever.
Updating existing objects is trickier. There's a couple ways you can approach this. One is to have the read thread use the latest data to create new objects, which it then hands off to the UI thread. If you have very simple object graphs, the easiest option might be to simply replace the old objects with their newer versions, keeping in mind that any UI code referencing an old object will need to know that it's been replaced. Alternatively, the UI thread can use the data from the new object to update the existing object. If you're following Rule #2, this will be totally thread-safe, and any UI code that pointed to the old object automatically sees the new data without any torn reads or other race-related nastiness. This approach is probably your best bet.
If, after trying out the approaches in the previous paragraph, you find that you are generating unacceptable amounts of garbage, there is a third option. The read thread can copy the raw data for each object into a temporary buffer, then hand the buffers over to the UI thread, which can use the data in the buffers to update the existing objects. This means more work occurring on the UI thread, but at least the data is already in memory (the socket I/O is already done). Since the point of this approach is to create less garbage, it only makes sense if you reuse the buffers. That means you need a thread-safe buffer pool. The read thread acquires a temporary buffer, fills it from the socket, hands it to the UI thread, which returns it to the pool when it's done. Astute readers will note that passing mutable buffers between threads bumps up against Rule #2, so take care that once a thread hands over a buffer, it immediately forgets about it. Because this approach requires a stronger grasp of thread safety to make the pool work, I recommend it only as a last resort. If you can get away with one of the options in the previous paragraph, please do so.
Regardless of which approach you use for updating existing objects, you'll need a way to match up the new objects/data with the old objects. If each object has a unique identifier, you can use a Dictionary<,> as an efficient lookup mechanism. Replacing old objects with their newer copies is a bit more involved, because the old versions may be scattered across multiple collections, some of which may not support efficient replacement.
One last thing: when you hand over new/updated objects to the UI thread, it is vastly preferable to do it in batches. For example, you're better off posting a single operation to your UI thread to update 100 objects than posting 100 separate operations that each update one object.

Coroutine to run SQL in Unity3d

I'm trying to understand Unity coroutines deeper. They can block execution yielding null or wait, but they are really not new threads.
In my case Unity should read info from database, which can take some time. And this is synchronous operation. This single line of code may potentially block execution for seconds.
Part of me tells just to start new thread. But I'm wondering whether it can be achieved with Unity-style coroutines.
private IEnumerator FetchPlayerInfo(int id)
{
// fetch player from DB
using (var session...)
{
// this line is NHibernate SQL query, may take long time
player = session.QueryOver<Player>()...;
}
// raise event to notify listeners that player info is fetched
}
I just don't see where to put yield. Does anyone know?
Thx in advance.

You can only return yield instructions when the control flow is in your own coroutine. If you have to run a long synchronous operation, and it has no asynchronous API (which is something to be expected from a database), then you really better off with starting another thread.
However, be aware that using other threads in Unity is a little bit tricky: you won't be able to use any of Unity's API and you'll have to check in the main thread for when the worker thread has a result ready. Consider looking at ready solutions, such as Loom.

You can think of yielding as breaking a large task into chunks. After each chunk you yield to let other things happen, then come back and do another chuck. In your case you would load X number of rows each chunk until all your data is loaded.

C# Threading without locking Producer or Consumer

TLDR; version of the main questions:
While working with threads, is it safe to read a list's contents with 1 thread, while another write to it, as long you do not delete list contents (reoganize order) and only reads new object after the new object is added fully
While an Int is being updated from "Old Value" to "New Value" by one thread, is there is a risk, if another thread reads this Int that the value returned is neither "Old Value" or "New Value"
Is it possible for a thread to "skip" a critical region if its busy, instead of just going to sleep and wait for the regions release?
I have 2 pieces of code running in seperate threads and I want to have the one act as a producer for the other. I do not want either thread "sleeping" while waiting for access, but instead skip forward in their internal code if the other thread is accessing this.
My original plan were to share the data via this approach (and once counter got high enough switch to a secondary list to avoid overflows).
pseudo code of flow as I original intended it.
Producer
{
Int counterProducer;
bufferedObject newlyProducedObject;
List <buffered_Object> objectsProducer;
while(true)
{
<Do stuff until a new product is created and added to newlyProducedObject>;
objectsProducer.add(newlyProducedObject_Object);
counterProducer++
}
}
Consumer
{
Int counterConsumer;
Producer objectProducer; (contains reference to Producer class)
List <buffered_Object> personalQueue
while(true)
<Do useful work, such as working on personal queue, and polish nails if no personal queue>
//get all outstanding requests and move to personal queue
while (counterConsumer < objectProducer.GetcounterProducer())
{
personalQueue.add(objectProducer.GetItem(counterconsumer+1));
counterConsumer++;
}
}
Looking at this, everything looked fine at first glance, I knew I would not be retrieving a half constructed product from the queue, so the status of the list regardless of where it is should not be a problem even if a thread switch occour while the Producer is adding a new object. Is this assumption correct, or can there be problems here? (my guess is as the consumer is asking for a specific location in the list and new objects are added to the end, and objects are never deleted that this will not be a problem)
But what caught my eye was, could a similar problem occour that "counterProducer" is at an unknown value while it is being "counterProducer++"? Could this result in the value temporary be "null" or some unknown value? Will this be a potential issue?
My goal is to have neither of the two threads lock while waiting for a mutex but instead continue their loops, which is why I made the above first, as there is no locking.
If the usage of the list will cause problems, my workaround will be to make a linked list implementation, and share it between the two classes, still use the counters to see if new work has been added and keep last location while the personalQueue moves new stuff to personal queue. So producer add new links, consumer reads them, and deletes previous. (no counter on the list, just external counters to know how much has been added and removed)
alternative pseudo code to avoid the counterConsumer++ risk (need help with this).
Producer
{
Int publicCounterProducer;
Int privateCounterProducer;
bufferedObject newlyProducedObject;
List <buffered_Object> objectsProducer;
while(true)
{
<Do stuff until a new product is created and added to newlyProducedObject>;
objectsProducer.add(newlyProducedObject_Object);
privateCounterProducer++
<Need Help: Some code that updates the publicCounterProducer to the privateCounterProducer if that variable is not
locked, else skips ahead, and the counter will get updated at next pass, at some point the consumer must be done reading stuff, and
new stuff is prepared already>
}
}
Consumer
{
Int counterConsumer;
Producer objectProducer; (contains reference to Producer class)
List <buffered_Object> personalQueue
while(true)
<Do useful work, such as working on personal queue, and polish nails if no personal queue>
//get all outstanding requests and move to personal queue
<Need Help: tries to read the publicProducerCounter and set readProducerCounter to this, else skips this code>
while (counterConsumer < readProducerCounter)
{
personalQueue.add(objectProducer.GetItem(counterconsumer+1));
counterConsumer++;
}
}
So goal in the 2nd part of code, and I have not been able to figure out how to code this, is to make both classes not wait for the other in case the other is in the "critical region" of updating the publicCounterProducer. If I read the lock functionality correct, the threads will go to sleep waiting for the release, which is not what I want. Might end up with having to use it though, in which case, first pseudocode would do it, and just set a "lock" on the getting of the value.
Hope you can help me out with my many questions.

No it is not safe. A context switch can occur within .Add after List has added the object, but before List has updated the internal data structure.
If it is int32, or if it is int64 and you are running in an x64 process, then there is no risk. But if you have any doubts, use the Interlocked class.
Yes, you can use a Semaphore, and when it is time to enter the critical region, use WaitOne overload that takes a timeout. Pass a timeout of 0. If WaitOne returns true, then you successfully acquired the lock and can enter. If it returns false, then you did not acquire the lock and should not enter.
You should really look at the System.Collections.Concurrent namespace. In particular, look at the BlockingCollection. It has a bunch of Try* operators you can use to add/remove items from the collection without blocking.

While working with threads, is it safe to read a list's contents with 1 thread, while another write to it, as long you do not delete list contents (reoganize order) and only reads new object after the new object is added fully
No, it is not. A side-effect of adding an item to a list may be to reallocate its underlying array. Current implementations of List<T> update the internal reference before copying the old data to it, so multiple threads may observe a list of the correct size but containing no data.
While an Int is being updated from "Old Value" to "New Value" by one thread, is there is a risk, if another thread reads this Int that the value returned is neither "Old Value" or "New Value"
Nope, int updates are atomic. But if two threads are both incrementing counterProducer at once, it will go wrong. You should use Interlocked.Increment() to increment it.
Is it possible for a thread to "skip" a critical region if its busy, instead of just going to sleep and wait for the regions release?
No, but you can use (for example) WaitHandle.WaitOne(int) to see if a wait succeeded, and branch accordingly. WaitHandle is implemented by several synchronization classes, such as ManualResetEvent.
Incidentally, is there a reason you are not using the built-in Producer/Consumer classes such as BlockingCollection<T>? BlockingCollection is easy to use (after you read the documentation!) and I'd recommend using it instead.

multithread read and process large text files

I have 10 lists of over 100Mb each with emails and I wanna process them using multithreads as fast as possible and without loading them into memory (something like reading line by line or reading small blocks)
I have created a function which is removing invalid ones based on a regex and another one which is organizing them based on each domain to other lists.
I managed to do it using one thread with:
while (reader.Peek() != -1)
but it takes too damn long.
How can I use multithreads (around 100 - 200) and maybe a backgroundworker or something to be able to use the form while processing the lists in parallel?
I'm new to csharp :P

Unless the data is on multiple physical discs, chances are that any more than a few threads will slow down, rather than speed up, the process.
What'll happen is that rather than reading consecutive data (pretty fast), you'll end up seeking to one place to read data for one thread, then seeking to somewhere else to read data for another thread, and so on. Seeking is relatively slow, so it ends up slower -- often quite a lot slower.
About the best you can do is dedicate one thread to reading data from each physical disc, then another to process the data -- but unless your processing is quite complex, or you have a lot of fast hard drives, one thread for processing may be entirely adequate.

There are multiple approaches to it:
1.) You can create threads explicitly like Thread t = new Thread(), but this approach is expensive on creating and managing a thread.
2.) You can use .net ThreadPool and pass your executing function's address to QueueUserWorkItem static method of ThreadPool Class. This approach needs some manual code management and synchronization primitives.
3.) You can create an array of System.Threading.Tasks.Task each processing a list which are executed parallely using all your available processors on the machine and pass that array to task.WaitAll(Task[]) to wait for their completion. This approach is related to Task Parallelism and you can find detailed information on MSDN
Task[] tasks = null;
for(int i = 0 ; i < 10; i++)
{
//automatically create an async task and execute it using ThreadPool's thread
tasks[i] = Task.StartNew([address of function/lambda expression]);
}
try
{
//Wait for all task to complete
Task.WaitAll(tasks);
}
catch (AggregateException ae)
{
//handle aggregate exception here
//it will be raised if one or more task throws exception and all the exceptions from defaulting task get accumulated in this exception object
}
//continue your processing further

You will want to take a look at the Task Parallel Library (TPL).
This library is made for parallel work, in fact. It will perform your action on the Threadpool in whatever is the most efficient fashion (typically). The only thing that I would caution is that if you run 100-200 threads at one time, then you possibly run into having to deal with context switching. That is, unless you have 100-200 processors. A good rule of thumb is to only run as many tasks in parallel as you have processors.
Some other good resources to review how to use the TPL:
Why and how to use the TPL
How to start a task.

I would be inclined to use parallel linq (plinq).
Something along the lines of:
Lists.AsParallel()
.SelectMany(list => list)
.Where(MyItemFileringFunction)
.GroupBy(DomainExtractionFunction)
AsParallel tells linq it can do this in parallel (which will mean the ordering of everything following will not be maintained)
SelectMany takes your individual lists and unrolls them such that all all items from all lists are effectivly in a single Enumerable
Where filers the items using your predicate function
GroupBy collects them by key, where DomainExtractionFunction is a function which gets a key (the domain name in your case) from the items (ie, the email)

Avoiding BinaryReader.ReadString() in C#?

Good morning,
At the startup of the application I am writing I need to read about 1,600,000 entries from a file to a Dictionary<Tuple<String, String>, Int32>. It is taking about 4-5 seconds to build the whole structure using a BinaryReader (using a FileReader takes about the same time). I profiled the code and found that the function doing the most work in this process is BinaryReader.ReadString(). Although this process needs to be run only once and at startup, I would like to make it as quick as possible. Is there any way I can avoid BinaryReader.ReadString() and make this process faster?
Thank you very much.

Are you sure that you absolutely have to do this before continuing?
I would examine the possibility of hiving off the task to a separate thread which sets a flag when finished. Then your startup code simply kicks off that thread and continues on its merry way, pausing only when both:
the flag is not yet set; and
no more work can be done without the data.
Often, the illusion of speed is good enough, as anyone who has coded up a splash screen will tell you.
Another possibility, if you control the data, is to store it in a more binary form so you can just blat it all in with one hit (i.e., no interpretation of the data, just read in the whole thing). That, of course, makes it harder to edit the data from outside your application but you haven't stated that as a requirement.
If it is a requirement or you don't control the data, I'd still look into my first suggestion above.

If you think that reading the file line by line is the bottleneck, and depending on its size, you can try to read it all at once:
// read the entire file at once
string entireFile = System.IO.File.ReadAllText(path);
It this doesn't help, you can try to add a separate thread with a semaphore, which would start reading in background immediately when the program is started, but block the requesting thread at the moment you try to access the data.
This is called a Future, and you have an implementation in Jon Skeet's miscutil library.
You call it like this at the app startup:
// following line invokes "DoTheActualWork" method on a background thread.
// DoTheActualWork returns an instance of MyData when it's done
Future<MyData> calculation = new Future<MyData>(() => DoTheActualWork(path));
And then, some time later, you can access the value in the main thread:
// following line blocks the calling thread until
// the background thread completes
MyData result = calculation.Value;
If you look at the Future's Value property, you can see that it blocks at the AsyncWaitHandle if the thread is still running:
public TResult Value
{
get
{
if (!IsCompleted)
{
_asyncResult.AsyncWaitHandle.WaitOne();
_lock.WaitOne();
}
return _value;
}
}

If strings are repeated inside tuples you could reorganize your file to have all different involving strings at the start, and have references to those strings (integers) in the body of the file. Your main Dictionary does not have to change, but you would need a temporary Dictionary during startup with all different strings (values) and their references (keys).

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.