Avoiding BinaryReader.ReadString() in C#? - c#

Good morning,
At the startup of the application I am writing I need to read about 1,600,000 entries from a file to a Dictionary<Tuple<String, String>, Int32>. It is taking about 4-5 seconds to build the whole structure using a BinaryReader (using a FileReader takes about the same time). I profiled the code and found that the function doing the most work in this process is BinaryReader.ReadString(). Although this process needs to be run only once and at startup, I would like to make it as quick as possible. Is there any way I can avoid BinaryReader.ReadString() and make this process faster?
Thank you very much.

Are you sure that you absolutely have to do this before continuing?
I would examine the possibility of hiving off the task to a separate thread which sets a flag when finished. Then your startup code simply kicks off that thread and continues on its merry way, pausing only when both:
the flag is not yet set; and
no more work can be done without the data.
Often, the illusion of speed is good enough, as anyone who has coded up a splash screen will tell you.
Another possibility, if you control the data, is to store it in a more binary form so you can just blat it all in with one hit (i.e., no interpretation of the data, just read in the whole thing). That, of course, makes it harder to edit the data from outside your application but you haven't stated that as a requirement.
If it is a requirement or you don't control the data, I'd still look into my first suggestion above.

If you think that reading the file line by line is the bottleneck, and depending on its size, you can try to read it all at once:
// read the entire file at once
string entireFile = System.IO.File.ReadAllText(path);
It this doesn't help, you can try to add a separate thread with a semaphore, which would start reading in background immediately when the program is started, but block the requesting thread at the moment you try to access the data.
This is called a Future, and you have an implementation in Jon Skeet's miscutil library.
You call it like this at the app startup:
// following line invokes "DoTheActualWork" method on a background thread.
// DoTheActualWork returns an instance of MyData when it's done
Future<MyData> calculation = new Future<MyData>(() => DoTheActualWork(path));
And then, some time later, you can access the value in the main thread:
// following line blocks the calling thread until
// the background thread completes
MyData result = calculation.Value;
If you look at the Future's Value property, you can see that it blocks at the AsyncWaitHandle if the thread is still running:
public TResult Value
{
get
{
if (!IsCompleted)
{
_asyncResult.AsyncWaitHandle.WaitOne();
_lock.WaitOne();
}
return _value;
}
}

If strings are repeated inside tuples you could reorganize your file to have all different involving strings at the start, and have references to those strings (integers) in the body of the file. Your main Dictionary does not have to change, but you would need a temporary Dictionary during startup with all different strings (values) and their references (keys).

Related

Writing to a file asynchronously, but in order

I've got some code which saves data from an object to XML. This locked the UI for a few seconds so I made it so it wouldn't.
foreach (Path path in m_canvasCompact.Children)
{
Task.Run(() => WritePathDataToXML(false, path));
}
Private void WritePAthDataToXML(bool is32x32, Path path)
{
//stuff going on...
xmlDoc.Root.Descendants.......Add(iconToAdd);
xmlDoc.Save(..);
}
The problem is (as expected) the order in which the data is written to the XML is in a random order depending upon the speed in which the tasks finish (I assume)
I could probably write some bodged code which looks at the XML and rearranges it once everything has been completed, but that's not ideal. Is there anyway to do this on a separate thread, but perhaps only one at a time, so they get executed and saved in the correct order.
Thanks.
It sounds like you want a producer/consumer queue. You can rig that up fairly easily using BlockingCollection<T>.
Create the blocking collection
Start a task which will read from the collection until it's "finished" (simplest with GetConsumingEnumerable), writing to the file
Add all the relevant items to the collection - making sure you do everything that touches UI elements within the UI thread.
Tell the collection it's "finished" (CompleteAdding)
Alternatively, as suggested in comments:
In the UI thread, create a collection with all the information you need from UI elements - basically you don't want to touch the UI elements within a non-UI thread.
Start a task to write that collection to disk; optionally await that task (which won't block the UI)
That's simpler, but it does mean building up the whole collection in memory before you start writing. With the first approach, you can add to the collection as you write - although it's entirely possible that if building the collection is much faster than writing to disk, you'll end up with the whole thing in memory anyway. If this is infeasible, you'll need some way of adding "gently" from the UI thread, without blocking it. It would be nice if BlockingCollection had an AddAsync method, but I can't see one.
We don't know enough about what you're doing with the Path elements to give you sample code for this, but hopefully that's enough of a starting point.
Run the whole loop in a Task:
Task.Run(()=>{
foreach (Path path in m_canvasCompact.Children)
{
WritePathDataToXML(false, path);
}
});
This will still take the same time, but should not block the UI.

Threadpool - How to call a method (with params) in the main thread from a worker thread

I'm working through my first attempt to thread an application. The app works with a large data set that is split up into manageable chunks which are stored on disk, so the entire data set never has to reside in memory all at once. Instead, a subset of the data can be loaded piecemeal as needed. These chunks were previously being loaded one after the other in the main thread. Of course, this would effectively pause all GUI and other operation until the data was fully loaded.
So I decided to look into threading, and do my loading while the app continues to function normally. I was able to get the basic concept working with a ThreadPool by doing something along the lines of the pseudo-code below:
public class MyApp
{
List<int> listOfIndiciesToBeLoaded; //This list gets updated based on user input
Dictionary<int,Stuff> loadedStuff = new Dictionary<int,Stuff>();
//The main thread queues items to be loaded by the ThreadPool
void QueueUpLoads()
{
foreach(int index in listOfIndiciesToBeLoaded)
{
if(!loadedStuff.ContainsKey(index))
loadedStuff.Add(index,new Stuff());
LoadInfo loadInfo = new LoadInfo(index);
ThreadPool.QueueUserWorkItem(LoadStuff, loadInfo);
}
}
//LoadStuff is called from the worker threads
public void LoadStuff(System.Object loadInfoObject)
{
LoadInfo loadInfo = loadInfoObject as LoadInfo;
int index = loadInfo.index;
int[] loadedValues = LoadValuesAtIndex(index); /* here I do my loading and ...*/
//Then I put the loaded data in the corresponding entry in the dictionary
loadedStuff[index].values = loadedValues;
//Now it is accessible from the main thread and it is flagged as loaded
loadedStuff[index].loaded = true;
}
}
public class Stuff
{
//As an example lets say the data being loaded is an array of ints
int[] values;
bool loaded = false;
}
//a class derived from System.Object to be passed via ThreadPool.QueueUserWorkItem
public class LoadInfo : System.Object
{
public int index;
public LoadInfo(int index)
{
this.index = index;
}
}
This is very primitive compared to the quite involved examples I've come across while trying to learn this stuff in the past few days. Sure, it loads the data concurrently and stuffs it into a dictionary accessible from the main thread, but it also leaves me with a crucial problem. I need the main thread to be notified when an item is loaded and which item it is so that the new data can be processed and displayed. Ideally, I'd like to have each completed load call a function on the main thread and provide it the index and newly loaded data as parameters. I understand that I can't just call functions on the main thread from multiple other threads running concurrently. They have to be queued up in some way for the main thread to run them when it is not doing something else. But this is where my current understanding of thread communication falls off.
I've read over a few in-depth explanations of how events and delegates can be set up using Control.Invoke(delegate) when working with Windows Forms. But I'm not working with Windows Forms and haven't been able to apply these ideas. I suppose I need a more universal approach that doesn't depend on the Control class. If you do respond, please be detailed and maybe use some of the naming in my pseudo-code. That way it will be easier for me to follow. Threading appears to be a pretty deep topic, and I'm just coming to grips with the basics. Also please feel free to make suggestions on how I can refine my question to be more clear.
If you aren't using a GUI framework with some kind of dispatcher or GUI thread (like WPF or WinForms) then you'll have to do this manually.
One way to do this is to use a SynchronizationContext.
It's somewhat tricky to manage but there are a few articles which go into how it works and how can you make your own:
http://www.codeproject.com/Articles/31971/Understanding-SynchronizationContext-Part-I
http://www.codeproject.com/Articles/32113/Understanding-SynchronizationContext-Part-II
However I would also consider using either a single 'DictionaryChanged' boolean which is regularly checked by your 'main thread' (when it is idle) to indicate that the dictionary is changed. The flag could then be reset on the main thread to indicate that this has been handled. Keep in mind that you'll need to do some locking there.
You could also queue messages using a thread safe queue which is written by the background thread and read from the main thread if a simple variable is not sufficient. This is essentially what most dispatcher implementations are actually doing under the hood.

How to run (create?) class in a separate thread?

I am writing a program that can be easily partitioned into several distinct parts. Simplified, it would look like this:
Reader class would work with getting data from a certain device,
Analyzer class would perform calculations on the data obtained from the device at regular intervals,
Form1 class that outputs UI (graphical representation of data gathered by Reader and number output by Analyzer
Naturally, I'd like those three classes to run in separate threads (on separate cores). Meaning - all methods of Reader run in its own thread, all methods of Analyzer run in its own thread, and Form1 runs in default thread.
However, all that comes to mind is using Thread or BackgroundWorker classes, and then instead of calling some resource-heavy method on Reader or Analyzer I'd instead call
BackgroundWorker.RunWorkerAsync()
I suppose this is not the best way to do it, is it? I'd rather somehow create the class in a separate thread and leave it there for its lifespan, but I just don't get how do I do it... And I can't think of a suitable search query it seems because I haven't found answer when I searched for one.
EDIT: Thank you for the comments, I think I understand, the question itself was assuming that you can create a class "on a thread" - with implied meaning of "any method of this class called will execute on its thread" - which makes no sense, and cannot be done.
I think you are on the right track. You will need
two threads Reader and Analyzer started by Form1. They basically consist of big loops that run until some flag stopReader or stopAnalyzer is set:
two concurrent queues, let's call them readQueue and analyzedQueue. Reader will put stuff in readQueue, Analyzer will read from readQueue and write to analyzedQueue, and Form1 will read from analyzedQueue.
void runReader()
{
while (!stopReader)
{
var data = ...; // read data from device
readQueue.Enqueue(data);
}
}
void runAnalyzer()
{
while (!stopAnalyzer)
{
Data data;
if (readQueue.TryDequeue(out data))
{
var result = ...; // analyze data
analyzedQueue.Enqueue(result);
}
else
{
Thread.Sleep(...); // wait a while
}
}
}
Instead of Thread.Sleep, you could use a BlockingCollection to make Analyzer wait until a new data item is available. In that case, you might want to use a CancellationToken instead of a Boolean for stopAnalyzer, so that you can interrupt BlockingCollection.Take when stopping your algorithm.

C# Threading without locking Producer or Consumer

TLDR; version of the main questions:
While working with threads, is it safe to read a list's contents with 1 thread, while another write to it, as long you do not delete list contents (reoganize order) and only reads new object after the new object is added fully
While an Int is being updated from "Old Value" to "New Value" by one thread, is there is a risk, if another thread reads this Int that the value returned is neither "Old Value" or "New Value"
Is it possible for a thread to "skip" a critical region if its busy, instead of just going to sleep and wait for the regions release?
I have 2 pieces of code running in seperate threads and I want to have the one act as a producer for the other. I do not want either thread "sleeping" while waiting for access, but instead skip forward in their internal code if the other thread is accessing this.
My original plan were to share the data via this approach (and once counter got high enough switch to a secondary list to avoid overflows).
pseudo code of flow as I original intended it.
Producer
{
Int counterProducer;
bufferedObject newlyProducedObject;
List <buffered_Object> objectsProducer;
while(true)
{
<Do stuff until a new product is created and added to newlyProducedObject>;
objectsProducer.add(newlyProducedObject_Object);
counterProducer++
}
}
Consumer
{
Int counterConsumer;
Producer objectProducer; (contains reference to Producer class)
List <buffered_Object> personalQueue
while(true)
<Do useful work, such as working on personal queue, and polish nails if no personal queue>
//get all outstanding requests and move to personal queue
while (counterConsumer < objectProducer.GetcounterProducer())
{
personalQueue.add(objectProducer.GetItem(counterconsumer+1));
counterConsumer++;
}
}
Looking at this, everything looked fine at first glance, I knew I would not be retrieving a half constructed product from the queue, so the status of the list regardless of where it is should not be a problem even if a thread switch occour while the Producer is adding a new object. Is this assumption correct, or can there be problems here? (my guess is as the consumer is asking for a specific location in the list and new objects are added to the end, and objects are never deleted that this will not be a problem)
But what caught my eye was, could a similar problem occour that "counterProducer" is at an unknown value while it is being "counterProducer++"? Could this result in the value temporary be "null" or some unknown value? Will this be a potential issue?
My goal is to have neither of the two threads lock while waiting for a mutex but instead continue their loops, which is why I made the above first, as there is no locking.
If the usage of the list will cause problems, my workaround will be to make a linked list implementation, and share it between the two classes, still use the counters to see if new work has been added and keep last location while the personalQueue moves new stuff to personal queue. So producer add new links, consumer reads them, and deletes previous. (no counter on the list, just external counters to know how much has been added and removed)
alternative pseudo code to avoid the counterConsumer++ risk (need help with this).
Producer
{
Int publicCounterProducer;
Int privateCounterProducer;
bufferedObject newlyProducedObject;
List <buffered_Object> objectsProducer;
while(true)
{
<Do stuff until a new product is created and added to newlyProducedObject>;
objectsProducer.add(newlyProducedObject_Object);
privateCounterProducer++
<Need Help: Some code that updates the publicCounterProducer to the privateCounterProducer if that variable is not
locked, else skips ahead, and the counter will get updated at next pass, at some point the consumer must be done reading stuff, and
new stuff is prepared already>
}
}
Consumer
{
Int counterConsumer;
Producer objectProducer; (contains reference to Producer class)
List <buffered_Object> personalQueue
while(true)
<Do useful work, such as working on personal queue, and polish nails if no personal queue>
//get all outstanding requests and move to personal queue
<Need Help: tries to read the publicProducerCounter and set readProducerCounter to this, else skips this code>
while (counterConsumer < readProducerCounter)
{
personalQueue.add(objectProducer.GetItem(counterconsumer+1));
counterConsumer++;
}
}
So goal in the 2nd part of code, and I have not been able to figure out how to code this, is to make both classes not wait for the other in case the other is in the "critical region" of updating the publicCounterProducer. If I read the lock functionality correct, the threads will go to sleep waiting for the release, which is not what I want. Might end up with having to use it though, in which case, first pseudocode would do it, and just set a "lock" on the getting of the value.
Hope you can help me out with my many questions.
No it is not safe. A context switch can occur within .Add after List has added the object, but before List has updated the internal data structure.
If it is int32, or if it is int64 and you are running in an x64 process, then there is no risk. But if you have any doubts, use the Interlocked class.
Yes, you can use a Semaphore, and when it is time to enter the critical region, use WaitOne overload that takes a timeout. Pass a timeout of 0. If WaitOne returns true, then you successfully acquired the lock and can enter. If it returns false, then you did not acquire the lock and should not enter.
You should really look at the System.Collections.Concurrent namespace. In particular, look at the BlockingCollection. It has a bunch of Try* operators you can use to add/remove items from the collection without blocking.
While working with threads, is it safe to read a list's contents with 1 thread, while another write to it, as long you do not delete list contents (reoganize order) and only reads new object after the new object is added fully
No, it is not. A side-effect of adding an item to a list may be to reallocate its underlying array. Current implementations of List<T> update the internal reference before copying the old data to it, so multiple threads may observe a list of the correct size but containing no data.
While an Int is being updated from "Old Value" to "New Value" by one thread, is there is a risk, if another thread reads this Int that the value returned is neither "Old Value" or "New Value"
Nope, int updates are atomic. But if two threads are both incrementing counterProducer at once, it will go wrong. You should use Interlocked.Increment() to increment it.
Is it possible for a thread to "skip" a critical region if its busy, instead of just going to sleep and wait for the regions release?
No, but you can use (for example) WaitHandle.WaitOne(int) to see if a wait succeeded, and branch accordingly. WaitHandle is implemented by several synchronization classes, such as ManualResetEvent.
Incidentally, is there a reason you are not using the built-in Producer/Consumer classes such as BlockingCollection<T>? BlockingCollection is easy to use (after you read the documentation!) and I'd recommend using it instead.

C# thread pool limiting threads

Alright...I've given the site a fair search and have read over many posts about this topic. I found this question: Code for a simple thread pool in C# especially helpful.
However, as it always seems, what I need varies slightly.
I have looked over the MSDN example and adapted it to my needs somewhat. The example I refer to is here: http://msdn.microsoft.com/en-us/library/3dasc8as(VS.80,printer).aspx
My issue is this. I have a fairly simple set of code that loads a web page via the HttpWebRequest and WebResponse classes and reads the results via a Stream. I fire off this method in a thread as it will need to executed many times. The method itself is pretty short, but the number of times it needs to be fired (with varied data for each time) varies. It can be anywhere from 1 to 200.
Everything I've read seems to indicate the ThreadPool class being the prime candidate. Here is what things get tricky. I might need to fire off this thing say 100 times, but I can only have 3 threads at most running (for this particular task).
I've tried setting the MaxThreads on the ThreadPool via:
ThreadPool.SetMaxThreads(3, 3);
I'm not entirely convinced this approach is working. Furthermore, I don't want to clobber other web sites or programs running on the system this will be running on. So, by limiting the # of threads on the ThreadPool, can I be certain that this pertains to my code and my threads only?
The MSDN example uses the event drive approach and calls WaitHandle.WaitAll(doneEvents); which is how I'm doing this.
So the heart of my question is, how does one ensure or specify a maximum number of threads that can be run for their code, but have the code keep running more threads as the previous ones finish up until some arbitrary point? Am I tackling this the right way?
Sincerely,
Jason
Okay, I've added a semaphore approach and completely removed the ThreadPool code. It seems simple enough. I got my info from: http://www.albahari.com/threading/part2.aspx
It's this example that showed me how:
[text below here is a copy/paste from the site]
A Semaphore with a capacity of one is similar to a Mutex or lock, except that the Semaphore has no "owner" – it's thread-agnostic. Any thread can call Release on a Semaphore, while with Mutex and lock, only the thread that obtained the resource can release it.
In this following example, ten threads execute a loop with a Sleep statement in the middle. A Semaphore ensures that not more than three threads can execute that Sleep statement at once:
class SemaphoreTest
{
static Semaphore s = new Semaphore(3, 3); // Available=3; Capacity=3
static void Main()
{
for (int i = 0; i < 10; i++)
new Thread(Go).Start();
}
static void Go()
{
while (true)
{
s.WaitOne();
Thread.Sleep(100); // Only 3 threads can get here at once
s.Release();
}
}
}
Note: if you are limiting this to "3" just so you don't overwhelm the machine running your app, I'd make sure this is a problem first. The threadpool is supposed to manage this for you. On the other hand, if you don't want to overwhelm some other resource, then read on!
You can't manage the size of the threadpool (or really much of anything about it).
In this case, I'd use a semaphore to manage access to your resource. In your case, your resource is running the web scrape, or calculating some report, etc.
To do this, in your static class, create a semaphore object:
System.Threading.Semaphore S = new System.Threading.Semaphore(3, 3);
Then, in each thread, you do this:
System.Threading.Semaphore S = new System.Threading.Semaphore(3, 3);
try
{
// wait your turn (decrement)
S.WaitOne();
// do your thing
}
finally {
// release so others can go (increment)
S.Release();
}
Each thread will block on the S.WaitOne() until it is given the signal to proceed. Once S has been decremented 3 times, all threads will block until one of them increments the counter.
This solution isn't perfect.
If you want something a little cleaner, and more efficient, I'd recommend going with a BlockingQueue approach wherein you enqueue the work you want performed into a global Blocking Queue object.
Meanwhile, you have three threads (which you created--not in the threadpool), popping work out of the queue to perform. This isn't that tricky to setup and is very fast and simple.
Examples:
Best threading queue example / best practice
Best method to get objects from a BlockingQueue in a concurrent program?
It's a static class like any other, which means that anything you do with it affects every other thread in the current process. It doesn't affect other processes.
I consider this one of the larger design flaws in .NET, however. Who came up with the brilliant idea of making the thread pool static? As your example shows, we often want a thread pool dedicated to our task, without having it interfere with unrelated tasks elsewhere in the system.

Categories

Resources