using ThreadPools to search through object lists

using ThreadPools to search through object lists - c#

I have these container objects (let's call them Container) in a list. Each of these Container objects in turn has a DataItem (or a derivate) in a list. In a typical scenario a user will have 15-20 Container objects with 1000-5000 DataItems each. Then there are some DataMatcher objects that can be used for different types of searches. These work mostly fine (since I have several hundred unit tests on them), but in order to make my WPF application feel snappy and responsive, I decided that I should use the ThreadPool for this task. Thus I have a DataItemCommandRunner which runs on a Container object, and basically performs each delegate in a list it takes as a parameter on each DataItem in turn; I use the ThreadPool to queue up one thread for each Container, so that the search in theory should be as efficient as possible on multi-core computers etc.
This is basically done in a DataItemUpdater class that looks something like this:
public class DataItemUpdater
{
private Container ch;
private IEnumerable<DataItemCommand> cmds;
public DataItemUpdater(Container container, IEnumerable<DataItemCommand> commandList)
{
ch = container;
cmds = commandList;
}
public void RunCommandsOnContainer(object useless)
{
Thread.CurrentThread.Priority = ThreadPriority.AboveNormal;
foreach (DataItem di in ch.ItemList)
{
foreach (var cmd in cmds)
{
cmd(sh);
}
}
//Console.WriteLine("Done running for {0}", ch.DisplayName);
}
}
(The useless object parameter for RunCommandsOnContainer is because I am experimenting with this with and without using threads, and one of them requires some parameter. Also, setting the priority to AboveNormal is just an experiment as well.)
This works fine for all but one scenario - when I use the AllWordsMatcher object type that will look for DataItem objects containing all words being searched for (as opposed to any words, exact phrase or regular expression for instance).
This is a pretty simple somestring.Contains(eachWord) based object, backed by unit tests. But herein lies some hairy strangeness.
When the RunCommandsOnContainer runs using ThreadPool threads, it will return insane results. Say I have a string like this:
var someString = "123123123 - just some numbers";
And I run this:
var res = someString.Contains("data");
When it runs, this will actually return true quite a lot - I have debugging information that shows it returning true for empty strings and other strings that simply do not contain the data. Also, it will some times return false even when the string actually contains the data being looked for.
The kicker in all this? Why do I suspect the ThreadPool and not my own code?
When I run the RunCommandsOnContainer() command for each Container in my main thread (i.e. locking the UI and everything), it works 100% correctly - every time! It never finds anything it shouldn't, and it never skips anything it should have found.
However, as soon as I use the ThreadPool, it starts finding a lot of items it shouldn't, while some times not finding items it should.
I realize this is a complex problem (it is painful trying to debug, that's for sure!), but any insight into why and how to fix this would be greatly appreciated!
Thanks!
Rune

It's a bit hard to see from the fragment you're posting, but judging by the symptoms I would look at the AllWordsMatcher (look for static state). If AllWordsMatcher is stateful you should also check that you're creating a new instance for each thread.
More generally I'd look at all the instances involved in the matching/searching process, specifically at the working objects being used when multithreaded. From past experience, the problem usually lies there. (It's easy to look too much at the object graph representing your business data Container/DataItem in this case)

Related

C# Issue handling memory usage for SearchResultAttributeCollection LDAP

Using the System.DirectoryServices.Protocols library:
I have a class LdapItemOperator that takes a SearchResultEntry object from an LDAP query (not Active Directory related) and stores the attributes for the object in a field: readonly SearchResultAttributeCollection LdapAttributes.
The problem I am experiencing is that when I have a large operation the garbage collector seems to never delete these objects after they ought to have been disposed because of the LdapAttributes field in my objects, at least I think that's the problem. What ways can I try to dispose of the objects when they are no longer required? I can't seem to find a way to incorporate a using statement in there, although I only have little experience with it.
As an example, let's say I have the following logic:
List<LdapItemOperator> itemList = GetList(ldapFilter);
List<bool> resultList = new List<bool>();
foreach (IdmLdapItemOperator item in itemList) {
bool result = doStuff(item);
resultList.Add(result);
}
//Even though we are out of the loop now, the objects are still stored in memory, how come? Same goes for the previous objects in the loop, they seem to remain in memory
Logic.WriteResultToLog(result);
After a good while of running the logic on large filesets, this process starts taking up enormous amounts of memory, of course...

I think you might be a little confused about how GC works. You can never know exactly when GC will run. And objects you are still holding a reference to will not be collected (unless it's a weak reference...).
Also "disposing" is yet another different concept, that hasn't much to do with GC.
Basically, all objects will be in memory already after the call to GetList. And memory consumption will not change much after that, the foreach loop shouldn't affect it at all.
Without knowing your implementation, maybe try returning an enumerable instead of a single list, or make multiple batched calls.

Does the need to make the code simpler justify the use of wrong abstractions?

Suppose we have a CommandRunner class that runs Commands, when a Command is created it's kept in the processingQueue for proccessing, if the execution of the Command finishes with errors the Command is moved to the faultedQueue for later processing but when everything is OK the Command is moved to the archiveQueue, the archiveQueue is not going to be processed in any way
the CommandRunner is something like this
class CommandRunner
{
public CommandRunner(IQueue<Command> processingQueue,
IQueue<Command> faultedQueue,
IQueue<Command> archiveQueue)
{
this.processingQueue = processingQueue;
this.faultedQueue= faultedQueue;
this.archiveQueue= archiveQueue;
}
public void RunCommands()
{
while(processingQueue.HasItems)
{
var current = processingQueue.Dequeue();
var result = current.Run();
if(result.HasError)
curent.MoveTo(faultedQueue);
else
curent.MoveTo(archiveQueue);
...
}
}
}
The CommandeRunner recives the three dependecies as a PersistentQueue the PersistentQueue is responsible for the long term storage of the Commands and so we free the CommandRunner from handling this
And the only purpose of the archiveQueue is to keep the design homogenous, to keep the CommandRunner persistence ignorant and with few dependencies
for example we can imagine a Property like this
IEnumerable<Command> AllCommands
{
get
{
return Enumerate(archiveQueue).Union(processingQueue).Union(faultedQueue);
}
}
many portions of the class need to do so(handle the Archive as a Queue to make the code simpler as shown above)
Does it make sense to use a Queue even if it's not the best abstraction, or do I have to use another abstraction for the archive concept.
what are other alternatives to meet these requirement?

Keep in mind that code, especially running code usually gets tangled and messy as time pass. To combat this, good names, good design, and meaningful comments come into play.
If you don't going to process the archiveQueue, and it's just a storage for messages that has been successfully processed, you can always store it as a different type (list, collection, set, whatever suits your needs), and then choose one of the following two:
Keep the name archiveQueue and change the underlying type. I would leave a comment where it's defined (or injected) saying : Notice that this might not be an actual queue. Name is for consistency reasons only.
Change the name to archiveRepository or something similar, while keeping the queue type. Obviously, since it's still a queue, you'll leave a comment saying: Notice, this is actually a queue.
Another thing to keep in mind, is that if you have n people working on your code base, you'll probably get n+1 different perferences about which way it shoud be done :)

Queue is a useful structure when you need to take care about the order of items inside it. If you need in your command post process, take care about the orders commands ran, then the queue can be a good choice.
If you don't need info about the order or commands, maybe you can use a List (on System.Collections namespace).
I think your choice are good, in the same case, I'll use a queues, we have a good example with OS design principles, inside OS (on Kernel) the process are queued for execution, clearly the OS queues are more complicated because they have other variables in mind like priority, and CPU utilization, but we can learn about the use of queues like data structures in process management.

Looking for advice on thread safety using static methods to 'process' a class instance

I have recently inherited a system that uses a very basic approach to processing workitems, basically, it does them one by one. To be honest, up until recently this worked well. However, we are looking to implement a similiar process for another type of workitem and I have been looking into Task Parallel Library and think that will fit the bill. However, I have some concerns about Thread Safety and to be honest, this is an area that I lack knowledge, so I am asking only my 2nd question on here in hope that someone can give me some good points as I have yet to find a definitive yes or no answer for this.
So we have our 'WorkItem' class
public class WorkItem
{
public int Id {get; set;}
public string data { get; set;}
}
A List<WorkItem> will be generated and these will then be processed using a Parallel.Foreach loop.
The Parallel.Foreach will call a private method, which in turn will call static methods from another assembly;
//Windows service that will run the Parallel.Foreach
private int MainMethod(WorkItem item)
{
item.Data = Processor.ProcessWorkItemDataProcess1(item.data);
item.Data = Processor.ProcessWorkItemDataProcess2(item.data);
SendToWorkFlow(item);
}
public static class Processor
{
public static string ProcessWorkItemDataProcess1(string data)
{
//Process it here
return string
}
public static string ProcessWorkItemDataProcess2(string data)
{
//Process it here
return string
}
}
And so on. All of these methods have logic in them to process the WorkItem instance at various different stages. Once complete, the MainMethod will send the processed WorkItem off to a Workflow System.
We will be processing these in batches of up to 30 in order not to overload the other systems. My concerns are basically the potential of 30 instances of WorkItem accessing the same static methods could cause some data integrity issues. For example, ProcessWorkItemDataProcess2 is called with WorkItem1.Data and is subsequently called with WorkItem2.Data and somehow WorkItem2.Data is returned when it should be WorkItem1.Data
All of the static methods are self-contained in so far as they have defined logic and will only (in theory) use the WorkItem that it was called with. There are no methods such as DB access, file access, etc.
So, hopefully that explains what I am doing. Should I have any concerns? If so, will creating an instance of the Processor class for each WorkItem solve any potential problems?
Thanks in advance

The scenario you describe doesn't sound like it has any blatant threading issues. Your worries about a static method being called on two different threads and getting the data mixed up is unfounded, unless you write code to mix things up. ;>
Since the methods are static, they don't have any shared object instance to worry about. That's good. You have isolated the work into self-contained work items. That is good.
You will need to check to make sure that none of the static methods access any global state, like static variables or properties, or reading from a file (the same file name for multiple work items). Reading of global state is less of a concern, writing is what will throw a wrench in the works.
You should also review your code to see how data is assigned to your work items and whether any of the code that processes the work items modifies the work item data. If the work items are treated as strictly read only by the methods, that's good. If the methods write changes back to fields or properties of the work items, you will need to double check that the data in the work items is not shared with any other work items. If the code that constructs the work item instances assigns a cached value to a property of multiple work items, and the static methods modify properties of that value, you will have threading conflicts. If the work item construction always constructs new instances of values that are assigned to properties of the work item, this shouldn't be an issue.

In a nutshell, if you have multiple threads accessing shared state, and at least one is writing, then you need to worry about thread safety. If not then you're golden.

How to break down large 'macro' classes?

One application I work on does only one thing, looking from outside world. Takes a file as input and after ~5 minutes spits out another file.
What happens inside is actually a sequential series of action. The application is, in our opinion, structured well because each action is like a small box, without too many dependencies.
Usually some later actions use some information from previous one and just a few can be executed in parallel - for the sake of simplicity we prefer to the execution sequential.
Now the problem is that the function that executes all this actions is like a batch file: a long list of calls to different functions with different arguments. So, looking in the code it looks like:
main
{
try
{
result1 = Action1(inputFile);
result2 = Action2(inputFile);
result3 = Action3(result2.value);
result4 = Action4(result1.value, inputFile);
... //You get the idea. There is no pattern passed paramteres
resultN = ActionN(parameters);
write output
}
catch
{
something went wrong, display the error
}
}
How would you model the main function of this application so is not just a long list of commands?

Not everything needs to fit to a clever pattern. There are few more elegant ways to express a long series of imperative statements than as, well, a long series of imperative statements.
If there are certain kinds of flexibility you feel you are currently lacking, express them, and we can try to propose solutions.
If there are certain clusters of actions and results that are re-used often, you could pull them out into new functions and build "aggregate" actions from them.
You could look in to dataflow languages and libraries, but I expect the gain to be small.

Not sure if it's the best approach, but you could have an object that would store all the results and you would give it to each method in turn. Every method would read the parameters it needs and write its result there. You could then have a collection of actions (either as delegates or objects implementing an interface) and call them in a loop.
class Results
{
public int Result1 { get; set; }
public string Result2 { get; set; }
…
}
var actions = new Action<Results>[] { Action1, Action2, … };
Results results = new Results();
foreach (var action in actions)
action(results);

You can think of implementing a Sequential Workflow from Windows Workflow

First of all, this solution is far not bad. If the actions are disjunct, I mean there are no global parameters or other hidden dependencies between different actions or between actions and the environment, it's a good solution. Easy to maintain or read, and when you need to expand the functionality, you have just to add new actions, when the "quantity" changes, you have just to add or remove lines from the macro sequence. If there's no need for change frequently the process chain: don't move!
If it's a system, where the implementation of actions don't often changes, but their order and parameters yes, you may design a simple script language, and transform the macro class into that script. This script should be maintained by someone else than you, someone who is familiar with the problem domain in the level of your "actions". So, he/she can assembly the application using script language without your assistance.
One nice approach for that kind of problem splitting is dataflow programming (a.k.a. Flow-based programming). In dataflow programming, there are pre-written components. Components are black boxes (from the view of the application developer), they have consumer (input) and producer (output) ports, which can be connected to form a processing network, which is then the application. If there're a good set of components for a domain, many applications can created without programming new components. Also, components can be built of other components (they called composite components).
Wikipedia (good starting point):
http://en.wikipedia.org/wiki/Dataflow_programming
http://en.wikipedia.org/wiki/Flow-based_programming
JPM's site (book, wiki, everything):
http://jpaulmorrison.com/fbp/
I think, bigger systems must have that split point you describe as "macro". Even games have that point, e.g. FPS games have a 3D engine and a game logic script, or there's SCUMM VM, which is the same.

Lockless list help!

Hi im trying to write a lockless list i got the adding part working it think but the code that extracts objects from the list does not work to good :(
Well the list is not a normal list.. i have the Interface IWorkItem
interface IWorkItem
{
DateTime ExecuteTime { get; }
bool Cancelled { get; }
void Execute(DateTime now);
}
and well i have a list where i can add this :P and the idear is when i run Get(); on the list it should loop it until it finds a IWorkItem that
If (item.ExecuteTime < DateTime.Now)
and remove it from the list and return it..
i have ran tests with many threads on my dual core cpu and it seems that Add works never failed so far but the Get function looses some workitems some where i have no idear whats wrong.....
ps if i get this working any one is free to use the code :) well you are any way but i dont se the point when its bugged :P
The code is here http://www.easy-share.com/1903474734/LinkedList.zip and if you try to run it you will see that it will some times not be able to get as many workitems as it did put in the list...
Edit: I have got a lockless list working it was faster than using the lock(obj) statment but i have a lock object that uses Interlocked that was still outpreforming the lockless list, im going to try to make a lockless arraylist and se if i get the same results there when im done ill upload the result here..

The problem is your algorithm: Consider this sequence of events:
Thread 1 calls list.Add(workItem1), which completes fully.
Status is:
first=workItem1, workItem1.next = null
Then thread 1 calls list.Add(workItem2) and reaches the spot right before the second Replace (where you have the comment "//lets try").
Status is:
first=workItem1, workItem1.next = null, nextItem=workItem1
At this point thread 2 takes over and calls list.Get(). Assume workItem1's executionTime is now, so the call succeeds and returns workItem1.
After this status is:
first = null, workItem1.next = null
(and in the other thread, nextItem is still workItem1).
Now we get back to the first thread, and it completes the Add() by setting workItem1.next:=workItem2.
If we call list.Get() now, we will get null, even though the Add() completed successfully.
You should probably look up a real peer-reviewed lock-free linked list algorithm. I think the standard one is this by John Valois. There is a C++ implementation here. This article on lock-free priority queues might also be of use.

You can use a timestamping protocol for datastructures just fine, mirroring this example from the database world:
Concurrency
But be clear that each item needs both a read and write timestamp, and be sure to follow the rules of the algorithm clearly.
There are some additional difficulties of implementing this on a linked list though, I think. The database example would be fine for a vector where you know the array index of what you want. However, in a linked list, you may need to walk down the pointers -- and the structure of the list could change while you are searching! I guess you could solve that by some sort of nuance (or if you just want to traverse the "new" list as it is, do nothing), but it poses a problem. Try to solve it without introducing some rollback condition that makes it worse than locking the list!

So are you sure that it needs to be lockless? Depending on your work load the non-blocking solution can sometimes be slower. Check out this MSDN article for a little more. Also proving that a lockless data structure is correct can be very difficult.

I am in no way an expert on the subject, but as far as I can see, you need to either make the ExecutionTime-field in the implementation of IWorkItem volatile (of course it might already be that) or insert a memorybarrier either after you set the ExecutionTime or before you read it.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.