List<> items lost during ToArray()? - c#

A while ago, I coded a system for collecting public transport disruptions. Information about any incident is collected in an MSSQL database. Consumers access these data by calling an .asmx web service. Data are collected from the DB using ADO.NET, each data row is then populating a Deviation object and added to a List. In the service layer, the list is applied a ToArray() call and returned to the consumer.
So far, so good. But the problem is that in some cases (5% or so), we have been aware that the array somehow is curtailed. Instead of the usual number of 15-20 items, only half of them, or even fewer, are returned. The remaining items are always at the end of the original list. And, even fewer times, a couple of items are repeated/shuffled at the beginning of the array.
After doing some testing on the different layers, it seems as the curtailing occurs at the end of the process, i.e. during the casting to an array or the SOAP serialization. But the code seems so innocent, huh??:
[WebMethod]
public Deviation[] GetDeviationsByTimeInterval(DateTime from, DateTime to)
{
return DeviRoutines.GetDeviationsByTimeInterval(from, to).ToArray();
}
I am not 100% sure the error doesn't occur in the SQL or data access layer, but they have proved to do their job during the testing. Any help on the subject would be of great help! :)

I'd do something like:
public Deviation[] GetDeviationsByTimeInterval(DateTime from, DateTime to)
{
var v1 = DeviRoutines.GetDeviationsByTimeInterval(from, to);
LogMe( "v1: " + v1.Count );
var v2 = v1.ToArray();
LogMe( "v2: " + v2.Length );
return v2;
}
Proofing what you expect usually pays out :-)

http://msdn.microsoft.com/en-us/library/x303t819.aspx
Type: T[] An array containing copies
of the elements of the List.
You didn't find a bug in .NET, it's most likely something in your GetDeviationsByTimeInterval

I'd be willing to bet ToArray is doing exactly what it's told, but either your from or to values are occasionally junk (validation error?) or GetDeviationsByTimeInterval is misinterpreting them for some reason.
Stick some logging into both Deviation[] and GetDeviationsByTimeInterval to see what values get passed to it, and the next time it goes pear-shaped you'll be able to diagnose where the problem is.

Related

stackexchange redis hashscan return all fields in one time

I'm using stackexchange.redis SDK in C#, and wish to scan my hash set.
I expected the SDK executed as redis client(when I execute "hscan myKey 0", it will return several key-value pairs, and an cursor which I'll use for the next scan). But when I use stackexchange.redis SDK to implement the "hashscan" method as following:
redisCache.HashScan(myKey, pageSize:10, cursor: 0)
It will return all the fields in "myKey", there are 2,000 key-value pairs in it.
How can I let it just return several results at one time?
Cause In the future, there will be millions of fields in "myKey", if they all return at one time, it'll cost lots of memory, and will it block the online service? Cause redis is single thread application.
Thanks!
It isn't doing quite what you think it is doing. The HashScan method here returns a custom iterator which maintains at most 2 pages of data; when you get near the end of one page, it fetches the next page automatically. Essentially, then, if you only want to read 20 items, just read 20 items. For example, LINQs .Take(20) would work fine. If you call .ToList() on the iterator, then yes: it will walk from one end to the other, fetching data dynamically as it needs. So: don't do that :)
Things it does not do:
fetch all the data in a single huge call to redis
perform lots of small calls to redis before returning from the HashScan method
As a side note: the custom iterator implements a custom interface to allow you to pick up and resume cursors, if you need that.

List queries 20 times faster than IQueryable?

Here is a test that i have setup this evening. It was made to prove something different, but the outcome was not quite as i expected.
I'm running a test with 10000 random queries on an IQueryable and while testing i found out that if i do the same on a List, my test is 20 times faster.
See below. My CarBrandManager.GetList originally returns an IQueryable, but now i first issue a ToList(), and then it's way faster.
Can anyone tell me something about why i see this big difference?
var sw = new Stopwatch();
sw.Start();
int queries = 10000;
//IQueryable<Model.CarBrand> carBrands = CarBrandManager.GetList(context);
List<Model.CarBrand> carBrands = CarBrandManager.GetList(context).ToList();
Random random = new Random();
int randomChar = 65;
for (int i = 0; i < queries; i++)
{
randomChar = random.Next(65, 90);
Model.CarBrand carBrand = carBrands.Where(x => x.Name.StartsWith(((char)randomChar).ToString())).FirstOrDefault();
}
sw.Stop();
lblStopWatch.Text = String.Format("Queries: {0} Elapsed ticks: {1}", queries, sw.ElapsedTicks);
There are potentially two issues at play here. First: It's not obvious what type of collection is returned from GetList(context), apart from the knowledge that it implements IQueryable. That means when you evaluate the result, it could very well be creating an SQL query, sending that query to a database, and materializing the result into objects. Or it could be parsing an XML file. Or downloading an RSS feed or invoking an OData endpoint on the internet. These would obviously take more time than simply filtering a short list in memory. (After all, how many car brands can there really be?)
But let's suppose that the implementation it returns is actually a List, and therefore the only difference you're testing is whether it's cast as an IEnumerable or as an IQueryable. Compare the method signatures on the Enumerable class's extension methods with those on Queryable. When you treat the list as an IQueryable, you are passing in Expressions, which need to be evaluated, rather than just Funcs which can be run directly.
When you're using a custom LINQ provider like Entity Framework, this gives the framework the ability to evaluate the actual expression trees and produce a SQL query and materialization plan from them. However, LINQ to Objects just wants to evaluate the lambda expressions in-memory, so it has to either use reflection or compile the expressions into Funcs, both of which have a large performance hit associated with them.
You may be tempted to just call .ToList() or .AsEnumerable() on the result set to force it to use Funcs, but from an information hiding perspective this would be a mistake. You would be assuming that you know that the data returned from the GetList(context) method is some kind of in-memory object. That may be the case at the moment, or it may not. Regardless, it's not part of the contract that is defined for the GetList(context) method, and therefore you cannot assume it will always be that way. You have to assume that the type you get back could very well be something that you can query. And even though there are probably only a dozen car brands to search through at the moment, it's possible that some day there will be thousands (I'm talking in terms of programming practice here, not necessarily saying this is the case with the car industry). So you shouldn't assume that it will always be faster to download the entire list of cars and filter them in memory, even if that happens to be the case right now.
If the CarBrandManager.GetList(context) might return an object backed by a custom LINQ provider (like an Entity Framework collection), then you probably want to leave the data cast as an IQueryable: even though your benchmark shows it being 20 times faster to use a list, that difference is so small that no user is ever going to be able to tell the difference. You may one day see performance gains of several orders of magnitude by calling .Where().Take().Skip() and only loading the data you really need from the data store, whereas you'd end up loading the whole table into your system's memory if you call .ToList() on right off the bat.
However, if you know that CarBrandManager.GetList(context) will always return an in-memory list (as the name implies), it should be changed to return an IEnumerable<Model.CarBrand> instead of an IQueryable<Model.CarBrand>. Or, if you're on .NET 4.5, perhaps an IReadOnlyList<Model.CarBrand> or IReadOnlyCollection<Model.CarBrand>, depending on what contract you're willing to force your CarManager to abide by.

code performance question

Let's say I have a relatively large list of an object MyObjectModel called MyBigList. One of the properties of MyObjectModel is an int called ObjectID. In theory, I think MyBigList could reach 15-20MB in size. I also have a table in my database that stores some scalars about this list so that it can be recomposed later.
What is going to be more efficient?
Option A:
List<MyObjectModel> MyBigList = null;
MyBigList = GetBigList(some parameters);
int RowID = PutScalarsInDB(MyBigList);
Option B:
List<MyObjectModel> MyBigList = null;
MyBigList = GetBigList(some parameters);
int TheCount = MyBigList.Count();
StringBuilder ListOfObjectID = null;
foreach (MyObjectModel ThisObject in MyBigList)
{
ListOfObjectID.Append(ThisObject.ObjectID.ToString());
}
int RowID = PutScalarsInDB ( TheCount, ListOfObjectID);
In option A I pass MyBigList to a function that extracts the scalars from the list, stores these in the DB and returns the row where these entries were made. In option B, I keep MyBigList in the page method where I do the extraction of the scalars and I just pass these to the PutScalarsInDB function.
What's the better option, and it could be that yet another is better? I'm concerned about passing around objects this size and memory usage.
I don't think you'll see a material difference between these two approaches. From your description, it sounds like you'll be burning the same CPU cycles either way. The things that matter are:
Get the list
Iterate through the list to get the IDs
Iterate through the list to update the database
The order in which these three activities occur, and whether they occur within a single method or a subroutine, doesn't matter. All other activities (declaring variables, assigning results, etc.,) are of zero to negligible performance impact.
Other things being equal, your first option may be slightly more performant because you'll only be iterating once, I assume, both extracting IDs and updating the database in a single pass. But the cost of iteration will likely be very small compared with the cost of updating the database, so it's not a performance difference you're likely to notice.
Having said all that, there are many, many more factors that may impact performance, such as the type of list you're iterating through, the speed of your connection to the database, etc., that could dwarf these other considerations. It doesn't look like too much code either way. I'd strongly suggesting building both and testing them.
Then let us know your results!
If you want to know which method has more performance you can use the stopwatch class to check the time needed for each method. see here for stopwatch usage: http://www.dotnetperls.com/stopwatch
I think there are other issues for a asp.net application you need to verify:
From where do read your list? if you read it from the data base, would it be more efficient to do your work in database within a stored procedure.
Where is it stored? Is it only read and destroyed or is it stored in session or application?

Lockless list help!

Hi im trying to write a lockless list i got the adding part working it think but the code that extracts objects from the list does not work to good :(
Well the list is not a normal list.. i have the Interface IWorkItem
interface IWorkItem
{
DateTime ExecuteTime { get; }
bool Cancelled { get; }
void Execute(DateTime now);
}
and well i have a list where i can add this :P and the idear is when i run Get(); on the list it should loop it until it finds a IWorkItem that
If (item.ExecuteTime < DateTime.Now)
and remove it from the list and return it..
i have ran tests with many threads on my dual core cpu and it seems that Add works never failed so far but the Get function looses some workitems some where i have no idear whats wrong.....
ps if i get this working any one is free to use the code :) well you are any way but i dont se the point when its bugged :P
The code is here http://www.easy-share.com/1903474734/LinkedList.zip and if you try to run it you will see that it will some times not be able to get as many workitems as it did put in the list...
Edit: I have got a lockless list working it was faster than using the lock(obj) statment but i have a lock object that uses Interlocked that was still outpreforming the lockless list, im going to try to make a lockless arraylist and se if i get the same results there when im done ill upload the result here..
The problem is your algorithm: Consider this sequence of events:
Thread 1 calls list.Add(workItem1), which completes fully.
Status is:
first=workItem1, workItem1.next = null
Then thread 1 calls list.Add(workItem2) and reaches the spot right before the second Replace (where you have the comment "//lets try").
Status is:
first=workItem1, workItem1.next = null, nextItem=workItem1
At this point thread 2 takes over and calls list.Get(). Assume workItem1's executionTime is now, so the call succeeds and returns workItem1.
After this status is:
first = null, workItem1.next = null
(and in the other thread, nextItem is still workItem1).
Now we get back to the first thread, and it completes the Add() by setting workItem1.next:=workItem2.
If we call list.Get() now, we will get null, even though the Add() completed successfully.
You should probably look up a real peer-reviewed lock-free linked list algorithm. I think the standard one is this by John Valois. There is a C++ implementation here. This article on lock-free priority queues might also be of use.
You can use a timestamping protocol for datastructures just fine, mirroring this example from the database world:
Concurrency
But be clear that each item needs both a read and write timestamp, and be sure to follow the rules of the algorithm clearly.
There are some additional difficulties of implementing this on a linked list though, I think. The database example would be fine for a vector where you know the array index of what you want. However, in a linked list, you may need to walk down the pointers -- and the structure of the list could change while you are searching! I guess you could solve that by some sort of nuance (or if you just want to traverse the "new" list as it is, do nothing), but it poses a problem. Try to solve it without introducing some rollback condition that makes it worse than locking the list!
So are you sure that it needs to be lockless? Depending on your work load the non-blocking solution can sometimes be slower. Check out this MSDN article for a little more. Also proving that a lockless data structure is correct can be very difficult.
I am in no way an expert on the subject, but as far as I can see, you need to either make the ExecutionTime-field in the implementation of IWorkItem volatile (of course it might already be that) or insert a memorybarrier either after you set the ExecutionTime or before you read it.

using ThreadPools to search through object lists

I have these container objects (let's call them Container) in a list. Each of these Container objects in turn has a DataItem (or a derivate) in a list. In a typical scenario a user will have 15-20 Container objects with 1000-5000 DataItems each. Then there are some DataMatcher objects that can be used for different types of searches. These work mostly fine (since I have several hundred unit tests on them), but in order to make my WPF application feel snappy and responsive, I decided that I should use the ThreadPool for this task. Thus I have a DataItemCommandRunner which runs on a Container object, and basically performs each delegate in a list it takes as a parameter on each DataItem in turn; I use the ThreadPool to queue up one thread for each Container, so that the search in theory should be as efficient as possible on multi-core computers etc.
This is basically done in a DataItemUpdater class that looks something like this:
public class DataItemUpdater
{
private Container ch;
private IEnumerable<DataItemCommand> cmds;
public DataItemUpdater(Container container, IEnumerable<DataItemCommand> commandList)
{
ch = container;
cmds = commandList;
}
public void RunCommandsOnContainer(object useless)
{
Thread.CurrentThread.Priority = ThreadPriority.AboveNormal;
foreach (DataItem di in ch.ItemList)
{
foreach (var cmd in cmds)
{
cmd(sh);
}
}
//Console.WriteLine("Done running for {0}", ch.DisplayName);
}
}
(The useless object parameter for RunCommandsOnContainer is because I am experimenting with this with and without using threads, and one of them requires some parameter. Also, setting the priority to AboveNormal is just an experiment as well.)
This works fine for all but one scenario - when I use the AllWordsMatcher object type that will look for DataItem objects containing all words being searched for (as opposed to any words, exact phrase or regular expression for instance).
This is a pretty simple somestring.Contains(eachWord) based object, backed by unit tests. But herein lies some hairy strangeness.
When the RunCommandsOnContainer runs using ThreadPool threads, it will return insane results. Say I have a string like this:
var someString = "123123123 - just some numbers";
And I run this:
var res = someString.Contains("data");
When it runs, this will actually return true quite a lot - I have debugging information that shows it returning true for empty strings and other strings that simply do not contain the data. Also, it will some times return false even when the string actually contains the data being looked for.
The kicker in all this? Why do I suspect the ThreadPool and not my own code?
When I run the RunCommandsOnContainer() command for each Container in my main thread (i.e. locking the UI and everything), it works 100% correctly - every time! It never finds anything it shouldn't, and it never skips anything it should have found.
However, as soon as I use the ThreadPool, it starts finding a lot of items it shouldn't, while some times not finding items it should.
I realize this is a complex problem (it is painful trying to debug, that's for sure!), but any insight into why and how to fix this would be greatly appreciated!
Thanks!
Rune
It's a bit hard to see from the fragment you're posting, but judging by the symptoms I would look at the AllWordsMatcher (look for static state). If AllWordsMatcher is stateful you should also check that you're creating a new instance for each thread.
More generally I'd look at all the instances involved in the matching/searching process, specifically at the working objects being used when multithreaded. From past experience, the problem usually lies there. (It's easy to look too much at the object graph representing your business data Container/DataItem in this case)

Categories

Resources