Let's say I have a relatively large list of an object MyObjectModel called MyBigList. One of the properties of MyObjectModel is an int called ObjectID. In theory, I think MyBigList could reach 15-20MB in size. I also have a table in my database that stores some scalars about this list so that it can be recomposed later.
What is going to be more efficient?
Option A:
List<MyObjectModel> MyBigList = null;
MyBigList = GetBigList(some parameters);
int RowID = PutScalarsInDB(MyBigList);
Option B:
List<MyObjectModel> MyBigList = null;
MyBigList = GetBigList(some parameters);
int TheCount = MyBigList.Count();
StringBuilder ListOfObjectID = null;
foreach (MyObjectModel ThisObject in MyBigList)
{
ListOfObjectID.Append(ThisObject.ObjectID.ToString());
}
int RowID = PutScalarsInDB ( TheCount, ListOfObjectID);
In option A I pass MyBigList to a function that extracts the scalars from the list, stores these in the DB and returns the row where these entries were made. In option B, I keep MyBigList in the page method where I do the extraction of the scalars and I just pass these to the PutScalarsInDB function.
What's the better option, and it could be that yet another is better? I'm concerned about passing around objects this size and memory usage.
I don't think you'll see a material difference between these two approaches. From your description, it sounds like you'll be burning the same CPU cycles either way. The things that matter are:
Get the list
Iterate through the list to get the IDs
Iterate through the list to update the database
The order in which these three activities occur, and whether they occur within a single method or a subroutine, doesn't matter. All other activities (declaring variables, assigning results, etc.,) are of zero to negligible performance impact.
Other things being equal, your first option may be slightly more performant because you'll only be iterating once, I assume, both extracting IDs and updating the database in a single pass. But the cost of iteration will likely be very small compared with the cost of updating the database, so it's not a performance difference you're likely to notice.
Having said all that, there are many, many more factors that may impact performance, such as the type of list you're iterating through, the speed of your connection to the database, etc., that could dwarf these other considerations. It doesn't look like too much code either way. I'd strongly suggesting building both and testing them.
Then let us know your results!
If you want to know which method has more performance you can use the stopwatch class to check the time needed for each method. see here for stopwatch usage: http://www.dotnetperls.com/stopwatch
I think there are other issues for a asp.net application you need to verify:
From where do read your list? if you read it from the data base, would it be more efficient to do your work in database within a stored procedure.
Where is it stored? Is it only read and destroyed or is it stored in session or application?
Related
Im building an multithreading program that handels big data and wounder what i can do to tweak it.
Right now i have 50 000millions entrys in a normal List and as i use multithreading i use lockstatement.
public string getUsername()
{
string user = null;
lock (UsersToCheckExistList)
{
user = UsersToCheckExistList.First();
UsersToCheckExistList.Remove(user);
}
return user;
}
When i run smaller lists 500k lines it works much faster. But when i load a bigger list 5-50mill it starts to slow down. One way to solve this issue is creating many small lists dynamic and store them in an Dictonary and this is the way i think i will go with. But as i want to learn more about optimizing i wounder if there is a better solution for this task?
All i want is the get a value from the collection and remove it same time from the collection.
You're using the wrong tools for the job - explicit locking is quite expensive, not to mention that the cost of removing the head of a List is O(Count). If you want a collection that is accessed concurrently it's best to use types in System.Collections.Concurrent, as they are heavily optimised for concurrent accesses. From your use case it seems you want a queue of users, so using a ConcurrentQueue:
ConcurrentQueue<string> UsersQueue;
public string getUsername()
{
string user = null;
UsersQueue.TryDequeue(out user);
return user;
}
The problem is that removing the first item from a list is O(n), so as you list grows it takes longer to remove the first item. You would probably be better off using a Queue instead. Since you need threadsafety, you could use ConcurrentQueue, which handles efficient locking for you.
You can put them all in a ConcurrentBag (https://learn.microsoft.com/en-us/dotnet/api/system.collections.concurrent.concurrentbag-1?view=netframework-4.8) then each thread can just use the TryTake method to grab one entry and remove it at the same time, you then don't need to worry about doing your own locking
If you have enough RAM for your data, you should definitely use ConcurrentQueue for FIFO access to you data.
But if you have not enough RAM you can try to use some database. Modern databases can cache data very effectively, you will have almost instant access to you data and save OS memory from swapping.
today I noticed that when I run several LINQ-statements on big data the time taken may vary extremely.
Suppose we have a query like this:
var conflicts = features.Where(/* some condition */);
foreach (var c in conflicts) // log the conflicts
Where features is a list of objects representing rows in a table. Hence these objects are quite complex and even querying one simple property of them is a huge operation (including the actual database-query, validation, state-changes...) I suppose performing such a query takes much time. Far wrong: the first statement executes in a quite small amount of time, whereas simply looping the results takes eternally.
However, If I convert the collection retrieved by the LINQ-expression to a List using IEnumerable#ToList() the first statement runs a bit slower and looping the results is very fast. Having said this the complete duration-time of second approach is much less than when not converting to a list.
var conflicts = features.Where(/* some condition */).ToList();
foreach (var c in conflicts) // log the conflicts
So I suppose that var conflicts = features.Where does not actually query but prepares the data. But I do not understand why converting to a list and afterwards looping is so much faster then. That´s the actual question
Has anybody an explanation for this?
This statement, just declare your intention:
var conflicts = features.Where(...);
to get the data that fullfils the criteria in Where clause. Then when you write this
foreach (var c in conflicts)
The the actual query will be executed and will start getting the results. This is called lazy loading. Another term we use for this is the deffered execution. We deffer the execution of the query, until we need it's data.
On the other hand, if you had done something like this:
var conflicts = features.Where(...).ToList();
an in memory collection would have been created, in which the results of the query would had been stored. In this case the query, would had been executed immediately.
Generally speaking, as you could read in wikipedia:
Lazy loading is a design pattern commonly used in computer programming
to defer initialization of an object until the point at which it is
needed. It can contribute to efficiency in the program's operation if
properly and appropriately used. The opposite of lazy loading is eager
loading.
Update
And I suppose this in-memory collection is much faster then when doing
lazy load?
Here is a great article that answers your question.
Welcome to the wonderful world of lazy evaluation. With LINQ the query is not executed until the result is needed. Before you try to get the result (ToList() gets the result and puts it in a list) you are just creating the query. Think of it as writing code vs running the program. While this may be confusing and may cause the code to execute at unexpected times and even multiple times if you for example you foreach the query twice, this is actually a good thing. It allows you to have a piece of code that returns a query (not the result but the actual query) and have another piece of code create a new query based on the original query. For example you may add additional filters on top of the original or page it.
The performance difference you are seeing is basically the database call happening at different places in your code.
I'm looking to retrieve the last row of a table by the table's ID column. What I am currently using works:
var x = db.MyTable.OrderByDescending(d => d.ID).FirstOrDefault();
Is there any way to get the same result with more efficient speed?
I cannot see that this queries through the entire table.
Do you not have an index on the ID column?
Can you add the results of analysing the query to your question, because this is not how it should be.
As well as the analysis results, the SQL produced. I can't see how it would be anything other than select top 1 * from MyTable order by id desc only with explicit column names and some aliasing. Nor if there's an index on id how it's anything other than a scan on that index.
Edit: That promised explanation.
Linq gives us a set of common interfaces, and in the case of C# and VB.NET some keyword support, for a variety of operations upon sources which return 0 or more items (e.g. in-memory collections, database calls, parsing of XML documents, etc.).
This allows us to express similar tasks regardless of the underlying source. Your query for example includes the source, but we could do a more general form of:
public static YourType FinalItem(IQueryable<YourType> source)
{
return source.OrderByDesending(d => d.ID).FirstOrDefault();
}
Now, we could do:
IEnumerable<YourType> l = SomeCallThatGivesUsAList();
var x = FinalItem(db.MyTable);//same as your code.
var y = FinalItem(l);//item in list with highest id.
var z = FinalItem(db.MyTable.Where(d => d.ID % 10 == 0);//item with highest id that ends in zero.
But the really important part, is that while we've a means of defining the sort of operation we want done, we can have the actual implementation hidden from us.
The call to OrderByDescending produces an object that has information on its source, and the lambda function it will use in ordering.
The call to FirstOrDefault in turn has information on that, and uses it to obtain a result.
In the case with the list, the implementation is to produce the equivalent Enumerable-based code (Queryable and Enumerable mirror each other's public members, as do the interfaces they use such as IOrderedQueryable and IOrderedEnumerable and so on).
This is because, with a list that we don't know is already sorted in the order we care about (or in the opposite order), there isn't any faster way than to examine each element. The best we can hope for is an O(n) operation, and we might get an O(n log n) operation - depending on whether the implementation of the ordering is optimised for the possibility of only one item being taken from it*.
Or to put it another way, the best we could hand-code in code that only worked on enumerables is only slightly more efficient than:
public static YourType FinalItem(IEnumerable<YourType> source)
{
YourType highest = default(YourType);
int highestID = int.MinValue;
foreach(YourType item in source)
{
curID = item.ID;
if(highest == null || curID > highestID)
{
highest = item;
highestID = curID;
}
}
return highest;
}
We can do slightly better with some micro-opts on handling the enumerator directly, but only slightly and the extra complication would just make for less-good example code.
Since we can't do any better than that by hand, and since the linq code doesn't know anything more about the source than we do, that's the best we could possibly hope for it matching. It might do less well (again, depending on whether the special case of our only wanting one item was thought of or not), but it won't beat it.
However, this is not the only approach linq will ever take. It'll take a comparable approach with an in-memory enumerable source, but your source isn't such a thing.
db.MyTable represents a table. To enumerate through it gives us the results of an SQL query more or less equivalent to:
SELECT * FROM MyTable
However, db.MyTable.OrderByDescending(d => d.ID) is not the equivalent of calling that, and then ordering the results in memory. Because queries get processed as a whole when they are executed, we actually get the result of an SQL query more or less like:
SELECT * FROM MyTable ORDER BY id DESC
Finally, the entire query db.MyTable.OrderByDescending(d => d.ID).FirstOrDefault() results in a query like:
SELECT TOP 1 * FROM MyTable ORDER BY id DESC
Or
SELECT * FROM MyTable ORDER BY id DESC LIMIT 1
Depending upon what sort of database server you are using. Then the results get passed to code equivalent to the following ADO.NET-based code:
return dataReader.Read() ?
new MyType{ID = dataReader.GetInt32(0), dataReader.GetInt32(1), dataReader.GetString(2)}//or similar
: null;
You can't get much better.
And as for that SQL query. If there's an index on the id column (and since it looks like a primary key, there certainly should be), then that index will be used to very quickly find the row in question, rather than examining each row.
In all, because different linq providers use different means to fulfil the query, they can all try their best to do so in the best way possible. Of course, being in an imperfect world we'll no doubt find that some are better than others. What's more, they can even work to pick the best approach for different conditions. One example of this is that database-related providers can choose different SQL to take advantage of features of different versions of databases. Another is that the implementation of the the version of Count() that works with in memory enumerations works a bit like this;
public static int Count<T>(this IEnumerable<T> source)
{
var asCollT = source as ICollection<T>;
if(asCollT != null)
return asCollT.Count;
var asColl = source as ICollection;
if(asColl != null)
return asColl.Count;
int tally = 0;
foreach(T item in source)
++tally;
return tally;
}
This is one of the simpler cases (and a bit simplified again in my example here, I'm showing the idea not the actual code), but it shows the basic principle of code taking advantage of more efficient approaches when they're available - the O(1) length of arrays and the Count property on collections that is sometimes O(1) and it's not like we've made things worse in the cases where it's O(n) - and then when they aren't available falling back to a less efficient but still functional approach.
The result of all of this is that Linq tends to give very good bang for buck, in terms of performance.
Now, a decent coder should be able to match or beat its approach to any given case most of the time†, and even when Linq comes up with the perfect approach there are some overheads to it itself.
However, over the scope of an entire project, using Linq means that we can concisely create reasonably efficient code that relates to a relatively constrained number of well defined entities (generally one per table as far as databases go). In particular, the use of anonymous functions and joins means that we get queries that are very good. Consider:
var result = from a in db.Table1
join b in db.Table2
on a.relatedBs = b.id
select new {a.id, b.name};
Here we're ignoring columns we don't care about here, and the SQL produced will do the same. Consider what we would do if we were creating the objects that a and b relate to with hand-coded DAO classes:
Create a new class to represent this combination of a's id and b's name, and relevant code to run the query we need to produce instances.
Run a query to obtain all information about each a and the related b, and live with the waste.
Run a query to obtain the information on each a and b that we care of, and just set default values for the other fields.
Of these, option 2 will be wasteful, perhaps very wasteful. Option 3 will be a bit wasteful and very error prone (what if we accidentally try to use a field elsewhere in the code that wasn't set correctly?). Only option 1 will be more efficient than what the linq approach will produce, but this is just one case. Over a large project this could mean producing dozens or even hundreds or thousands of slightly different classes (and unlike the compiler, we won't necessarily spot the cases where they're actually the same). In practice, therefore, linq can do us some great favours when it comes to efficiency.
Good policies for efficient linq are:
Stay with the type of query you start with as long as you can. Whenever you grab items into memory with ToList() or ToArray etc, consider if you really need to. Unless you need to or you can clearly state the advantage doing so gives you, don't.
If you do need to move to processing in memory, favour AsEnumerable() over ToList() and the other means, so you only grab one at a time.
Examine long-running queries with SQLProfiler or similar. There are a handful of cases where policy 1 here is wrong and moving to memory with AsEnumerable() is actually better (most relate to uses of GroupBy that don't use aggregates on the non-grouped fields, and hence don't actually have a single SQL query they correspond with).
If a complicated query is hit many times, then CompiledQuery can help (less so with 4.5 since it has automatic optimisations that cover some of the cases it helps in), but it's normally better to leave that out of the first approach and then use it only in hot-spots that are efficiency problems.
You can get EF to run arbitrary SQL, but avoid it unless it's a strong gain because too much such code reduces the consistent readability using a linq approach throughout gives (I have to say though, I think Linq2SQL beats EF on calling stored procedures and even more so on calling UDFs, but even there this still applies - it's less clear from just looking at the code how things relate to each other).
*AFAIK, this particular optimisation isn't applied, but we're talking about the best possible implementation at this point, so it doesn't matter if it is, isn't, or is in some versions only.
†I'll admit though that Linq2SQL would often produce queries that use APPLY that I would not think of, as I was used to thinking of how to write queries in versions of SQLServer before 2005 introduced it, while code doesn't have those sort of human tendencies to go with old habits. It pretty much taught me how to use APPLY.
I have objects that has a DateTime property, how can i query for the oldest object?
After asking on db4o forum, I get the answer:
It's quite easy: create a sorted SODA-Query and take the first / last object from the resulting ObjectSet. Don't iterate the ObjectSet (therefore the objects won't be activated), just take the required object directly via #ObjectSet.Get(index).
Please note: db4o supports just a limited set of performant sortings (alphabetical, numbers, object ids) in query execution, so maybe you have to store your DateTime as milliseconds to achieve good performance.
first of all, your object needs to keep track of the time itself, so it depends on your requirements:
class Customer
{
public DateTime DateSignedUp {get; private set;}
// ...
}
Now, you can query for the object in whatever way you like, using Linq, SODA, or Native Queries, e.g.
IObjectContainer container = ...;
Customer oldestCustomer = container.Query<Customer>().OrderBy(p => p.DateSignedUp).First();
However, there is a set of pitfalls:
Don't use DateTime in your persisted object. I have had massive problems with them. I can't reproduce the problem so I couldn't report it yet, but I can personally not recommend using them. Use a long instead and copy the ticks from the respective DateTime. Store all times in UTC, unless you're explicitly referring to local time, such as in the case of bus schedules.
Put an index on the time
The order operation could be very, very slow for large amounts of objects because of issue COR-1133. If you have a large amount of objects and you know the approximate age of the object, try to impose a where constrain, because that will be fast. See also my blogpost regarding that performance issue, which can become very annoying already at ~50-100k objects.
Best,
Chris
I have a huge dictionary of blank values in a variable called current like so:
struct movieuser {blah blah blah}
Dictionary<movieuser, float> questions = new Dictionary<movieuser, float>();
So I am looping through this dictionary and need to fill in the "answers", like so:
for(var k = questions.Keys.GetEnumerator();k.MoveNext(); )
{
questions[k.Current] = retrieveGuess(k.Current.userID, k.Current.movieID);
}
Now, this doesn't work, because I get an InvalidOperationException from trying to modify the dictionary I am looping through. However, you can see that the code should work fine - since I am not adding or deleting any values, just modifying the value. I understand, however, why it is afraid of my attempting this.
What is the preferred way of doing this? I can't figure out a way to loop through a dictionary WITHOUT using iterators.
I don't really want to create a copy of the whole array, since it is a lot of data and will eat up my ram like its still Thanksgiving.
Thanks,
Dave
Matt's answer, getting the keys first, separately is the right way to go. Yes, there'll be some redundancy - but it will work. I'd take a working program which is easy to debug and maintain over an efficient program which either won't work or is hard to maintain any day.
Don't forget that if you make MovieUser a reference type, the array will only be the size of as many references as you've got users - that's pretty small. A million users will only take up 4MB or 8MB on x64. How many users have you really got?
Your code should therefore be something like:
IEnumerable<MovieUser> users = RetrieveUsers();
IDictionary<MovieUser, float> questions = new Dictionary<MovieUser, float>();
foreach (MovieUser user in users)
{
questions[user] = RetrieveGuess(user);
}
If you're using .NET 3.5 (and can therefore use LINQ), it's even easier:
IDictionary<MovieUser, float> questions =
RetrieveUsers.ToDictionary(user => user, user => RetrieveGuess(user));
Note that if RetrieveUsers() can stream the list of users from its source (e.g. a file) then it will be efficient anyway, as you never need to know about more than one of them at a time while you're populating the dictionary.
A few comments on the rest of your code:
Code conventions matter. Capitalise the names of your types and methods to fit in with other .NET code.
You're not calling Dispose on the IEnumerator<T> produced by the call to GetEnumerator. If you just use foreach your code will be simpler and safer.
MovieUser should almost certainly be a class. Do you have a genuinely good reason for making it a struct?
Is there any reason you can't just populate the dictionary with both keys and values at the same time?
foreach(var key in someListOfKeys)
{
questions.Add(key, retrieveGuess(key.userID, key.movieID);
}
store the dictionary keys in a temporary collection then loop over the temp collection and use the key value as your indexer parameter. This should get you around the exception.