IDataReader and "HasColumn", Best approach? - c#

I've seen two common approaches for checking if a column exists in an IDataReader:
public bool HasColumn(IDataReader reader, string columnName)
{
try
{
reader.getOrdinal(columnName)
return true;
}
catch
{
return false;
}
}
Or:
public bool HasColumn(IDataReader reader, string columnName)
{
reader.GetSchemaTable()
.DefaultView.RowFilter = "ColumnName='" + columnName + "'";
return (reader.GetSchemaTable().DefaultView.Count > 0);
}
Personally, I've used the second one, as I hate using exceptions for this reason.
However, on a large dataset, I believe RowFilter might have to do a table scan per column, and this may be incredibly slow.
Thoughts?

I think I have a reasonable answer for this old gem.
I would go with the first approach cause its much simpler. If you want to avoid the exception you can cache the field names and do a TryGet on the cache.
public Dictionary<string,int> CacheFields(IDataReader reader)
{
var cache = new Dictionary<string,int>();
for (int i = 0; i < reader.FieldCount; i++)
{
cache[reader.GetName(i)] = i;
}
return cache;
}
The upside of this approach is that it is simpler and gives you better control. Also, note, you may want to look into case insensitive or kana insensitive compares, which would make stuff a little trickier.

A lot depends on how you're using HasColumn. Are you calling it just once or twice, or repeatedly in a loop? Is the column likely to be there or is that completely unknown in advance?
Setting a row filter probably would do a table scan each time. (Also, in theory, GetSchemaTable() could generate an entirely new table with every call, which would be even more expensive -- I don't believe SqlDataReader does this, but at the IDataReader level, who knows?) But if you only call it once or twice I can't imagine this being that much of an issue (unless you have thousands of columns or something).
(I would, however, at least store the result of GetSchemaTable() in a local var within the method to avoid calling it twice in quick succession, if not cache it somewhere on the off chance that your particular IDataReader DOES regenerate it.)
If you know in advance that under normal circumstances the column you ask for will be present, the exception method is a bit more palatable (because the column not being there is, in fact, an exceptional case). Even if not, it might perform slightly better, but again unless you're calling it repeatedly you should ask yourself if performance is really that much of a concern.
And if you ARE calling it repeatedly, you probably should consider a different approach anyway, such as: call GetSchemaTable() once up front, loop through the table, and load the field names into a Dictionary or some other structure that is designed for fast lookups.

I wouldn't worry about the performance impact. Even if you had a table with 1000 columns (which would be an enormous table), you are still only doing a "table scan" of 1000 rows. That is likely to be trivial.
Premature optimization will just lead you toward an unnecessarily complex implementation. Implement the version that seems best to you, and then measure the performance impact. If it is unacceptable compared to your performance requirements, then consider alternatives.

Related

What's the most efficient way to get only the final row of a SQL table using EF4?

I'm looking to retrieve the last row of a table by the table's ID column. What I am currently using works:
var x = db.MyTable.OrderByDescending(d => d.ID).FirstOrDefault();
Is there any way to get the same result with more efficient speed?
I cannot see that this queries through the entire table.
Do you not have an index on the ID column?
Can you add the results of analysing the query to your question, because this is not how it should be.
As well as the analysis results, the SQL produced. I can't see how it would be anything other than select top 1 * from MyTable order by id desc only with explicit column names and some aliasing. Nor if there's an index on id how it's anything other than a scan on that index.
Edit: That promised explanation.
Linq gives us a set of common interfaces, and in the case of C# and VB.NET some keyword support, for a variety of operations upon sources which return 0 or more items (e.g. in-memory collections, database calls, parsing of XML documents, etc.).
This allows us to express similar tasks regardless of the underlying source. Your query for example includes the source, but we could do a more general form of:
public static YourType FinalItem(IQueryable<YourType> source)
{
return source.OrderByDesending(d => d.ID).FirstOrDefault();
}
Now, we could do:
IEnumerable<YourType> l = SomeCallThatGivesUsAList();
var x = FinalItem(db.MyTable);//same as your code.
var y = FinalItem(l);//item in list with highest id.
var z = FinalItem(db.MyTable.Where(d => d.ID % 10 == 0);//item with highest id that ends in zero.
But the really important part, is that while we've a means of defining the sort of operation we want done, we can have the actual implementation hidden from us.
The call to OrderByDescending produces an object that has information on its source, and the lambda function it will use in ordering.
The call to FirstOrDefault in turn has information on that, and uses it to obtain a result.
In the case with the list, the implementation is to produce the equivalent Enumerable-based code (Queryable and Enumerable mirror each other's public members, as do the interfaces they use such as IOrderedQueryable and IOrderedEnumerable and so on).
This is because, with a list that we don't know is already sorted in the order we care about (or in the opposite order), there isn't any faster way than to examine each element. The best we can hope for is an O(n) operation, and we might get an O(n log n) operation - depending on whether the implementation of the ordering is optimised for the possibility of only one item being taken from it*.
Or to put it another way, the best we could hand-code in code that only worked on enumerables is only slightly more efficient than:
public static YourType FinalItem(IEnumerable<YourType> source)
{
YourType highest = default(YourType);
int highestID = int.MinValue;
foreach(YourType item in source)
{
curID = item.ID;
if(highest == null || curID > highestID)
{
highest = item;
highestID = curID;
}
}
return highest;
}
We can do slightly better with some micro-opts on handling the enumerator directly, but only slightly and the extra complication would just make for less-good example code.
Since we can't do any better than that by hand, and since the linq code doesn't know anything more about the source than we do, that's the best we could possibly hope for it matching. It might do less well (again, depending on whether the special case of our only wanting one item was thought of or not), but it won't beat it.
However, this is not the only approach linq will ever take. It'll take a comparable approach with an in-memory enumerable source, but your source isn't such a thing.
db.MyTable represents a table. To enumerate through it gives us the results of an SQL query more or less equivalent to:
SELECT * FROM MyTable
However, db.MyTable.OrderByDescending(d => d.ID) is not the equivalent of calling that, and then ordering the results in memory. Because queries get processed as a whole when they are executed, we actually get the result of an SQL query more or less like:
SELECT * FROM MyTable ORDER BY id DESC
Finally, the entire query db.MyTable.OrderByDescending(d => d.ID).FirstOrDefault() results in a query like:
SELECT TOP 1 * FROM MyTable ORDER BY id DESC
Or
SELECT * FROM MyTable ORDER BY id DESC LIMIT 1
Depending upon what sort of database server you are using. Then the results get passed to code equivalent to the following ADO.NET-based code:
return dataReader.Read() ?
new MyType{ID = dataReader.GetInt32(0), dataReader.GetInt32(1), dataReader.GetString(2)}//or similar
: null;
You can't get much better.
And as for that SQL query. If there's an index on the id column (and since it looks like a primary key, there certainly should be), then that index will be used to very quickly find the row in question, rather than examining each row.
In all, because different linq providers use different means to fulfil the query, they can all try their best to do so in the best way possible. Of course, being in an imperfect world we'll no doubt find that some are better than others. What's more, they can even work to pick the best approach for different conditions. One example of this is that database-related providers can choose different SQL to take advantage of features of different versions of databases. Another is that the implementation of the the version of Count() that works with in memory enumerations works a bit like this;
public static int Count<T>(this IEnumerable<T> source)
{
var asCollT = source as ICollection<T>;
if(asCollT != null)
return asCollT.Count;
var asColl = source as ICollection;
if(asColl != null)
return asColl.Count;
int tally = 0;
foreach(T item in source)
++tally;
return tally;
}
This is one of the simpler cases (and a bit simplified again in my example here, I'm showing the idea not the actual code), but it shows the basic principle of code taking advantage of more efficient approaches when they're available - the O(1) length of arrays and the Count property on collections that is sometimes O(1) and it's not like we've made things worse in the cases where it's O(n) - and then when they aren't available falling back to a less efficient but still functional approach.
The result of all of this is that Linq tends to give very good bang for buck, in terms of performance.
Now, a decent coder should be able to match or beat its approach to any given case most of the time†, and even when Linq comes up with the perfect approach there are some overheads to it itself.
However, over the scope of an entire project, using Linq means that we can concisely create reasonably efficient code that relates to a relatively constrained number of well defined entities (generally one per table as far as databases go). In particular, the use of anonymous functions and joins means that we get queries that are very good. Consider:
var result = from a in db.Table1
join b in db.Table2
on a.relatedBs = b.id
select new {a.id, b.name};
Here we're ignoring columns we don't care about here, and the SQL produced will do the same. Consider what we would do if we were creating the objects that a and b relate to with hand-coded DAO classes:
Create a new class to represent this combination of a's id and b's name, and relevant code to run the query we need to produce instances.
Run a query to obtain all information about each a and the related b, and live with the waste.
Run a query to obtain the information on each a and b that we care of, and just set default values for the other fields.
Of these, option 2 will be wasteful, perhaps very wasteful. Option 3 will be a bit wasteful and very error prone (what if we accidentally try to use a field elsewhere in the code that wasn't set correctly?). Only option 1 will be more efficient than what the linq approach will produce, but this is just one case. Over a large project this could mean producing dozens or even hundreds or thousands of slightly different classes (and unlike the compiler, we won't necessarily spot the cases where they're actually the same). In practice, therefore, linq can do us some great favours when it comes to efficiency.
Good policies for efficient linq are:
Stay with the type of query you start with as long as you can. Whenever you grab items into memory with ToList() or ToArray etc, consider if you really need to. Unless you need to or you can clearly state the advantage doing so gives you, don't.
If you do need to move to processing in memory, favour AsEnumerable() over ToList() and the other means, so you only grab one at a time.
Examine long-running queries with SQLProfiler or similar. There are a handful of cases where policy 1 here is wrong and moving to memory with AsEnumerable() is actually better (most relate to uses of GroupBy that don't use aggregates on the non-grouped fields, and hence don't actually have a single SQL query they correspond with).
If a complicated query is hit many times, then CompiledQuery can help (less so with 4.5 since it has automatic optimisations that cover some of the cases it helps in), but it's normally better to leave that out of the first approach and then use it only in hot-spots that are efficiency problems.
You can get EF to run arbitrary SQL, but avoid it unless it's a strong gain because too much such code reduces the consistent readability using a linq approach throughout gives (I have to say though, I think Linq2SQL beats EF on calling stored procedures and even more so on calling UDFs, but even there this still applies - it's less clear from just looking at the code how things relate to each other).
*AFAIK, this particular optimisation isn't applied, but we're talking about the best possible implementation at this point, so it doesn't matter if it is, isn't, or is in some versions only.
†I'll admit though that Linq2SQL would often produce queries that use APPLY that I would not think of, as I was used to thinking of how to write queries in versions of SQLServer before 2005 introduced it, while code doesn't have those sort of human tendencies to go with old habits. It pretty much taught me how to use APPLY.

code performance question

Let's say I have a relatively large list of an object MyObjectModel called MyBigList. One of the properties of MyObjectModel is an int called ObjectID. In theory, I think MyBigList could reach 15-20MB in size. I also have a table in my database that stores some scalars about this list so that it can be recomposed later.
What is going to be more efficient?
Option A:
List<MyObjectModel> MyBigList = null;
MyBigList = GetBigList(some parameters);
int RowID = PutScalarsInDB(MyBigList);
Option B:
List<MyObjectModel> MyBigList = null;
MyBigList = GetBigList(some parameters);
int TheCount = MyBigList.Count();
StringBuilder ListOfObjectID = null;
foreach (MyObjectModel ThisObject in MyBigList)
{
ListOfObjectID.Append(ThisObject.ObjectID.ToString());
}
int RowID = PutScalarsInDB ( TheCount, ListOfObjectID);
In option A I pass MyBigList to a function that extracts the scalars from the list, stores these in the DB and returns the row where these entries were made. In option B, I keep MyBigList in the page method where I do the extraction of the scalars and I just pass these to the PutScalarsInDB function.
What's the better option, and it could be that yet another is better? I'm concerned about passing around objects this size and memory usage.
I don't think you'll see a material difference between these two approaches. From your description, it sounds like you'll be burning the same CPU cycles either way. The things that matter are:
Get the list
Iterate through the list to get the IDs
Iterate through the list to update the database
The order in which these three activities occur, and whether they occur within a single method or a subroutine, doesn't matter. All other activities (declaring variables, assigning results, etc.,) are of zero to negligible performance impact.
Other things being equal, your first option may be slightly more performant because you'll only be iterating once, I assume, both extracting IDs and updating the database in a single pass. But the cost of iteration will likely be very small compared with the cost of updating the database, so it's not a performance difference you're likely to notice.
Having said all that, there are many, many more factors that may impact performance, such as the type of list you're iterating through, the speed of your connection to the database, etc., that could dwarf these other considerations. It doesn't look like too much code either way. I'd strongly suggesting building both and testing them.
Then let us know your results!
If you want to know which method has more performance you can use the stopwatch class to check the time needed for each method. see here for stopwatch usage: http://www.dotnetperls.com/stopwatch
I think there are other issues for a asp.net application you need to verify:
From where do read your list? if you read it from the data base, would it be more efficient to do your work in database within a stored procedure.
Where is it stored? Is it only read and destroyed or is it stored in session or application?

Which LINQ query is more effective?

I have a huge IEnumerable(suppose the name is myItems), which way is more effective?
Solution 1: Filter it first then ForEach.
Array.ForEach(myItems.Where(FILTER-IT-HERE).ToArray(),MY-ACTION);
Solution 2: Do RETURN in MY-ACTION if the item is not up to the mustard.
Array.ForEach(myItems.ToArray(),MY-ACTION-WITH-FILTER);
Is one of them always better than another? Or any other good suggestions? Thanks in advance.
Did you do any measurements? Since WE can't measure the run time of My-Action then only you can. Measure and decide.
Sometimes one has to create benchmark's because similar looking activities could produce radically different and unexpected results.
You do not say what your data source is so I'm going to assume it may be data on an SQL server in which case filtering at the server side will likely always be the best approach because you have minimized the amount of data transfer. Memory access is always faster than data transfer from disk to memory so whenever you can transfer fewer records, you are likely to have better performance.
Well, both times, you're converting to an array, which might not be so efficient if the IEnumerable is very large (like you said). You could create a generic extension method for IEnumerable, like:
public static void ForEach<T>(this IEnumerable<T> current, Action<T> action) {
foreach (var i in current) {
action(i);
}
}
and then you could do this:
IEnumerable<int> ints = new List<int>();
ints.Where(i => i == 5).ForEach(i => Console.WriteLine(i));
If performance is a concern, it's unclear to me why you'd be bothering to construct an entire array in the first place. Why not just this?
foreach (var item in myItems.Where(FILTER-IT-HERE))
MY-ACTION;
Or:
foreach (var item in myItems)
MY-ACTION-WITH-FILTER;
I ask because, while the others are right that you can't really know without testing, I wouldn't expect there to be much difference between the above two options. I would expect there to be a difference, on the other hand, between creating/populating an array (seemingly for no reason) and not creating an array.
Everything else being equal, calling ToArray() first will impart a greater performance hit than when calling it last. Although, as has been stated by others before me,
Why use ToArray() and Array.ForEach() at all?
We don't know that everything else actually is equal since you do not reveal the implementation details of your filter and action.
The idea of LINQ is to work on enumerable collections, so the best LINQ query is the one where you don't use Array.ForEach() and .ToArray() at all.
I would say that this falls into the category of premature optimization. If, after establishing benchmarks, you find that the code is too slow, you can always try each approach and pick the result that works better for you.
Since we don't know how the IEnumerable<> is produced it's hard to say which approach will perform better. We also don't know how many items will remain after you apply your predicate - nor do we know whether the action or iteration steps are going to be the dominant factor in the execution of your code. The only way to know for sure is to try it both ways, profile the results, and pick the best.
Performance aside, I would choose the version that is most clear - which (for me) is to first filter and then apply the projection to the result.

C# performance question

quandry is - which of the following two method performs best
Goal - get an object of type Wrapper ( defined below )
criteria - speed over storage
no. of records - about 1000- about 2000, max about 6K
Choices - Create Object on the fly or do a lookup from a dictionary
Execution speed - called x times per second
NB - i need to deliver the working code first and then go for optimization hence if any theorists can provide glimpses on behind the scene info, that'll help before i get to the actual performance test possibly by eod thu
Definitions -
class Wrapper
{
public readonly DataRow Row;
public Wrapper(DataRow dr)
{
Row = dr;
}
public string ID { get { return Row["id"].ToString(); } }
public string ID2 { get { return Row["id2"].ToString(); } }
public string ID3 { get { return Row["id3"].ToString(); } }
public double Dbl1 { get { return (double)Row["dbl1"]; } }
// ... total about 12 such fields !
}
Dictionary<string,Wrapper> dictWrappers;
Method 1
Wrapper o = new Wrapper(dr);
/// some action with o
myMethod( o );
Method 2
Wrapper o;
if ( ! dictWrappers.TryGetValue( dr["id"].ToString(), out o ) )
{
o = new Wrapper(dr);
dictWrapper.Add(o.ID, o);
}
/// some action with o
myMethod( o );
Never optimize without profiling first.
Never profile unless the code does not meet specifications/expectations.
If you need to profile this code, write it both ways and benchmark it with your expected load.
EDIT: I try to favor the following over optimization unless performance is unacceptable:
Simplicity
Readability
Maintainability
Testability
I've (recently) seen highly-optimized code that was very difficult to debug. I refactored it to simplify it, then ran performance tests. The performance was unacceptable, so I profiled it, found the bottlenecks, and optimized only those. I re-ran the performance tests, and the new code was comparable to the highly-optimized version. And it's now much easier to maintain.
Here's a free profiling tool.
The first one would be faster, since it isn't actually doing a lookup, it is just doing a simple allocation and an assignment.
The two segments of code are not nearly equivalent. In function however, because Method 1 could create many duplicates.
Without actually testing I would expect that caching the field values in Wrapper (that is, avoiding all the ToString calls and casts) would probably have more of an impact on performance.
Then once you are caching those values you will probably want to keep instances of Wrapper around rather than frequently recreate them.
Assuming that you're really worried about per (hey, it happens) then your underlying wrapper itself could be improved. You're doing field lookups by string. If you're going to make the call a lot with the same field set in the row, it's actually faster to cache the ordinals and look up by ordinal.
Of course this is only if you really, really need to worry about performance, and the instances where this would make a difference are fairly rare (though in embedded devices it's not as rare as on the desktop).

Improving DAL performance

The way i currently populate business objects is using something similar to the snippet below.
using (SqlConnection conn = new SqlConnection(Properties.Settings.Default.CDRDatabase))
{
using (SqlCommand comm = new SqlCommand(SELECT, conn))
{
conn.Open();
using (SqlDataReader r = comm.ExecuteReader(CommandBehavior.CloseConnection))
{
while (r.Read())
{
Ailias ailias = PopulateFromReader(r);
tmpList.Add(ailias);
}
}
}
}
private static Ailias PopulateFromReader(IDataReader reader)
{
Ailias ailias = new Ailias();
if (!reader.IsDBNull(reader.GetOrdinal("AiliasId")))
{
ailias.AiliasId = reader.GetInt32(reader.GetOrdinal("AiliasId"));
}
if (!reader.IsDBNull(reader.GetOrdinal("TenantId")))
{
ailias.TenantId = reader.GetInt32(reader.GetOrdinal("TenantId"));
}
if (!reader.IsDBNull(reader.GetOrdinal("Name")))
{
ailias.Name = reader.GetString(reader.GetOrdinal("Name"));
}
if (!reader.IsDBNull(reader.GetOrdinal("Extention")))
{
ailias.Extention = reader.GetString(reader.GetOrdinal("Extention"));
}
return ailias;
}
Does anyone have any suggestions of how to improve performance on something like this? Bear in mind that PopulateFromReader, for some types, contains more database look-ups in order to populate the object fully.
One obvious change would be to replace this kind of statement:
ailias.AiliasId = reader.GetInt32(reader.GetOrdinal("AiliasId"));
with
ailias.AiliasId = reader.GetInt32(constAiliasId);
where constAiliasId is a constant holding the ordinal of the field AiliasId.
This avoids the ordinal lookups in each iteration of the loop.
If the data volume is high, then it can happen that the overhead of building a huge list can be a bottleneck; in which case, it can be more efficient to use a streaming object model; i.e.
public IEnumerable<YourType> SomeMethod(...args...) {
using(connection+reader) {
while(reader.Read()) {
YourType item = BuildObj(reader);
yield return item;
}
}
}
The consuming code (via foreach etc) then only has a single object to deal with (at a time). If they want to get a list, they can (with new List<SomeType>(sequence), or in .NET 3.5: sequence.ToList()).
This involves a few more method calls (an additional MoveNext()/Current per sequence item, hidden behind the foreach), but you will never notice this when you have out-of-process data such as from a database.
Your code looks almost identical to a lot of our business object loading functions. When we suspect DAL performance issues, we take a look at a few things things.
How many times are we hopping out to the DB? Is there any way we can connect less often and bring back larger chunks of data via the use of multiple result sets (we use stored procedures.) So, instead of each child object loading its own data, the parent will fetch all data for itself and its children. You can run into fragile SQL (sort orders that need to match, etc) and tricky loops to walk over the DataReaders, but we have found it to be more optimal than multiple DB trips.
Fire up a packet sniffer/network monitor to see exactly how much data is being transmitted across the wire. You may be surprised to see how massive some of the result sets are. If they are, then you might think about alternate ways of approaching the issue. Like lazy/defer loading some child data.
Make sure that you are using all of the results you are asking for. For example, going from SELECT * FROM (with 30 fields being returned) to simply SELECT Id, Name FROM (if that is all you needed) could make a large difference.
AFAIK, that is as fast as it gets. Perhaps the slowness is in the SQL query/server. Or somewhere else.
It's likely the real problem is the multiple, per-object lookups that you mention. Have you looked closely to see if they can all be put into a single stored procedure?

Categories

Resources