Recently I had to do some very processing heavy stuff with data stored in a DataSet. It was heavy enough that I ended up using a tool to help identify some bottlenecks in my code. When I was analyzing the bottlenecks, I noticed that although DataSet lookups were not terribly slow (they weren't the bottleneck), it was slower than I expected. I always assumed that DataSets used some sort of HashTable style implementation which would make lookups O(1) (or at least thats what I think HashTables are). The speed of my lookups seemed to be significantly slower than this.
I was wondering if anyone who knows anything about the implementation of .NET's DataSet class would care to share what they know.
If I do something like this :
DataTable dt = new DataTable();
if(dt.Columns.Contains("SomeColumn"))
{
object o = dt.Rows[0]["SomeColumn"];
}
How fast would the lookup time be for the Contains(...) method, and for retrieving the value to store in Object o? I would have thought it be very fast like a HashTable (assuming what I understand about HashTables is correct) but it doesn't seem like it...
I wrote that code from memory so some things may not be "syntactically correct".
Actually it's advisable to use integer when referencing column, which can improve a lot in terms of performance. To keep things manageable, you could declare constant integer. So instead of what you did, you could do
const int SomeTable_SomeColumn = 0;
DataTable dt = new DataTable();
if(dt.Columns.Contains(SomeTable_SomeColumn))
{
object o = dt.Rows[0][SomeTable_SomeColumn];
}
Via Reflector the steps for DataRow["ColumnName"] are:
Get the DataColumn from ColumnName. Uses the row's DataColumnCollection["ColumnName"]. Internally, DataColumnCollection stores its DataColumns in a Hastable. O(1)
Get the DataRow's row index. The index is stored in an internal member. O(1)
Get the DataColumn's value at the index using DataColumn[index]. DataColumn stores its data in a System.Data.Common.DataStorage (internal, abstract) member:
return dataColumnInstance._storage.Get(recordIndex);
A sample concrete implementation is System.Data.Common.StringStorage (internal, sealed). StringStorage (and the other concrete DataStorages I checked) store their values in an array. Get(recordIndex) simply grabs the object in the value array at the recordIndex. O(1)
So overall you're O(1) but that doesn't mean the hashing and function calling during the operation is without cost. It just means it doesn't cost more as the number of DataRows or DataColumns increases.
Interesting that DataStorage uses an array for values. Can't imagine that's easy to rebuild when you add or remove rows.
I imagine that any lookups would be O(n), as I don't think they would use any type of hashtable, but would actually use more of an array for finding rows and columns.
Actually, I believe the columns names are stored in a Hashtable. Should be O(1) or constant lookup for case-sensitive lookups. If it had to look through each, then of course it would be O(n).
Related
given an expression like so
DataTable1.Columns.Add("value", typeof(double), "rate * loan_amt");
in a datatable with 10000 rows where rate is same for all rows and loan_amt varies
When the rate changes, it changes for all
currently that means iterating through all rows like so
foreach(DataRow dr in DataTable1.Rows) dr["rate"] = new_rate;
wondering if there,s a better way using a ReferenceTable (with only 1 row ) in the same DataSet and linking it somehow like so
DataTable1.Columns.Add("value", typeof(double), "RefTable.Row0.rate * loan_amt");
so changing the rate would be as simple as
RefTable.Rows[0]["rate"] = new_rate;
Or any other way ?
That is a good idea, but you would have to rewrite any time that data was accessed in legacy code. It would certainly make updates to the rate more efficient, but you may run into issues with reverse compatibility.
If their isn't much code accessing that table then it isn't such a big deal, but if this is a production system with multiple processes calling that data you might end up with a runaway train of null value exceptions when trying to access the "rate" column of the original table, or inconsistencies for your "value" depending on which code accessed which table to retrieve the rate.
If this is not the case, then it's no big deal. Go for it.
found the answer, adding it for others who might land here
Key is to add a DataRelation between the two tables/columns and the expression would be
Parent.rate * loan_amt
So I've been trying to better understand the difference between these two but all i can really find info on is the difference between DataSets and DataTables- a single Array can only hold one datatype, whereas from what i can tell, DataTables are basically a generic multidimensional array and it has a 1:1 relationship to the DataSource stored in memory. Is this accurate? are DataTables 'just' a generic multidimensional array or am i missing some fundamental difference?
A DataTable models a database table in memory. The type can track changes etc in order to sync with a database. The columns (dimensions) can be referenced either by index or name.
A DataSet can hold a collection of such tables and the relationships between them (referential integrity constraints).
An array doesn't do any of that.
DataTable is kind of like a multi-dimensional array in that it's an in-memory data storage of a certain "size", but there are significant additional features. For example, each "column" has name information and specific type information, there is change tracking for synchronization with the data storage, rows can store null values, etc..
A DataSet is basically an entire "set" of data (ie: multiple DataTables) held in memory.
I am working with SqlXml and the stored procedure that returns a xml rather than raw data. How does one actually read the data when returned is a xml and does not know about the column names. I used the below versions and have heard getting data from SqlDataReader through ordinal is faster than through column name. Please advice on which is best and with a valid reason or proof
sqlDataReaderInstance.GetString(0);
sqlDataReaderInstance[0];
and have heard getting data from SqlDataReader through ordinal is faster than through column name
Both your examples are getting data through the index (ordinal), not the column name:
Getting data through the column name:
while(reader.Read())
{
...
var value = reader["MyColumnName"];
...
}
is potentially slower than getting data through the index:
int myColumnIndex = reader.GetOrdinal("MyColumnName");
while(reader.Read())
{
...
var value = reader[myColumnIndex];
...
}
because the first example must repeatedly find the index corresponding to "MyColumnName". If you have a very large number of rows, the difference might even be noticeable.
In most situations the difference won't be noticeable, so favour readability.
UPDATE
If you are really concerned about performance, an alternative to using ordinals is to use the DbEnumerator class as follows:
foreach(IDataRecord record in new DbEnumerator(reader))
{
...
var value = record["MyColumnName"];
...
}
The DbEnumerator class reads the schema once, and maintains an internal HashTable that maps column names to ordinals, which can improve performance.
Compared to the speed of getting data from disk both will be effectively as fast as each other.
The two calls aren't equivalent: the version with an indexer returns an object, whereas GetString() converts the object to a string, throwing an exception if this isn't possible (i.e. the column is DBNull).
So although GetString() might be slightly slower, you'll be casting to a string anyway when you use it.
Given all the above I'd use GetString().
Indexer method is faster because it returns data in native format and uses ordinal.
Have a look at these threads:
Maximize Performance with SqlDataReader
.NET SqlDataReader Item[] vs. GetString(GetOrdinal())?
I have a function that receives three different "people" objects and generates a new "compatibility" object based on the combined values in the "people" objects.
However, about 1/3 of the time the three "people" objects that it receives as input are the same as one before, though possibly in a different order. In these cases I do NOT want to make a new "score" object, but simply return a value contained within the existing object.
Originally, the program just loops through the list<> of "compatibility" objects searching for the one that belongs to these three "people" (since each "compatibility" object contains an array of people objects). This method is really slow considering that there's over thousands of "compatibility" objects and over a million "people" objects.
I had the idea of using a dictionary where the key is a number I generated by combining the three people objects' id values into a single UInt64 using XOR, and storing the score objects in as dictionary values rather than in a list. This cuts down the time by about half, and is acceptable in terms of time performance, but there's way too many collisions, and it returns a wrong score too often.
Any suggestions or pointers would be much appreciated.
Edit: To add to the original question, each "people" object has a bunch of other fields that I could use, but the problem is making a key that is UNIQUE and COMMUTATIVE.
I think you're looking at things in a much too complex manner. Take the 3 PersonID values and sort them,so that they're always in the same order, no matter which order they were passed in. Then set a value in a hashtable using the three PersonIDs as the key, separated with a hyphen or some other character that won't occur in a PersonID value. Then later, check if there's a value in the hashtable with that key.
So if the three PersonIDs are 10, 5 and 22, the hash key could be something like "5-10-22".
Create the key by concatinating objectids after sorting the trio in a pre-determined order.
Your best option would be a custom IEqualityComparer class. Declare your Dictionary like this
Dictionary<List<People>, Compatability> people =
new Dictionary<List<People>, Compatability>(new PersonListComparer());
You'll need to create a PersonListComparer class that implements IEqualityComparer<List<People>>. There are two methods you'll need to implement, one that gets a hash code and one that compares equality. The Dictionary will use GetHashCode to determine if two lists are POSSIBLY equal, and the Equals method to determine if they actually are (in other words, the hash code is fast but could give a false positive but never a false negative). Use your existing hashing algorithm (the XOR) for GetHashCode, then just comare the two lists explicitly in the Equals method.
This should do the trick!
Why not use the names of the people as the dictionary key? (Sort the names first, so that order of passing doesn't matter.)
IE, John, Alice, and Bob become something like my_dictionary["Alice_Bob_John"] <- if that key exists, you've already computed the score, otherwise, you need to compute it. As an alternative to my string hacking above, you could actually use a structure:
NameTriple n = new NameTriple("John", "Alice", "Bob");
// NameTriple internally sorts the names.
my_dictionary[n] ...
If you want to keep everything in memory and not use a database, I'd recommend something akin to a tree structure. Assuming your object IDs are sortable and order doesn't matter, you can accomplish this with nested dictionaries.
Namely, a Dictionary<Key, Dictionary<Key, Dictionary<Key, Compatibility>>> should do the trick. Sort the IDs, and use the lowest value in the outer dictionary, the next value in the next, and the final value to find the compatibility object. This way, there will be no collisions, and lookup should be quite fast.
Or, now that I think again, this doesn't have to be that complicated. Just use a string as a key and concatenate the IDs together in sorted order with a "!" or something else in between that doesn't occur naturally in the IDs.
assuming all "Person" objects are unique, store a UUID in the object.
in your function staticly store the quad (P1,P2,P3,V) where P1,P2,P3 are UUID's of a Person object, sorted (to avoid the ordering problem) and V is the result from the previous calculation.
then your function checks to is if there is an entry for this triplet of Persons, if not it does the work and stores it.
you can store the (P1,P2,P3,V) values in a dictionary, just key off some hash of the three P values
I need to represent a lookup table in C#, here is the basic structure:
Name Range Multiplier
Active 10-20 0.5
What do you guys suggest?
I will need to lookup on range and retrieve the multiplier.
I will also need to lookup using the name.
UPdate
It will have maybe 10-15 rows in total.
Range is integer date type.
What you actually have is two lookup tables: one by Name and one by Range. There are several ways you can represent these in memory depending on how big the table will get.
The mostly-likely fit for the "by-name" lookup is a dictionary:
var MultiplierByName = new Dictionary<string, double>() { {"Active",.5}, {"Other", 1.0} };
The range is trickier. For that you will probably want to store either just the minimum or the maximum item, depending on how your range works. You may also need to write a function to reduce any given integer to it's corresponding stored key value (hint: use integer division or the mod operator).
From there you can choose another dictionary (Dictionary<int, double>), or if it works out right you could make your reduce function return a sequential int and use a List<double> so that your 'key' just becomes an index.
But like I said: to know for sure what's best we really need to know the scope and nature of the data in the lookup, and the scenario you'll use to access it.
Create a class to represent each row. It would have Name, RangeLow, RangeHigh and Multiplier properties. Create a list of such rows (read from a file or entered in the code), and then use LINQ to query it:
from r in LookupTable
where r.RangeLow <= x && r.RangeHigh >= x
select r.Multiplier;
Sometimes simplicity is best. How many entries are we looking at, and are the ranges integer ranges as you seem to imply in your example? While there are several approaches I can think of, the first one that comes to mind is to maintain two different lookup dictionaries, one for the name and one for the value (range) and then just store redundant info in the range dictionary. Of course, if your range is keyed by doubles, or your range goes into the tens of thousands I'd look for something different, but simplicity rules in my book.
I would implement this using a DataTable, assuming there was no pressing reason to use another datatype. DataTable.Select would work fine for running a lookup on Name or Range. You do lose some performance using a DataTable for this but with 10-15 records would it matter that much.