SqlDataReader are these two the same Which one is faster - c#

I am working with SqlXml and the stored procedure that returns a xml rather than raw data. How does one actually read the data when returned is a xml and does not know about the column names. I used the below versions and have heard getting data from SqlDataReader through ordinal is faster than through column name. Please advice on which is best and with a valid reason or proof
sqlDataReaderInstance.GetString(0);
sqlDataReaderInstance[0];

and have heard getting data from SqlDataReader through ordinal is faster than through column name
Both your examples are getting data through the index (ordinal), not the column name:
Getting data through the column name:
while(reader.Read())
{
...
var value = reader["MyColumnName"];
...
}
is potentially slower than getting data through the index:
int myColumnIndex = reader.GetOrdinal("MyColumnName");
while(reader.Read())
{
...
var value = reader[myColumnIndex];
...
}
because the first example must repeatedly find the index corresponding to "MyColumnName". If you have a very large number of rows, the difference might even be noticeable.
In most situations the difference won't be noticeable, so favour readability.
UPDATE
If you are really concerned about performance, an alternative to using ordinals is to use the DbEnumerator class as follows:
foreach(IDataRecord record in new DbEnumerator(reader))
{
...
var value = record["MyColumnName"];
...
}
The DbEnumerator class reads the schema once, and maintains an internal HashTable that maps column names to ordinals, which can improve performance.

Compared to the speed of getting data from disk both will be effectively as fast as each other.
The two calls aren't equivalent: the version with an indexer returns an object, whereas GetString() converts the object to a string, throwing an exception if this isn't possible (i.e. the column is DBNull).
So although GetString() might be slightly slower, you'll be casting to a string anyway when you use it.
Given all the above I'd use GetString().

Indexer method is faster because it returns data in native format and uses ordinal.
Have a look at these threads:
Maximize Performance with SqlDataReader
.NET SqlDataReader Item[] vs. GetString(GetOrdinal())?

Related

SqlBulkCopy WriteToServer with an IDataReader instead of DataTable and Programmatically Adjusted Field Values

We have a working code in C# that utilizes SqlBulkCopy to insert records into a table from a stored procedure source. At a high-level:
Reads data from a stored procedure that puts the records into a DataTable. Essentially calls the SP and does an AdpAdapter to put the records into the DataTable. Let's call this srcDataTable.
Dynamically maps the column names between source and destination through configuration, a table that's similar to the following:
TargetTableName
ColumnFromSource
ColumnInDestination
DefaultValue
Formatting
TableA
StudentFirstName
FirstName
NULL
NULL
TableA
StudentLastName
LastName
NULL
NULL
TableA
Birthday
Birthdate
1/1/1900
dd/MM/yyyy
Based on the mapping from #2, set up new rows from srcDataTable using .NewRow() of a DataRow to another DataTable that matches the structure of the destination table (where ColumnNameOfDestination is based). Let's call this targetDataTable. As you can see from the table, there may be instances where the value from the source is not specified, or needs to be formatted a certain way. This is the primary reason why we're having to add data rows on the fly to another data table, and the adjustment / defaulting of the values are handled in code.
Call SqlBulkCopy to write all the rows in targetDataTable to the actual SQL table.
This approach has been working alright in tandem with stored procedures that utilize FETCH and OFFSET so it only returns an X number of rows at a time to deal with memory constraints. Unfortunately, as we're getting more and more data sources that are north of 50 million rows, and that we're having to share servers, we're needing to find a faster way to do so while keeping memory consumption in check. Researching options, it seems like utilizing an IDataReader for SQLBulkCopy will allow us to limit the memory consumption of the code, and not having to delegate getting X number of records in the stored procedure itself anymore.
In terms of preserving current functionality, it looks like we can utilize SqlBulkCopyMappingOptions to allow us to maintain mapping the fields even if they're named differently. What I can't confirm however is the defaulting or formatting of the values.
Is there a way to extend the DataReader's Read() method so that we can introduce that same logic to revise whatever value will be written to the destination if there's configuration asking us to? So a) check if the current row has a value populated from the source, b) default its value to the destination table if configured, and c) apply formatting rules as it gets written to the destination table.
You appear to be asking "can I make my own class that implements IDataReader and has some altered logic to the Read() method?"
The answer's yes; you can write your own data reader that does whatever it likes in Read(), format the server's hard disk as soon as it's called even.. When you're implementing an interface you aren't "extend[ing] the DataReader's read method", you're providing your own implementation that externally appears to obey a specific contract but the implementation detail is entirely up to you. If you want, upon every read, to pull down a row from db X into a temp array, zip through the array tweaking the values to have some default or other adjustment, before returning true that's fine..
..if you wanted to do the value adjustment in the GetXXX, then that's also fine.. you're writing the reader so you decide. All the bulk copier is going to do is call Read until it returns false and write the data it gets from e.g. GetValue (if it wasn't immediately clear: read doesn't produce the data that will be written, GetValue does. Read is just an instruction to move to the next set of data that must be written but it doesn't even have to do that. You could implement it as { return DateTime.Now.DayOfWeek == DayOfWeek.Monday; } and GetValue as { return Guid.NewGuid().ToString(); } and your copy operation would spend until 23:59:59.999 filling the database with guids, but only on Monday)
The question is a bit unclear. It looks like the actual question is whether it's possible to transform data before using SqlBulkCopy with a data reader.
There are a lot of ways to do it, and the appropriate one depends on how the rest of the ETL code does. Does it only work with data readers? Or does it load batches of rows that can be modified in memory?
Use IEnumerable<> and ObjectReader
FastMember's ObjectReader class creates an IDataReader wrapper over any IEnumerable<T> collection. This means that both strongly-typed .NET collections and iterator results can be sent to SqlBulkCopy.
IEnumerable<string> lines=File.ReadLines(filePath);
using(var bcp = new SqlBulkCopy(connection))
using(var reader = ObjectReader.Create(lines, "FileName"))
{
bcp.DestinationTableName = "SomeTable";
bcp.WriteToServer(reader);
}
It's possible to create a transformation pipeline using LINQ queries and iterator methods this way, and feed the result to SqlBulkCopy using ObjectReader. The code is a lot simpler than trying to create a custom IDataReader.
In this example, Dapper can be used to return query results as an IEnumerable<>:
IEnumerable<Order> orders=connection.Query<Order>("select ... where category=#category",
new {category="Cars"});
var ordersWithDate=orders.Select(ord=>new OrderWithDate {
....
SaleDate=DateTime.Parse(ord.DateString,CultureInfo.GetCultureInfo("en-GB");
});
using var reader = ObjectReader.Create(ordersWithDate, "Id","SaleDate",...));
Custom transforming data readers
It's also possible to create custom data readers by implementing the IDataReader interface. Libraries like ExcelDataReader and CsvHelper provide such wrappers over their results. CsvHelper's CsvDataReader creates an IDataReader wrapper over the parsed CSV results. The downside to this is that IDataReader has a lot of methods to implement. The GetSchemaTable will have to be implemented to provide column and information to later transformation steps and SqlBulkCopy.
IDataReader may be dynamic, but it requires adding a lot of hand-coded type information to work. In CsvDataReader most methods just forward the call to the underlying CsvReader, eg :
public long GetInt64(int i)
{
return csv.GetField<long>(i);
}
public string GetName(int i)
{
return csv.Configuration.HasHeaderRecord
? csv.HeaderRecord[i]
: string.Empty;
}
But GetSchemaTable() is 70 lines, with defaults that aren't optimal. Why use sting as the column type when the parser can already parse date and numeric data for example?
One way to get around this is to create a new custom IDataReader using a copy of the previous reader's Schema Table and adding the extra columns. CsvDataReader's constructor accepts a DataTable schemaTable parameter to handle cases where its own GetSchemaTable isn't good enough. That DataTable could be modified to add extra columns :
/// <param name="csv">The CSV.</param>
/// <param name="schemaTable">The DataTable representing the file schema.</param>
public CsvDataReader(CsvReader csv, DataTable schemaTable = null)
{
this.csv = csv;
csv.Read();
if (csv.Configuration.HasHeaderRecord)
{
csv.ReadHeader();
}
else
{
skipNextRead = true;
}
this.schemaTable = schemaTable ?? GetSchemaTable();
}
A DerivedColumnReader could be created that does just that in its constructor :
public DerivedColumnReader<TSource,TResult>(string sourceName, string targetname,Fun<TSource,TResult> func,DataTable schemaTable)
{
...
AddSchemaColumn(schemaTable);
_schemaTable=schemaTable;
}
void AddSchemaColumn(DataTable dt,string targetName)
{
var row = dt.NewRow();
row["AllowDBNull"] = true;
row["BaseColumnName"] = targetName;
row["ColumnName"] = targetName;
row["ColumnMapping"] = MappingType.Element;
row["ColumnOrdinal"] = dt.Rows.Count+1;
row["DataType"] = typeof(TResult);
//20-30 more properties
dt.Rows.Add(row);
}
That's a lot of boiler plate that's eliminated with LINQ.
Just providing closure to this. So the main question really is to how we can avoid running into out of memory exceptions when fetching data from SQL without employing FETCH and OFFSET in the stored procedure. The resolution didn't require getting fancy with a custom Reader similar to SqlDataReader, but adding count checking and calling SqlBulkCopy in batches. The code is similar to what's written below:
using (var dataReader = sqlCmd.ExecuteReader(CommandBehavior.SequentialAccess))
{
int rowCount = 0;
while (dataReader.Read())
{
DataRow dataRow = SourceDataSet.Tables[source.ObjectName].NewRow();
for (int i = 0; i < SourceDataSet.Tables[source.ObjectName].Columns.Count; i++)
{
dataRow[(SourceDataSet.Tables[source.ObjectName].Columns[i])] = dataReader[i];
}
SourceDataSet.Tables[source.ObjectName].Rows.Add(dataRow);
rowCount++;
if (rowCount % recordLimitPerBatch == 0)
{
// Apply our field mapping
ApplyFieldMapping();
// Write it up
WriteRecordsIntoDestinationSQLObject();
// Remove from our dataset once we get to this point
SourceDataSet.Tables[source.ObjectName].Rows.Clear();
}
}
}
Where ApplyFieldMapping() makes field-specific changes to the contents of the datatable, and WriteRecordsIntoDestinationSqlObject(). This allowed us to call the stored procedure just once to fetch the data, and let the loop keep memory in check by writing records out and clearing them afterwards when we hit a preset recordLimitPerBatch.

Reading data from reader in C# with Sql Server

In C#, which one is more efficient way of reading reader object, through integer indexes or through named indexes ?
ad.Name = reader.GetString(0);
OR
ad.Name = reader["Name"].ToString();
The name overload needs to find the index first.
MSDN
a case-sensitive lookup is performed first. If it fails, a second
case-insensitive search is made (a case-insensitive comparison is done
using the database collation). Unexpected results can occur when
comparisons are affected by culture-specific casing rules. For
example, in Turkish, the following example yields the wrong results
because the file system in Turkish does not use linguistic casing
rules for the letter 'i' in "file".
From Getordinal (which is used therefore):
Because ordinal-based lookups are more efficient than named lookups,
it is inefficient to call GetOrdinal within a loop. Save time by
calling GetOrdinal once and assigning the results to an integer
variable for use within the loop.
so in a loop it might be more efficient to lookup the ordinal index once and reuse that in the loop body.
However, the name-lookup is backed by a class that is using a HashTable which is very efficient.
reader.GetString(index);
This will get the row value at that column index as string, The second solution is more ideal because it allows you to get the value at that index in your own prefered type.
Example:-
String name = reader["Name"].ToString();
int age = (int) reader["Age"]
ad.Name = reader["Name"].ToString();
This is most efficient way.
Because although you change database table structure afterwords, there will be no effect on this code since you have directly mentioned column name.
But with column index, it will change when you add any column to table before this column.

"cursor like" reading inside a CLR procedure/function

I have to implement an algorithm on data which is (for good reasons) stored inside SQL server. The algorithm does not fit SQL very well, so I would like to implement it as a CLR function or procedure. Here's what I want to do:
Execute several queries (usually 20-50, but up to 100-200) which all have the form select a,b,... from some_table order by xyz. There's an index which fits that query, so the result should be available more or less without any calculation.
Consume the results step by step. The exact stepping depends on the results, so it's not exactly predictable.
Aggregate some result by stepping over the results. I will only consume the first parts of the results, but cannot predict how much I will need. The stop criteria depends on some threshold inside the algorithm.
My idea was to open several SqlDataReader, but I have two problems with that solution:
You can have only one SqlDataReader per connection and inside a CLR method I have only one connection - as far as I understand.
I don't know how to tell SqlDataReader how to read data in chunks. I could not find documentation how SqlDataReader is supposed to behave. As far as I understand, it's preparing the whole result set and would load the whole result into memory. Even if I would consume only a small part of it.
Any hint how to solve that as a CLR method? Or is there a more low level interface to SQL server which is more suitable for my problem?
Update: I should have made two points more explicit:
I'm talking about big data sets, so a query might result in 1 mio records, but my algorithm would consume only the first 100-200 ones. But as I said before: I don't know the exact number beforehand.
I'm aware that SQL might not be the best choice for that kind of algorithm. But due to other constraints it has to be a SQL server. So I'm looking for the best possible solution.
SqlDataReader does not read the whole dataset, you are confusing it with the Dataset class. It reads row by row, as the .Read() method is being called. If a client does not consume the resultset the server will suspend the query execution because it has no room to write the output into (the selected rows). Execution will resume as the client consumes more rows (SqlDataReader.Read is being called). There is even a special command behavior flag SequentialAccess that instructs the ADO.Net not to pre-load in memory the entire row, useful for accessing large BLOB columns in a streaming fashion (see Download and Upload images from SQL Server via ASP.Net MVC for a practical example).
You can have multiple active result sets (SqlDataReader) active on a single connection when MARS is active. However, MARS is incompatible with SQLCLR context connections.
So you can create a CLR streaming TVF to do some of what you need in CLR, but only if you have one single SQL query source. Multiple queries it would require you to abandon the context connection and use isntead a fully fledged connection, ie. connect back to the same instance in a loopback, and this would allow MARS and thus consume multiple resultsets. But loopback has its own issues as it breaks the transaction boundaries you have from context connection. Specifically with a loopback connection your TVF won't be able to read the changes made by the same transaction that called the TVF, because is a different transaction on a different connection.
SQL is designed to work against huge data sets, and is extremely powerful. With set based logic it's often unnecessary to iterate over the data to perform operations, and there are a number of built-in ways to do this within SQL itself.
1) write set based logic to update the data without cursors
2) use deterministic User Defined Functions with set based logic (you can do this with the SqlFunction attribute in CLR code). Non-Deterministic will have the affect of turning the query into a cursor internally, it means the value output is not always the same given the same input.
[SqlFunction(IsDeterministic = true, IsPrecise = true)]
public static int algorithm(int value1, int value2)
{
int value3 = ... ;
return value3;
}
3) use cursors as a last resort. This is a powerful way to execute logic per row on the database but has a performance impact. It appears from this article CLR can out perform SQL cursors (thanks Martin).
I saw your comment that the complexity of using set based logic was too much. Can you provide an example? There are many SQL ways to solve complex problems - CTE, Views, partitioning etc.
Of course you may well be right in your approach, and I don't know what you are trying to do, but my gut says leverage the tools of SQL. Spawning multiple readers isn't the right way to approach the database implementation. It may well be that you need multiple threads calling into a SP to run concurrent processing, but don't do this inside the CLR.
To answer your question, with CLR implementations (and IDataReader) you don't really need to page results in chunks because you are not loading data into memory or transporting data over the network. IDataReader gives you access to the data stream row-by-row. By the sounds it your algorithm determines the amount of records that need updating, so when this happens simply stop calling Read() and end at that point.
SqlMetaData[] columns = new SqlMetaData[3];
columns[0] = new SqlMetaData("Value1", SqlDbType.Int);
columns[1] = new SqlMetaData("Value2", SqlDbType.Int);
columns[2] = new SqlMetaData("Value3", SqlDbType.Int);
SqlDataRecord record = new SqlDataRecord(columns);
SqlContext.Pipe.SendResultsStart(record);
SqlDataReader reader = comm.ExecuteReader();
bool flag = true;
while (reader.Read() && flag)
{
int value1 = Convert.ToInt32(reader[0]);
int value2 = Convert.ToInt32(reader[1]);
// some algorithm
int newValue = ...;
reader.SetInt32(3, newValue);
SqlContext.Pipe.SendResultsRow(record);
// keep going?
flag = newValue < 100;
}
Cursors are a SQL only function. If you wanted to read chunks of data at a time, some sort of paging would be required so that only a certain amount of the records would be returned. If using Linq,
.Skip(Skip)
.Take(PageSize)
Skips and takes could be used to limit results returned.
You can simply iterate over the DataReader by doing something like this:
using (IDataReader reader = Command.ExecuteReader())
{
while (reader.Read())
{
//Do something with this record
}
}
This would be iterating over the results one at a time, similiar to a cursor in SQL Server.
For multiple recordsets at once, try MARS
(if SQL Server)
http://msdn.microsoft.com/en-us/library/ms131686.aspx

Initialising array to a dynamic size - C#

I am using an array to hold the results of an SQLite query, at the moment I am using a two dimensional table to do this.
The problem that I have is that I have to manually specify the size of one of the indexes before the array can be used but it feels like a waste as I have no real knowledge of how many rows will be returned. I do use a property which allows me to retrieve the amount of columns that are present which is used in the arrays initialisation.
Example:
using (SQLiteDataReader dr = cmd.ExecuteReader())
{
results = new object[10,dr.FieldCount];
while (dr.Read())
{
jIterator++;
for (int i = 0; i < dr.FieldCount; i++)
{
results[jIterator, i] = dr.GetValue(i);
}
}
}
//I know there are a few count bugs
Example Storage:
I simply add the data to the array for as long as the while loop returns true, in this instance I set the first index to 10 as I already know how many elements are in the database and 10 will be more than enough.
How would I go about changing the array so it's size can be dynamic, Is this even possible based on the way I am getting the database results?
You should not be using an array, you should be using a dynamically-sized container like List<T>. Or, better yet, you could use something like the ADO.NET DataTable, since this is exactly what it's designed to store, and using a DbDataAdapter will avoid having to repeat the same IDataReader code all over the place.
Don't use an array. Use a generic list System.Collections.Generic.List<type>, you can even have a List of a List. These are resizable and do all of that plumbing for you.
I would suggest defining a structure to hold your contact information, then just creating a generic list of that type. They can expand without limit.
You have several options:
Use a List<T[]> instead of a 2 dimensional array.
Load your datareader into a dataset instead of an array
Skip the datareader entirely and use a dataadapter to fill a dataset
Use an iterator block to transform the datareader into an IEnumerable instead of an array
Of these, in most cases by far your best option is the last; it means you don't need to have the entire result set of your query in memory at one time, which is the main point for using an SqlDataReader in the first place.
You can create dynamic arrays in C#.
int[] intarray;
Although other responses that suggest List may be more appropriate.

Speed of DataSet row/column lookups?

Recently I had to do some very processing heavy stuff with data stored in a DataSet. It was heavy enough that I ended up using a tool to help identify some bottlenecks in my code. When I was analyzing the bottlenecks, I noticed that although DataSet lookups were not terribly slow (they weren't the bottleneck), it was slower than I expected. I always assumed that DataSets used some sort of HashTable style implementation which would make lookups O(1) (or at least thats what I think HashTables are). The speed of my lookups seemed to be significantly slower than this.
I was wondering if anyone who knows anything about the implementation of .NET's DataSet class would care to share what they know.
If I do something like this :
DataTable dt = new DataTable();
if(dt.Columns.Contains("SomeColumn"))
{
object o = dt.Rows[0]["SomeColumn"];
}
How fast would the lookup time be for the Contains(...) method, and for retrieving the value to store in Object o? I would have thought it be very fast like a HashTable (assuming what I understand about HashTables is correct) but it doesn't seem like it...
I wrote that code from memory so some things may not be "syntactically correct".
Actually it's advisable to use integer when referencing column, which can improve a lot in terms of performance. To keep things manageable, you could declare constant integer. So instead of what you did, you could do
const int SomeTable_SomeColumn = 0;
DataTable dt = new DataTable();
if(dt.Columns.Contains(SomeTable_SomeColumn))
{
object o = dt.Rows[0][SomeTable_SomeColumn];
}
Via Reflector the steps for DataRow["ColumnName"] are:
Get the DataColumn from ColumnName. Uses the row's DataColumnCollection["ColumnName"]. Internally, DataColumnCollection stores its DataColumns in a Hastable. O(1)
Get the DataRow's row index. The index is stored in an internal member. O(1)
Get the DataColumn's value at the index using DataColumn[index]. DataColumn stores its data in a System.Data.Common.DataStorage (internal, abstract) member:
return dataColumnInstance._storage.Get(recordIndex);
A sample concrete implementation is System.Data.Common.StringStorage (internal, sealed). StringStorage (and the other concrete DataStorages I checked) store their values in an array. Get(recordIndex) simply grabs the object in the value array at the recordIndex. O(1)
So overall you're O(1) but that doesn't mean the hashing and function calling during the operation is without cost. It just means it doesn't cost more as the number of DataRows or DataColumns increases.
Interesting that DataStorage uses an array for values. Can't imagine that's easy to rebuild when you add or remove rows.
I imagine that any lookups would be O(n), as I don't think they would use any type of hashtable, but would actually use more of an array for finding rows and columns.
Actually, I believe the columns names are stored in a Hashtable. Should be O(1) or constant lookup for case-sensitive lookups. If it had to look through each, then of course it would be O(n).

Categories

Resources