Read a SQLDataReader more than once in C#? - c#

I believe in JAVA we can set the data reader to point to the 1st row of it's result set. I have a SQLDataReader that has over 20 columns and I need to pre-process some columns and calculate some values before I use the entire result set. Is it possible to re-read the data reader after reading it for the first time to get just selected columns? Or will I need to store the results in a data table and read from it later?

From the MSDN docs at http://msdn.microsoft.com/en-us/library/system.data.sqlclient.sqldatareader.aspx
Provides a way of reading a forward-only stream of rows from a SQL Server database.
You can read one row at a time, but you cannot go back to a row once you've read the next row.
You can read columns in a row in any order, any number of times, and skip fields if you want.
while (reader.Read())
{
var a = reader[2]; // you can skip fields
var b = reader[0]; // you don't have to read the fields in order
var c = reader[2]; // you can re-read fields
}

No it is not possible.
From SqlDataReader on MSDN:
Provides a way of reading a forward-only stream of rows from a SQL Server database.
(emphasis mine)
You can read all the needed data into a collection and iterate over the copy as often as you want.

Perhaps you could run 2 queries as part of an sproc. 1st query is limited to the fields for precalculated fields, the 2nd for the rest of the fields. Then in your C# make use of the reader.ReadNext method.
Alternatively SQL does support calculated fields, which might make things simpler.
But personally I'd read the whole data into a List<T> for processing...

Related

SqlBulkCopy WriteToServer with an IDataReader instead of DataTable and Programmatically Adjusted Field Values

We have a working code in C# that utilizes SqlBulkCopy to insert records into a table from a stored procedure source. At a high-level:
Reads data from a stored procedure that puts the records into a DataTable. Essentially calls the SP and does an AdpAdapter to put the records into the DataTable. Let's call this srcDataTable.
Dynamically maps the column names between source and destination through configuration, a table that's similar to the following:
TargetTableName
ColumnFromSource
ColumnInDestination
DefaultValue
Formatting
TableA
StudentFirstName
FirstName
NULL
NULL
TableA
StudentLastName
LastName
NULL
NULL
TableA
Birthday
Birthdate
1/1/1900
dd/MM/yyyy
Based on the mapping from #2, set up new rows from srcDataTable using .NewRow() of a DataRow to another DataTable that matches the structure of the destination table (where ColumnNameOfDestination is based). Let's call this targetDataTable. As you can see from the table, there may be instances where the value from the source is not specified, or needs to be formatted a certain way. This is the primary reason why we're having to add data rows on the fly to another data table, and the adjustment / defaulting of the values are handled in code.
Call SqlBulkCopy to write all the rows in targetDataTable to the actual SQL table.
This approach has been working alright in tandem with stored procedures that utilize FETCH and OFFSET so it only returns an X number of rows at a time to deal with memory constraints. Unfortunately, as we're getting more and more data sources that are north of 50 million rows, and that we're having to share servers, we're needing to find a faster way to do so while keeping memory consumption in check. Researching options, it seems like utilizing an IDataReader for SQLBulkCopy will allow us to limit the memory consumption of the code, and not having to delegate getting X number of records in the stored procedure itself anymore.
In terms of preserving current functionality, it looks like we can utilize SqlBulkCopyMappingOptions to allow us to maintain mapping the fields even if they're named differently. What I can't confirm however is the defaulting or formatting of the values.
Is there a way to extend the DataReader's Read() method so that we can introduce that same logic to revise whatever value will be written to the destination if there's configuration asking us to? So a) check if the current row has a value populated from the source, b) default its value to the destination table if configured, and c) apply formatting rules as it gets written to the destination table.
You appear to be asking "can I make my own class that implements IDataReader and has some altered logic to the Read() method?"
The answer's yes; you can write your own data reader that does whatever it likes in Read(), format the server's hard disk as soon as it's called even.. When you're implementing an interface you aren't "extend[ing] the DataReader's read method", you're providing your own implementation that externally appears to obey a specific contract but the implementation detail is entirely up to you. If you want, upon every read, to pull down a row from db X into a temp array, zip through the array tweaking the values to have some default or other adjustment, before returning true that's fine..
..if you wanted to do the value adjustment in the GetXXX, then that's also fine.. you're writing the reader so you decide. All the bulk copier is going to do is call Read until it returns false and write the data it gets from e.g. GetValue (if it wasn't immediately clear: read doesn't produce the data that will be written, GetValue does. Read is just an instruction to move to the next set of data that must be written but it doesn't even have to do that. You could implement it as { return DateTime.Now.DayOfWeek == DayOfWeek.Monday; } and GetValue as { return Guid.NewGuid().ToString(); } and your copy operation would spend until 23:59:59.999 filling the database with guids, but only on Monday)
The question is a bit unclear. It looks like the actual question is whether it's possible to transform data before using SqlBulkCopy with a data reader.
There are a lot of ways to do it, and the appropriate one depends on how the rest of the ETL code does. Does it only work with data readers? Or does it load batches of rows that can be modified in memory?
Use IEnumerable<> and ObjectReader
FastMember's ObjectReader class creates an IDataReader wrapper over any IEnumerable<T> collection. This means that both strongly-typed .NET collections and iterator results can be sent to SqlBulkCopy.
IEnumerable<string> lines=File.ReadLines(filePath);
using(var bcp = new SqlBulkCopy(connection))
using(var reader = ObjectReader.Create(lines, "FileName"))
{
bcp.DestinationTableName = "SomeTable";
bcp.WriteToServer(reader);
}
It's possible to create a transformation pipeline using LINQ queries and iterator methods this way, and feed the result to SqlBulkCopy using ObjectReader. The code is a lot simpler than trying to create a custom IDataReader.
In this example, Dapper can be used to return query results as an IEnumerable<>:
IEnumerable<Order> orders=connection.Query<Order>("select ... where category=#category",
new {category="Cars"});
var ordersWithDate=orders.Select(ord=>new OrderWithDate {
....
SaleDate=DateTime.Parse(ord.DateString,CultureInfo.GetCultureInfo("en-GB");
});
using var reader = ObjectReader.Create(ordersWithDate, "Id","SaleDate",...));
Custom transforming data readers
It's also possible to create custom data readers by implementing the IDataReader interface. Libraries like ExcelDataReader and CsvHelper provide such wrappers over their results. CsvHelper's CsvDataReader creates an IDataReader wrapper over the parsed CSV results. The downside to this is that IDataReader has a lot of methods to implement. The GetSchemaTable will have to be implemented to provide column and information to later transformation steps and SqlBulkCopy.
IDataReader may be dynamic, but it requires adding a lot of hand-coded type information to work. In CsvDataReader most methods just forward the call to the underlying CsvReader, eg :
public long GetInt64(int i)
{
return csv.GetField<long>(i);
}
public string GetName(int i)
{
return csv.Configuration.HasHeaderRecord
? csv.HeaderRecord[i]
: string.Empty;
}
But GetSchemaTable() is 70 lines, with defaults that aren't optimal. Why use sting as the column type when the parser can already parse date and numeric data for example?
One way to get around this is to create a new custom IDataReader using a copy of the previous reader's Schema Table and adding the extra columns. CsvDataReader's constructor accepts a DataTable schemaTable parameter to handle cases where its own GetSchemaTable isn't good enough. That DataTable could be modified to add extra columns :
/// <param name="csv">The CSV.</param>
/// <param name="schemaTable">The DataTable representing the file schema.</param>
public CsvDataReader(CsvReader csv, DataTable schemaTable = null)
{
this.csv = csv;
csv.Read();
if (csv.Configuration.HasHeaderRecord)
{
csv.ReadHeader();
}
else
{
skipNextRead = true;
}
this.schemaTable = schemaTable ?? GetSchemaTable();
}
A DerivedColumnReader could be created that does just that in its constructor :
public DerivedColumnReader<TSource,TResult>(string sourceName, string targetname,Fun<TSource,TResult> func,DataTable schemaTable)
{
...
AddSchemaColumn(schemaTable);
_schemaTable=schemaTable;
}
void AddSchemaColumn(DataTable dt,string targetName)
{
var row = dt.NewRow();
row["AllowDBNull"] = true;
row["BaseColumnName"] = targetName;
row["ColumnName"] = targetName;
row["ColumnMapping"] = MappingType.Element;
row["ColumnOrdinal"] = dt.Rows.Count+1;
row["DataType"] = typeof(TResult);
//20-30 more properties
dt.Rows.Add(row);
}
That's a lot of boiler plate that's eliminated with LINQ.
Just providing closure to this. So the main question really is to how we can avoid running into out of memory exceptions when fetching data from SQL without employing FETCH and OFFSET in the stored procedure. The resolution didn't require getting fancy with a custom Reader similar to SqlDataReader, but adding count checking and calling SqlBulkCopy in batches. The code is similar to what's written below:
using (var dataReader = sqlCmd.ExecuteReader(CommandBehavior.SequentialAccess))
{
int rowCount = 0;
while (dataReader.Read())
{
DataRow dataRow = SourceDataSet.Tables[source.ObjectName].NewRow();
for (int i = 0; i < SourceDataSet.Tables[source.ObjectName].Columns.Count; i++)
{
dataRow[(SourceDataSet.Tables[source.ObjectName].Columns[i])] = dataReader[i];
}
SourceDataSet.Tables[source.ObjectName].Rows.Add(dataRow);
rowCount++;
if (rowCount % recordLimitPerBatch == 0)
{
// Apply our field mapping
ApplyFieldMapping();
// Write it up
WriteRecordsIntoDestinationSQLObject();
// Remove from our dataset once we get to this point
SourceDataSet.Tables[source.ObjectName].Rows.Clear();
}
}
}
Where ApplyFieldMapping() makes field-specific changes to the contents of the datatable, and WriteRecordsIntoDestinationSqlObject(). This allowed us to call the stored procedure just once to fetch the data, and let the loop keep memory in check by writing records out and clearing them afterwards when we hit a preset recordLimitPerBatch.

Migrating big data to new database

I'd like to transfer a large amount of data from SQL Server to MongoDB (Around 80 million records) using a solution I wrote in C#.
I want to transfer say 200 000 records at a time, but my problem is keeping track of what has already been transferred. Normally I'd do it as follows:
Gather IDs from destination to exclude from source scope
Read from source (Excluding IDs already in destination)
Write to destination
Repeat
The problem is that I build a string in C# containing all the IDs that exist in the destination, for the purpose of excluding those from source selection, eg.
select * from source_table where id not in (<My large list of IDs>)
Now you can imagine what happens here when I have already inserted 600 000+ records and then build a string with all the IDs, it gets large and slows things down even more, so I'm looking for a way to iterate through say 200 000 records at a time, like a cursor, but I have never done something like this and so I am here, looking for advice.
Just as a reference, I do my reads as follows
SqlConnection conn = new SqlConnection(myConnStr);
conn.Open();
SqlCommand cmd = new SqlCommand("select * from mytable where id not in ("+bigListOfIDs+")", conn);
SqlDataReader reader = cmd.ExecuteReader();
if (reader.HasRows)
{
while (reader.Read())
{
//Populate objects for insertion into MongoDB
}
}
So basically, I want to know how to iterate through large amounts of data without selecting all that data in one go, or having to filter the data using large strings. Any help would be appreciated.
Need more rep to comment, but if you sort by your id column you could change your where clause to become
select * from source_table where *lastusedid* < id and id <= *lastusedid+200000*
which will give you the range of 200000 you asked for and you only need to store the single integer
There are many different ways of doing this, but I would suggest first that you don't try to reinvent the wheel but look at existing programs.
There are many programs designed to export and import data between different databases, some are very flexible and expensive, but others come with free options and most DBMS programs include something.
Option 1:
Use SQL Server Management Studio (SSMS) Export wizards.
This allows you to export to different sources. You can even write complex queries if required. More information here:
https://www.mssqltips.com/sqlservertutorial/202/simple-way-to-export-data-from-sql-server/
Option 2:
Export your data in ascending ID order.
Store the last exported ID in a table.
Export the next set of data where ID > lastExportedID
Option 3:
Create a copy of your data in a back-up table.
Export from this table, and delete the records as you export them.

Strange error when adding row to datatable

I'm hoping you guys can help me figure out why this is happening. I've been tearing my hair out trying to figure this out.
Here's an example directly from my code (with the boring bits cut out)
...(Set up the connection and command, initialize a datatable "dataTable")...
using (SqlDataReader reader = cmd.ExecuteReader())
{
//Query storage object
object[] buffer = new object[reader.FieldCount];
//Set the datatable schema to the schema from the query
dataTable = reader.GetSchemaTable();
//Read the query
while (reader.Read())
{
reader.GetValues(buffer);
dataTable.Rows.Add(buffer);
}
}
The error is
Input string was not in a correct format.Couldn't store in NumericScale Column. Expected type is Int16.
The specific column data types as returned by the schema are (ordered by column)
System.Data.SqlTypes.SqlInt32
System.Data.SqlTypes.SqlInt32
System.Data.SqlTypes.SqlByte
System.Data.SqlTypes.SqlMoney
System.Data.SqlTypes.SqlString
System.Data.SqlTypes.SqlGuid
System.Data.SqlTypes.SqlDateTime
It would appear that the data that should be in column #5 is actually appearing in column #3. But that is pure speculation.
What I know is that in order to use a dataTable "dynamically" with a query that can continue any number of different types of data the best route is to use GetSchemaTable() to retrieve it.
What I Saw In The Debugger
When I dropped into the debugger I took a look at dataTable's types built from the schema vs. the types returned to the object from reader.GetValues(). They are exactly the same.
It seems like dataTable.Rows.Add(buffer) is adding the columns a few columns off from where it should be. But this shouldn't be possible. Especially with the schema being directly built from the reader. I've played with options such as "CommandBehavior.KeyInfo" within ExecuteReader() and still had the same error occur.
Note: I need to run the query this way to enable the end-user to halt the query mid-read. Please do not suggest I scrap this and use an SqlDataAdapter or DataTable.Load() solution.
I'd really appreciate any help. Thank you!
Method DbDataReader.GetSchemaTable() return metadata table containing column names and types. It is not an empty table that one could expect. For more details see MSDN
I'm sorry but GetSchemaTable retrieves schema for a given table and GetValues retrieves the actual row data. E.g. your dataTable will contain columns like column name, column type, etc. (see MSDN reference) while your buffer will contain the actual data whose representation will differ from what the dataTable contains.
Why don't you just use:
dataTable.Fill();
Why are you manually loading it one row at a time?
Please check what you are inserting into the column row which gas GUID data type. I got the same error and found that while inserting records after reading records from csv file, there were couple of empty spaces in csv file.

DataReader has rows and data, trying to read from it says "no data is present"

I haven't used DataReaders in ages (I prefer to use an ORM) but I'm forced to at work. I pull back the rows, and check that HasRows is true; debugging at this point and examining the reader shows that my data is there.
Now here's the issue: the moment I call reader.Read(), trying to expand the results says "The enumeration yielded no results" or whatever, and I get the "Invalid attempt to read when no data is present." error. I get the same thing if I don't call Read() (which is the default since the DataReader starts before the first record).
I cannot remember the proper way to handle this; the data is there when I check HasRows, but is gone the moment I either try to read from it right after or after I call Read, which makes no sense as if I don't call Read, the reader should still be before the first record, and if the property is set that starts it at the first record (SingleRow? I forget the name of it) is set, then I should be able to read rows without calling Read, however both ways seem to move past the row containing the data.
What am I forgetting? Code is fairly straightforward:
TemplateFile file = null;
using (DbDataReader reader = ExecuteDataReaderProc("GetTemplateByID", idParam))
{
if (reader.HasRows) // reader has data at this point - verified with debugger
{
reader.Read(); // loses data at this point if I call Read()
template = new TemplateFile
{
FileName = Convert.ToString(reader["FileName"]) // whether or not I call
// Read, says no data here
};
}
}
Just to clarify the answer, it was using the debugger since expanding the results view calls Read() and therefore it moves past the row. As Marc Gravell said in a comment: Debugger considered harmful
If you want to put the data into a file, start by loading a DataTable instead of using a DataReader.
With the DataReader, as has been mentioned in the comments, you might want to iterate through the result set with a while loop
while (reader.Read())
{
}
The loop reads one row at a time and quits when all of the rows have been read.
Once you move to the next row, the previous rows are no longer available unless you have put them into some other structure, like a list or DataTable.
But you can use a DataAdapater to fill a DataTable so there might not be a reason to use a DataReader. Then you can write to a file from the DataTable.
In any event, I don't see how this line could work.
FileName = Convert.ToString(reader["FileName"])
I can post additional code for either approach if you like.
HTH Harvey Sather

Join multiple DataRows into a single DataRow

I am writing this in C# using .NET 3.5. I have a System.Data.DataSet object with a single DataTable that uses the following schema:
Id : uint
AddressA: string
AddressB: string
Bytes : uint
When I run my application, let's say the DataTable gets filled with the following:
1 192.168.0.1 192.168.0.10 300
2 192.168.0.1 192.168.0.20 400
3 192.168.0.1 192.168.0.30 300
4 10.152.0.13 167.10.2.187 80
I'd like to be able to query this DataTable where AddressA is unique and the Bytes column is summed together (I'm not sure I'm saying that correctly). In essence, I'd like to get the following result:
1 192.168.0.1 1000
2 10.152.0.13 80
I ultimately want this result in a DataTable that can be bound to a DataGrid, and I need to update/regenerate this result every 5 seconds or so.
How do I do this? DataTable.Select() method? If so, what does the query look like? Is there an alternate/better way to achieve my goal?
EDIT: I do not have a database. I'm simply using an in-memory DataSet to store the data, so a pure SQL solution won't work here. I'm trying to figure out how to do it within the DataSet itself.
For readability (and because I love it) I would try to use LINQ:
var aggregatedAddresses = from DataRow row in dt.Rows
group row by row["AddressA"] into g
select new {
Address = g.Key,
Byte = g.Sum(row => (uint)row["Bytes"])
};
int i = 1;
foreach(var row in aggregatedAddresses)
{
result.Rows.Add(i++, row.Address, row.Byte);
}
If a performace issue is discovered with the LINQ solution I would go with a manual solution summing up the rows in a loop over the original table and inserting them into the result table.
You can also bind the aggregatedAddresses directly to the grid instead of putting it into a DataTable.
most efficient solution would be to do the sum in SQL directly
select AddressA, SUM(bytes) from ... group by AddressA
I agree with Steven as well that doing this on the server side is the best option. If you are using .NET 3.5 though, you don't have to go through what Rune suggests. Rather, use the extension methods for datasets to help query and sum the values.
Then, you can map it easily to an anonymous type which you can set as the data source for your grid (assuming you don't allow edits to this, which I don't see how you can, since you are aggregating the data).
I agree with Steven that the best way to do this is to do it in the database. But if that isn't an option you can try the following:
Make a new datatable and add the columns you need manually using DataTable.Columns.Add(name, datatype)
Step through the first datatables Rows collection and for each row create a new row in your new datatable using DataTable.NewRow()
Copy the values of the columns found in the first table into the new row
Find the matching row in the other data table using Select() and copy out the final value into the new data row
Add the row to your new data table using DataTable.Rows.Add(newRow)
This will give you a new data table containing the combined data from the two tables. It won't be very fast, but unless you have huge amounts of data it will probably be fast enough. But try to avoid doing a LIKE-query in the Select, for that one is slow.
One possible optimization would be possible if both tables contains rows with identical primary keys. You could then sort both tables and step through them fetching both data rows using their array index. This would rid you of the Select call.

Categories

Resources