I'm using SqlBulkCopy to insert large amounts of data into SQL Server. The code is simple, like following:
using (var copy = new SqlBulkCopy(connection))
{
copy.BatchSize = 10000;
copy.DestinationTableName = "MyTableName";
// other parameters...
copy.WriteToServer(reader);
}
The problem I'm having is the data volume is large, so it takes very long time to finish the copy operation. Sometimes there is transient errors in the network or in SQL Server, which would fail the WriteToServer call.
In these situations, I want to be able to continue the copy from the failed batch, instead of deleting all the data and doing the copy all over again from the very beginning. But I can't do this because at the time when I caught the exception thrown from WriteToServer, the reader object would have already been in a wrong position.
One solution I can think of is to cache the entire batch in my code. However that would mean I need to manage the memory for the cache. Also I believe this approach hurts performance a lot.
Is there any better ways? Thanks.
Related
When attempting to upload ~30,000 users into a dynamodb table using the Amazon.DynamoDBv2 wrapper for .net, not all records made it, however, there was no exception either.
var userBatch = _context.CreateBatchWrite<Authentication_User>();
userBatch.AddPutItems(users);
userBatch.ExecuteAsync();
Approximately 2,500'ish records were written to the table. Has anyone found a limit to number or size of batch inserts?
From the documentation (emphasis mine):
When using the object persistence model, you can specify any number of operations in a batch.
I exited the process after the return from ExecuteAsync(). That was the problem. When I let it run I can see the data slowly build up. However, in the bulk insert I used a Task.Wait() because there was nothing to do until the records had been loaded. However, Iridium's answer above assited me with the second issue that revolved around ProvisionedThroughput exceptions.
I wrote an app running on multiple client machines connecting to a remote oracle database (master) to synchronize theirs local open source database (slave). This works fine so far. But sometimes a local table needs to be fully initialized (dropped and afterwards all rows of master database inserted). If the master table is big enough (ColumnCount or DataType/DataSize and a certain RowSize) the app sometimes runs into an OutOfMemoryException.
The app is running on windows machines with .NET 4.0. Version of ODP.NET is 4.122.18.3. Oracle database is 12c (12.1.0.2.0).
I don't want to show the data to any user (the app is running in the background), else i could do some paging or filtering. Since not all tables contains keys or are able to be sorted automatically its pretty hard to fetch the table in parts. The initializing of the local table should be done in one transaction without multiple partial commits.
I can brake the problem down to a simple code sample showing the managed memory allocation which I didn't expect. At this point I'm not sure how to explain or solve the problem.
using (var connection = new OracleConnection(CONNECTION_STRING))
{
connection.Open();
using (var command = connection.CreateCommand())
{
command.CommandText = STATEMENT;
using (var reader = command.ExecuteReader())
{
while (reader.Read())
{
//reader[2].ToString();
//reader.GetString(2);
//reader.GetValue(2);
}
}
}
}
When uncommenting any of the three reader.* calls the memory of the requested column data seems to be pinned internally by ODP.NET (OracleInternal.Network.OraBuf) for each record. For requesting a few thousand records this doesn't seem to be a problem. But when fetching 100k+ records the memory allocation gets to hundreds of MB, which leads to an OutOfMemoryException. The more data the specified column has the faster the OOM happens (mostly NVARCHAR2).
Calling additionally GC.Collect() manually doesn't do anything. The GC.Collect()'s shown in the image are done internally (no calls by myself).
Since I don't store the read data at any place I would have expected the data is not cached while iterating the DbDataReader. Can u help me understand what is happening here and how to avoid it?
This seems to be a known bug (Bug 21975120) when reading clob column values using ExecuteReader() method with the managed Driver 12.1.0.2. The Workaround is to use the OracleDataReader specific methods (for example oracleDataReader.GetOracleValue(i)). The OracleClob value can explicitly be closed to free the Memory allocation.
var item = oracleDataReader.GetOracleValue(columnIndex);
if (item is OracleClob clob)
{
if (clob != null)
{
// use clob.Value ...
clob.Close();
}
}
I've written a bit of code to manipulate data for a comprehensive transaction. However, am suffering endless problems and dead ends. If I run my code on a small data set, it works as expected. But now that I've had the production environment restored to the testing db, to get a full scope of testing. I basically wasted my time as best I can tell.
private static void AddProvisionsForEachSupplement2(ISupplementCoordinator supplmentCord)
{
var time = DateTime.Now;
using (var scope = new UnitOfWorkScope())
{
var supplements = supplmentCord.GetContracts(x => x.EffectiveDate <= new DateTime(2014, 2, 27)).AsEnumerable();
foreach (var supplement in supplements){
var specialProvisionTable = supplement.TrackedTables.FirstOrDefault(x => x.Name == "SpecialProvisions");
SetDefaultSpecialProvisions(specialProvisionTable, supplement);
Console.Out.WriteLine(supplement.Id.ToString() + ": " + (DateTime.Now - time).TotalSeconds);
}
}
}
You can see I decided to test my timing, it takes roughly 300+ seconds to complete the loop, and then the 'commit' that occurs is obscenely long. Probably longer.
I get this error:
The transaction associated with the current connection has completed but has not been disposed. The transaction must be disposed before the connection can be used to execute SQL statements.
I added [Transaction(Timeout = 6000)] to even get that, before I was getting a transaction timeout.
There is so much more info that we need and giving you a definitive answer is going to be tough. However when dealing with large datasets in one massive transaction is always going to hurt performance.
However, the first thing I would do too be honest is to hook up a SQL profiler (like NHProf, or SQL profiler, or even Log4Net) to see how many queries you are issuing. Then as you test you can see if you can reduce the queries which will in turn reduce the time.
Then after working this out you have a few options:-
Use a stateless session to grab data, but this removes change tracking so you will need to persist it later manually
Reduce the single big transactions into smaller transactions (this is what I would start with)
See if eager loading a whole lot of data in one hit might be more efficient
See if using batching is going to help you
Don't give up, fight and understand why this is giving you issues, this is the key becoming a master of NHibernate (rather than a slave to it)
Good luck :)
I have some code that at the end of the program's life, uploads the entire contents of 6 different lists into a database. The problem is, they're parallel lists with about 14,000 items in each, and I have to run an Insert query for each of their separate item(s). This takes a long time, is there a faster way to do this? Here's a sample of the relevant code:
public void uploadContent()
{
var cs = Properties.Settings.Default.Database;
SqlConnection dataConnection = new SqlConnection(cs);
dataConnection.Open();
for (int i = 0; i < urlList.Count; i++)
{
SqlCommand dataCommand = new SqlCommand(Properties.Settings.Default.CommandString, dataConnection);
try
{
dataCommand.Parameters.AddWithValue("#user", userList[i]);
dataCommand.Parameters.AddWithValue("#computer", computerList[i]);
dataCommand.Parameters.AddWithValue("#date", timestampList[i]);
dataCommand.Parameters.AddWithValue("#itemName", domainList[i]);
dataCommand.Parameters.AddWithValue("#itemDetails", urlList[i]);
dataCommand.Parameters.AddWithValue("#timesUsed", hitsList[i]);
dataCommand.ExecuteNonQuery();
}
catch (Exception e)
{
using (StreamWriter sw = File.AppendText("errorLog.log"))
{
sw.WriteLine(e);
}
}
}
dataConnection.Close();
}
Here is the command string the code is pulling from the config file:
CommandString:
INSERT dbo.InternetUsage VALUES (#user, #computer, #date, #itemName, #itemDetails, #timesUsed)
As mentioned in #alerya's answer, doing the following will help (added explanation here)
1) Make Command and parameter creation outside of for loop
Since the same command is being used each time, it doesn't make sense to re-create the command each time. In addition to creating a new object (which takes time), the command must also be verified each time it is created for several things (table exists, etc). This introduces a lot of overhead.
2) Put the inserts within a transaction
Putting all of the inserts within a transaction will speed things up because, by default, a command that is not within a transaction will be considered its own transaction. Therefore, every time you insert something, the database server must then verify that what it just inserted is actually saved (usually on a harddisk, which is limited by the speed of the disk). When multiple INSERTs are within one transactions, however, the check only needs to be performed once.
The downside to this approach, based on the code you've already shown, is that one bad INSERT will spoil the bunch. Whether or not this is acceptable depends on your specific requirements.
Aside
Another thing you really should be doing (though this won't speed things up in the short term) is properly using the IDisposable interface. This means either calling .Dispose() on all IDisposable objects (SqlConnection, SqlCommand), or, ideally, wrapping them in using() blocks:
using( SqlConnection dataConnection = new SqlConnection(cs)
{
//Code goes here
}
This will prevent memory leaks from these spots, which will become a problem quickly if your loops get too large.
Make Command and parameter creation outside of for (int i = 0; i < urlList.Count; i++)
Also Create insert within a transaction
If it possible create a Stored Procedure and pass parameters as DataTable.
Sending INSERT commands one by one to a database will really make the whole process slow, because of the round trips to the database server. If you're worried about performance, you should consider using a bulk insert strategy. You could:
Generate a flat file with all your information, in the format that BULK INSERT understands.
Use the BULK INSERT command to import that file to your database (http://msdn.microsoft.com/en-us/library/ms188365(v=sql.90).aspx).
Ps. I guess when you say SQL you're using MS SQL Server.
Why don't you run your uploadContent() method from a separate thread.
This way you don't need to worry about how much time the query takes to execute.
I have to implement an algorithm on data which is (for good reasons) stored inside SQL server. The algorithm does not fit SQL very well, so I would like to implement it as a CLR function or procedure. Here's what I want to do:
Execute several queries (usually 20-50, but up to 100-200) which all have the form select a,b,... from some_table order by xyz. There's an index which fits that query, so the result should be available more or less without any calculation.
Consume the results step by step. The exact stepping depends on the results, so it's not exactly predictable.
Aggregate some result by stepping over the results. I will only consume the first parts of the results, but cannot predict how much I will need. The stop criteria depends on some threshold inside the algorithm.
My idea was to open several SqlDataReader, but I have two problems with that solution:
You can have only one SqlDataReader per connection and inside a CLR method I have only one connection - as far as I understand.
I don't know how to tell SqlDataReader how to read data in chunks. I could not find documentation how SqlDataReader is supposed to behave. As far as I understand, it's preparing the whole result set and would load the whole result into memory. Even if I would consume only a small part of it.
Any hint how to solve that as a CLR method? Or is there a more low level interface to SQL server which is more suitable for my problem?
Update: I should have made two points more explicit:
I'm talking about big data sets, so a query might result in 1 mio records, but my algorithm would consume only the first 100-200 ones. But as I said before: I don't know the exact number beforehand.
I'm aware that SQL might not be the best choice for that kind of algorithm. But due to other constraints it has to be a SQL server. So I'm looking for the best possible solution.
SqlDataReader does not read the whole dataset, you are confusing it with the Dataset class. It reads row by row, as the .Read() method is being called. If a client does not consume the resultset the server will suspend the query execution because it has no room to write the output into (the selected rows). Execution will resume as the client consumes more rows (SqlDataReader.Read is being called). There is even a special command behavior flag SequentialAccess that instructs the ADO.Net not to pre-load in memory the entire row, useful for accessing large BLOB columns in a streaming fashion (see Download and Upload images from SQL Server via ASP.Net MVC for a practical example).
You can have multiple active result sets (SqlDataReader) active on a single connection when MARS is active. However, MARS is incompatible with SQLCLR context connections.
So you can create a CLR streaming TVF to do some of what you need in CLR, but only if you have one single SQL query source. Multiple queries it would require you to abandon the context connection and use isntead a fully fledged connection, ie. connect back to the same instance in a loopback, and this would allow MARS and thus consume multiple resultsets. But loopback has its own issues as it breaks the transaction boundaries you have from context connection. Specifically with a loopback connection your TVF won't be able to read the changes made by the same transaction that called the TVF, because is a different transaction on a different connection.
SQL is designed to work against huge data sets, and is extremely powerful. With set based logic it's often unnecessary to iterate over the data to perform operations, and there are a number of built-in ways to do this within SQL itself.
1) write set based logic to update the data without cursors
2) use deterministic User Defined Functions with set based logic (you can do this with the SqlFunction attribute in CLR code). Non-Deterministic will have the affect of turning the query into a cursor internally, it means the value output is not always the same given the same input.
[SqlFunction(IsDeterministic = true, IsPrecise = true)]
public static int algorithm(int value1, int value2)
{
int value3 = ... ;
return value3;
}
3) use cursors as a last resort. This is a powerful way to execute logic per row on the database but has a performance impact. It appears from this article CLR can out perform SQL cursors (thanks Martin).
I saw your comment that the complexity of using set based logic was too much. Can you provide an example? There are many SQL ways to solve complex problems - CTE, Views, partitioning etc.
Of course you may well be right in your approach, and I don't know what you are trying to do, but my gut says leverage the tools of SQL. Spawning multiple readers isn't the right way to approach the database implementation. It may well be that you need multiple threads calling into a SP to run concurrent processing, but don't do this inside the CLR.
To answer your question, with CLR implementations (and IDataReader) you don't really need to page results in chunks because you are not loading data into memory or transporting data over the network. IDataReader gives you access to the data stream row-by-row. By the sounds it your algorithm determines the amount of records that need updating, so when this happens simply stop calling Read() and end at that point.
SqlMetaData[] columns = new SqlMetaData[3];
columns[0] = new SqlMetaData("Value1", SqlDbType.Int);
columns[1] = new SqlMetaData("Value2", SqlDbType.Int);
columns[2] = new SqlMetaData("Value3", SqlDbType.Int);
SqlDataRecord record = new SqlDataRecord(columns);
SqlContext.Pipe.SendResultsStart(record);
SqlDataReader reader = comm.ExecuteReader();
bool flag = true;
while (reader.Read() && flag)
{
int value1 = Convert.ToInt32(reader[0]);
int value2 = Convert.ToInt32(reader[1]);
// some algorithm
int newValue = ...;
reader.SetInt32(3, newValue);
SqlContext.Pipe.SendResultsRow(record);
// keep going?
flag = newValue < 100;
}
Cursors are a SQL only function. If you wanted to read chunks of data at a time, some sort of paging would be required so that only a certain amount of the records would be returned. If using Linq,
.Skip(Skip)
.Take(PageSize)
Skips and takes could be used to limit results returned.
You can simply iterate over the DataReader by doing something like this:
using (IDataReader reader = Command.ExecuteReader())
{
while (reader.Read())
{
//Do something with this record
}
}
This would be iterating over the results one at a time, similiar to a cursor in SQL Server.
For multiple recordsets at once, try MARS
(if SQL Server)
http://msdn.microsoft.com/en-us/library/ms131686.aspx