Reading Oracle CLOB data from ODBC reader is super slow

Reading Oracle CLOB data from ODBC reader is super slow - c#

We're downloading data from an Oracle database using C# and .Net 4.5.
values[] is an object array;
reader is the ODBC reader with an open connection to the Oracle database table with CLOB data.
Here's the relevant code:
if (reader.Read())
{ //Download and save the values
for (int x = 0; x < reader.FieldCount; x++)
{ //Populate all the values
values[x] = reader[x]; //this line seems to cause execution to hang
}
//
//blah blah blah
//
}
The C# code seems to hang on the line values[x] = reader[x];.
We're assigning every column in the row read to a special object array because we need to do separate stuff on that data later, and not have to worry about the data type at the moment.
The problem lies that when a table is hit with an Oracle CLOB data column that's big ( > 28,000 ), that line never seems to complete.
If we eliminate the CLOB column from what the odbc reader reads, everything works perfectly.
Questions:
Why would this be? Shouldn't the array assignment be relatively quick?
What are some possible work arounds so we can keep the CLOB columns in the data downloaded? We
need to keep the ODBC reader as a generic ODBC reader (and not made Oracle specific).
The application is compiled and must remain 32-bit.
Thanks!

It the type CLOB which is the problem!
cast it to varchar and the performance is ok:
change
select clob from table
to 2 selects
select DBMS_LOB.SUBSTR(clob,4000,1) from table where length(clob)<= 4000;
select clob from table where length(clob)> 4000;
It's easy to verify - the performance is slow for small clob columns as weel, but casting the to varchar solves the problem!
The problem is that every single CLOB column fetch issues 2 extra net roundtrips from the client to the server!
Normally you'll have one net roundtrip for several rows

Unless there is a really compelling reason to use ODBC, you are far better off using ODP.net or managed ODP.net. While I can't say with 100% certainty that ODBC is your issue, I can tell you I have used .NET with LOBs numerous times without issue, using ODP.net. I can also tell you that the Microsoft driver for Oracle (depricated) caused countless mystery issues, most of them arround performance. For example, query 1 works fine, but replacing a literal with a bind variable results in a 15 second performance delay on the execute. Using ODP.net, the delay goes away. My guess is that ODBC may have similar mystery issues.
ODP.net (or DevArt dotConnect) are the only two tools I know of that enable .NET to leverage the OCI, which has great features like bulk inserts and updates. It's possible that OCI has a better way of dealing with large LOBs. Is there a compelling reason you have to use ODBC?

Add LONGDATACOMPAT=1; on your string connection:
string conn = #"DSN=database;UID=user;PWD=password;LONGDATACOMPAT=1;";

Related

Huge managed memory allocation when reading (iterating) data with DbDataReader

I wrote an app running on multiple client machines connecting to a remote oracle database (master) to synchronize theirs local open source database (slave). This works fine so far. But sometimes a local table needs to be fully initialized (dropped and afterwards all rows of master database inserted). If the master table is big enough (ColumnCount or DataType/DataSize and a certain RowSize) the app sometimes runs into an OutOfMemoryException.
The app is running on windows machines with .NET 4.0. Version of ODP.NET is 4.122.18.3. Oracle database is 12c (12.1.0.2.0).
I don't want to show the data to any user (the app is running in the background), else i could do some paging or filtering. Since not all tables contains keys or are able to be sorted automatically its pretty hard to fetch the table in parts. The initializing of the local table should be done in one transaction without multiple partial commits.
I can brake the problem down to a simple code sample showing the managed memory allocation which I didn't expect. At this point I'm not sure how to explain or solve the problem.
using (var connection = new OracleConnection(CONNECTION_STRING))
{
connection.Open();
using (var command = connection.CreateCommand())
{
command.CommandText = STATEMENT;
using (var reader = command.ExecuteReader())
{
while (reader.Read())
{
//reader[2].ToString();
//reader.GetString(2);
//reader.GetValue(2);
}
}
}
}
When uncommenting any of the three reader.* calls the memory of the requested column data seems to be pinned internally by ODP.NET (OracleInternal.Network.OraBuf) for each record. For requesting a few thousand records this doesn't seem to be a problem. But when fetching 100k+ records the memory allocation gets to hundreds of MB, which leads to an OutOfMemoryException. The more data the specified column has the faster the OOM happens (mostly NVARCHAR2).
Calling additionally GC.Collect() manually doesn't do anything. The GC.Collect()'s shown in the image are done internally (no calls by myself).
Since I don't store the read data at any place I would have expected the data is not cached while iterating the DbDataReader. Can u help me understand what is happening here and how to avoid it?

This seems to be a known bug (Bug 21975120) when reading clob column values using ExecuteReader() method with the managed Driver 12.1.0.2. The Workaround is to use the OracleDataReader specific methods (for example oracleDataReader.GetOracleValue(i)). The OracleClob value can explicitly be closed to free the Memory allocation.
var item = oracleDataReader.GetOracleValue(columnIndex);
if (item is OracleClob clob)
{
if (clob != null)
{
// use clob.Value ...
clob.Close();
}
}

Querying Intersystem Caché through ODBC

I'm querying Caché for a list of tables in two schemas and looping through those tables to obtain a count on the tables. However, this is incredibly slow. For instance, 13 million records took 8 hours to return results. When I query an Oracle database with 13 million records (on the same network), it takes 1.1 seconds to return results.
I'm using a BackgroundWorker to carry out the work apart from the UI (Windows Form).
Here's the code I'm using with the Caché ODBC driver:
using (OdbcConnection odbcCon = new OdbcConnection(strConnection))
{
try
{
odbcCon.Open();
OdbcCommand odbcCmd = new OdbcCommand();
foreach (var item in lstSchema)
{
var item = i;
odbcCmd.CommandText = "SELECT Count(*) FROM " + item;
odbcCmd.Connection = odbcCon;
AppendTextBox(item + " Count = " + Convert.ToInt32(odbcCmd.ExecuteScalar()) + "\r\n");
int intPercentComplete = (int)((float)(lstSchema.IndexOf(item) + 1) / (float)intTotalTables * 100);
worker.ReportProgress(intPercentComplete);
ModifyLabel(" (" + (lstSchema.IndexOf(item) + 1) + " out of " + intTotalTables + " processed)");
}
}
catch (Exception ex)
{
MessageBox.Show(ex.ToString());
return;
}
}
Is the driver the issue?
Thanks.

I supose the devil is in the details. Your code does
SELECT COUNT(*) FROM Table
If the table has no indices then I wouldn't be surprised that it is slower than you expect. If the table has indices, especially bitmap indices, I would expect this to be on par with Oracle.
The other thing to consider is to understand how Cache is configured, ie what are the global buffers, what does the performance of the disk look like.

Intersystems cache is slower for querying than any SQL database I have used, especially when you deal with large databases. Now add an ODBC overhead to the picture and you will achieve even worse performance.
Some level of performance can be achieved through use of bitmap indexes, but often the only way to get good performance is to create more data.
You might even find that you can allocate more memory for the database (but that never seemed to do much for me)
For example every time you add new data force the database to increment a number somewhere for your count (or even multiple entries for the purpose of grouping). Then you can have performance at a reasonable level.
I wrote a little Intersystems performance test post on my blog...
http://tesmond.blogspot.co.uk/2013/09/intersystems-cache-performance-woe-is-me.html

Cache has a built in (smart) function that determines how to best execute queries. Of course having indexes, especially bitmapped, will drastically help query times. Though, a mere 13 million rows should take seconds tops. How much data is in each row? We have 260 million rows in many tables and 790 million rows in others. We can mow through the whole thing in a couple of minutes. A non-indexed, complex query may take a day, though that is understandable. Take a look at what's locking your globals. We have also discovered that apparently queries run even if the client is disconnected. You can kill the task with the management portal, but the system doesn't seem to like doing more than one ODBC query at once with larger queries because it takes gigs of temp data to do such a query. We use DBVisualizer for a JDBC connection.
Someone mentioned TuneTable, that's great to run if your table changes a lot or at least a couple of times in the table's life. This is NOT something that you want to overuse. http://docs.intersystems.com/ens20151/csp/docbook/DocBook.UI.Page.cls?KEY=GSQL_optimizing is where you can find some documentation and other useful information about this and improving performance. If it's not fast then someone broke it.
Someone also mentioned that select count() will count an index instead of the table itself with computed properties. This is related to that decision engine that compiles your sql queries and decides what the most efficient method is to get your data. There is a tool in the portal that will show you how long it takes and will show you the other methods (that the smart interpreter [I forget what it's called]) that are available. You can see the Query Plan at the same page that you can execute SQL in the browser mentioned below. /csp/sys/exp/UtilSqlQueryShowPlan.csp
RE: I can't run this query from within the Management Portal because the tables are only made available from within an application and/or ODBC.
That isn't actually true. Within the management portal, go to System Explorer, SQL, then Execute SQL Statements. Please note that you must have adequate privileges to see this %ALL will allow access to everything. Also, you can run SQL queries natively in TERMINAL by executing.... do $system.SQL.Shell() Then type your queries. This interface should be faster than ODBC as I think it uses object access. Also, keep in mind that embedded SQL and object access of data is the fastest way to access data.
Please let me know if you have any more questions!

What is the best way to create XML for Stored Procedure Output

I have a stored procedure that returns few thousand records. I need to apply XSL to output of this SP. What would be the best way to do that
Read data in dataset, use XmlDataDocument and apply XSL
Output XML from SP and apply XSL on that
For 2, I am worried that XML size will be too big and C# code reading it will timeout. Please suggest.

You can use the SqlXmlCommand type from the namespace Microsoft.Data.SqlXml.
From msdn here:
SqlXmlCommand cmd = new SqlXmlCommand(ConnString);
cmd.CommandText = "SELECT TOP 20 FirstName, LastName FROM Person.Contact FOR XML AUTO";
cmd.XslPath = "MyXSL.xsl";
cmd.RootTag = "root";
...etc
If you're worried about timing out just set a larger timeout on your connection string.

We use a similar approach on our platform. But we do not apply the XSL directly on the stored procedure as we use MS SQL Server Express and it does not allow to apply the XSL's. At least that is what I have found when looking for this particular subject, a superior version of MSSQL Server is needed for such transformation.
The stored procedure that outputs the XML is big and filthy (but runs smoothly) and we have no issues reading back the results. Couple of seconds for each request at most, usually around couple thousands rows as result and joining several tables. Using VS2010 and VB.Net with Telerik components.

"cursor like" reading inside a CLR procedure/function

I have to implement an algorithm on data which is (for good reasons) stored inside SQL server. The algorithm does not fit SQL very well, so I would like to implement it as a CLR function or procedure. Here's what I want to do:
Execute several queries (usually 20-50, but up to 100-200) which all have the form select a,b,... from some_table order by xyz. There's an index which fits that query, so the result should be available more or less without any calculation.
Consume the results step by step. The exact stepping depends on the results, so it's not exactly predictable.
Aggregate some result by stepping over the results. I will only consume the first parts of the results, but cannot predict how much I will need. The stop criteria depends on some threshold inside the algorithm.
My idea was to open several SqlDataReader, but I have two problems with that solution:
You can have only one SqlDataReader per connection and inside a CLR method I have only one connection - as far as I understand.
I don't know how to tell SqlDataReader how to read data in chunks. I could not find documentation how SqlDataReader is supposed to behave. As far as I understand, it's preparing the whole result set and would load the whole result into memory. Even if I would consume only a small part of it.
Any hint how to solve that as a CLR method? Or is there a more low level interface to SQL server which is more suitable for my problem?
Update: I should have made two points more explicit:
I'm talking about big data sets, so a query might result in 1 mio records, but my algorithm would consume only the first 100-200 ones. But as I said before: I don't know the exact number beforehand.
I'm aware that SQL might not be the best choice for that kind of algorithm. But due to other constraints it has to be a SQL server. So I'm looking for the best possible solution.

SqlDataReader does not read the whole dataset, you are confusing it with the Dataset class. It reads row by row, as the .Read() method is being called. If a client does not consume the resultset the server will suspend the query execution because it has no room to write the output into (the selected rows). Execution will resume as the client consumes more rows (SqlDataReader.Read is being called). There is even a special command behavior flag SequentialAccess that instructs the ADO.Net not to pre-load in memory the entire row, useful for accessing large BLOB columns in a streaming fashion (see Download and Upload images from SQL Server via ASP.Net MVC for a practical example).
You can have multiple active result sets (SqlDataReader) active on a single connection when MARS is active. However, MARS is incompatible with SQLCLR context connections.
So you can create a CLR streaming TVF to do some of what you need in CLR, but only if you have one single SQL query source. Multiple queries it would require you to abandon the context connection and use isntead a fully fledged connection, ie. connect back to the same instance in a loopback, and this would allow MARS and thus consume multiple resultsets. But loopback has its own issues as it breaks the transaction boundaries you have from context connection. Specifically with a loopback connection your TVF won't be able to read the changes made by the same transaction that called the TVF, because is a different transaction on a different connection.

SQL is designed to work against huge data sets, and is extremely powerful. With set based logic it's often unnecessary to iterate over the data to perform operations, and there are a number of built-in ways to do this within SQL itself.
1) write set based logic to update the data without cursors
2) use deterministic User Defined Functions with set based logic (you can do this with the SqlFunction attribute in CLR code). Non-Deterministic will have the affect of turning the query into a cursor internally, it means the value output is not always the same given the same input.
[SqlFunction(IsDeterministic = true, IsPrecise = true)]
public static int algorithm(int value1, int value2)
{
int value3 = ... ;
return value3;
}
3) use cursors as a last resort. This is a powerful way to execute logic per row on the database but has a performance impact. It appears from this article CLR can out perform SQL cursors (thanks Martin).
I saw your comment that the complexity of using set based logic was too much. Can you provide an example? There are many SQL ways to solve complex problems - CTE, Views, partitioning etc.
Of course you may well be right in your approach, and I don't know what you are trying to do, but my gut says leverage the tools of SQL. Spawning multiple readers isn't the right way to approach the database implementation. It may well be that you need multiple threads calling into a SP to run concurrent processing, but don't do this inside the CLR.
To answer your question, with CLR implementations (and IDataReader) you don't really need to page results in chunks because you are not loading data into memory or transporting data over the network. IDataReader gives you access to the data stream row-by-row. By the sounds it your algorithm determines the amount of records that need updating, so when this happens simply stop calling Read() and end at that point.
SqlMetaData[] columns = new SqlMetaData[3];
columns[0] = new SqlMetaData("Value1", SqlDbType.Int);
columns[1] = new SqlMetaData("Value2", SqlDbType.Int);
columns[2] = new SqlMetaData("Value3", SqlDbType.Int);
SqlDataRecord record = new SqlDataRecord(columns);
SqlContext.Pipe.SendResultsStart(record);
SqlDataReader reader = comm.ExecuteReader();
bool flag = true;
while (reader.Read() && flag)
{
int value1 = Convert.ToInt32(reader[0]);
int value2 = Convert.ToInt32(reader[1]);
// some algorithm
int newValue = ...;
reader.SetInt32(3, newValue);
SqlContext.Pipe.SendResultsRow(record);
// keep going?
flag = newValue < 100;
}

Cursors are a SQL only function. If you wanted to read chunks of data at a time, some sort of paging would be required so that only a certain amount of the records would be returned. If using Linq,
.Skip(Skip)
.Take(PageSize)
Skips and takes could be used to limit results returned.
You can simply iterate over the DataReader by doing something like this:
using (IDataReader reader = Command.ExecuteReader())
{
while (reader.Read())
{
//Do something with this record
}
}
This would be iterating over the results one at a time, similiar to a cursor in SQL Server.
For multiple recordsets at once, try MARS
(if SQL Server)
http://msdn.microsoft.com/en-us/library/ms131686.aspx

Execute multiple SQL commands in one round trip

I am building an application and I want to batch multiple queries into a single round-trip to the database. For example, lets say a single page needs to display a list of users, a list of groups and a list of permissions.
So I have stored procs (or just simple sql commands like "select * from Users"), and I want to execute three of them. However, to populate this one page I have to make 3 round trips.
Now I could write a single stored proc ("getUsersTeamsAndPermissions") or execute a single SQL command "select * from Users;exec getTeams;select * from Permissions".
But I was wondering if there was a better way to specify to do 3 operations in a single round trip. Benefits include being easier to unit test, and allowing the database engine to parrallelize the queries.
I'm using C# 3.5 and SQL Server 2008.

Something like this. The example is probably not very good as it doesn't properly dispose objects but you get the idea. Here's a cleaned up version:
using (var connection = new SqlConnection(ConnectionString))
using (var command = connection.CreateCommand())
{
connection.Open();
command.CommandText = "select id from test1; select id from test2";
using (var reader = command.ExecuteReader())
{
do
{
while (reader.Read())
{
Console.WriteLine(reader.GetInt32(0));
}
Console.WriteLine("--next command--");
} while (reader.NextResult());
}
}

The single multi-part command and the stored procedure options that you mention are the two options. You can't do them in such a way that they are "parallelized" on the db. However, both of those options does result in a single round trip, so you're good there. There's no way to send them more efficiently. In sql server 2005 onwards, a multi-part command that is fully parameterized is very efficient.
Edit: adding information on why cram into a single call.
Although you don't want to care too much about reducing calls, there can be legitimate reasons for this.
I once was limited to a crummy ODBC driver against a mainframe, and there was a 1.2 second overhead on each call! I'm serious. There were times when I crammed a little extra into my db calls. Not pretty.
You also might find yourself in a situation where you have to configure your sql queries somewhere, and you can't just make 3 calls: it has to be one. It shouldn't be that way, bad design, but it is. You do what you gotta do!
Sometimes of course it can be very good to encapsulate multiple steps in a stored procedure. Usually not for saving round trips though, but for tighter transactions, getting ID for new records, constraining for permissions, providing encapsulation, blah blah blah.

Making one round-trip vs three will be more eficient indeed. The question is wether it is worth the trouble. The entire ADO.Net and C# 3.5 toolset and framework opposes what you try to do. TableAdapters, Linq2SQL, EF, all these like to deal with simple one-call==one-resultset semantics. So you may loose some serious productivity by trying to beat the Framework into submission.
I would say that unless you have some serious measurements showing that you need to reduce the number of roundtrips, abstain. If you do end up requiring this, then use a stored procedure to at least give an API kind of semantics.
But if your query really is what you posted (ie. select all users, all teams and all permissions) then you obviosuly have much bigger fish to fry before reducing the round-trips... reduce the resultsets first.

I this this link might be helpful.
Consider using at least the same connection-openning; according to what it says here, openning a connection is almost the top-leader of performance cost in Entity-Framework.

Firstly, 3 round trips isn't really a big deal. If you were talking about 300 round trips then that would be another matter, but for just 3 round trips I would conderer this to definitley be a case of premature optimisation.
That said, the way I'd do this would probably be to executed the 3 stored procuedres using SQL:
exec dbo.p_myproc_1 #param_1 = #in_param_1, #param_2 = #in_param_2
exec dbo.p_myproc_2
exec dbo.p_myproc_3
You can then iterate through the returned results sets as you would if you directly executed multiple rowsets.

Build a temp-table? Insert all results into the temp table and then select * from #temp-table
as in,
#temptable=....
select #temptable.field=mytable.field from mytable
select #temptable.field2=mytable2.field2 from mytable2
etc... Only one trip to the database, though I'm not sure it is actually more efficient.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.