I currently have a C# process that is saving millions of records to Oracle, currently all in a single thread and within a transaction. I am interested in doing some parallel processing on this where I can split the data across threads. Will an ADO.NET/Oracle transaction work properly across the threads? Do I just create the transaction on the main thread, or do I need to also create a sub-transaction for each thread?
Any experience with this providing some performance improvements, or is the bottleneck Oracle itself.
If your code is, essentially:
for each record
add record to database
Then it's unlikely that adding multiple threads is going to be of much help. You might be able to get performance increase with two threads, when one is gathering and transmitting one record while the other's record is being inserted. But it's unlikely that the overlap would be huge.
You're much better off doing something like:
while not end of records
add 1,000 records to block
call stored proc to insert 1,000 records
That should speed things up quite a bit because you reduce the amount of back-and-forth between client and server.
The way to speed it up beyond that probably isn't to create multiple threads that run the loop, but rather to issue an asynchronous call so that the database can be doing the inserts while you're creating the next block of records. Something like this:
while not end of records
add 1,000 records to block
wait for pending asynchronous call to complete
issue asynchronous call to insert 1,000 records
There are many different ways to issue that asynchronous call. I would recommend using Tasks.
Edit
It occurs to me that you might have a problem trying to keep a transaction alive across asynchronous calls. If so, then you do the database insert on the main thread, and have the asynchronous task fill the buffer. It looks like this:
start transaction
buffer = fill_buffer(); // this is synchronous
while buffer.count > 0
{
task = start asynchronous task to fill the next buffer
call database to insert records from buffer
buffer = task.result // waits for task to complete
}
end transaction
This technique ensures that all database calls for the transaction occur on the main thread.
My suggestion would be, if you can (your workplace allows it), write this as a pl/sql procedure using bulk inserts instead of relying on a middleware app. The improvement will be huge as long as is coded well.
If you have to use a middleware (.net), I recommend you use ODP.NET Link since ADO.NET-to-Oracle is deprecated (if I am not mistaken). In addition, ODP.NET will give you a boost in performance because it uses oracle 11g new features and improvements.
As far as middleware, I never done any parallel threading, but I suspect you will have transactions issues with oracle (since you are inserting and the way relational databases work). I know it is possible, but for the extra effort, it is just better to move the processing to the database and let oracle do its magic.
Related
It is a .Net application which works with an external device. When some entity (corresponds to the row in a table) wants to communicate with device, the corresponding row in the SQL Server table should be locked until the device return a result or SQL Server times out.
I need to:
lock a specific row in a table so that row could be read, but could not be deleted or updated.
locking mechanism should run in a separate thread so that application's main thread works as usual
lock should be released if a response has made
lock should be released after a while if no response is received
What is the best practice?
Is there any standardize way to accomplish this?
Should I:
run a new thread (task) in my C# code and begin a serializable transaction, select desired row transactionally and wait to either time is up or cancellation token is received?
or use some combination of sp_getapplock and ...etc?
You cannot operate on locks across transactions or sessions. That approach is not feasible.
You need to run one transaction and keep it open for the duration that you want the lock to persist.
The kind of parallelism technology you use is immaterial. An async method with async ADO.NET IO would be suitable. So would be a separate LongRunning task.
You probably need to pass a CancellationToken to the transaction code that when signaled makes the transaction shut down. That way you can implement arbitrary shutdown conditions without cluttering the transaction code.
Here's few points that you should consider:
Sp_getapplock is not row based, so I would assume it's not something you can use
"application's main thread works as usual." -- but if you're locking rows, any update / delete operation will get stuck, so is that working as usual?
Once the locking ends, is it ok to do all the updates right after that that were blocked?
Is your blocker thread going to do updates too?
If the application and the external device are doing updates, how to be sure they are handled in correct order / way?
I would say you need to design your application to work properly in this situation, not just try to add this kind of feature as an add-on.
The title says about releasing in another transaction, but that's not really explained in the question.
My server application that is written in C# starts a new thread every time it needs to insert or remove data from the database. The problem is that since the execution of the threads is arbitrary it is not ensured that a delete command is executed after the insertion of the same object if those events occur almost at the same time.
E.g.: The server receives the command to insert multiple objects. It takes about 5 seconds to insert all the object. After 1 second of execution the server receives the command to delete all those objects again from the database. Since the removal could happen before all objects are completely stored the outcome is unknown.
How can the order of execution of certain thread be managed?
You can use transactions for this and specify different levels for different operations.
For example, you can use the highest level of transactions for writes/updates/deletes but a low level for reads. You can also fine-tune this to allow blocking of only specific rows, compared to tables. Specific terminology depends on the database and data access library you use.
I would advice against using any ordering. Parallel and ordered just don't go well together. For example:
You need to horizontally scale servers, once you add a second server and a load balancer a mutex solution will not work
In a large and distributed systems a message queue won't work either as by the time one thread completed a scan and decided that we good to go, another thread can write a message that should have prevented operation execution. Moreover, given you receive high load, scanning the same queue multiple times is inefficient.
If you know that you receive insert before delete and the problem is just that you don't want to interrupt your insertion then you can just use lock on your insertion code.
static object m_Lock = new object();
public void Insert()
{
lock (m_Lock)
{
InsertRecords();
}
}
public void Remove()
{
lock (m_Lock)
{
RemoveRecords();
}
}
This way you are sure that remove won't happen during insert.
P.S. Seems strange though that you need to insert and then delete right away.
I think the simplest way is to queue all incoming requests to insert objects in one collection, and to queue all incoming requests to delete objects in a second collection.
The server should have a basic loop that does :
a. check if there are incoming inserts , if so -> perform all inserts.
b. check if there are incoming delete requests, if so -> perform all delete requests.
c. sleep for X milli-seconds.
Now, if you have a delete request on an object that does not exist.
you have two options:
a. igore this request and discard it.
b. ignore this request for this round and keep it in the collection for the next N rounds,
before deleting it (Finally deleting it- assuming this is simply a bad request and is not a problem of race condition.)
Use a Queue (with a single servicing thread) to enforce the ordering. You can also use Task Parallel Library to manage tasks with dependencies on other tasks, though that's very difficult with arbitrary DB operations.
I think you need to rethink how you manage the incoming operations, and whether or not their inter-dependencies are predictable enough that you can safely use multiple threads in this way. You may need to add some "depends on" information into incoming operations to achieve that goal.
I need to process data from some large number of file with thousands of data in terms of rows.Earlier i was reading the whole file row by row and processing.It took a lot of time for processing all the file when the number of files increased.Then some one said that threads can be used to perform the task in less amount of time??Can threading make this process fast.I'm using c# language.
It certainly can although it depends on the particular job in question. A very common pattern is to have one thread doing the file IO and multiple threads processing the actual lines.
How many processing threads to start will depend on how many processors/cores you have on your system, and how the results of the processing get written out. If the processing time per line is very small however, you probably won't get too much speed improvement having multiple processing threads and a single processing thread would be optimal.
Good thing with performance question is to assume that your code is just doing something unnecessary and try to find what it is - measure, review, draw - whatever works for you. I'm not saying that the code you have is slow, it just a way to look at it.
With adding multithreading to the mix first you may find it to be much harder to analyze the code.
More concrete for your task: combining multiple similar operation (like read a record from file or commit to DB) together may save significant amount of time (you need to prototype and measure).
I would recommend you do batch insert to your database.
You can have a thread that reads a line to a concurrent queue. while other thread is pulling the data from concurrent queue. agregating it if necessary or if you are doing any operation on it. then batch insert the data to database. it will save you quite a time.
Inserting a line to db would be very slow. you have to do batch inserts.
Yes, using threads can speed thigns up.
Threads are to be used when you have time onsuming tasks you can run in the background (like, when you process say 10 files, but only need one, you can have a thread process each of them which will be a lot faster then processing them all on your main thread).
Please not that there may be bugs related, so you should make sure all threads finished running before continuing and trying to access what they got.
Look up "C#.NET multithreading"
any thread can run a specified function, and background worker is a nice class as well (I prefer pure multithreading though).
Also note that this may backfire and wind up slower, but it's a good idea to try.
Threading is one way (there are others) of letting you overlap the processing with the I/O. That means instead of the total time being the sum of the time to read the data and the time to processing the data, you can reduce it to (roughly) whichever of the two is larger (usually the I/O time).
If you mostly want to overlap the I/O time, you might want to look at overlapped I/O and/or I/O completion ports.
Edit: If you're going to do this, you normally want to base the number of I/O threads on the number of separate physical disks you're going to be reading from, and the number of processing threads on the number of processors you have available to do the processing (but only as many as necessary to keep up with the data being supplied by the reader thread). For a typical desktop machine, that will often mean only two threads, one to read and one to process data.
i have created a windows application(using c# .net) for debugging contest in our department.
in this many user use the same database to select the list of questions and update the marks in their respective id alone.
does it required to use threading concept when they update their marks in the database..
any1 please help me..
thanks in advace...
Mutil-Threading or multiple threads are used in scenarios where you want to do more than one task at one time or do some tasks simultaneously. You should think about your scenario and possible use of multiple threads in your scenario. If you think there is some task which can be divided in to two separate tasks and they can run in parallel, you can use multi-threading to gain performance improvements. Similarly if you think there is some task that is heavy and takes huge time you can move that task to Background Thread and use main thread to deal with some other task in parallel. It all depends on your scenario.
Now coming to your scenario if it is a windows forms application most likely there will be only one user of this app at one time who will be interacting through UI. If this assumption is correct i don't think so you will need multi-threading. If user is doing some inputs thorough UI and he clicks save button at the end to save info in DB you don't need multi-threading a single UI thread will be enough to do this
No this is not needed. Each user will cosume a connection from the database connection pool, and those work concurrently and no parallel programming is required.
If you update a database from different threads, it will not corrupt. This is different from regular C#, where you need to apply locks to protect your objects. You may be required to use transactions to ensure that your database updates don't interfere with each other at a higher level. Very simply put, transactions ensure that your database stays consistent if you edit your database at multiple tables, or if your database changes depend on the database contents, such as adding an order from a customer, transactions prevent you add an order for a deleted customer.
You need to use non-UI thread for any database interactions, otherwise UI may become unresponsive. E.g. if you execute a long query from UI thread (or your connection was disrupted, or the database is under heavy use or whatever, anything that can go wrong - will go wrong), UI thread gets blocked until full response is received.
In the situations where you have multiple users, which may update the same data in the database, you may need to introduce transactions to ensure correct control and data flow - ACID.
My problem is that I'm apparently using too many tasks (threads?) that call a method that queries a SQL Server 2008 database. Here is the code:
for(int i = 0; i < 100000 ; i++)
{
Task.Factory.StartNew(() => MethodThatQueriesDataBase()).ContinueWith(t=>OtherMethod(t));
}
After a while I get a SQL timeout exception. I want keep the actual number of threads low(er) than 100000 to a buffer of say "no more than 10 at a time". I know I can manage my own threads using the ThreadPool, but I want to be able to use the beauty of TPL with the ContinueWith.
I looked at the Task.Factory.Scheduler.MaximumConcurrencyLevel but it has no setter.
How do I do that?
Thanks in advance!
UPDATE 1
I just tested the LimitedConcurrencyLevelTaskScheduler class (pointed out by Skeet) and still doing the same thing (SQL Timeout).
BTW, this database receives more than 800000 events per day and has never had crashes or timeouts from those. It sounds kinda weird that this will.
You could create a TaskScheduler with a limited degree of concurrency, as explained here, then create a TaskFactory from that, and use that factory to start the tasks instead of Task.Factory.
Tasks are not 1:1 with threads - tasks are assigned threads for execution out of a pool of threads, and the pool of threads is normally kept fairly small (number of threads == number of CPU cores) unless a task/thread is blocked waiting for a long-running synchronous result - such as perhaps a synchronous network call or file I/O.
So spinning up 10,000 tasks should not result in the production of 10,000 actual threads. However, if every one of those tasks immediately dives into a blocking call, then you may wind up with more threads, but it still shouldn't be 10,000.
What may be happening here is you are overwhelming the SQL db with too many requests all at once. Even if the system only sets up a handful of threads for your thousands of tasks, a handful of threads can still cause a pileup if the destination of the call is single-threaded. If every task makes a call into the SQL db, and the SQL db interface or the db itself coordinates multithreaded requests through a single thread lock, then all the concurrent calls will pile up waiting for the thread lock to get into the SQL db for execution. There is no guarantee of which threads will be released to call into the SQL db next, so you could easily end up with one "unlucky" thread that starts waiting for access to the SQL db early but doesn't get into the SQL db call before the blocking wait times out.
It's also possible that the SQL back-end is multithreaded, but limits the number of concurrent operations due to licensing level. That is, a SQL demo engine only allows 2 concurrent requests but the fully licensed engine supports dozens of concurrent requests.
Either way, you need to do something to reduce your concurrency to more reasonable levels. Jon Skeet's suggestion of using a TaskScheduler to limit the concurrency sounds like a good place to start.
I suspect there is something wrong with the way you're handling DB connections. Web servers could have thousands of concurrent page requests running all in various stages of SQL activity. I'm betting that attempts to reduce the concurrent task count is really masking a different problem.
Can you profile the SQL connections? Check out perfmon to see how many active connections there are. See if you can grab-use-release connections as quickly as possible.