I want to read a huge directory and its subdirectories and files ,then write to database.Everything is fine but i put a trigger on a table that it is fired when a data is inserted and update another table.Trigger works fine with a single sql command but
Due to long process in the main program , trigger is not fired. I am using queue dequeue , and backroundworker thread.(c#)
How can this problem be solved.?any idea apreciated.
I assume that the trigger is working OK, but you need all the data to be processed before seeing the trigger effects take place. Therefore, I suggest that you split the data into smaller pieces (batches), and insert them to the database one by one. Basically, choose a size of the batch that suits best your setup and load the data on iterations.
Here is some example C# code:
public void ProcessData(String rootDirectory, int batchSize)
{
IEnumerable<string> pathsToProcess = GetPathsToProcess(rootDirectory);
int currentBatch = 0;
while (currentBatch*batchSize < pathsToProcess.Length)
{
// take a subset of the paths to process
IEnumerable<string> batch = pathsToProcess
.Skip(currentBatch*batchSize)
.Take(batchSize);
DoYourDatabaseLogic(batch);
currentBatch++;
}
}
The code above will execute the database operation for a smaller subset of data, after which your trigger will execute against that data. This will happen for each of the batches. You will still have to wait for all the batches to complete, but you can see the changes for the ones that have completed.
Using this approach brings, however, an important issue to worry about: What would happen if some of the batches fails for some reason?
In case you must revert all the changes for the entire pathsToProcess collection if a single batch/subset of it fails, you should organize the above code to run in a single database transaction, and ensure the rollback takes place appropriately.
If the pathsToProcess collection is not required to be rolled back entirely, I still recommend using transactions on each of the batches. In that case you may need to know which batch did you write last successfully, in order to resume from it if the data is to get processed again.
Related
I am trying to make a C# WinForms application that fetches data from a url that is saved in a table named "Links". And each link has a "Last Checked" and "Next Check" datetime and there is "interval" which decides "next check" based on last check.
Right now, what I am doing is fetching ID with a query BEFORE doing the webscraping, and after that I turn Last Checked into DateTime.Now and Next Check into null untill all is completed. Which both then gets updated, after web scraping is done.
Problem with this is if there is any "abort" with an ongoing process, lastcheck will be a date, but nextcheck will be null.
So I need a better way for two processes to not work on same table's same row. But not sure how.
For a multithreaded solution, the standard engineering approach is to use a pool of workers and a pool of work.
This is just a conceptual sketch - you should adapt it to your circumstances:
A worker (i.e. a thread) looks at the pool of work. If there is some work available, it marks it as in_progress. This has to be done so that no two threads can take the same work. For example, you could use a lock in C# to do the query in a database, and to mark a row before returning it.
You need to have a way of un-marking it after the thread finishes. Successful or not, in_progress must be re-set. Typically, you could use a finally block so that you don't miss it in the event of any exception.
If there is no work available, the thread goes to sleep.
Whenever a new work arrives (i.e. INSERT, or a nextcheck is due), one of sleeping threads is awakened.
When your program starts, it should clear any in_progress flags in the event of a previous crash.
You should take advantage of DBMS transactions so that any changes a worker makes after completing its work are atomic - i.e. other threads percieve them as they had happened all at once.
By changing the size of worker pool, you can set the maximum number of simultaneously active workers.
First thing, the separation of controller/workers might be a better pattern as mentioned in other answer. This will work better if the number of threads gets large and te number of links to check is large.
But if your problem is this:
But problem with it is, if for any reason that scraping gets
aborted/finishes halfway/doesn't work properly, LastCheck becomes
DateTime.Now but NextCheck is left NULL, and previous
LastCheck/NextCheck values are gone, and LastCheck/NextCheck values
are updated for a link that is not actually checked
You just need to handle errors better.
The failure will result in exception. Catch the exception and handle it by resetting the state in the database. For example:
void DoScraping(.....)
{
try
{
// ....
}
catch (Exception err)
{
// oh dear, it went wrong, reset lastcheck/nextcheck
}
}
What you reset last/nextcheck to depends on you. You could reset them to what they where at the start if when you determine 'the next thing to do' you also get the values of last/nextcheck and store in variables. Then in the event of failure just set to what they were before.
I am for the first time trying to use Thread in my windows service application.Now as per my condition i have to read data from database and if it matches with condition i have to execute a function in new thread.Now the main concern is that as my function which meant to execute in new Thread is lengthy and will take time so i have a query that, Will my program will reach to datareader code and read the new value from the database while my function keeps on executing in the background in thread.My application execution logic is time specific.
Here is the code..
while (dr.Read())
{
time = dr["SendingTime"].ToString();
if ((str = DateTime.Now.ToString("HH:mm")).Equals(time))
{
//Execute Function and send reports based on the data from the database.
Thread thread = new Thread(sendReports);
thread.Start();
}
}
Please help me..
Yep, as the comments said, you will have one thread per row. if you have 4-5 rows, and you'll run that code, you'll get 4-5 threads working happily in the back.
You might be happy with it, and leave it, and in half a year, someone else will play with the DB, and you'll get 10K rows, and this will create 10K threads, and you'll be on a holiday and people will call you panicking because the program is broken ...
In other words, you don't want to do it, because it's a bad practice.
You should either use a queue with working units, and have a fixed number of threads reading from those queues (in which case you might have 10K units there, but lets say 10 threads that will pick them up and process them until they are done), or some other mechanism to make sure you don't create a thread per row.
Unless of course, you don't care ...
I have some code that at the end of the program's life, uploads the entire contents of 6 different lists into a database. The problem is, they're parallel lists with about 14,000 items in each, and I have to run an Insert query for each of their separate item(s). This takes a long time, is there a faster way to do this? Here's a sample of the relevant code:
public void uploadContent()
{
var cs = Properties.Settings.Default.Database;
SqlConnection dataConnection = new SqlConnection(cs);
dataConnection.Open();
for (int i = 0; i < urlList.Count; i++)
{
SqlCommand dataCommand = new SqlCommand(Properties.Settings.Default.CommandString, dataConnection);
try
{
dataCommand.Parameters.AddWithValue("#user", userList[i]);
dataCommand.Parameters.AddWithValue("#computer", computerList[i]);
dataCommand.Parameters.AddWithValue("#date", timestampList[i]);
dataCommand.Parameters.AddWithValue("#itemName", domainList[i]);
dataCommand.Parameters.AddWithValue("#itemDetails", urlList[i]);
dataCommand.Parameters.AddWithValue("#timesUsed", hitsList[i]);
dataCommand.ExecuteNonQuery();
}
catch (Exception e)
{
using (StreamWriter sw = File.AppendText("errorLog.log"))
{
sw.WriteLine(e);
}
}
}
dataConnection.Close();
}
Here is the command string the code is pulling from the config file:
CommandString:
INSERT dbo.InternetUsage VALUES (#user, #computer, #date, #itemName, #itemDetails, #timesUsed)
As mentioned in #alerya's answer, doing the following will help (added explanation here)
1) Make Command and parameter creation outside of for loop
Since the same command is being used each time, it doesn't make sense to re-create the command each time. In addition to creating a new object (which takes time), the command must also be verified each time it is created for several things (table exists, etc). This introduces a lot of overhead.
2) Put the inserts within a transaction
Putting all of the inserts within a transaction will speed things up because, by default, a command that is not within a transaction will be considered its own transaction. Therefore, every time you insert something, the database server must then verify that what it just inserted is actually saved (usually on a harddisk, which is limited by the speed of the disk). When multiple INSERTs are within one transactions, however, the check only needs to be performed once.
The downside to this approach, based on the code you've already shown, is that one bad INSERT will spoil the bunch. Whether or not this is acceptable depends on your specific requirements.
Aside
Another thing you really should be doing (though this won't speed things up in the short term) is properly using the IDisposable interface. This means either calling .Dispose() on all IDisposable objects (SqlConnection, SqlCommand), or, ideally, wrapping them in using() blocks:
using( SqlConnection dataConnection = new SqlConnection(cs)
{
//Code goes here
}
This will prevent memory leaks from these spots, which will become a problem quickly if your loops get too large.
Make Command and parameter creation outside of for (int i = 0; i < urlList.Count; i++)
Also Create insert within a transaction
If it possible create a Stored Procedure and pass parameters as DataTable.
Sending INSERT commands one by one to a database will really make the whole process slow, because of the round trips to the database server. If you're worried about performance, you should consider using a bulk insert strategy. You could:
Generate a flat file with all your information, in the format that BULK INSERT understands.
Use the BULK INSERT command to import that file to your database (http://msdn.microsoft.com/en-us/library/ms188365(v=sql.90).aspx).
Ps. I guess when you say SQL you're using MS SQL Server.
Why don't you run your uploadContent() method from a separate thread.
This way you don't need to worry about how much time the query takes to execute.
I have a CSV file and I have to insert it into a SQL Server database. Is there a way to speed up the LINQ inserts?
I've created a simple Repository method to save a record:
public void SaveOffer(Offer offer)
{
Offer dbOffer = this.db.Offers.SingleOrDefault (
o => o.offer_id == offer.offer_id);
// add new offer
if (dbOffer == null)
{
this.db.Offers.InsertOnSubmit(offer);
}
//update existing offer
else
{
dbOffer = offer;
}
this.db.SubmitChanges();
}
But using this method, the program is way much slower then inserting the data using ADO.net SQL inserts (new SqlConnection, new SqlCommand for select if exists, new SqlCommand for update/insert).
On 100k csv rows it takes about an hour vs 1 minute or so for the ADO.net way. For 2M csv rows it took ADO.net about 20 minutes. LINQ added about 30k of those 2M rows in 25 minutes. My database has 3 tables, linked in the dbml, but the other two tables are empty. The tests were made with all the tables empty.
P.S. I've tried to use SqlBulkCopy, but I need to do some transformations on Offer before inserting it into the db, and I think that defeats the purpose of SqlBulkCopy.
Updates/Edits:
After 18hours, the LINQ version added just ~200K rows.
I've tested the import just with LINQ inserts too, and also is really slow compared with ADO.net. I haven't seen a big difference between just inserts/submitchanges and selects/updates/inserts/submitchanges.
I still have to try batch commit, manually connecting to the db and compiled queries.
SubmitChanges does not batch changes, it does a single insert statement per object. If you want to do fast inserts, I think you need to stop using LINQ.
While SubmitChanges is executing, fire up SQL Profiler and watch the SQL being executed.
See question "Can LINQ to SQL perform batch updates and deletes? Or does it always do one row update at a time?" here: http://www.hookedonlinq.com/LINQToSQLFAQ.ashx
It links to this article: http://www.aneyfamily.com/terryandann/post/2008/04/Batch-Updates-and-Deletes-with-LINQ-to-SQL.aspx that uses extension methods to fix linq's inability to batch inserts and updates etc.
Have you tried wrapping the inserts within a transaction and/or delaying db.SubmitChanges so that you can batch several inserts?
Transactions help throughput by reducing the needs for fsync()'s, and delaying db.SubmitChanges will reduce the number of .NET<->db roundtrips.
Edit: see http://www.sidarok.com/web/blog/content/2008/05/02/10-tips-to-improve-your-linq-to-sql-application-performance.html for some more optimization principles.
Have a look at the following page for a simple walk-through of how to change your code to use a Bulk Insert instead of using LINQ's InsertOnSubmit() function.
You just need to add the (provided) BulkInsert class to your code, make a few subtle changes to your code, and you'll see a huge improvement in performance.
Mikes Knowledge Base - BulkInserts with LINQ
Good luck !
I wonder if you're suffering from an overly large set of data accumulating in the data-context, making it slow to resolve rows against the internal identity cache (which is checked once during the SingleOrDefault, and for "misses" I would expect to see a second hit when the entity is materialized).
I can't recall 100% whether the short-circuit works for SingleOrDefault (although it will in .NET 4.0).
I would try ditching the data-context (submit-changes and replace with an empty one) every n operations for some n - maybe 250 or something.
Given that you're calling SubmitChanges per isntance at the moment, you may also be wasting a lot of time checking the delta - pointless if you've only changed one row. Only call SubmitChanges in batches; not per record.
Alex gave the best answer, but I think a few things are being over looked.
One of the major bottlenecks you have here is calling SubmitChanges for each item individually. A problem I don't think most people know about is that if you haven't manually opened your DataContext's connection yourself, then the DataContext will repeatedly open and close it itself. However, if you open it yourself, and then close it yourself when you're absolutely finished, things will run a lot faster since it won't have to reconnect to the database every time. I found this out when trying to find out why DataContext.ExecuteCommand() was so unbelievably slow when executing multiple commands at once.
A few other areas where you could speed things up:
While Linq To SQL doesn't support your straight up batch processing, you should wait to call SubmitChanges() until you've analyzed everything first. You don't need to call SubmitChanges() after each InsertOnSubmit call.
If live data integrity isn't super crucial, you could retrieve a list of offer_id back from the server before you start checking to see if an offer already exists. This could significantly reduce the amount of times you're calling the server to get an existing item when it's not even there.
Why not pass an offer[] into that method, and doing all the changes in cache before submitting them to the database. Or you could use groups for submission, so you don't run out of cache. The main thing would be how long till you send over the data, the biggest time wasting is in the closing and opening of the connection.
Converting this to a compiled query is the easiest way I can think of to boost your performance here:
Change the following:
Offer dbOffer = this.db.Offers.SingleOrDefault (
o => o.offer_id == offer.offer_id);
to:
Offer dbOffer = RetrieveOffer(offer.offer_id);
private static readonly Func<DataContext, int> RetrieveOffer
{
CompiledQuery.Compile((DataContext context, int offerId) => context.Offers.SingleOrDefault(o => o.offer_id == offerid))
}
This change alone will not make it as fast as your ado.net version, but it will be a significant improvement because without the compiled query you are dynamically building the expression tree every time you run this method.
As one poster already mentioned, you must refactor your code so that submit changes is called only once if you want optimal performance.
Do you really need to check if the record exist before inserting it into the DB. I thought it looked strange as the data comes from a csv file.
P.S. I've tried to use SqlBulkCopy,
but I need to do some transformations
on Offer before inserting it into the
db, and I think that defeats the
purpose of SqlBulkCopy.
I don't think it defeat the purpose at all, why would it? Just fill a simple dataset with all the data from the csv and do a SqlBulkCopy. I did a similar thing with a collection of 30000+ rows and the import time went from minutes to seconds
I suspect it isn't the inserting or updating operations that are taking a long time, rather the code that determines if your offer already exists:
Offer dbOffer = this.db.Offers.SingleOrDefault (
o => o.offer_id == offer.offer_id);
If you look to optimise this, I think you'll be on the right track. Perhaps use the Stopwatch class to do some timing that will help to prove me right or wrong.
Usually, when not using Linq-to-Sql, you would have an insert/update procedure or sql script that would determine whether the record you pass already exists. You're doing this expensive operation in Linq, which certainly will never hope to match the speed of native sql (which is what's happening when you use a SqlCommand and select if the record exists) looking-up on a primary key.
Well you must understand linq creates code dynamically for all ADO operations that you do instead handwritten, so it will always take up more time then your manual code. Its simply an easy way to write code but if you want to talk about performance, ADO.NET code will always be faster depending upon how you write it.
I dont know if linq will try to reuse its last statement or not, if it does then seperating insert batch with update batch may improve performance little bit.
This code runs ok, and prevents large amounts of data:
if (repository2.GeoItems.GetChangeSet().Inserts.Count > 1000)
{
repository2.GeoItems.SubmitChanges();
}
Then, at the end of the bulk insertion, use this:
repository2.GeoItems.SubmitChanges();
I am new to threads and in need of help. I have a data entry app that takes an exorbitant amount of time to insert a new record(i.e 50-75 seconds). So my solution was to send an insert statement out via a ThreadPool and allow the user to begin entering the data for the record while that insert which returns a new record ID while that insert is running. My problem is that a user can hit save before the new ID is returned from that insert.
I tried putting in a Boolean variable which get set to true via an event from that thread when it is safe to save. I then put in
while (safeToSave == false)
{
Thread.Sleep(200)
}
I think that is a bad idea. If i run the save method before that tread returns, it gets stuck.
So my questions are:
Is there a better way of doing this?
What am I doing wrong here?
Thanks for any help.
Doug
Edit for more information:
It is doing an insert into a very large (approaching max size) FoxPro database. The file has about 200 fields and almost as many indexes on it.
And before you ask, no I cannot change the structure of it as it was here before I was and there is a ton of legacy code hitting it. The first problem is, in order to get a new ID I must first find the max(id) in the table then increment and checksum it. That takes about 45 seconds. Then the first insert is simply and insert of that new id and an enterdate field. This table is not/ cannot be put into a DBC so that rules out auto-generating ids and the like.
#joshua.ewer
You have the proccess correct and I think for the short term I will just disable the save button, but I will be looking into your idea of passing it into a queue. Do you have any references to MSMQ that I should take a look at?
1) Many :), for example you could disable the "save" button while the thread is inserting the object, or you can setup a Thread Worker which handle a queue of "save requests" (but I think the problem here is that the user wants to modify the newly created record, so disabling the button maybe it's better)
2) I think we need some more code to be able to understand... (or maybe is a synchronization issue, I am not a bug fan of threads too)
btw, I just don't understand why an insert should take so long..I think that you should check that code first! <- just as charles stated before (sorry, dind't read the post) :)
Everyone else, including you, addressed the core problems (insert time, why you're doing an insert, then update), so I'll stick with just the technical concerns with your proposed solution. So, if I get the flow right:
Thread 1: Start data entry for
record
Thread 2: Background calls to DB to retrieve new Id
The save button is always enabled,
if user tries to save before Thread
2 completes, you put #1 to sleep for
200 ms?
The simplest, not best, answer is to just have the button disabled, and have that thread make a callback to a delegate that enables the button. They can't start the update operation until you're sure things are set up appropriately.
Though, I think a much better solution (though it might be overblown if you're just building a Q&D front end to FoxPro), would be to throw those save operations into a queue. The user can key as quickly as possible, then the requests are put into something like MSMQ and they can complete in their own time asynchronously.
Use a future rather than a raw ThreadPool action. Execute the future, allow the user to do whatever they want, when they hit Save on the 2nd record, request the value from the future. If the 1st insert finished already, you'll get the ID right away and the 2nd insert will be allowed to kick off. If you are still waiting on the 1st operation, the future will block until it is available, and then the 2nd operation can execute.
You're not saving any time unless the user is slower than the operation.
First, you should probably find out, and fix, the reason why an insert is taking so long... 50-75 seconds is unreasonable for any modern database for a single row insert, and indicates that something else needs to be addressed, like indices, or blocking...
Secondly, why are you inserting the record before you have the data? Normally, data entry apps are coded so that the insert is not attempted until all the necessary data for the insert has been gathered from the user. Are you doing this because you are trying to get the new Id back from the database first, and then "update" the new empty record with the user-entered data later? If so, almost every database vendor has a mechanism where you can do the insert only once, without knowing the new ID, and have the database return the new ID as well... What vendor database are you using?
Is a solution like this possible:
Pre-calculate the unique IDs before a user even starts to add. Keep a list of unique Id's that are already in the table but are effectively place holders. When a user is trying to insert, reserve them one of the unique IDs, when the user presses save, they now replace the place-holder with their data.
PS: It's difficult to confirm this, but be aware of the following concurrency issue with what you are proposing (with or without threads): User A, starts to add, user B starts to add, user A calculates ID 1234 as the max free ID, user B calculates ID 1234 as the max free ID. User A inserts ID 1234, User B inserts ID 1234 = Boom!