I've been investigating TPL as means of quickly generating a large volume of files - I have about 10 million rows in a database, events which belong to patients, which I want to output into their own text file, in the location d:\EVENTS\PATIENTID\EVENTID.txt
I'm using a two nested Parallel.ForEach loops - the outer in which a list of patients is retrieved and the inner in which the events for a patient are retrieved and written to a file.
This is the code I'm using, it's pretty rough at the moment, as I'm just trying to get things working.
DataSet1TableAdapters.GetPatientsTableAdapter ta = new DataSet1TableAdapters.GetPatientsTableAdapter();
List<DataSet1.GetPatientsRow> Pats = ta.GetData().ToList();
List<DataSet1.GetPatientEventsRow> events = null;
string patientDir = null;
System.IO.DirectoryInfo di = new DirectoryInfo(txtAllEventsPath.Text);
di.GetDirectories().AsParallel().ForAll((f) => f.Delete(true));
//get at the patients
Parallel.ForEach(Pats
, new ParallelOptions() { MaxDegreeOfParallelism = 8 }
, patient =>
{
patientDir = "D:\\Events\\" + patient.patientID.ToString();
//Output directory
Directory.CreateDirectory(patientDir);
events = new DataSet1TableAdapters.GetPatientEventsTableAdapter().GetData(patient.patientID).ToList();
if (Directory.Exists(patientDir))
{
Parallel.ForEach(events.AsEnumerable()
, new ParallelOptions() { MaxDegreeOfParallelism = 8 }
, ev =>
{
List<DataSet1.GetAllEventRow> anEvent =
new DataSet1TableAdapters.GetAllEventTableAdapter();
File.WriteAllText(patientDir + "\\" + ev.EventID.ToString() + ".txt", ev.EventData);
});
}
});
The code I have produced works very quickly but produces an error after a few seconds (in which about 6,000 files are produced). The error produced is one of two types:
DirectoryNotFoundException: Could not find a part of the path 'D:\Events\PATIENTID\EVENTID.txt'.
Whenever this error is produced, the directory structure D:\Events\PATIENTID\ exists, as other files have been created within that directory. An if condition checks for the existence of D:\Events\PATIENTID\ before the second loop is entered.
The process cannot access the file 'D:\Events\PATIENTID\EVENTID.txt' because it is being used by another process.
When this error occurs, sometimes the indicated file exists or doesn't.
So, can anyone of any advice as to why these errors are being produced. I don't understand either, and as far I can see, it should just work (and indeed does, for a short while).
From MSDN:
Use the Parallel Loop pattern when you need to perform the same independent operation for each element of a collection or for a fixed number of iterations. The steps of a loop are independent if they don't write to memory locations or files that are read by other steps.
Parallel.For can speed up the processing of your rows by doing multi threading but it comes with a caveat that if it is not used correctly it will end with unexpected behavior of the program like the one you are having above.
The reason for following error :
DirectoryNotFoundException: Could not find a part of the path 'D:\Events\PATIENTID\EVENTID.txt'.
can be that the one thread goes to write and the directory is not there mean while the other thread creates that. Normally when doing parallelism there can be race conditions as we are doing multi-threading and if we don't use proper mechanics like locks or monitors then we end up with these kind of issues.
As you are doing file writing so multiple threads when trying to write to the same file end up with the error you have latter i.e.
The process cannot access the file 'D:\Events\PATIENTID\EVENTID.txt' because it is being used by another process.
as one thread is already writing to file so at that time other threads would fail to access the file for writing to it.
I would suggest to use a normal loop instead of parallelism here.
Related
I am for the first time trying to use Thread in my windows service application.Now as per my condition i have to read data from database and if it matches with condition i have to execute a function in new thread.Now the main concern is that as my function which meant to execute in new Thread is lengthy and will take time so i have a query that, Will my program will reach to datareader code and read the new value from the database while my function keeps on executing in the background in thread.My application execution logic is time specific.
Here is the code..
while (dr.Read())
{
time = dr["SendingTime"].ToString();
if ((str = DateTime.Now.ToString("HH:mm")).Equals(time))
{
//Execute Function and send reports based on the data from the database.
Thread thread = new Thread(sendReports);
thread.Start();
}
}
Please help me..
Yep, as the comments said, you will have one thread per row. if you have 4-5 rows, and you'll run that code, you'll get 4-5 threads working happily in the back.
You might be happy with it, and leave it, and in half a year, someone else will play with the DB, and you'll get 10K rows, and this will create 10K threads, and you'll be on a holiday and people will call you panicking because the program is broken ...
In other words, you don't want to do it, because it's a bad practice.
You should either use a queue with working units, and have a fixed number of threads reading from those queues (in which case you might have 10K units there, but lets say 10 threads that will pick them up and process them until they are done), or some other mechanism to make sure you don't create a thread per row.
Unless of course, you don't care ...
I want to read a huge directory and its subdirectories and files ,then write to database.Everything is fine but i put a trigger on a table that it is fired when a data is inserted and update another table.Trigger works fine with a single sql command but
Due to long process in the main program , trigger is not fired. I am using queue dequeue , and backroundworker thread.(c#)
How can this problem be solved.?any idea apreciated.
I assume that the trigger is working OK, but you need all the data to be processed before seeing the trigger effects take place. Therefore, I suggest that you split the data into smaller pieces (batches), and insert them to the database one by one. Basically, choose a size of the batch that suits best your setup and load the data on iterations.
Here is some example C# code:
public void ProcessData(String rootDirectory, int batchSize)
{
IEnumerable<string> pathsToProcess = GetPathsToProcess(rootDirectory);
int currentBatch = 0;
while (currentBatch*batchSize < pathsToProcess.Length)
{
// take a subset of the paths to process
IEnumerable<string> batch = pathsToProcess
.Skip(currentBatch*batchSize)
.Take(batchSize);
DoYourDatabaseLogic(batch);
currentBatch++;
}
}
The code above will execute the database operation for a smaller subset of data, after which your trigger will execute against that data. This will happen for each of the batches. You will still have to wait for all the batches to complete, but you can see the changes for the ones that have completed.
Using this approach brings, however, an important issue to worry about: What would happen if some of the batches fails for some reason?
In case you must revert all the changes for the entire pathsToProcess collection if a single batch/subset of it fails, you should organize the above code to run in a single database transaction, and ensure the rollback takes place appropriately.
If the pathsToProcess collection is not required to be rolled back entirely, I still recommend using transactions on each of the batches. In that case you may need to know which batch did you write last successfully, in order to resume from it if the data is to get processed again.
Let me preface my question by stating that I am a casual developer, so I don't really know what I am doing past the basics.
This is developed in c# and .net 3.5.
The core of what my current application is doing is to connect to remote servers, execute a WMI call, retrieve some data and then place this data into a database. It is a simple healthcheck application.
The basic code runs fine, but I ran into an issue where if a few servers were offline, it would take 1 minute to timeout (which is realistic because of network bandwidth etc). I had an execution of the application run for 45 minutes (because 40 servers were offline) which is not efficient, since the code executed for 45 minutes and 40 minutes wait time.
After some research I think that using threads would be the best way to get around it, if I spawned a thread for each of the servers as it was processing.
Here is my thread code:
for (int x = 0; x < mydataSet.Tables[0].Rows.Count; x++)
{
Thread ts0 = new Thread(() =>
executeSomeSteps(mydataSet.Tables[0].Rows[x]["IPAddress"].ToString(),
mydataSet.Tables[0].Rows[x]["ID"].ToString(), connString, filepath));
ts0.Start();
}
The dataset contains an ID reference and an IP Address. The execute some steps function looks like this:
static void executeSomeSteps(string IPAddress, string ID, string connstring, string filepath)
{
string executeStuff;
executeStuff = funclib.ExecuteSteps(IPAddress, ID, connstring, filepath);
executeStuff = null;
}
And execute some steps inserts data into a database based on returned wmi results. This process works fine as mentioned earlier but the problem is that some of the threads in the above for loop end up with the same data, and it executes more than once per server. There are often up to 5 records for a single server once the process completes.
From the research I have done, I believe it might be an issue with more than one thread reading the same x value from the dataset.
So now onto my questions:
Assume there are 10 records in the dataset:
Why are there more than 10 executions happening?
Will I still gain the performance if I lock the dataset value?
Can someone point me into the right direction regarding how to deal with variable data being passed to a static function by multiple threads?
What Davide refers to is that at the time you thread is executed the captured values from Rows[x] may be different (are even likely to be different) than when the delegate was created. This is because you the for loops goes on while the threads start running. This is a very common gotcha. It may even happen without servers timing out.
The solution to this "modified closure" problem is to use new variables for each thread:
for (int x = 0; x < mydataSet.Tables[0].Rows.Count; x++)
{
string ip = mydataSet.Tables[0].Rows[x]["IPAddress"].ToString();
string id = mydataSet.Tables[0].Rows[x]["ID"].ToString();
Thread ts0 = new Thread(() => executeSomeSteps(ip, id, connString, filepath));
ts0.Start();
}
You may even encounter a System.ArgumentOutOfRangeException because when the for loop has finished, the last x++ may have executed, making x 1 higher than the row indexes. Any thread reaching its Rows[x] part will then throw the exception.
Edit
This issue kept bugging me. I think what you describe in your comment (it looked like extra records were being generated by one iteration) is exactly what the modified closure does. A few threads happen to start roughly at the same time, all taking the value of x at that moment. You must also have found that servers were skipped in one time lap, I cannot image that not happening.
I am trying to isolate the source of a "memory leak" in my C# application. This application copies a large number of potentially large files into records in a database using the image column type in SQL Server. I am using a LinqToSql and associated objects for all database access.
The main loop iterates over a list of files and inserts. After removing much boilerplate and error handling, it looks like this:
foreach (Document doc in ImportDocs) {
using (var dc = new DocumentClassesDataContext(connection)) {
byte[] contents = File.ReadAllBytes(doc.FileName);
DocumentSubmission submission = new DocumentSubmission() {
Content = contents,
// other fields
};
dc.DocumentSubmissions.InsertOnSubmit(submission); // (A)
dc.SubmitChanges(); // (B)
}
}
Running this program over the entire input results in an eventual OutOfMemoryException. CLR Profiler reveals that 99% of the heap consists of large byte[] objects corresponding to the sizes of the files.
If I comment both lines A and B, this leak goes away. If I uncomment only line A, the leak comes back. I don't understand how this is possible, as dc is disposed for every iteration of the loop.
Has anyone encountered this before? I suspect directly calling stored procedures or doing inserts will avoid this leak, but I'd like to understand this before trying something else. What is going on?
Update
Including GC.Collect(); after line (B) appears to make no significant change to any case. This does not surprise me much, as CLR Profiler was showing a good number of GC events without explicitly inducing them.
Which operating system are you running this on? Your problem may not be related to Linq2Sql, but to how the operating system manages large memory allocations. For instance, Windows Server 2008 is much better at managing large objects in memory than XP. I have had instances where the code working with large files was leaking on XP but was running fine on Win 2008 server.
HTH
I don't entirely understand why, but making a copy of the iterating variable fixed it. As near as I can tell, LinqToSql was somehow making a copy of the DocumentSubmission inside each Document.
foreach (Document doc in ImportDocs) {
// make copy of doc that lives inside loop scope
Document copydoc = new Document() {
field1 = doc.field1,
field2 = doc.field2,
// complete copy
};
using (var dc = new DocumentClassesDataContext(connection)) {
byte[] contents = File.ReadAllBytes(copydoc.FileName);
DocumentSubmission submission = new DocumentSubmission() {
Content = contents,
// other fields
};
dc.DocumentSubmissions.InsertOnSubmit(submission); // (A)
dc.SubmitChanges(); // (B)
}
}
I've got an issue which shows up intermitantly in my unit tests and I can't work out why.
The unit test itself is adding multiple documents to an index, then trying to query the index to get the documents back out again.
So 95% of the time it works without any problems. Then the other 5% of the time it cannot retrieve the documents back out of the index.
My unit test code is as follows:
[Test]
public void InsertMultipleDocuments()
{
string indexPath = null;
using (LuceneHelper target = GetLuceneHelper(ref indexPath))
{
target.InsertOrUpdate(
target.MakeDocument(GetDefaultSearchDocument()),
target.MakeDocument(GetSecondSearchDocument()));
var doc = target.GetDocument(_documentID.ToString()).FirstOrDefault();
Assert.IsNotNull(doc);
Assert.AreEqual(doc.DocumentID, _documentID.ToString());
doc = target.GetDocument(_document2ID.ToString()).FirstOrDefault();
Assert.IsNotNull(doc);
Assert.AreEqual(doc.DocumentID, _document2ID.ToString());
}
TidyUpTempFolder(indexPath);
}
I won't post the full code from my LuceneHelper, but the basic idea of it is that it holds an IndexSearcher in reference which is closed every time an item is written to the index (so it can be re-opened again with all the of the latest documents).
The actual unit test will often fail when gathering the second document. I assumed it was to do with the searcher not being closed and seeing cached data, however this isn't the case.
Does Lucene have any delay in adding documents to the index? I assumed that once it had added the document to the index it was available immediately as long as you closed any old search indexers and opened a new one.
Any ideas?
How do you close the IndexWriter you use for updating the index? The close method has an overload that takes a single boolean parameter specifying whether or not you want to wait for merges to complete. The default merge scheduler runs the merges in a separate thread and that might cause your problems.
Try closing the writer like this:
indexWriter.Close(true);
More information can be found at Lucene.NET documentation.
Btw, which version of Lucene.NET are you using?