So I am creating a customer indexing program for my company and I have basically everything coded and working except I want the indexing program to watch the user-specified indexed directories and update the underlying data store on the fly to help eliminate the need for full indexing often.
I coded everything in WPF/C# with an underlying SQLite database and I am sure the folder watchers would work well under "non-heavy loads", but the problem is we use TortoiseSVN and when the user does an SVN Update, etc. that creates a heavy file load the FileSystemWatcher and SQLite updates just can't keep up (even with the max buffer size). Basically I am doing a database insert every time the watcher event is hit.
So my main question is...does anyone have suggestions on how to implement this file watcher to handle such heavy loads?
Some thoughts I had were: (1) Create a staging collection for all the queries and use a timer and thread to insert the data later on (2) Write the queries to a file and use a timer thread later on for the insert
Help....
You want to buffer the data received from your file watch events in memory. Thus when receiving the events from your registered file watchers, you can accumulate them in memory as fast as possible for a burst of activity. Then on a separate process or thread, you read them from your in memory buffer and do whatever you need for persistent storage or whatever process is more time intensive.
You can use a Queue to queue in all the requests. I have made good experience with the MS MessageQueue, which comes out of the box and is quite easy to use.
see http://www.c-sharpcorner.com/UploadFile/rajkpt/101262007012217AM/1.aspx
Then have a separate WorkerThread which grabs a predefined number of elements from the queue and inserts them to the database. Here I'd suggest to merge the single inserts to a bulkinsert.
If you want to be 100% sure you can check for cpu and IO performance before making the inserts.
Here a code snippet for determining the cpu utilization:
Process.TotalProcessorTime.TotalMilliseconds/Environment.ProcessorCount
The easiest is to have the update kick off a timer (say, one minute). If another update comes in in the meantime, you queue the change and restart the timer. Only when a minute has gone by without activity do you start processing.
Related
I'm trying to come up with a repository that can by itself bundle several items to be added to the database, and then perform a bulk insert operation in certain intervals. Something like
public interface MyRepository
{
void InsertDeferred(DTO.Item item);
}
The repository would bundle all items, and every 30 seconds or so perform a bulk insert-operation to the underlying SQLite database (via EFCore) with all items that were added since the last flush.
Is there any pattern to do this as fail-safe as possible? Otherwise, what if the application shuts down (expectedly or unexpecdetly) just before the next flush, lots of data that might be lost...
I think the major logging libraries (log4net etc.), for example, also optionally only flush their writes in interval - does anyone know if they maybe came up with a clever way to prevent or minimize data loss?
Thanks!
I would write to a BlockingCollection, and have a single thread reading from there and writing to disk. You can control the potential data loss with the size of the BlockingCollection, and whether you have the reading thread wait for a certain interval or minimum number of items before flushing to disk.
I have a c# console application running on a 64-bit Windows 2008 r2 server which also hosts MSSQL Server 2005.
This application runs through text files, reads the line, splits the line values into variables, and inserts the data into a SQL database hosted at localhost.
Each Text file is a new thread, each line is a new thread, and each SQL insert statement is executed under a new thread.
I am counting the number of each of these types of threads and decrementing when they complete. I'm wondering what the best way is to "pend" future threads from opening...
For example.. before a new SQL insert thread is opened I'm calling...
while(numberofcurrentthreads > specifiednumberofthreads)
{
// wait
}
new.Thread(insertSQL);
Where specifiednumberofthreads has been estimated to a value that does not throw System.OutofMemoryExceptions. A lot of guess work has gone into determining that number for each process.
My questions is.. is there a more 'efficient' or proper way to do this? Is there a way to read System memory, not physical memory, and wait based on a specified resource allotment?
To illustrate this idea...
while(System.Memory < (System.Memory/2) || System.OutofMemory == true)
{
// wait
}
new.Thread(insertSQL);
The current method I am employing works and completes in a decent time.. but it could do better. Some of the text files going through the process are larger than others and do not necessarily make the best use of system resources...
In example, if I say process 2 text files at a time that works perfectly when both text files are < 300KB. It does not work so well if one or two are over 100,000KB.
There also seems to be a 'butter-zone' where things process most efficiently. Somewhere averaging around 75% of all CPU resources. Crank these values too high and it will run at 100% CPU but process way slower as it cannot keep up.
It's crazy to be creating a new thread for every file and for every line and for every SQL insert statement. You'd probably be much better off using three threads and a chained producer-consumer model, all of which communicate through thread-safe queues. In C#, that would be BlockingCollection.
First, you set up two queues, one for lines that have been read from a text file, and one for lines that have been processed:
const int MaxQueueSize = 10000;
BlockingCollection<string> _lines = new BlockingCollection<string>(MaxQueueSize);
BlockingCollection<DataObject> _dataObjects = new BlockingCollection<DataObject>(MaxQueueSize);
DataObject, by the way, is what I'm calling the object that you'll be inserting into the database. You don't say what that is. It doesn't really matter for the purposes of this discussion, but you'd replace it with whatever type you use to represent the processed string.
Now, you create three threads:
A thread that reads text files line-by-line and places the lines into the _lines queue.
A line processor that reads lines one-by-one from the _lines queue, processes it, and creates a DataObject which it then places on the _dataObjects queue.
A thread that reads the _dataObjects queue and inserts them into the database.
Beyond simplicity (and this is very easy to put together), there are many benefits to this model.
First, having more than one thread reading from the disk concurrently usually leads to slower performance because the disk drive can only do one thing at a time. Having multiple threads hitting the disk at the same time just causes unnecessary head seeks. Just one thread will keep your input queue full.
Second, limiting the queues' sizes will prevent you from running out of memory. When the disk reading thread tries to insert the 10,001th item into the queue, it will wait until the processing thread removes an item. That's the "blocking" part of BlockingCollection.
You might find that you can speed your SQL inserts by grouping them and sending a bunch of records at once, doing what is essentially a bulk insert of 100 or 1000 records at a time rather than sending 100 or 1000 individual transactions.
This solution prevents the problem of too many threads. You have a fixed number of threads, all of which are running as fast as they possibly can. And memory use is constrained by limiting the number of things that can be in the queues.
The solution also scales rather well. If you have files on multiple drives, you can add a second file reading thread to read the files from that other physical drive and places the lines in the same queue. BlockingCollection supports multiple producers and multiple consumers, so adding another producer is no trouble at all.
The same goes for consumers. If you find that the processing step is the bottleneck, you can add another processing thread. It, too, will read from the _lines queue and write to the dataObjects queue.
However, having more threads than you have processor cores will likely make your program slower. If you have a four-core processor, creating 8 processing threads won't do you any good. It will make things slower because the operating system will be spending a lot of time on thread context switches rather than on doing useful work.
You'll have to do a little tuning to get the best performance. Queue sizes should be large enough to support continuous workflow (so no thread is starved of work, or spends too much time waiting for the output queue), but not so large to overfill memory. Depending on the relative speed of the three stages, one of the queues might have to be larger than the other. If one of the three stages is a bottleneck, you can add another thread to help at that stage.
I created a simple example of this model using text files for input and output. It should be pretty easy to extend for your situation. See Simple Multithreading, and the follow up, Part 2.
I have a network server which has a share that exposes a set of files. These files are consumed by processes that are running on multiple servers, sometimes several processes on the same machine.
The set of files are updated a couple times a day, and the set of files are fairly large.
We are attempting to reduce the bandwidth used by these processes retrieving these filesets by making processes that are on the same machine share the same fileset.
In order to do this, we want each process on the same machine to coordinate with the other processes that need the same files so that only one will attempt to download the files, and then the files will be shared by all the processes once complete.
Additionally, we need to prevent the server from performing an update on the fileset while a download is in progress.
In order to facilitate this requirement, I created a file lock class. This class opens a file called .lock in the specified location. The file is opened as read/write so that it will prevent another process from doing the same, regardless of what machine the process is running on. This is enclosed in a try/catch so that if the file is already locked, the exception is caught and the lock is not acquired. This already works correctly.
The problem I am trying to solve is that if a process hangs for some reason while it has the lock, all the other processes will indefinitely fail to sync these files because they cannot acquire the lock.
One solution we were exploring today was to have a multi-lock setup, where each lock would have a guid in the name, and instead of fighting over a single hard lock, locks could be acquired as many as requested. However, processes would be responsible for making sure there is only one lock set when they begin a download. This is so that if a process with a lock hangs, we can consider it expired after a certain time limit, and nothing prevents a new process from requesting a lock in addition to the hung lock.
The problem here is that the creation of these multi locks needs to be synchronized between processes or else there could be a race condition on the creation and checking of the lock count.
I don't see a way to synchronize this without reintroducing a hard locking mechanism like the first solution, but then we are right back where we started where a hung process will block the others from doing a download.
Any suggestions?
A common way to tackle this is to use some sort of shareable lock file, with the real locking logic performed via the content. For example, consider an SQlite database file with a single table as a lock file: Something like
CREATE TABLE lock (
id INTEGER PRIMARY KEY AUTOINCREMENT,
host TEXT,
pid INTEGER,
expires INTEGER
)
A consumer (or the producer for an update to the fileset) requests a lock by INSERTing into the table
A process heartbeats by UPDATEing its own row, making it never-expiring
expired rows are discarded: Crashed processes will stop updating and eventually their lock will be discarded
The lowest id holds the lock
Processes on the same host may evaluate the host field to find out, if another process on the same host already wants to copy, making it obsolete to request another copy
Ofcourse this can be done via a database server (or in fact locking server) instead of a database file if feasable, but the SQlite method has the advantage of requiring nothing more than file access.
The trick here is good use of caching.
The designated "download" process that updates the fileset should first grab it from the remote location and store it in a temp file. Then it should simply continue attempting to acquire a read/write lock on the local file(s) you want to replace. When it succeeds, do the swap and drop the lock. This part should go very very quickly.
Also, it is quite unlikely to "hang" when doing a simple file copy on a local drive. Meaning the other dependent processes will be able to continue functioning regardless of what happens with this one.
To make sure the downloading process is functioning correctly you'll need a monitoring program that pings the download process every so often to ensure it's responsive. If it's not then alert someone..
I currently have a c# console app where multiple instances run at the same time. The app accesses values in a database and processes them. While a row is being processed it becomes flagged so that no other instance attempts to process it at the same time. My question is what is a efficient and graceful way to unflag those values in the event an instance of the program crashes? So if an instance crashed I would only want to unflag those values currently being processed by that instance of the program.
Thanks
The potential solution will depend heavily on how you start the console applications.
In our case, the applications are started based on configuration records in the database. When one of these applications performs a lock, it uses the primary key from the database configuration record to perform the lock.
When the application starts up, the first thing it does is release all locks on the records that it previously locked.
To control all of the child processes, we have a service that uses the information from the configuration tables to start the processes and then keeps an eye on them, restarting them when they fail.
Each of the processes is also responsible for updating a status table in the database with the last time it was available with a maximum allowed delay of 2 minutes (for heavy processing). This status table is used by sysadmins to watch for problems, but it could also be used to manually release locks in case of a repeating failure in a given process.
If you don't have a structured approach like this, it could be very difficult to automatically unlock records unless you have a solid profile of your application performance that would allow you to know that any lock over 5 minutes old is invalid because it should only take, on average, 15 seconds to process a record with a maximum of 2 minutes.
To be able to handle any kind of crash, even power off I would suggest to timestamp records additionally and after some reasonable timeout treat records as unlocked even if they are flagged.
We're developing an application which reads data from a number of external hardware devices continuously. The data rate is between 0.5MB - 10MB / sec, depending on the external hardware configuration.
The reading of the external devices is currently being done on a BackgroundWorker. Trying to write the acquired data to disk with this same BackgroundWorker does not appear to be a good solution, so what we want to do is, to queue this data to be written to a file, and have another thread dequeue the data and write to a file. Note that there will be a single producer and single consumer for the data.
We're thinking of using a synchronized queue for this purpose. But we thought this wheel must have been invented so many times already, so we should ask the SO community for some input.
Any suggestions or comments on things that we should watch out for would be appreciated.
I would do what a combination of what mr 888 does.
Basicly in you have 2 background workers,
one that reads from the hardware device.
one that writes the data to disk.
Hardware background worker:
Adds chucks on data from the hardware in the Queue<> . In whatever format you have it in.
Write background worker
Parses the data if needed and dumps to disk.
One thing to consider here is is getting the data from the hardware to disk as fast as posible importent?
If Yes, then i would have the write brackground test basicly in a loop with a 100ms or 10ms sleep in the while loop with checking if the que has data.
If No, Then i would have it either sleep a simular amount ( Making the assumtion that the speed you get from your hardware changes periodicly) and make only write to disk when it has around 50-60mb of data. I would consider doing it this way because modern hard drives can write about 60mb pr second ( This is a desktop hard drive, your enterprice once could be much quicker) and constantly writing data to it in small chucks is a waste of IO bandwith.
I am pretty confident that your queue will be pretty much ok. But make sure that you use efficient method of storing/retrieving data not to overhaul you logging procedure with memory allocation/deallocation. I would go for some pre-allocated memory buffer, and use it as a circular queue.
u might need queing
eg. code
protected Queue<Byte[]> myQ;
or
protected Queue<Stream> myQ;
//when u got the content try
myQ.Enque(...);
and use another thread to pop the queue
// another thread
protected void Loging(){
while(true){
while(myQ.Count > 0){
var content = myQ.Dequeue();
// save content
}
System.Threading.Thread.Sleep(1000);
}
}
I have a similar situation, In my case I used an asynchrounous lockfree queue with a LIFO synchronous object
Basically the threads that write to the queue set the sync object in the LIFO while the other threads 'workers' reset the sync object in the LIFO
We have fixed number of sync objects that are equal to the threads. the reason for using a LIFO is that to keep minimum number of threads running and better use of cache system.
Have you tried MSMQ