I did not work with Queue collection yet. But based on information I was able to gather it seems like this is a right approach to my problem.
I have console app that scan folder for new files of certain type. Based on specific criteria. Only new items are added to queue.xml file. This is done at some time interval (every 1 hour).
Another console app is triggered at different time point (every 4 hours). It reads queue.xml file and passes each item for processing. It seems that the best way is to parse xml file and create Queue collection. This way each item will be processed in order.
Here is problem. Processing file can take couple hours, and during that time queue.xml may have some new items, therefore Queue Collection will not reflect this changes.
Is it possible to parse xml file again and add new items to Queue that is currently in progress?
Changing size of the Collection during runtime will cause problems. Is it Queue different in that way?
Is it possible to parse xml file again and add new items to Queue that is currently in progress?
Of course, you just have to define the rules by which it is safe for this to happen.
Use a mutex in both applications to lock the file during read/write, and in your processing application subscribe to a FileSystemWatcher event to detect when the file has changed.
Changing size of the Collection during runtime will cause problems. Is it Queue different in that way?
It can be safe to change the size of any collection at run time, that's usually why you use a collection (e.g. they have an Add() method for a reason)... you just have to do it safely, in the context of your solution.
If there is multi-thread access to the queue, lock it.
If there is a chance that the queue size can change during iteration, iterate over a copy of the queue.
If there is a chance that a process can change a file required by both applications, mutex it to control access.
Related
I have a service that, when invoked, performs expensive operations on a large dataset.
The dataset is a list of items, i.e. something like a List<Item> which contains an average of a few million Item instances.
All Item instances in the list are different from each other and the service executes the same method on it, called Process(Item item): the Process(Item item) method is mostly CPU-bound, however, it requires exclusive access to a file on the File System to process the given Item correctly: this means all the items in the list cannot be processed in parallel.
Due to the large amount of data that needs to be processed, I am looking into a way to improve the performance by processing the items in parallel.
A simple (but not elegant) way to do that would be to make a few copies of the file and run an equal amount of threads: this would allow me to process as many Item instances in parallel as the total amount of file copies I make.
However, I wish to find a more clean and elegant way as I don't want to manually handle those file copies.
To do that, I am looking into using Docker containers and Kubernetes.
In such a setup, the Docker image would include both the service runtime as well as the file, so that each container (or Pod) that is created from that image would have its own copy of the file.
The question:
At this point, I am mostly missing how to orchestrate the processing of the Item instances across the various containers in a solid way.
How to do that?
Note that a similar question was raised in this StackOverflow question and most answers suggested to rely on Kubernetes liveness and readiness probes to avoid traffic to be routed to a given Pod, in my case the Pod that is already processing an Item instance.
However, I don't think probes where designed to be used this way and it is an approach that feels more like a hack to me, therefore I am looking for a more solid solution to better control how the Item instances are processed.
I'm still relatively new to TPL Dataflow, and not 100% sure if I am using it correctly or if I'm even suppose to use it.
I'm trying to employ this library to help out with file-copying+file-upload.
Basically the structure/process of handling files in our application is as follows:
1) The end user will select a bunch of files from their disk and choose to import them into our system.
2) Some files have higher priority, while the others can complete at their own pace.
3) When a bunch of files is imported here is the process:
Queue these import requests, one request maps to one file
These requests are stored into a local sqlite db
These requests also explicitly indicate if it demands higher priority or not
We currently have two active threads running (one to manage higher priority and one for lower)
They go into a waiting state until signalled.
When new requests come in, they get signalled to dig into the local db to process the requests.
Both threads are responsible for copying the file to a separate cached location, so just a simple File.Copy call. The difference is, one thread does the actual File.Copy call immediately. While the other thread just enqueues them all onto the ThreadPool to run.
-Once the files are copied, the request gets updated, the request has a Status enum property that has different states like Copying, Copied, etc.
The request also requires a ServerTimestamp to be set, the ServerTimestamp is important, because there are times where a user may be saving changes to a file that's essentially the same, but has different versions, so the order is important.
Another separate thread is running that gets signalled to fetch requests from the local DB where the status is Copied. It will then ping an endpoint to ask for a ServerTimestamp, and update the request with it
Lastly once the request has had the file copy complete and the server ticket is set, we can now upload the physical file to the server.
So I'm toying around with using TransformBlock's
1- File Copy TransformBLock
I'm thinking there could be two File Copy TransformBlock's one that's for higher priority and one for lower priority.
My understanding is that it uses the TaskScheduler.Current which uses the ThreadPool behind the scenes. I was thinking maybe a custom TaskScheduler that spawns a new thread on the fly. This scheduler can be used for the higher priority file copy block.
2- ServerTimestamp TransformBlock
So this one will be linked to the 1st block, and take in all the copied files in and get the server timestamp and set it int he request.
3-UploadFIle TransformBlock
This will upload the file
Problems I'm facing:
Say for example we have 5 file requests enqueued in the local db.
File1
File2
File3-v1
File3-v2
File3-v3
We Post/SendAsync all 5 requests to the first block.
If File1,File2,File3-v1,File3-v3 are successful but File3-v2 fails, I kind of want the block to not flow onto the next ServerTimestamp block, because it's important the File3 versions are completely copied before proceeding, or else it will go out of order.
But this kind of leads onto how is it going to retry correctly and have the other 4 files that had already been copied move with it over to the next block?
I'm not sure if I am structuring this correctly or if TPL Dataflow supports my usecase.
So I am creating a customer indexing program for my company and I have basically everything coded and working except I want the indexing program to watch the user-specified indexed directories and update the underlying data store on the fly to help eliminate the need for full indexing often.
I coded everything in WPF/C# with an underlying SQLite database and I am sure the folder watchers would work well under "non-heavy loads", but the problem is we use TortoiseSVN and when the user does an SVN Update, etc. that creates a heavy file load the FileSystemWatcher and SQLite updates just can't keep up (even with the max buffer size). Basically I am doing a database insert every time the watcher event is hit.
So my main question is...does anyone have suggestions on how to implement this file watcher to handle such heavy loads?
Some thoughts I had were: (1) Create a staging collection for all the queries and use a timer and thread to insert the data later on (2) Write the queries to a file and use a timer thread later on for the insert
Help....
You want to buffer the data received from your file watch events in memory. Thus when receiving the events from your registered file watchers, you can accumulate them in memory as fast as possible for a burst of activity. Then on a separate process or thread, you read them from your in memory buffer and do whatever you need for persistent storage or whatever process is more time intensive.
You can use a Queue to queue in all the requests. I have made good experience with the MS MessageQueue, which comes out of the box and is quite easy to use.
see http://www.c-sharpcorner.com/UploadFile/rajkpt/101262007012217AM/1.aspx
Then have a separate WorkerThread which grabs a predefined number of elements from the queue and inserts them to the database. Here I'd suggest to merge the single inserts to a bulkinsert.
If you want to be 100% sure you can check for cpu and IO performance before making the inserts.
Here a code snippet for determining the cpu utilization:
Process.TotalProcessorTime.TotalMilliseconds/Environment.ProcessorCount
The easiest is to have the update kick off a timer (say, one minute). If another update comes in in the meantime, you queue the change and restart the timer. Only when a minute has gone by without activity do you start processing.
In an Outlook AddIn I'm working on, I use a list to grab all the messages in the current folder, then process them, then save them. First, I create a list of all messages, then I create another list from the list of messages, then finally I create a third list of messages that need to be moved. Essentially, they are all copies of each other, and I made it this way to organize it. Would it increase performance if I used only one list? I thought lists were just references to the actual item.
Without seeing your code it is impossible to tell if you are creating copies of the list itself or copies of the reference to the list - the latter is preferable.
Another thing to consider is whether or not you could stream the messages from Outlook using an iterator block. By using a List<T> you are currently buffering the entire sequence of messages which means you must hold them all in memory, processing them one at a time. Streaming the messages would reduce the memory pressure on your application as you would only need to hold each message in memory long enough to process it.
Unless your lists contains 10 millions items or more, it should not be a problem.
Outlook seems to have a problem much smaller sized mailboxes, so I would say you are pretty much safe.
I have data structure, specifically a queue, that is growing so large as to cause an out of memory exception. This was unexpected behavior due to the relativity simple object it is holding (having a essentially one string field).
Is there an easy way, or a built in .NET way, to save this collection to a file on disk (there is no database to work with), and it continue to function transparently as queue?
Maybe a queue is not an appropriate data structure for your problem. How many objects are in your queue? Do you really need to store all those strings for later? Could you replace the strings with something smaller like enums or something more object-oriented like the Flyweight design pattern?
If you are processing lots of data, sometimes it's faster to recompute or reload the original data than saving a copy for later. Or you can process the data as you load it and avoid saving it for later processing.
I would first investigate why you are getting the OOM.
If you are adding to the queue - keep a check on the size and perform some action when a threshold is breached.
Can you filter those items? Do the items have many duplicates? In which case you could replace duplicates with a pre-cached object.
I would use Sqlite to save the data to disk.
In response to your comment on the question, I guess you could split your file-collecting thread into two threads:
The first thread merely counts the number of files to be processed, and increment a volatile int count. This thread only updates the count; it does not store anything in the queue.
The second thread is very similar to the first one, except that it doesn't update the count, and instead it actually saves the data into the queue. When the size of the queue reaches a certain threshold, your thread should block for some time, and then resume adding data to the queue. This ensures that your queue is never larger than a certain threshold.
I would even guess that you wouldn't actually need the second thread. Since the first one would give you the count you need, you can find the actual files in your main thread, one file at a time. This way you'll save yourself from having the queue, which will reduce your memory requirements.
However, I doubt that your queue is the reason you're getting out of memory exceptions. Even if you're adding one million entries to the queue, this would only take about 512 MB memory. I suggest you check your processing logic.
The specific answer is no, there is not an easy or built in way to do this. You have to write it to disk "yourself".
Figure out why you are getting the out of memory it might surprise you. Maybe its string internment, maybe your are fragmenting the GC with all the small object allocations. The CLR Profiler from Microsoft is fantastic for this.