I have a service that, when invoked, performs expensive operations on a large dataset.
The dataset is a list of items, i.e. something like a List<Item> which contains an average of a few million Item instances.
All Item instances in the list are different from each other and the service executes the same method on it, called Process(Item item): the Process(Item item) method is mostly CPU-bound, however, it requires exclusive access to a file on the File System to process the given Item correctly: this means all the items in the list cannot be processed in parallel.
Due to the large amount of data that needs to be processed, I am looking into a way to improve the performance by processing the items in parallel.
A simple (but not elegant) way to do that would be to make a few copies of the file and run an equal amount of threads: this would allow me to process as many Item instances in parallel as the total amount of file copies I make.
However, I wish to find a more clean and elegant way as I don't want to manually handle those file copies.
To do that, I am looking into using Docker containers and Kubernetes.
In such a setup, the Docker image would include both the service runtime as well as the file, so that each container (or Pod) that is created from that image would have its own copy of the file.
The question:
At this point, I am mostly missing how to orchestrate the processing of the Item instances across the various containers in a solid way.
How to do that?
Note that a similar question was raised in this StackOverflow question and most answers suggested to rely on Kubernetes liveness and readiness probes to avoid traffic to be routed to a given Pod, in my case the Pod that is already processing an Item instance.
However, I don't think probes where designed to be used this way and it is an approach that feels more like a hack to me, therefore I am looking for a more solid solution to better control how the Item instances are processed.
Related
I did not work with Queue collection yet. But based on information I was able to gather it seems like this is a right approach to my problem.
I have console app that scan folder for new files of certain type. Based on specific criteria. Only new items are added to queue.xml file. This is done at some time interval (every 1 hour).
Another console app is triggered at different time point (every 4 hours). It reads queue.xml file and passes each item for processing. It seems that the best way is to parse xml file and create Queue collection. This way each item will be processed in order.
Here is problem. Processing file can take couple hours, and during that time queue.xml may have some new items, therefore Queue Collection will not reflect this changes.
Is it possible to parse xml file again and add new items to Queue that is currently in progress?
Changing size of the Collection during runtime will cause problems. Is it Queue different in that way?
Is it possible to parse xml file again and add new items to Queue that is currently in progress?
Of course, you just have to define the rules by which it is safe for this to happen.
Use a mutex in both applications to lock the file during read/write, and in your processing application subscribe to a FileSystemWatcher event to detect when the file has changed.
Changing size of the Collection during runtime will cause problems. Is it Queue different in that way?
It can be safe to change the size of any collection at run time, that's usually why you use a collection (e.g. they have an Add() method for a reason)... you just have to do it safely, in the context of your solution.
If there is multi-thread access to the queue, lock it.
If there is a chance that the queue size can change during iteration, iterate over a copy of the queue.
If there is a chance that a process can change a file required by both applications, mutex it to control access.
I need to download certain files using FTP.Already it is implemented without using the thread. It takes too much time to download all the files.
So i need to use some thread for speed up the process .
my code is like
foreach (string str1 in files)
{
download_FTP(str1)
}
I refer this , But i don't want every files to be queued at ones.say for example 5 files at a time.
If the process is too slow, it means most likely that the network/Internet connection is the bottleneck. In that case, downloading the files in parallel won't significantly increase the performance.
It might be another story though if you are downloading from different servers. We may then imagine that some of the servers are slower than others. In that case, parallel downloads would increase the overall performance since the program would download files from other servers while being busy with slow downloads.
EDIT: OK, we have more info from you: Single server, many small files.
Downloading multiple files involves some overhead. You can decrease this overhead by somehow grouping the files (tar, zip, whatever) on server-side. Of course, this may not be possible. If your app would talk to a web server, I'd advise to create a zip file on the fly server-side according to the list of files transmitted in the request. But you are on an FTP server so I'll assume you have nearly no flexibility server-side.
Downloading several files in parallel may probably increase the throughput in your case. Be very careful though about restrictions set by the server such as the max amount of simultaneous connections. Also, keep in mind that if you have many simultaneous users, you'll end up with a big amount of connections on the server: users x threads. Which may prove counter-productive according to the scalability of the server.
A commonly accepted rule of good behaviour consists in limiting to max 2 simultaneoud connections per user. YMMV.
Okay, as you're not using .NET 4 that makes it slightly harder - the Task Parallel Library would make it really easy to create five threads reading from a producer/consumer queue. However, it still won't be too hard.
Create a Queue<string> with all the files you want to download
Create 5 threads, each of which has a reference to the queue
Make each thread loop, taking an item off the queue and downloading it, or finishing if the queue is empty
Note that as Queue<T> isn't thread-safe, you'll need to lock to make sure that only one thread tries to fetch an item from the queue at a time:
string fileToDownload = null;
lock(padlock)
{
if (queue.Count == 0)
{
return; // Done
}
fileToDownload = queue.Dequeue();
}
As noted elsewhere, threading may not speed things up at all - it depends where the bottleneck is. If the bottleneck is the user's network connection, you won't be able to get more data down the same size of pipe just by using multi-threading. On the other hand, if you have a lot of small files to download from different hosts, then it may be latency rather than bandwidth which is the problem, in which case threading will help.
look up on ParameterizedThreadStart
List<System.Threading.ParameterizedThreadStart> ThreadsToUse = new List<System.Threading.ParameterizedThreadStart>();
int count = 0;
foreach (string str1 in files)
{
ThreadsToUse.add(System.Threading.ParameterizedThreadStart aThread = new System.Threading.ParameterizedThreadStart(download_FTP));
ThreadsToUse[count].Invoke(str1);
count ++;
}
I remember something about Thread.Join that can make all threads respond to one start statement, due to it being a delegate.
There is also something else you might want to look up on which i'm still trying to fully grasp which is AsyncThreads, with these you will know when the file has been downloaded. With a normal thread you gonna have to find another way to flag it's finished.
This may or may not help your speed, in one way of your line speed is low then it wont help you much,
on the other hand some servers set each connection to be capped to a certain speed in which you this in theory will set up multiple connections to the server therefore having a slight increase in speed. how much increase tho I cannot answer.
Hope this helps in some way
I can add some experience to the comments already posted. In an app some years ago I had to generate a treeview of files on an FTP server. Listing files does not normally require actual downloading, but some of the files were zipped folders and I had to download these and unzip them, (sometimes recursively), to display the files/folders inside. For a multithreaded solution, this reqired a 'FolderClass' for each folder that could keep state and so handle both unzipped and zipped folders. To start the operation off, one of these was set up with the root folder and submitted to a P-C queue and a pool of threads. As the folder was LISTed and iterated, more FolderClass instances were submitted to the queue for each subfolder. When a FolderClass instance reached the end of its LIST, it PostMessaged itself, (it was not C#, for which you would need BeginInvoke or the like), to the UI thread where its info was added to the listview.
This activity was characterised by a lot of latency-sensitive TCP connect/disconnect with occasional download/unzip.
A pool of, IIRC, 4-6 threads, (as already suggested by other posters), provided the best performance on the single-core system i had at the time and, in this particular case, was much faster than a single-threaded solution. I can't remember the figures exactly, but no stopwatch was needed to detect the performance boost - something like 3-4 times faster. On a modern box with multiiple cores where LISTs and unzips could occur concurrently, I would expect even more improvement.
There were some problems - the visual ListView component could not keep up with the incoming messages, (because of the multiple threads, data arrived for aparrently 'random' positions on the treeview and so required continual tree navigation for display), and so the UI tended to freeze during the operation. Another problem was detecting when the operation had actually finished. These snags are probably not relevant to your download-many-small-files app.
Conclusion - I expect that downloading a lot of small files is going to be faster if multithreaded with multiple connections, if only from mitigating the connect/disconnect latency which can be larger than the actual data download time. In the extreme case of a satellite connection with high speed but very high latency, a large thread pool would provide a massive speedup.
Note the valid caveats from the other posters - if the server, (or its admin), disallows or gets annoyed at the multiple connections, you may get no boost, limited bandwidth or a nasty email from the admin!
Rgds,
Martin
I have a program that we'd like to multi-thread at a certain point. We're using CSLA for our business rules. At a one location of our program we are iterating over a BusinessList object and running some sanity checks against the data one row at a time. When we up the row count to about 10k rows it takes some time to run the process (about a minute). Naturally this sounds like a perfect place to use a bit of TPL and make this multi-threaded.
I've done a fair amount of multithreaded work through the years, so I understand the pitfalls of switching from single to multithreaded code. I was surprised to find that the code bombed within the CSLA routines themselves. It seems to be related to the code behind the CSLA PropertyInfo classes.
All of our business object properties are defined like this:
public static readonly PropertyInfo<string> MyTextProperty = RegisterProperty<string>(c => c.MyText);
public string MyText {
get { return GetProperty(MyTextProperty); }
set { SetProperty(MyTextProperty, value); }
}
Is there something I need to know about multithreading and CSLA? Are there any caveats that aren't found in any written documentation (I haven't found anything as of yet).
--EDIT---
BTW: the way I implemented my multithreading via throwing all the rows into a ConcurrentBag and then spawning 5 or so tasks that just grab objects from the bag till the bag is empty. So I don't think the problem is in my code.
As you've discovered, the CSLA.NET framework is not thread-safe.
To solve your particular problem, I would make use of the Wintellect Power Threading library; either the AsyncEnumerator/SyncGate combo or the ReaderWriterGate on its own.
The Power Threading library will allow you queue 'read' and 'write' requests to a shared resource (your CSLA.NET collection). At one moment in time, only a single 'write' request will be allowed access to the shared resource, all without thread-blocking the queued 'read' or 'write' requests. Its very clever and super handy for safely accessing shared resources from multiple threads. You can spin up as many threads as you wish and the Power Threading library will synchronise the access to your CSLA.NET collection.
I'm presently working on a side-by-side application (C#, WinForms) that injects messages into an application via COM.
This application uses multiple foreach statements, polling entity metrics from the application that accepts COM. A ListBox is used to list each entity, and when a user selects one from this list, a thread is created and executed, calling a method that retrieves the required data.
When a user selects a different entity from the list, the running thread is aborted and a new thread is created for the newly selected entity.
I've spent a day looking into my threading and memory usage, and have come to a conclusion that everything is fine. Never is there more than 6 threads running concurrently (all unique for executing different members), and via the Windows task manager, my application never peaks >10 CPU%, 29M MEM.
The only thing coming to mind is that the COM object you are using is designed to run in a single threaded apartment (STA). If that is the case then it will not matter how many threads you start; they will all eventually get serialized when calling into this COM object. And if your machine has multiple cores then you will definitely see less than 100% usage. 10% seems awfully low though. I would not be surprised to see something around 25% which would basically represent one pegged core of a quad core system, but the 10% figure might require another explanation. If your code or the COM object itself is waiting for IO operations to complete that might explain more of the low throughput.
In WinForms you can do SuspendLayout() and ResumeLayout(). If you are inserting a lot of items (or in general doing a lot of screen updates) you would first call SuspectLayout() then do all of your updates and then ResumeLayout().
You don't mention what's slow, so it's very difficult to say anything with certainty. However, since you say that you insert items into a listbox, I'll make a complete guess and ask how many items is that each time? It can be very slow to insert a lot of items into a list box.
If that's the case, you could speed it up by instead of listing each entity in one listbox, only list a set of categories there and then when the user selects a category you'll populate another listbox with the entities related to that category.
I want to build a windows service that will use a remote encoding service (like encoding.com, zencoder, etc.) to upload video files for encoding, download them after the encoding process is complete, and process them.
In order to do that, I was thinking about having different queues, one for handling currently waiting files, one for files being uploaded, one for files waiting for encoding to complete and one more for downloading them. Each queue has a limitation, for example only 5 files can be uploaded for encoding at a certain time. The queues have to be visible and able to resurrect from a crash - currently we do that by writing the queue to an SQL table and managing the number of items in a separate table.
I also want the queues to run in the background, independent of each other, but able to transfer files from one queue to another as the process goes on.
My biggest question mark is about how to build the queues and managing them, and less about limiting the number of items in each queue.
I am not sure what is the right approach for this and would really appreciate any help.
Thanks!
You probably don't need to separate the work into separate queues, as long as they are logically separated in some way (tagged with different "job types" or such).
As I see it, the challenge is to not pick up and process more than a given limited number of jobs from the queue, based on the type of job. I had a somewhat similar issue a while ago which led to a question here on SO, and a subsequent blog post with my solution, both of which might give you some ideas.
In short my solution was that I keep a list of "tokens". When ever I want to perform a job that has some sort of limitation, I first pick up a token. If no tokens are available, I will need to wait for one to become available. Then you can use whatever queueing mechanism suitable to handle the queue as such.
There are various ways to approach this and it depends which one suits your case in terms of reliability and resilience/development cost/maintenance cost. You need to answer the question on the likes that what if server crashes, is it important to carry on what you were doing?
Queue can be implemented in MSMQ, SQL Server or simply in code and all queues in memory. For workflow you can use Windows Workflow Foundation, or implement it yourself which would be probably easier but change would be more difficult.
So if you give a few more hints, I should be able to help you better.