Is there a way to estimate the remaining time in .net remoting? - c#

I got two applications communicating via IPCChannel.
I call a method(specifically a property.get) that returns a list of 80+ objects, each containing another 500-1000 objects.
This call takes about 40-60 seconds to complete. Is there a way to determine the estimated remaining time in order to give the user some feedback - apart from splitting up the list and fetching the objects one by one (which would make it possible to compute the remaining time myself)?

Splitting the communication up into smaller pieces is actually a pretty good way to solve this problem.
You need to have an initial handshaking communication to establish how many pieces are to be used, but then you can give an accurate estimate of the time to go. Make sure that you don't break it up into pieces that are so small that the communication overhead degrades performance.

Related

High volume blacklist contains operation - performance in C#

I am working on desktop application which need perform web site access check. I have a huge black lists on PC where application is running, and faced with task:
How to perform fastest check over those black lists?
I'm using C#/.NET development stack, currently my idea load all those lists into hashset and invoke Contains method, but I not sure that this is good idea to load it all into memory, maybe you can suggest another way which save memory from one side and will work as fast as it can from another?
The files are in form of plain text, and in the region of megabytes but this size is expected to grow.
UPDATE:
I found black lists of web site here after download and unzip it the size of data about 80 megabytes. So I not sure that keep all data in memory good idea.
UPDATE 2
I've created perfomance test, downloaded blacklist with 2339643
items.
Loaded it into HashSet and perform 1000 iterations to check
speed.
Results:
The maximum amount of time which Contains method take: 0.2
milliseconds (this is first call)
Second call take about '0.0164' milliseconds
milliseconds and other even less. The perfomance is good.
But application where I run test take about 250MB of system memory which
is not so good as HashSet perfomance.
You can use a HashSet to store your black list, this data structure allows O(1) amortised time complexity for inserts and checks if the item is present in the set.
If you need something more scalable, you can consider brining in redis or memcached.
Reading through comments, I would consider creating a web service that performs a check. A user can query web service, which in turn would query redis or memchached or slq server if you don't need it all in memory. Alternatively, I would suggest looking at whitelisting, if your black lists grow too much this could indicate a problem with the current approach.

Why does my Parallel.ForAll call end up using a single thread?

I have been using PLINQ recently to perform some data handling.
Basically I have about 4000 time series (so basically instances of Dictionary<DataTime,T>) which I stock in a list called timeSeries.
To perform my operation, I simply do:
timeSeries.AsParallel().ForAll(x=>myOperation(x))
If I have a look at what is happening with my different cores, I notice that first, all my CPUs are being used and I see on the console (where I output some logs) that several time series are processed at the same time.
However, the process is lengthy, and after about 45 minutes, the logging clearly indicates that there is only one thread working. Why is that?
I tried to give it some thought, and I realized that timeSeries contains instances simpler to process from myOperation's point of view at the beginning and the end of the list. So, I wondered if maybe the algorithm that PLINQ was using consisted in splitting the 4000 instances on, say, 4 cores, giving each of them 1000. Then, when the core is finished with its allocation of work, it goes back to idle. This would mean that one of the core may be facing a much heavier workload.
Is my theory correct or is there another possible explanation?
Shall I shuffle my list before running it or is there some kind of parallelism parameters I can use to fix that problem?
Your theory is probably correct although there is something called 'workstealing' that should counter this. I'm not sure why that doesn't work here. Are there many (>= dozens) large jobs at the outer ends or just a few?
Aside from shuffling your data you could use the overload for AsParallel() that accepts a custom Partioner. That would allow you to balance the work better.
Side note: for this situation I would prefer Parallel.ForEach(), more options and cleaner syntax.

Divide work among processes or threads?

I am interning for a company this summer, and I got passed down this program which is a total piece. It does very computationally intensive operations throughout most of its duration. It takes about 5 minutes to complete a run on a small job, and the guy I work with said that the larger jobs have taken up to 4 days to run. My job is to find a way to make it go faster. My idea was that I could split the input in half and pass the halves to two new threads or processes, I was wondering if I could get some feedback on how effective that might be and whether threads or processes are the way to go.
Any inputs would be welcomed.
Hunter
I'd take a strong look at TPL that was introduced in .net4 :) PLINQ might be especially useful for easy speedups.
Genereally speaking, splitting into diffrent processes(exefiles) is inadvicable for perfomance since starting processes is expensive. It does have other merits such as isolation(if part of a program crashes) though, but i dont think they are applicable for your problem.
If the jobs are splittable, then going multithreaded/multiprocessed will bring better speed. That is assuming, of course, that the computer they run on actually has multiple cores/cpus.
Threads or processes doesn't really matter regarding speed (if the threads don't share data). The only reason to use processes that I know of is when a job is likely to crash an entire process, which is not likely in .NET.
Use threads if theres lots of memory sharing in your code but if you think you'd like to scale the program to run across multiple computers (when required cores > 16) then develop it using processes with a client/server model.
Best way when optimising code, always, is to Profile it to find out where the Logjam's are IMO.
Sometimes you can find non obvious huge speed increases with little effort.
Eqatec, and SlimTune are two free C# profilers which may be worth trying out.
(Of course the other comments about which parallelization architecture to use are spot on - it's just I prefer analysis first....
Have a look at the Task Parallel Library -- this sounds like a prime candidate problem for using it.
As for the threads vs processes dilemma: threads are fine unless there is a specific reason to use processes (e.g. if you were using buggy code that you couldn't fix, and you did not want a bad crash in that code to bring down your whole process).
Well if the problem has a parallel solution then this is the right way to (ideally) significantly (but not always) increase performance.
However, you don't control making additional processes except for running an app that launches multiple mini apps ... which is not going to help you with this problem.
You are going to need to utilize multiple threads. There is a pretty cool library added to .NET for parallel programming you should take a look at. I believe its namespace is System.Threading.Tasks or System.Threading with the Parallel class.
Edit: I would definitely suggest though, that you think about whether or not a linear solution may fit better. Sometimes parallel solutions would taken even longer. It all depends on the problem in question.
If you need to communicate/pass data, go with threads (and if you can go .Net 4, use the Task Parallel Library as others have suggested). If you don't need to pass info that much, I suggest processes (scales a bit better on multiple cores, you get the ability to do multiple computers in a client/server setup [server passes info to clients and gets a response, but other than that not much info passing], etc.).
Personally, I would invest my effort into profiling the application first. You can gain a much better awareness of where the problem spots are before attempting a fix. You can parallelize this problem all day long, but it will only give you a linear improvement in speed (assuming that it can be parallelized at all). But, if you can figure out how to transform the solution into something that only takes O(n) operations instead of O(n^2), for example, then you have hit the jackpot. I guess what I am saying is that you should not necessarily focus on parallelization.
You might find spots that are looping through collections to find specific items. Instead you can transform these loops into hash table lookups. You might find spots that do frequent sorting. Instead you could convert those frequent sorting operations into a single binary search tree (SortedDictionary) which maintains a sorted collection efficiently through the many add/remove operations. And maybe you will find spots that repeatedly make the same calculations. You can cache the results of already made calculations and look them up later if necessary.

Algorithm- minimizing total tardiness

I am trying to implement an exact algorithm of minimizing total tardiness for single machine. I was searching in the web to get an idea how I can implement it using dynamic programming. I read the paper of Lawler who proposed a PSEUDOPOLYNOMIAL algorithm back in 77. However, still could not able to map it in java or c# code.
Could you please help me providing some reference how to implement this exact algorithm effectively?
Edit-1: #bcat: not really. I need to implement it for our software. :( still I am not able to find any guidance how to implement it. Greedy one is easy to implement , but result of scheduling is not that impressive.
Kind Regards,
Xiaon
You could have specified the exact characteristics of your particular problem along with limits for different variables and the overall characteristics of the input.
Without this, I assume you use this definition:
You want to minimize the total tardiness in a single machine scheduling problem with n independent jobs, each having its processing time and due date.
What you want to do is to pick up a subset of the jobs so that they do not overlap (single machine available) and you can also pick the order in which you do them, keeping the due dates.
I guess the first step to do is to sort by the due dates, it seems that there is no benefit of sorting them in some different way.
Next what is left is to pick the subset. Here is where the dynamic programming comes to help. I'll try to describe the state and recursive relation.
State:
[current time][current job] = maximum number of jobs done
Relation:
You either process the current job and call
f(current time + processing_time_of[current job],current job +1)
or you skip the process and call
f(current time,current job +1)
finally you take the minimum of those 2 values returned by those calls and store it in state
[current time][current job]
And you start the recursion at time 0 and job 0.
By the way, greedy seems to be doing pretty well, check this out:
http://www.springerlink.com/content/f780526553631475/
For single machine, Longest Processing Time schedule (also known as Decreasing-Time Algorithm) is optimal if you know all jobs ahead of time and can sort them from longest to shortest (so all you need to do is sort).
As has been mentioned already, you've specified no information about your scheduling problem, so it is hard to help beyond this (as no universally optimal and efficient scheduler exists).
Maybe the Job Scheduling Section of the Stony Brook Algorithm Repository can help you.

factory floor simulation

I would like to create a simulation of a factory floor, and I am looking for ideas on how to do this. My thoughts so far are:
• A factory is a made up of a bunch of processes, some of these processes are in series and some are in parallel. Each process would communicate with it's upstream and downstream and parallel neighbors to let them know of it’s through put
• Each process would it's own basic attributes like maximum throughput, cost of maintenance as a result of through put
Obviously I have not fully thought this out, but I was hoping somebody might be able to give me a few ideas or perhaps a link to an on line resource
update:
This project is only for my own entertainment, and perhaps learn a little bit alnong the way. I am not employed as a programmer, programming is just a hobby for me. I have decided to write it in C#.
Simulating an entire factory accurately is a big job.
Firstly you need to figure out: why are you making the simulation? Who is it for? What value will it give them? What parts of the simulation are interesting? How accurate does it need to be? What parts of the process don't need to be simulated accurately?
To figure out the answers to these questions, you will need to talk to whoever it is that wants the simulation written.
Once you have figured out what to simulate, then you need to figure out how to simulate it. You need some models and some parameters for those models. You can maybe get some actual figures from real production and try to derive models from the figures. The models could be a simple linear relationship between an input and an output, a more complex relationship, and perhaps even a stochastic (random) effect. If you don't have access to real data, then you'll have to make guesses in your model, but this will never be as good so try to get real data wherever possible.
You might also want to consider to probabilities of components breaking down, and what affect that might have. What about the workers going on strike? Unavailability of raw materials? Wear and tear on the machinery causing progressively lower output over time? Again you might not want to consider these details, it depends on what the customer wants.
If your simulation involves random events, you might want to run it many times and get an average outcome, for example using a Monte Carlo simulation.
To give a better answer, we need to know more about what you need to simulate and what you want to achieve.
Since your customer is yourself, you'll need to decide the answer to all of the questions that Mark Byers asked. However, I'll give you some suggestions and hopefully they'll give you a start.
Let's assume your factory takes a few different parts and assembles them into just one finished product. A flowchart of the assembly process might look like this:
Factory Flowchart http://img62.imageshack.us/img62/863/factoryflowchart.jpg
For the first diamond, where widgets A and B are assembled, assume it takes on average 30 seconds to complete this step. We'll assume the actual time it takes the two widgets to be assembled is distributed normally, with mean 30 s and variance 5 s. For the second diamond, assume it also takes on average 30 seconds, but most of the time it doesn't take nearly that long, and other times it takes a lot longer. This is well approximated by an exponential distribution, with 30 s as the rate parameter, often represented in equations by a lambda.
For the first process, compute the time to assemble widgets A and B as:
timeA = randn(mean, sqrt(variance)); // Assuming C# has a function for a normally
// distributed random number with mean and
// sigma as inputs
For the second process, compute the time to add widget C to the assembly as:
timeB = rand()/lambda; // Assuming C# has a function for a uniformly distributed
// random number
Now your total assembly time for each iGadget will be timeA + timeB + waitingTime. At each assembly point, store a queue of widgets waiting to be assembled. If the second assembly point is a bottleneck, it's queue will fill up. You can enforce a maximum size for its queue, and hold things further up stream when that max size is reached. If an item is in a queue, it's assembly time is increased by all of the iGadgets ahead of it in the assembly line. I'll leave it up to you to figure out how to code that up, and you can run lots of trials to see what the total assembly time will be, on average. What does the resultant distribution look like?
Ways to "spice this up":
Require 3 B widgets for every A widget. Play around with inventory. Replenish inventory at random intervals.
Add a quality assurance check (exponential distribution is good to use here), and reject some of the finished iGadgets. I suggest using a low rejection rate.
Try using different probability distributions than those I've suggested. See how they affect your simulation. Always try to figure out how the input parameters to the probability distributions would map into real world values.
You can do a lot with this simple simulation. The next step would be to generalize your code so that you can have an arbitrary number of widgets and assembly steps. This is not quite so easy. There is an entire field of applied math called operations research that is dedicated to this type of simulation and analysis.
What you're describing is a classical problem addressed by discrete event simulation. A variety of both general purpose and special purpose simulation languages have been developed to model these kinds of problems. While I wouldn't recommend programming anything from scratch for a "real" problem, it may be a good exercise to write your own code for a small queueing problem so you can understand event scheduling, random number generation, keeping track of calendars, etc. Once you've done that, a general purpose simulation language will do all that stuff for you so you can concentrate on the big picture.
A good reference is Law & Kelton. ARENA is a standard package. It is widely used and, IMHO, is very comprehensive for these kind of simulations. The ARENA book is also a decent book on simulation and it comes with the software that can be applied to small problems. To model bigger problems, you'll need to get a license. You should be able to download a trial version of ARENA here.
It maybe more then what you are looking for but visual components is a good industrial simulation tool.
To be clear I do not work for them nor does the company I work for currently use them, but we have looked at them.
Automod is the way to go.
http://www.appliedmaterials.com/products/automod_2.html
There is a lot to learn, and it won't be cheap.
ASI's Automod has been in the factory simulation business for about 30 years. It is now owned by Applied Materials. The big players who work with material handling in a warehouse use Automod because it is the proven leader.

Categories

Resources