Combinatorial optimization where several criteria must be satisfied - c#

We are a group of first-year students studying computer science.
We are working on a project called "The electronical diet plan" (directly translated)
We want to make a program in C# that on a weekly basic calculate a diet plan that fullfiels/satisfies some criteria:
Your daily energy intake should not exceed the calculated calorie needs.
(Ex. If we calculate that a person should eat 2000 calories per day, the diet plan should plan approximately 2000 calories)
The daily energy (calories) should be distributed as follows:
Fat 25-35%
Carbohydrates 50-60%
Proteins 10-20%
We have a "database" with food and how much fat, carbohydrates and proteins it contains + the approximate price.
And we have a "database" with recipies and how much time it takes to cook it.
SO: We want to make a program that on a weekly basic calculate a good diet plan which satisfies the daily energy need (and how it should be distributed (fat, carbohydrates, proteins)). The program should also plan a diet plan that not takes
a lot of time and not cost to much (the user defines a upper bound for the price pr. week).
SO.. We want help to find a method/algorithm that can combinate 3-6 dishes per day which satisfied this ^^
We have been looking at a lot of combinational optimizations algorithms/problems but mostly "The knapsack problem".
But these algorithms/problem is only satisfying one criteria or trying to find the "cheapest" solution.
-> We want to satisfy a lot of criteria and want to find the best solution (not cheapest.. ex. fat has to be between 25-35%, not just be the lowest value)
We hope that some of you can help us with a good algorithm.

When it comes to finding the "cheapest" solution rather than the "best", you'll just have to redefine "cheap".
In optimization theory one often refers to the cost function, which is to be minimized - in your case, "cost" could be "fat percentage point difference from 30%", i.e. it costs nothing to eat 30% fat, and equally much to eat 20% as 40%. Of course, to make the method even more sophisticated, you could weigh it so it's more "expensive" to eat too much fat, than too little.
Now, if you create costs for each of your criteria, you also have to weigh them together, as mellamokb noted in a comment; to do this, simply calculate a weighted total cost. You'll end up with something like the following:
cost of diet = (importance of price) * price + (importance of time) * time + (importance of fat) * (deviation from fat goal) + etc ...
If you want to make it impossible to go over budget (in money spent), you could add terms like
over budget ? infinity : 0 to make the algorithm find solutions within the budget. You can also make constraints for repetition of meals etc - it's more or less your imagination (and computing power) that set the limits.
Now that you have a cost function, you can start working on your solution to the problem: minimizing the cost of the diet. And suddenly all those algorithms finding the "cheapest" solution make sense... ;)
Note that formulating this cost function is usually the difficult part. Depending on how you weigh your costs you'll find very different solutions to the problem; not all of them will be useful (in fact most of them probably won't be).


Generate closest teams based on employee schedules C#

I am given a csv of employee schedules with columns:
employee ID, first last name, sunday schedule, monday schedule, ... , saturday schedule
1 week schedule for each employee. I've attached a screenshot of a portion of the csv file. The total file has around 300 rows.
I need to generate teams of 15 based on the employees' schedules (locations don't matter) so that the employees on each team have the closest schedules to each other. Pseudocode of what I have tried:
parse csv file into array of schedules (my own struct definition)
match employees who have the same exact schedule into teams (creates ~5 full sized teams, 20 - 25 half filled teams, leaves ~50 schedules who don't match with anyone)
for i = 1 to 14, for each member of teams of size i, find the team with the closest schedule (as a whole) and add the member to that team. Once a team reaches size 15, mark them as "done".
This worked somewhat but definitely did not give me the best teams. My question is does anyone know a better way to do this? Pseudocode or just a general idea will help, thanks.
EDIT: Here is an example of the formula of comparison
The comparison is based on half hour blocks of difference between the agents schedules. Agent 25 has a score of 16 because he has a difference of 8 half hours with Agent 23 and 24. The team's total score is 32 based on everyone's scores added together.
Not all agents work 8 hour days, and many have different days off, which have the greatest effect on their "closeness" score. Also, a few agents have a different schedule on a certain day than their normal schedule. For example, one agent might work 7am - 3pm on mondays but work 8am - 4pm on tuesday - friday.
Unless you find a method that gets you an exact best answer, I would add a hill-climbing phase at the end that repeatedly checks to see if swapping any pair of agents between teams would improve things, and swaps them if this is the case, only stopping when it has rechecked every pair of agents and there are no more improvements to be made.
I would do this for two reasons:
1) Such hill-climbing finds reasonably good solutions surprisingly often.
2) People are good at finding improvements like this. If you produce a computer-generated schedule and people can find simple improvements (perhaps because they notice they are often scheduled at the same time as somebody from another team) then you're going to look silly.
Thinking about (2) another way to find local improvements would be to look for cases where a small number of people from different teams are scheduled at the same time and see if you can swap them all onto the same team.
Can't say for sure about the schedules, but in string algorithms you can find an edit distance calculation. The idea is to define number of operations you need to perform to get one string from another. For example, distance between kitten and sitting is 3, 2 for substitutions and 1 for deletion. I think that you can define a metric between two employees' schedule in similar way.
Now, after you have a distance function, you may start a clusterization. The k-means algorithm may be a good start for you, but it's main disadvantage is that the number of groups is fixed initially. But I think that you can easily adjust general logic for your needs. After that, you may try some additional ways to cluster your data, but you really should start with your distance function, and then simply optimize it on your employee records.

Detect unstable trend (timeseries)

I'm looking for a way to detect faulty sensors in an IOT environment.
In this case a tank level sensor. The readings are always fluctuating somewhat, and the "hop" at the beginning is a tank refill which is "normal". On Sep 16 the sensor started to malfunction and just gives apparent random values after that.
As a programmer ideally I'd like a simple way of detecting the problem (and as soon after it starts as possible).
I can mess about with "if direction of vector between two hourly averages changes direction more than once per day it is unstable". But I guess there are more sound and stable algorithms out there.
Two simple options:
domain knowledge based: If you know the max possible output of the tank (say 5 liter/h), any output above that would signal an error. I.e. in case of the example, if
t1-t2 > 5
assuming t1 and t2 show the tank capacity at hourly intervall. You might want to add sensor accuracy related safety margin.
past data based: Assuming that all tanks are similar regarding output capacity and used sensor quality, calculate the following for all your data of non-faulty sensors:
The result is the error threshold to be used, similar to the value 5 above.
Note: tank refill operation might require additional consideration.
Additional methods are described e.g. here. You can find other papers for sure.
Standard deviation.
You're looking at how much variation there is between the measurements. Standard deviation is an easy formula, and well known. Look for a high value, and you know there's a problem.
You can also use coefficient of variation, which is the ratio of the mean to standard deviation.

Linear regression on variables that does not scale directly with the output

I've been trying to follow a machine learning course on coursera. So far, most of the linear regression models introduced use variables that their numerical values have a positive correlation with the output.
Input: square feet of the house
Output: house price.
I'm however, trying to implement a multivariate regression model with some of the variables those numerical value that is not directly proportional to the output.
-what day is it (Mon,Tues..),
-what holiday is it (NewYear,Xmas..),
-what month is it(Jan,Feb),
-what time is it(0100,1300..)
-Number of visitors.
For the variables: what day is it, what holiday is it, what month is it, I am using an enumeration and assign a value for each value. (NewYear =1, Christmas =2, etc.). Is it better to do it this way or have separate variables? (IsNewYear, IsChristmas, etc.)
I understand that by applying higher orders of power in a variable, it can have a better fit, which is what I want for the holidays variable. Are there any methods that I can use to let the computer learn the best order by itself?
Are there any existing C# libraries that I can use that allows different orders of power for different variable? (e.g. 13 for holidays and quadratic for the time of the day)
For the variables: what day is it, what holiday is it, what month is it, I am using an enumeration and assign a value for each value. (NewYear =1, Christmas =2, etc.). Is it better to do it this way or have separate variables? (IsNewYear, IsChristmas, etc.)
Yes, you should never encode any order inside a variable which does not follow arithmetics, thus NewYear=1, Christmas=2, Thanksgiving=3 would mean that Christmas=(Thanksgiving+NewYear) / 2... now something you would like to have. One hot encoding (isNewyear etc.) is favorable so you do not encode false knowledge.
I understand that by applying higher orders of power in a variable, it can have a better fit, which is what I want for the holidays variable. Are there any methods that I can use to let the computer learn the best order by itself?
This is what non-linear methods do. Kernel methods (kernelized linear regression, SVR), neural networks, regression trees/forests etc.
Are there any existing C# libraries that I can use that allows different orders of power for different variable? (e.g. 13 for holidays and quadratic for the time of the day)
You should not think about this in such terms, you are not supposed to fit powers by hand, you should rather give a model flexibility to fit high orders by themselves (see previous point).

Norms, rules or guidelines for calculating and showing "ETA/ETC" for a process

ETC = "Estimated Time of Completion"
I'm counting the time it takes to run through a loop and showing the user some numbers that tells him/her how much time, approximately, the full process will take. I feel like this is a common thing that everyone does on occasion and I would like to know if you have any guidelines that you follow.
Here's an example I'm using at the moment:
int itemsLeft; //This holds the number of items to run through.
double timeLeft;
TimeSpan TsTimeLeft;
list<double> avrage;
double milliseconds; //This holds the time each loop takes to complete, reset every loop.
//The background worker calls this event once for each item. The total number
//of items are in the hundreds for this particular application and every loop takes
//roughly one second.
private void backgroundWorker1_ProgressChanged(object sender, ProgressChangedEventArgs e)
//An item has been completed!
//Get an avgrage time per item and multiply it with items left.
timeLeft = avrage.Sum() / avrage.Count * itemsLeft;
TsTimeLeft = TimeSpan.FromSeconds(timeLeft);
this.Text = String.Format("ETC: {0}:{1:D2}:{2:D2} ({3:N2}s/file)",
avrage.Sum() / avrage.Count);
//Only using the last 20-30 logs in the calculation to prevent an unnecessarily long List<>.
if (avrage.Count > 30)
avrage.RemoveRange(0, 10);
milliseconds = 0;
//this.profiler.Interval = 10;
private void profiler_Tick(object sender, EventArgs e)
milliseconds += 0.01;
As I am a programmer at the very start of my career I'm curious to see what you would do in this situation. My main concern is the fact that I calculate and update the UI for every loop, is this bad practice?
Are there any do's/don't's when it comes to estimations like this? Are there any preferred ways of doing it, e.g. update every second, update every ten logs, calculate and update UI separately? Also when would an ETA/ETC be a good/bad idea.
The real problem with estimation of time taken by a process is the quantification of the workload. Once you can quantify that, you can made a better estimate
Examples of good estimates
File system I/O or network transfer. Whether or not file systems have bad performance, you can get to know in advance, you can quantify the total number of bytes to be processed and you can measure the speed. Once you have these, and once you can monitor how many bytes have you transferred, you get a good estimate. Random factors may affect your estimate (i.e. an application starts meanwhile), but you still get a significative value
Encryption on large streams. For the reasons above. Even if you are computing a MD5 hash, you always know how many blocks have been processed, how many are to be processed and the total.
Item synchronization. This is a little trickier. If you can assume that the per-unit workload is constant or you can make a good estimate of the time required to process an item when variance is low or insignificant, then you can make another good estimate of the process. Pick email synchronization: if you don't know the byte size of the messages (otherwise you fall in case 1) but common practice tells that the majority of emails have quite the same size, then you can use the mean of the time taken to download/upload all processed emails to estimate the time taken to process a single email. This won't work in 100% of the cases and is subject to error, but you still see progress bar progressing on a large account
In general the rule is that you can make a good estimate of ETC/ETA (ETA is actually the date and time the operation is expected to complete) if you have a homogeneous process about of which you know the numbers. Homogeneity grants that the time to process a work item is comparable to others, i.e. the time taken to process a previous item can be used to estimate future. Numbers are used to make correct calculations.
Examples of bad estimates
Operations on a number of files of unknown size. This time you know only how many files you want to process (e.g. to download) but you don't know their size in advance. Once the size of the files has a high variance you see troubles. Having downloaded half of the file, when these were the smallest and sum up to 10% of total bytes, can be said being halfway? No! You just see the progress bar growing fast to 50% and then much slowly
Heterogenous processes. E.g. Windows installations. As pointed out by #HansPassant, Windows installations provide a worse-than-bad estimate. Installing a Windows software involves several processes including: file copy (this can be estimated), registry modifications (usually never estimated), execution of transactional code. The real problem is the last. Transactional processes involving execution of custom installer code are discusses below
Execution of generic code. This can never be estimated. A code fragment involves conditional statements. The execution of these involve changing paths depending on a condition external to the code. This means, for example, that a program behaves differently whether you have a printer installed or not, whether you have a local or a domain account, etc.
Estimating the duration of a software process isn't both an impossible and an exact/*deterministic* task.
It's not impossible because, even in the case of code fragments, you can either find a model for your code (pick a LU factorization as an example, this may be estimated). Or you might redesign your code splitting it into an estimation phase - where you first determine the branch conditions - and an execution phase, where all pre-determined branches are taken. I said might because this task is in practice impossible: most code determines branches as effects of previous conditions, meaning that estimating a branch actually involves running the code. Chicken and egg circle
It's not a deterministic process. Computer systems, especially if multitasking are affected by a number of random factors that may impact on your estimated process. You will never get a correct estimate before running your process. At most, you can detect external factors and re-estimate your process. The fork between your estimate and the real duration of process is mathematically converging to zero when you get closer to process end (lim [x->N] |est(N) - real(N)| == 0, where N is the process duration)
If your user interface is so obscure that you have to explain that ETC doesn't mean Etcetera then you are doing it wrong. Every user understands what a progress bar does, don't help.
Nothing is quite as annoying as an inaccurate progress bar. Particularly ones that promise a quick finish but then don't deliver. I'd give the progress bar displayed by any installer on Windows as a good example of one that is fundamentally broken. Just not a shining example of an implementation that you should pursue.
Such a progress bar is broken because it is utterly impossible to guess up front how long it is going to take to install a program. File systems have very unpredictable perf. This is a very common problem with estimating execution time. Better UI models are the spinning dots you'd see in a video player and many programs in Windows 8. Or the marquee style supported by the common ProgressBar control. Just feedback that says "I'm not dead, working on it". Even the hour-glass cursor is better than a bad estimate. If you have something to report beyond a technicality that no user is really interested in then don't hesitate to display that. Like the number of files you've processed or the number of kilobytes you've downloaded. The actual value of the number isn't that useful, seeing the rate at which it increases is the interesting tidbit.

A nearest neighbour when edge costs are asymmetric, some doubts

To clarify my post, I have edited it based on comments.
I was thinking how to implement a nearest neighbour search efficiently when edge costs are asymmetric. I'm thinking a range of cities something like from 100 to 12000.
In more detail, as an example, there's a cost COST1 on travelling from city A to city B, e.g. by foot, and a cost COST1/10 to travel from B to A, e.g. by train. In other words, the problem I see here is that if I have an asymmetric matrix C representing costs between travelling cities and I select one point A, how could discover efficiently, say, three nearest neighbouring cities B1, B2 and B3 in terms of travelling cost? I would like to run the queries repeatedly. Preprocessing time, if not huge, is all right.
The efficiency pondering let me to thinking something like a k-d tree, which faciliates for finding k nearest neighbours in O(lg(n)) time when costs between cities are symmetric. This is the snag with just basic k-d tree in my case as the travelling costs aren't in general the same in both directions between any two cities. The gist of the matter seems to be then, how could I do something like k-nearest neighbours in asymmetric case?
To remedy the aforementioned symmetry assumption, I thought that instead of just one tree, I have two trees constructed so that the costs are calculated in both directions, and then I run a search through both trees. Then I became to wonder, does anyone know if there's already something specifically for the purpose of asymmetric costs and/or would using two trees as an idea be totally astray?
It also may be k-d trees in two dimensions isn't necessarily the most fit solution. So pointers to other data structures and algorithms are welcome too. Especially if someone has practical experience regarding my problem size. Wikipedia lists quite a bunch of approaches, and maybe even approximate solution is good for what I'm trying to do (this is for a smallish game for learning purposes).
For each point you need to calculate costs for all available travel types(foot,travel,..), lead to one unit,compare and get min. And this cost you can use in search algorithms.

