Linear regression on variables that does not scale directly with the output - c#

I've been trying to follow a machine learning course on coursera. So far, most of the linear regression models introduced use variables that their numerical values have a positive correlation with the output.
Input: square feet of the house
Output: house price.
I'm however, trying to implement a multivariate regression model with some of the variables those numerical value that is not directly proportional to the output.
Inputs:
-what day is it (Mon,Tues..),
-what holiday is it (NewYear,Xmas..),
-what month is it(Jan,Feb),
-what time is it(0100,1300..)
Output:
-Number of visitors.
Questions:
For the variables: what day is it, what holiday is it, what month is it, I am using an enumeration and assign a value for each value. (NewYear =1, Christmas =2, etc.). Is it better to do it this way or have separate variables? (IsNewYear, IsChristmas, etc.)
I understand that by applying higher orders of power in a variable, it can have a better fit, which is what I want for the holidays variable. Are there any methods that I can use to let the computer learn the best order by itself?
Are there any existing C# libraries that I can use that allows different orders of power for different variable? (e.g. 13 for holidays and quadratic for the time of the day)
Thanks.

For the variables: what day is it, what holiday is it, what month is it, I am using an enumeration and assign a value for each value. (NewYear =1, Christmas =2, etc.). Is it better to do it this way or have separate variables? (IsNewYear, IsChristmas, etc.)
Yes, you should never encode any order inside a variable which does not follow arithmetics, thus NewYear=1, Christmas=2, Thanksgiving=3 would mean that Christmas=(Thanksgiving+NewYear) / 2... now something you would like to have. One hot encoding (isNewyear etc.) is favorable so you do not encode false knowledge.
I understand that by applying higher orders of power in a variable, it can have a better fit, which is what I want for the holidays variable. Are there any methods that I can use to let the computer learn the best order by itself?
This is what non-linear methods do. Kernel methods (kernelized linear regression, SVR), neural networks, regression trees/forests etc.
Are there any existing C# libraries that I can use that allows different orders of power for different variable? (e.g. 13 for holidays and quadratic for the time of the day)
You should not think about this in such terms, you are not supposed to fit powers by hand, you should rather give a model flexibility to fit high orders by themselves (see previous point).

Related

Generate closest teams based on employee schedules C#

I am given a csv of employee schedules with columns:
employee ID, first last name, sunday schedule, monday schedule, ... , saturday schedule
1 week schedule for each employee. I've attached a screenshot of a portion of the csv file. The total file has around 300 rows.
I need to generate teams of 15 based on the employees' schedules (locations don't matter) so that the employees on each team have the closest schedules to each other. Pseudocode of what I have tried:
parse csv file into array of schedules (my own struct definition)
match employees who have the same exact schedule into teams (creates ~5 full sized teams, 20 - 25 half filled teams, leaves ~50 schedules who don't match with anyone)
for i = 1 to 14, for each member of teams of size i, find the team with the closest schedule (as a whole) and add the member to that team. Once a team reaches size 15, mark them as "done".
This worked somewhat but definitely did not give me the best teams. My question is does anyone know a better way to do this? Pseudocode or just a general idea will help, thanks.
EDIT: Here is an example of the formula of comparison
The comparison is based on half hour blocks of difference between the agents schedules. Agent 25 has a score of 16 because he has a difference of 8 half hours with Agent 23 and 24. The team's total score is 32 based on everyone's scores added together.
Not all agents work 8 hour days, and many have different days off, which have the greatest effect on their "closeness" score. Also, a few agents have a different schedule on a certain day than their normal schedule. For example, one agent might work 7am - 3pm on mondays but work 8am - 4pm on tuesday - friday.
Unless you find a method that gets you an exact best answer, I would add a hill-climbing phase at the end that repeatedly checks to see if swapping any pair of agents between teams would improve things, and swaps them if this is the case, only stopping when it has rechecked every pair of agents and there are no more improvements to be made.
I would do this for two reasons:
1) Such hill-climbing finds reasonably good solutions surprisingly often.
2) People are good at finding improvements like this. If you produce a computer-generated schedule and people can find simple improvements (perhaps because they notice they are often scheduled at the same time as somebody from another team) then you're going to look silly.
Thinking about (2) another way to find local improvements would be to look for cases where a small number of people from different teams are scheduled at the same time and see if you can swap them all onto the same team.
Can't say for sure about the schedules, but in string algorithms you can find an edit distance calculation. The idea is to define number of operations you need to perform to get one string from another. For example, distance between kitten and sitting is 3, 2 for substitutions and 1 for deletion. I think that you can define a metric between two employees' schedule in similar way.
Now, after you have a distance function, you may start a clusterization. The k-means algorithm may be a good start for you, but it's main disadvantage is that the number of groups is fixed initially. But I think that you can easily adjust general logic for your needs. After that, you may try some additional ways to cluster your data, but you really should start with your distance function, and then simply optimize it on your employee records.

Naming convention for number range

What are some possible ways to name a variable representing a range of numbers? For example, I am working on a metrics application that displays the age of certain items in a person's queue. They are measured in
0-50 days
51-100 days
100+ days
I've thought about spelling the range out: zeroToFifty, range0-50. I've also considered naming them by "sections": first, second, third, but this doesn't prove to be very descriptive at all. What have you guys done to represent number ranges?
First, a name like ZeroToFifty isn't really very descriptive, hardly any better than if (number < 50). Variable names should provide more information if possible, while still being brief.
Second, I'd advise against embedding the numerical values into the constants - if you decide that the bottom range goes to 60 then a ZeroToFifty naming won't match any more. It will be much easier to adjust the values later if you don't have to refactor a name change throughout your codebase. Also, users of the constant probably don't care about 50, they care about "is it young or old?".
So you need to think "what do these number ranges represent"?
It depends on the usage, but you may find Young, Mature, Old works well for your case, as it describes the age of the item (and thus gives you strong clues about the meaning or usage of the value). Or maybe Modern, Classic, Vintage. Or Baby, Child, Adult. (If they "fit" the usage you have in mind).
In C# if you use an enumerated type, the typename must always be used, and that also can help clarify the meaning: ItemAge.Young/Mature/Old or TimeInQueue.Short/Medium/Long.

Methodologies or algorithms for filling in missing data

I am dealing with datasets with missing data and need to be able to fill forward, backward, and gaps. So, for example, if I have data from Jan 1, 2000 to Dec 31, 2010, and some days are missing, when a user requests a timespan that begins before, ends after, or encompasses the missing data points, I need to "fill in" these missing values.
Is there a proper term to refer to this concept of filling in data? Imputation is one term, don't know if it is "the" term for it though.
I presume there are multiple algorithms & methodologies for filling in missing data (use last measured, using median/average/moving average, etc between 2 known numbers, etc.
Anyone know the proper term for this problem, any online resources on this topic, or ideally links to open source implementations of some algorithms (C# preferably, but any language would be useful)
The term you're looking for is interpolation. (obligatory wiki link)
You're asking for a C# solution with datasets but you should also consider doing this at the database level like this.
An simple, brute-force approach in C# could be to build an array of consecutive dates with your beginning and ending values as the min/max values. Then use that array to merge "interpolated" date values into your data set by inserting rows where there is no matching date for your date array in the dataset.
Here is an SO post that gets close to what you need: interpolating missing dates with C#. There is no accepted solution but reading the question and attempts at answers may give you an idea of what you need to do next. E.g. Use the DateTime data in terms of Ticks (long value type) and then use an interpolation scheme on that data. The convert the interpolated long values to DateTime values.
The algorithm you use will depend a lot on the data itself, the size of the gaps compared to the available data, and its predictability based on existing data. It could also incorporate other information you might know about what's missing, as is common in statistics, when your actual data may not reflect the same distribution as the universe across certain categories.
Linear and cubic interpolation are typical algortihms that are not difficult to implement, try googling those.
Here's a good primer with some code:
http://paulbourke.net/miscellaneous/interpolation/
The context of the discussion in that link is graphics but the concepts are universally applicable.
For the purpose of feeding statistical tests, a good search term is imputation - e.g. http://en.wikipedia.org/wiki/Imputation_%28statistics%29

NHibernate - Money Type - Multiple Fields in Table

I am attempting to save financial transactions to a database using NHibernate and have come across a number of blog posts suggesting the use of a Money Type whereby the amount is stored as a double and the currency is stored as a string - i.e. there will be two fields in the database.
For my purposes, I will have multiple financial records in the same table - e.g. Unit Price, Tax in dollars, savings in dollars, etc. The above approach will work, but will result in duplicated data as there will be a column for the currency type of each of these (in this example 3) fields. This is unnecessary as the currency will always be the same for savings as it is for price, etc. - if it is dollars for one, it will be dollars for the other...
Has anyone run into a similar issue and, if so, can you tell me the solution you ended up with?
Thanks
JP
I have seen the same thing many places too, so far I have seen no explanation of why. What good is it? This is not how people do business or how a logical system works. If I am doing international business from the US, my systems will still have an internal basis of the dollar.
Currency is important as a boundary condition for an event, where you need an exchange rate. Even an international bank. I currently send all my drug money, USD, to the Caymans where my money is available to me also accounted in dollars. Maybe yours is in euros. So they get it, and what they do is convert both to whatever currency they use internally, sea shells whatever. Now they have to keep up with this stuff, but what they have to know at any given time is the exchange rate between dollars/sea shells to know how much they are in to me for, and sea shells/euros to know how much they are in to you for. This little goody does not do squat for them either it is a value object, how could it? Currency in this case would be fixed at the account level, not a bunch of these things floating around.
In general currency will be fixed by something else, like you observe, at a row level. A row of data is related, you should be able to do math on money values, in which case the currency in a row would have to be the same. Maybe I do business in Europe and have to quote in euros. I might want to record for historical purposes a quote versus payment in both currencies. I question if this is a single row design but if I decided it were, a Money object you describe is a single value object with two components. It should be considered a single entity, and I think what I am describing here is semantically different from a "Money" object and might as well be described explicitly by decimal/currency columns that we do not try to composite as a sing;e Money value because they cannot be compared or have math done on them.
I would just not go this route because leads to confusing subtle inconsistencies in semantics, and probably adds nothing. Where are you ever really going to use it that a row-wise currency would not give you the same thing, typically in a more rational fashion?
But if a manager insists you do something that is fundamentally pointless or worse, in terms of NHibernate you have declared a single atomic entity, Money that happens to have 2 components, a decimal and string. Because Money is now a single atomic unit, you have to always have both columns for each money, there is no workaround for avoiding the duplicate columns.

Possible Combination of Knapsack problem and?

Alright quick overview
I have looked into the knapsack problem
http://en.wikipedia.org/wiki/Knapsack_problem
and i know it is what i need for my project, but the complicated part of my project would be that i need multiple sacks inside a main sack.
The large knapsack that holds all the "bags" can only carry x amount of "bags" (lets say 9 for sake of example). Each bag has different values;
Weight
Cost
Size
Capacity
and so on, all of those values are integer numbers. Lets assume from 0-100.
The inner bag will also be assigned a type, and there can only be one of that type within the outer bag, although the program input will be given multiple of the same type.
I need to assign a maximum weight that the main bag can hold, and all other properties of the smaller bags need to be grouped by weighted values.
Example
Outer Bag:
Can hold 9 smaller bags
Weight no more than 98 [Give or take 5 either side]
Must hold one of each type, Can only hold one of each type at a time.
Inner Bags:
Cost, Weighted at 100%
Size, Weighted at 67%
Capacity, Weighted at 44%
The program will be given an input of multiple bags, and then must work out combinations of Smaller Bags to go into the larger bag, there will be multiple solutions depending on the input, and the program would output the best solutions for me.
I am wondering what you guys think the best way for me to approach this would be.
I will be programming it in either Java, or C#. I would love to program it in PHP but i'm afraid the algorithm would be very inefficient for web servers.
Thanks for any help you can give
-Zack
Okay, well, knapsack is NP-hard so I'm pretty certain this will be NP-hard as well (if it weren't you could solve knapsack by doing this with only one outer bag.) So for an exactly optimal solution, you're probably going to be able to do no beter than searching all combinations. So the outline of the program you want will be like
for each possible combination
do
if current combination is better than best previous
save current combination as best so far
fi
od
and the run time will be exponential. It sounds, though, like you might be able to get a near solution with dynamic programming.
Consider using Prolog for your logical programming. There's multiple implementations of it including P# on mono (.NET). Theres a bit of a learning curve, but once you get used to it, it's pretty much in a league of its own for this kind of problem solving.
Hope this helps. Cheers!
link to P#

Categories

Resources