I have an optimization issue that I'm not sure where to go from here. I have a program that tries to find the best combination of inputs that return the highest predicted r squared value. The problem is that I have 21 total inputs (List) and I need them in a set of 15 inputs. The formula for total combinations is:
n! / r!(n - r)! = 21! / 15!(21 - 15)! = 54,264 possible combinations
So obviously running through each combination and calculating the predicted rsquared is not an ideal solution so is there an better way/algorithm/method I can use to try to skip or narrow down the bad combinations so that I'm only processing the fewest amount of combinations? Here is my current psuedo code for this issue:
public BestCombo GetBestCombo(List<List<MultipleRegressionInfo>> combosList)
{
BestCombo bestCombo = new BestCombo();
foreach (var combo in combosList)
{
var predRsquared = CalculatePredictedRSquared(combo);
if (predRsquared > bestCombo.predRSquared)
{
bestCombo.predRSquared = predRsquared;
bestCombo.BestRSquaredCombo = combo;
}
}
return bestCombo;
}
public class BestCombo
{
public double predRSquared { get; set; }
public IEnumerable<MultipleRegressionInfo> BestRSquaredCombo { get; set; }
}
public class MultipleRegressionInfo
{
public List<double> input { get; set; }
public List<double> output { get; set; }
}
public double CalculatePredictedRSquared(List<MultipleRegressionInfo> combo)
{
Matrix<double> matrix = BuildMatrix(combo.Select(i => i.input).ToArray());
Vector<double> vector = BuildVector(combo.ElementAt(0).output);
var coefficients = CalculateWithQR(matrix, vector);
var y = CalculateYIntercept(coefficients, input, output);
var estimateList = CalculateEstimates(coefficients, y, input, output);
return GetPredRsquared(estimateList, output);
}
54,264 is not enormous for a computer - it might be worth timing a few calls to compute R^2 and multiplying up to see just how long this would take.
There is a branch and bound algorithm for this sort of problem, which relies on the fact that R^2(A,B,C) >= R^2(A,B) - that the R^2 can only decrease when you drop a variable. Recursively search the space of all sets of variables of size at least 15. After computing the R^2 for a set of variables, make recursive calls with sets produced by dropping a single variable from the set, where any such drop must be to the right of any existing gap (so A.CDE produces A..DE, A.C.E, and A.CD. but not ..CDE, which will be produced by .BCDE). You can terminate the recursion when you get down to the desired size of set, or when you find an R^2 that is no better than the best answer so far.
If it happens that you often find R^2 values no better than the best answer so far, this will save time - but this is not guaranteed. You can attempt to improve the efficiency by chosing to investigate the sets with highest R^2 first, hoping that you find a new best answer good enough to rule out their siblings by the time you come to them, and by using a procedure to calculate R^2 for A.CDE that makes use of the calculations you have already done for ABCDE.
Related
Consider some special condition, where we want to generate location data with some random speed.
public class Location
{
public double Lat { get; set; }
public double Lng { get; set; }
public int Speed { get; set; }
public DateTime Date { get; set; }
}
The speed can be randomly generated using Random.Next() method.
Now consider that we are going to have limit of 1 to 200 for speed,
and we want that most of the Random.Next(1,200) result be more likely at the rang of 1 to 120
(for example if we have 160 locations, most of the location speeds be at around 1 to 120(about 60% to 80% of the location) and the rest be at about range of 120 to 200)
I know some bad and ugly ways where you can divide your locations randomly into two list of locations and then generate speed separately for those lists, but I'm looking for a better and more efficient way.
Thanks!
Edit :
I have to mention that there is a property called
Date which is Type of DateTime and defines the time of a locations occurrence.
A list of location that is going to be generated will be a path so the speeds being generated should be relative to locations and time to just seem right(for example 2 continues locations can't have 2 unrelative speed like first location : 80KM/h , second location: 140 KM/h in a short time span of 30 seconds) so just the speed, date time and location should seem logically a normal path.
Can't you just use two random numbers? The first to determine which range, and the second to choose a number in the appropriate range?
Random rng = new Random(); // This should only be created once, somewhere.
double proportionInLowerRange = 0.7;
int speed;
if (rng.NextDouble() <= proportionInLowerRange)
speed = rng.Next(1, 121);
else
speed = rng.Next(120, 201);
Note: The probabilities are linear across both ranges, so if you wanted a normal distribution this wouldn't work.
I'm not good at stats, so I tried to solve a simple problem in C#. The problem: "A given team has a 65% chance to win a single game against another team. What is the probability that they will win a best-of-5 set?"
I wanted to look at the relationship between that probability and the number of games in the set. How does a Bo3 compare to a Bo5, and so on?
I did this by creating Set and Game objects and running iterations. The win decision is done with this code:
Won = rnd.Next(1, 100) <= winChance;
rnd is, as you might expect, a static System.Random object.
Here's my Set object code:
public class Set
{
public int NumberOfGames { get; private set; }
public List<Game> Games { get; private set; }
public Set(int numberOfGames, int winChancePct)
{
NumberOfGames = numberOfGames;
GamesNeededToWin = Convert.ToInt32(Math.Ceiling(NumberOfGames / 2m));
Games = Enumerable.Range(1, numberOfGames)
.Select(i => new Game(winChancePct))
.ToList();
}
public int GamesNeededToWin { get; private set; }
public bool WonSet => Games.Count(g => g.Won) >= GamesNeededToWin;
}
My issue is that the results I get aren't quite what they should be. Someone who sucks less at stats did the math for me, and it seems my code is always overestimating the chance of winning the set, and the number of iterations doesn't improve the accuracy.
The results I get (% set win by games per set) are below. The first column is the games per set, the next is the statistical win rate (which my results should be approaching), and the remaining columns are my results based on the number of iterations. As you can see, more iterations don't seem to be making the numbers more accurate.
Games Per Set|Expected Set Win Rate|10K|100K|1M|10M
1 65.0% 66.0% 65.6% 65.7% 65.7%
3 71.8% 72.5% 72.7% 72.7% 72.7%
5 76.5% 78.6% 77.4% 77.5% 77.5%
7 80.0% 80.7% 81.2% 81.0% 81.1%
9 82.8% 84.1% 83.9% 83.9% 83.9%
The entire project is posted on github here if you want to look.
Any insight into why this isn't producing accurate results would be greatly appreciated.
The answer of Darren Sisson is correct; your computation is off by approximately 1%, and so all your results are as well.
My recommendation is that you solve the problem by encapsulating your desired semantics into an object which you can then test independently:
interface IDistribution<T>
{
T Sample();
}
static class Extensions
{
public static IEnumerable<T> Samples(this IDistribution<T> d)
{
while (true) yield return d.Sample();
}
}
class Bernoulli : IDistribution<bool>
{
// Note that we could also make it IDistribution<int> and return
// 0 and 1 instead of false and true; that would be the more
// "classic" approach to a Bernoulli distribution. Your choice.
private double d;
private Random random = new Random();
private Bernoulli(double d) { this.d = d; }
public static Make(double d) => new Bernoulli(d);
public bool Sample() => random.NextDouble() < d;
}
And now you have a biased coin flipper which you can test independently. You can now write code like:
int flips = 1000;
int heads = Bernoulli
.Make(0.65)
.Samples()
.Take(flips)
.Where(x => x)
.Count();
to do 1000 coin flips with a 65% chance of heads.
Notice that what we are doing here is constructing a probability distribution monad and then using the tools of LINQ to express a conditional probability. This is a powerful technique; your application barely scratches the surface of what we can do with it.
Exercise: construct extension methods Where, Select and SelectMany which take not IEnumerable<T> but rather IDistribution<T>; can you express the semantics of the distribution in terms of the distribution type itself, rather than making a transformation from the distribution monad to the sequence monad? Can you do the same for zip joins?
Exercise: construct other implementations of IDistribution<T>. Can you construct, say, a Cauchy distribution of doubles? What about a normal distribution? What about a dice-rolling distribution on a fair die of n sides? Now, can you put this all together? What is the distribution which is: flip a coin; if heads, roll four dice and add them together, otherwise roll two dice and discard all the doubles, and multiply the results.
quick look, The Random function's upper bound is exclusive so would need to be set to 101
I'm storing some data in a Math.net vector, as I have to do some calculations with it as a whole. This data comes with a time information when it was collected. So for example:
Initial = 5, Time 2 = 7, Time 3 = 8, Time 4 = 10
So when I store the data in a Vector it looks like this.
stateVectorData = [5,7,8,10]
Now sometimes I need to extract a single entry of the vector. But I don't have the index itself, but a time Information. So what I try is a dictionary with the information of the time and the index of the data in my stateVector.
Dictionary<int, int> stateDictionary = new Dictionary<int, int>(); //Dict(Time, index)
Everytime I get new data I add an entry to the dictionary(and of course to the stateVector). So at Time 2 I did:
stateDictionary.Add(2,1);
Now this works as long as I don't change my vector. Unfortunately I have to delete an entry in the vector when it gets too old. Assume time 2 is too old I delete the second entry and have a resulting vector of:
stateVector = [5,8,10]
Now my dictionary has the wrong index values stored.
I can think of two possible solutions how to solve this.
To loop through the dictionary and decrease every value (with key > 2) by 1.
What I think would be more elegant, is storing a reference to an vectorentry in the dictionary instead of the index.
So something like
Dictionary<int, ref int> stateDictionary =
new Dictionary<int, ref int>(); //Dict(Time, reference to vectorentry)
stateDictionary.Add(2, ref stateVector[1]);
Using something like this, I wouldn't care about deleting some entrys in the vector, as I still have the reference to the rest of the vectorentries. Now I know it's not possible to store a reference in C#.
So my question is, is there any alternative to looping through the whole dictionary? Or is there another solution without a dictionary I don't see at the moment?
Edit to answer juharr:
Time information doesn't always increase by one. Depends on some parallel running process and how long it takes. Probably increasing between 1 to 3. But also could be more.
There are some values in the vector which never get deleted. I tried to show this with the initial value of 5 which stays in the vector.
Edit 2:
Vector stores at least 5000 to 6000 elements. Maximum is not defined at the moment, as it is restricted by the elements I can handle in real time, so in my case I have about 0.01s to do my further calculations. This is why I search an effective way, so I can increase the number of elements in the vector (or increase the maximum "age" of my vectorentries).
I need the whole vector for calculation about 3 times the number I need to add a value.
I have to delete an entry with the lowest frequency. And finding a single value by its time key will be the most often case. Maybe 30 to 100 times a second.
I know this all sounds very undefined, but the frequency of finding and deleting part depends on an other process, which can vary a lot.
Though hope you can help me. Thanks so far.
Edit 3:
#Robinson
The exact number of times I need the whole vector also depends on the parallel process. Minimum would be two times every iteration (so twice in 0.01s), maximum at least 4 to 6 times every iteration.
Again, the size of the vector is what I want to maximize. So assumed to be very big.
Edit Solution:
First thanks to all, who helped me.
After experimenting a bit, I'm using the following construction.
I'm using a List, where I save the indexes in my state vector.
Additionally I use a Dictionary to assign my Time-key to the List Entry.
So when I delete something in the state vector, I loop only over the List, which seems to be much faster than looping the dictionary.
So it is:
stateVectorData = [5,7,8,10]
IndexList = [1,2,3];
stateDictionary = { Time 2, indexInList = 0; Time 3, indexInList = 1; Time 4, indexInList = 2 }
TimeKey->stateDictionary->indexInList -> IndexList -> indexInStateVector -> data
You can try this:
public class Vector
{
private List<int> _timeElements = new List<int>();
public Vector(int[] times)
{
Add(times);
}
public void Add(int time)
{
_timeElements.Add(time);
}
public void Add(int[] times)
{
_timeElements.AddRange(time);
}
public void Remove(int time)
{
_timeElements.Remove(time);
if (OnRemove != null)
OnRemove(this, time);
}
public List<int> Elements { get { return _timeElements; } }
public event Action<Vector, int> OnRemove;
}
public class Vectors
{
private Dictionary<int, List<Vector>> _timeIndex;
public Vectors(int maxTimeSize)
{
_timeIndex = new Dictionary<int, List<Vector>>(maxTimeSize);
for (var i = 0; i < maxTimeSize; i++)
_timeIndex.Add(i, new List<Vector>());
List = new List<Vector>();
}
public List<Vector> FindVectorsByTime(int time)
{
return _timeIndex[time];
}
public List<Vector> List { get; private set; }
public void Add(Vector vector)
{
List.Add(vector);
vector.Elements.ForEach(element => _timeIndex[element].Add(vector));
vector.OnRemove += OnRemove;
}
private void OnRemove(Vector vector, int time)
{
_timeIndex[time].Remove(vector);
}
}
To use:
var vectors = new Vectors(maxTimeSize: 6000);
var vector1 = new Vector(new[] { 5, 30, 8, 20 });
var vector2 = new Vector(new[] { 25, 5, 23, 11 });
vectors.Add(vector1);
vectors.Add(vector2);
var findsTwo = vectors.FindVectors(time: 5);
vector1.Remove(time: 5);
var findsOne = vectors.FindVectors(time: 5);
The same can be done for adding times, also the code is just for illustration purposes.
Please help I've been trying to generate a random binary search tree of size 1024 and the elements needs to be random sortedset ... I'm able to write a code to create a binary search tree manually by adding elements manually but I'm unablele yo write a code that would generate a random balanced binary tree of size 1024 then use try to find a key in that tree ... please please and thank u ahead ....
Edit added code from comments
ya it is homework... and this is what i got so far as code:
using System;
namespace bst {
public class Node {
public int value;
public Node Right = null;
public Node Left = null;
public Node(int value)
{
this.value = value;
}
}
public class BST {
public Node Root = null;
public BST() { }
public void Add(int new_value)
{
if(Search(new_value))
{
Console.WriteLine("value (" + new_value + ") already");
}
else
{
AddNode(this.Root,new_value);
}
}
}
}
Use recursion.
Each branch generates a new branch, select the middle item in the unsorted set, the median. Put it in the current item in the tree. Copy all items less than the median to another array, send that new array to the call of the same method. Copy all items greater than the median to another array, send that new array to the call of the same method.\
Balanced trees have to have an odd number of items, unless the main parent node is not filled in. You need to decide if there are two values that are the Median, whether the duplicate belongs on the lower branch or upper branch. I put duplicates on the upper branch in my example.
The median will be the number where an equal amount of numbers is less than and greater than the number. 1,2,3,3,4,18,29,105,123
In this case, the median is 4, even though the mean (or average) is much higher.
I didn't include code that determines the median.
BuildTreeItem(TreeItem Item, Array Set)
{
Array Smalls;
Array Larges;
Median = DetermineMedian(Set);
Item.Value = Median;
if(Set.Count() == 1)
return;
for (int i = 0; int i < Set.Count(); i++)
{
if(Set[i] < Median)
{
Smalls.new(Set[i]);
}
else
{
Larges.new(Set[i]);
}
}
Item.Lower = new TreeItem;
Item.Upper = new TreeItem;
BuildTreeItem(TreeItem.Lower, Smalls);
BuildTreeItem(TreeItem.Upper, Larges);
}
Unless it is homework the easiest solution would be to sort data first and then build a tree by using middle item as root and descending down each half. Method proposed by Xaade is similar , but much slower due to DetermineMedian complexity.
The other option is to actually look at algorithms that build balanced trees (like http://en.wikipedia.org/wiki/Red-black_tree ) to see if it fits your requirements.
EDIT: removing incorrect statement about speed of Xaade algorithm - it is actually as fast as quick sort (n log n - check each element on every level of recursion with log n levels of recursion), not sure why I estimated it slower.
This question already has answers here:
How do I determine the standard deviation (stddev) of a set of values?
(12 answers)
Standard Deviation in LINQ
(8 answers)
Closed 9 years ago.
I need to calculate the standard deviation of a generic list. I will try to include my code. Its a generic list with data in it. The data is mostly floats and ints. Here is my code that is relative to it without getting into to much detail:
namespace ValveTesterInterface
{
public class ValveDataResults
{
private List<ValveData> m_ValveResults;
public ValveDataResults()
{
if (m_ValveResults == null)
{
m_ValveResults = new List<ValveData>();
}
}
public void AddValveData(ValveData valve)
{
m_ValveResults.Add(valve);
}
Here is the function where the standard deviation needs to be calculated:
public float LatchStdev()
{
float sumOfSqrs = 0;
float meanValue = 0;
foreach (ValveData value in m_ValveResults)
{
meanValue += value.LatchTime;
}
meanValue = (meanValue / m_ValveResults.Count) * 0.02f;
for (int i = 0; i <= m_ValveResults.Count; i++)
{
sumOfSqrs += Math.Pow((m_ValveResults - meanValue), 2);
}
return Math.Sqrt(sumOfSqrs /(m_ValveResults.Count - 1));
}
}
}
Ignore whats inside the LatchStdev() function because I'm sure its not right. Its just my poor attempt to calculate the st dev. I know how to do it of a list of doubles, however not of a list of generic data list. If someone had experience in this, please help.
The example above is slightly incorrect and could have a divide by zero error if your population set is 1. The following code is somewhat simpler and gives the "population standard deviation" result. (http://en.wikipedia.org/wiki/Standard_deviation)
using System;
using System.Linq;
using System.Collections.Generic;
public static class Extend
{
public static double StandardDeviation(this IEnumerable<double> values)
{
double avg = values.Average();
return Math.Sqrt(values.Average(v=>Math.Pow(v-avg,2)));
}
}
This article should help you. It creates a function that computes the deviation of a sequence of double values. All you have to do is supply a sequence of appropriate data elements.
The resulting function is:
private double CalculateStandardDeviation(IEnumerable<double> values)
{
double standardDeviation = 0;
if (values.Any())
{
// Compute the average.
double avg = values.Average();
// Perform the Sum of (value-avg)_2_2.
double sum = values.Sum(d => Math.Pow(d - avg, 2));
// Put it all together.
standardDeviation = Math.Sqrt((sum) / (values.Count()-1));
}
return standardDeviation;
}
This is easy enough to adapt for any generic type, so long as we provide a selector for the value being computed. LINQ is great for that, the Select funciton allows you to project from your generic list of custom types a sequence of numeric values for which to compute the standard deviation:
List<ValveData> list = ...
var result = list.Select( v => (double)v.SomeField )
.CalculateStdDev();
Even though the accepted answer seems mathematically correct, it is wrong from the programming perspective - it enumerates the same sequence 4 times. This might be ok if the underlying object is a list or an array, but if the input is a filtered/aggregated/etc linq expression, or if the data is coming directly from the database or network stream, this would cause much lower performance.
I would highly recommend not to reinvent the wheel and use one of the better open source math libraries Math.NET. We have been using that lib in our company and are very happy with the performance.
PM> Install-Package MathNet.Numerics
var populationStdDev = new List<double>(1d, 2d, 3d, 4d, 5d).PopulationStandardDeviation();
var sampleStdDev = new List<double>(2d, 3d, 4d).StandardDeviation();
See http://numerics.mathdotnet.com/docs/DescriptiveStatistics.html for more information.
Lastly, for those who want to get the fastest possible result and sacrifice some precision, read "one-pass" algorithm https://en.wikipedia.org/wiki/Standard_deviation#Rapid_calculation_methods
I see what you're doing, and I use something similar. It seems to me you're not going far enough. I tend to encapsulate all data processing into a single class, that way I can cache the values that are calculated until the list changes.
for instance:
public class StatProcessor{
private list<double> _data; //this holds the current data
private _avg; //we cache average here
private _avgValid; //a flag to say weather we need to calculate the average or not
private _calcAvg(); //calculate the average of the list and cache in _avg, and set _avgValid
public double average{
get{
if(!_avgValid) //if we dont HAVE to calculate the average, skip it
_calcAvg(); //if we do, go ahead, cache it, then set the flag.
return _avg; //now _avg is garunteed to be good, so return it.
}
}
...more stuff
Add(){
//add stuff to the list here, and reset the flag
}
}
You'll notice that using this method, only the first request for average actually computes the average. After that, as long as we don't add (or remove, or modify at all, but those arnt shown) anything from the list, we can get the average for basically nothing.
Additionally, since the average is used in the algorithm for the standard deviation, computing the standard deviation first will give us the average for free, and computing the average first will give us a little performance boost in the standard devation calculation, assuming we remember to check the flag.
Furthermore! places like the average function, where you're looping through every value already anyway, is a great time to cache things like the minimum and maximum values. Of course, requests for this information need to first check whether theyve been cached, and that can cause a relative slowdown compared to just finding the max using the list, since it does all the extra work setting up all the concerned caches, not just the one your accessing.