or-tools - Compute the stdev from a SumArray() - c#

I need to generate plannings for employees using Google's Optimization Tools.
One of the constraints would be that every employee has approximately the same amount of working hours.
Thus, I want to aggregate in a list how many hours each employee is working, and then minimize the standard deviation of this list.
var workingTimes = new List<SumArray>();
foreach (var employee in employees) {
// Gather the duration of each task the employee is
// assigned to in a list
// o.IsAssign is an IntVar and task.Duration is an int
var allDurations = shifts.Where(o => o.Employee == employee.Name)
.Select(o => o.IsAssigned * task[o.Task].Duration);
// Total time the employee is working
var workTime = new SumArray(allDurations);
workingTimes.Add(workTime);
}
Now I want to minimize the stdev of workingTimes. I tried the following:
IntegerExpression workingTimesMean = new SumArray(workingTimes) * (1/workingTimes.Count);
var gaps = workingTimes.Select(o => (o - workingTimesMean)*(o - workingTimesMean));
var stdev = new SumArray(gaps) * (1/gaps.Count());
model.Minimize(stdev);
But the LINQ query at the 2nd line of the last code snippet is throwing me an error:
Can't apply operator * to IntegerExpression and IntegerExpression
How can I compute the standard deviation of a Google.OrTools.Sat.SumArray?

The 'natural' API only supports linear expressions.
You need to use the AddProductEquality() API.
Please note that 1 / Gaps.Count() will always return 0 (we are in integer arithmetic).
So you need to scale everything up.
Personally, I would just minimize the unscaled sum of abs(val - average). No need to divide by the number of elements.
Just check that the computation of the average has the right precision (once again, we are in integer arithmetic).
You could also consider just minimize the max(abs(val - average)). This is simpler and may be good enough.

Related

Compare if elements are almost equal in a list in C# .NET

I am still very beginner in C# and .NET and just need to do this simple test.
var odds = new System.Collections.Generic.List<double>();
// here is a code which adds the values in the list
foreach(var odd in odds)
{
System.Console.WriteLine(odd);
}
and the output is something like that:
13.098252624859418
14.098252624859349
13.098252624859577
13.098252624853423
14.098252624859398
So I would like to compare all the values inside the list if they are almost equal. That means even if there is a little difference between the numbers (such as 13 and 14) inside the list still to be acceptable so I would like this difference to be maximum of 2.
Check the difference between the maximum value and the minimum value in the list (2 in your case). Using a tolerance value. For example
double delta = 2;
// getting largest element
var maxNum = odds.Max();
// getting smallest element
var minNum = odds.Min();
var almostEqual = maxNum - minNum <= delta;
You'll need to do it manually, as is recommended with every floating point number comparison (because floating point math is unintuitive), doing that is quite simple, something like this:
var a = 13.098252624859418;
var b = 14.098252624859398;
// define your acceptable range, i.e 1.0 means number 1.0 larger and smaller are equal to one another
var delta = 1.0;
var areNearlyEqual = Math.Abs(a - b) <= delta; // true
Now if you want to check if every element in a List is nearly equal to every other element, there is a naïve and more "complicated" solution, I'll start with the naïve one:
(Don't actually use this implementation, this is for illustration purposes of how to check equality of all items in a list which aren't just numbers)
var allAreNearlyEqual = true; // Let's start of assuming all are equal
foreach (var x in odds)
{
if (!allAreNearlyEqual)
break;
foreach (var y in odds)
{
if (!Math.Abs(x - y) <= delta)
allAreNearlyEqual = false;
}
}
Console.WriteLine(allAreNearlyEqual);
As you can see we need to iterate over every element in the list (x) and compare it to every other element in the list (y), there is an easier to read (and also faster*) version of this:
var max = odds.Max();
var min = odds.Min();
if (Math.Abs(max - min) <= delta)
Console.WriteLine("All items are nearly equal");
else
Console.WriteLine("Not all items are nearly equal");
(This takes advantage of the fact that all other elements between the min and max are also close enough to be nearly equal, if the min and max are)
You can check out the implementation for Max here to see how they do it, but basically it's just a foreach loop which returns the highest value found.
*The second version is faster, because it's O(2N) where as the first version is O(N^2), I added the first version to illustrate how you could do the same thing on a list of objects which are not just numbers

(ML.NET) How to train a dataset that doesn't contain labels

For a webshop I want to create a model that gives recommendations based on what is on someone's wishlist: a "someone who has X on their wishlist we also recommend Y" scenario. The issue is that the trainers don't work due to a lack of proper Labels which I do not have in my dataset or a lack of enough data altogether. This results in either inaccurate data or prediction scores of float.NAN (either all or most scores end up like this)
At my disposal I have all existing wishlists with the subsequent ProfileId and ItemId's (both are integers). These are grouped in ProfileID-ItemID combinations (representing an item on a wishlist, so a user with 3 items will have 3 combinations). In total, there are around 150.000 combinations I can work with for 16.000 users and 50.000 items. Items that only appear on a single wishlist (or not at all) or users with only one item on their wishlist are excluded from the training data (the above numbers are already filtered). If I want to, I could add extra columns of data representing the category an item is a part of (toys, books, etc.), prices and other metadata.
What I do not have are ratings, since the webshop doesn't use those. Therefore, I cannot use them to represent the "Label"
public class WishlistItem
{
// these variables are either uint32 or a Single (float) based on the training algorithm.
public uint ProfileId;
public uint ItemId;
public float Label;
}
What I expect I need to fix the issue:
A combination or either of the three:
1) that I need to use a different trainer. If so, which would be best suited?
2) that I need to insert different values for the Label variable. If so, how should it be generated?
3) that I need to generate different 'fake' dataset to pad the trainingdata. If so, how should it be generated?
Explanation of the problem and failed attempts to remedy it
I have tried to parse the data using different trainers to see what would work best for my dataset: the FieldAwareFactorizationMachine, the MatrixFactorizationMachine and the OLSTrainer. I've also tried to use the MatrixFactorizationMachine for LossFunctionType.SquareLossOneClass, where rather than a ProfileID-ItemID combination combinations of ItemIds on a Wishlist are inserted. (eg. item1-item2, item2-item3, item1-item3 from a wishlist where 3 items are present)
The machines are based on information found in their subsequent tutorials:
FieldAware: https://xamlbrewer.wordpress.com/2019/04/23/machine-learning-with-ml-net-in-uwp-field-aware-factorization-machine/
MatrixFactorization: https://learn.microsoft.com/en-us/dotnet/machine-learning/tutorials/movie-recommendation
MatrixFactorization (OneClass): https://medium.com/machinelearningadvantage/build-a-product-recommender-using-c-and-ml-net-machine-learning-ab890b802d25
OLS: https://learn.microsoft.com/en-us/dotnet/api/microsoft.ml.mklcomponentscatalog.ols?view=ml-dotnet
Here is an example of one of the pipelines, the others are very similar:
string profileEncoded = nameof(WishlistItem.ProfileId) + "Encoded";
string itemEncoded = nameof(WishlistItem.ItemId) + "Encoded";
// the Matrix Factorization pipeline
var options = new MatrixFactorizationTrainer.Options {
MatrixColumnIndexColumnName = profileEncoded,
MatrixRowIndexColumnName = itemEncoded,
LabelColumnName = nameof(WishlistItem.Label),
NumberOfIterations = 100,
ApproximationRank = 100
};
trainerEstimator = Context.Transforms.Conversion.MapValueToKey(outputColumnName: profileEncoded, inputColumnName: nameof(WishlistItem.ProfileId))
.Append(Context.Transforms.Conversion.MapValueToKey(outputColumnName: itemEncoded, inputColumnName: nameof(WishlistItem.ItemId)))
.Append(Context.BinaryClassification.Trainers.FieldAwareFactorizationMachine(new string[] { "Features" }));
In order to mitigate the issue of lacking labels, I've tried several workarounds:
leaving them blank (a 0f float value)
using the hashcodes of the itemid, profileid or a combination of both
counting the amount of items a specific itemid or profileid is included, also manipulating that figure to create less extreme values in case an item is represented hundreds of times. (using square root or a log function, creating Label = Math.Log(amountoftimes); or Label = Math.Ceiling(Math.Log(amountoftimes)
for the FieldAware machine, where the Label is a Boolean rather than a Float, the calculation above is used to determine whether the float result is above the average of below the average for all items
When testing, I test using the following of 2 possible methods to determine what recommendations "Y" can be created for Item "X":
compare ItemID X to all existing items, with the ProfileID of the user.
List<WishlistItem> predictionsForUser = profileMatrix.DistinctBy(x => x.ItemID).Select(x => new WishlistItem(userId, x.GiftId, x.Label));
IDataView transformed = trainedModel.Transform(Context.Data.LoadFromEnumerable(predictionsForUser));
CoPurchasePrediction[] predictions = Context.Data.CreateEnumerable<CoPurchasePrediction>(transformed, false).ToArray();
IEnumerable<KeyValuePair<WishlistItem, CoPurchasePrediction>> results = Enumerable.Range(0, predictions.Length).ToDictionary(x => predictionsForUser[x], x => predictions[x]).Where(x => OrderByDescending(x => x.Value.Score).Take(10);
return results.Select(x => x.Key.GiftId.ToString()).ToArray();
Compare the ItemID X to items on other people's wishlists where X is also present. This one is used for the FieldAware Factorization Trainer, which uses a Boolean as Label.
public IEnumerable<WishlistItem> CreatePredictDataForUser(string userId, IEnumerable<WishlistItem> userItems)
{
Dictionary<string, IEnumerable<WishlistItem>> giftIdGroups = profileMatrix.GroupBy(x => x.GiftId).ToDictionary(x => x.Key, x => x.Select(y => y));
Dictionary<string, IEnumerable<WishlistItem>> profileIdGroup = profileMatrix.GroupBy(x => x.ProfileId).ToDictionary(x => x.Key, x => x.Select(y => y));
profileIdGroup.Add(userId, userItems);
List<WishlistItem> results = new List<WishlistItem>();
foreach (WishlistItem wi in userItems)
{
IEnumerable<WishlistItem> giftIdGroup = giftIdGroups[wi.GiftId];
foreach(WishlistItem subwi in giftIdGroup)
{
results.AddRange(profileIdGroup[subwi.ProfileId]);
}
}
IEnumerable<WishlistItem> filtered = results.ExceptBy(userItems, x => x.GiftId);
// get duplicates
Dictionary<string, float> duplicates = filtered.GroupBy(x => x.GiftId).ToDictionary(x => x.Key, x => giftLabelValues[x.First().GiftId]);
float max = duplicates.Values.Max();
return filtered.DistinctBy(x => x.GiftId).Select(x => new WishlistItem(userId, x.GiftId, duplicates[x.GiftId] * 2 > max));
}
However, the testing data remains either completely or partially unusable (float.NAN), or creates always the same recommendation results (we recommend Y and Z for item X) regardless of the item inserted.
When evaluating the data using a testdataview (DataOperationsCatalog.TrainTestData split = Context.Data.TrainTestSplit(data, 0.2)) It either shows promising results with high accuracy or a random value all over the place, and it doesn't add up with the results I'm getting; high accuracy still results in float.NAN or 'always the same'
Online it is pointed out that float.NAN may be the result of a small dataset. to compensate, I have tried creating 'fake' datasets; profile-item combinations (with label 0f or false, while the rest is 0f+ or true) that are randomly generated based on existing profileid's and itemid's. (It is checked beforehand to rule out that these random 'negative' data isn't also a 'real' combinationset on accident). However, this has shown little to no effect.
I don't think any of the solutions you have tried will work, as, as you have pointed out, you do not have any label data. Faking the label data will not work either as the ML algorithm will work with this faked label.
What I believe you are looking for is a One-Class Matrix Factorization algorithm.
Your "label" or "score" is implicit - the fact that the item is in the user's wishlist itself indicates the label - that the user has an interest in the item. The One-Class Matrix Factorization uses this kind of implicit labelling.
Have a read through this article:
https://medium.com/machinelearningadvantage/build-a-product-recommender-using-c-and-ml-net-machine-learning-ab890b802d25
What you are looking for is a classic recommender system solution. Recommender systems are accustomed to missing and sparse data. There are many approaches to solve this problem, and I recommend starting with this article. Generally, there are two approaches in recommender systems - model-based and memory-based. In my experience, model-based methods perform much better than memory-based ones. There's a nice summary here regarding the different models and solutions. Look at the matrix factorization solution by Koren and Bell here which works very well in many cases.

Custom option to Search a Sorted list faster than Plain Binary Search

Following is the use-case:
Sorted List of DateTime type, with granularity in the millisecond
Search for nearest DateTime, which satisfy the supplied predicate delegate
Performance is an issue, since List has 100K+ records, total time span of 10 hours from minimum to maximum index and lot of frequent calls (50+ / run), impacts performance
What we currently do, custom binary search as follows ?
public static int BinaryLastOrDefault<T>(this IList<T> list, Predicate<T> predicate)
{
var lower = 0;
var upper = list.Count - 1;
while (lower < upper)
{
var mid = lower + ((upper - lower + 1) / 2);
if (predicate(list[mid]))
{
lower = mid;
}
else
{
upper = mid - 1;
}
}
if (lower >= list.Count) return -1;
return !predicate(list[lower]) ? -1 : lower;
}
Can I use Dictionary to make it O(1) ?
My understanding is No, since the input value may not be there and in that case we need to return the closest value, which if in above code returns -1, then last element in the sorted list is the expected result
Following is the option I am considering
Data structure like Dictionary<int,SortedDictionary<DateTime,int>>
Total duration DateTime duration between highest and lowest value is 10 hours ~ 10 * 3600 * 1000 ms = 36 million ms
Created buckets of 60 sec each, total number of elements ~ 36 million / 60 K = 600
For any supplied DateTime value, its now easy to find the Bucket, where limited number of values can be stored as SortedDictionary, with key as DateTime value and original index as value, thus if required then data can enumerated to find the closest index
In my understanding this implementation, will make the search much faster than Binary search detailed above, since data searched would be substantially reduced, Any suggestion what more can be done to improve the search time further to further improve it in the algorithmic terms, I can try the Parallel options for various independent calls separately
I made some performance tests using the native BinarySearch method of List<T>. The logic for finding the nearest DateTime is shown below:
public static DateTime GetNearest(List<DateTime> source, DateTime date)
{
var index = source.BinarySearch(date);
if (index >= 0) return source[index];
index = ~index;
if (index == 0) return source[0];
if (index == source.Count) return source[source.Count - 1];
var d1 = source[index - 1];
var d2 = source[index];
return (date - d1 < d2 - date) ? d1 : d2;
}
I created a random list of 1,000,000 sorted dates, covering a time span of 10 hours from min to max. Then I created an equally sized list with unsorted random dates to search, covering a slightly larger time span. Then changed the build to Release and started the test. The result demonstrated that it is possible to make more than 800,000 searches in less than a second, using only a single core of a relatively slow machine.
Then I increased the complexity of the test by searching in a List<(DateTime, object)> containing 1,000,000 elements, so that each comparison needs two extra calls to a dateSelector function, which returns the DateTime property of each ValueTuple.
The result: 350,000 searches per thread per second.
I increased the complexity even further by using reference types as elements, populating a List<Tuple<DateTime, object>> with 1,000,000 tuples. The performance was still pretty decent: 270,000 searches per thread per second.
My conclusion is that the BinarySearch method is lightning fast, and it would be surprising if it was found to be the bottleneck of an application.

C# - Best Way to Match 2 items from a List Without Nested Loops

Say i have a list that hold minitues of film durations called
filmDurations in type of int.
And i have a int parameter called flightDuration for a duration
of any given flight in minitues.
My objective is :
For any given flightDuration, i want to match 2 film from my filmDurations that their sums exactly finishes 30 minutes from flight.
For example :
filmDurations = {130,105,125,140,120}
flightDuration = 280
My output : (130 120)
I can do it with nested loops. But it is not effective and it is time consuming.
I want to do it more effectively.
I thinked using Linq but still it is O(n^2).
What is the best effective way?
Edit: I want to clear one thing.
I want to find filmDurations[i] + filmDurations[j] in;
filmDurations[i] + filmDurations[j] == fligtDuration - 30
And say i have very big amont of film durations.
You could sort all durations (remove duplicates) (O(n log n)) and than iterate through them (until the length flight-duration -30). Search for the corresponding length of the second film (O(log n)).
This way you get all duration-pairs in O(n log n).
You can also use a HashMap (duration -> Films) to find matching pairs.
This way you can avoid sorting and binary search. Iterate through all durations and look up in the map if there are entries with duration = (flight-duration -30).
Filling the map needs O(n) lookup O(1) and you need to iterate all durations.
-> Over all complexity O(n) but you loose the possibility to find 'nearly matching pairs which would be easy to implement using the sorted list approach described above)
As Leisen Chang said you can put all items into dictionary. After doing that rewrite your equation
filmDurations[i] + filmDurations[j] == fligtDuration - 30
as
filmDurations[i] == (fligtDuration - 30 - filmDurations[j])
Now for each item in filmDurations search for (fligtDuration - 30 - filmDurations[j]) in dictionary. And if such item found you have a solution.
Next code implement this concept
public class IndicesSearch
{
private readonly List<int> filmDurations;
private readonly Dictionary<int, int> valuesAndIndices;
public IndicesSearch(List<int> filmDurations)
{
this.filmDurations = filmDurations;
// preprocessing O(n)
valuesAndIndices = filmDurations
.Select((v, i) => new {value = v, index = i})
.ToDictionary(k => k.value, v => v.index);
}
public (int, int) FindIndices(
int flightDuration,
int diff = 30)
{
// search, also O(n)
for (var i = 0; i < filmDurations.Count; ++i)
{
var filmDuration = filmDurations[i];
var toFind = flightDuration - filmDuration - diff;
if (valuesAndIndices.TryGetValue(toFind, out var j))
return (i, j);
}
// no solution found
return (-1, -1); // or throw exception
}
}

Determining % of time above a certain value in a dataset

I have a dataset of voltages (Sampled every 500ms). Lets say it looks something like this (In an array):
0ms -> 1.4v
500ms -> 1.3v
1000ms -> 1.2v
1500ms -> 1.5v
2000ms -> 1.3v
2500ms -> 1.3v
3000ms -> 1.2v
3500ms -> 1.3v
Assuming the transition between readings is linear (IE: 250ms = 1.35v), how would I go about calculating the total % of time that the reading is above or equal to 1.3v?
I was initially going to just get % of values that are >= 1.3v (IE: 6/8 in sample array), however this only works if the angle between points is 45 degrees. I am assuming I have to do something like create a line from point 1 to point 2 and find the intercept with the base line (1.3v). Then do the same for point 2 and point 3 and find the distance between both intersects (Say 700ms) then repeat for all points and get as a % of total sample time.
EDIT
Maybe I wasn't clear when I initially asked. I need help with identifying how I can perform these calculations, IE: objects/classes that I can use to help me virtually graph these lines and perform these calculations or any 3rd party math packages that might offer these capabilities.
The important part is not to think in data points, but in intervals. Every interval (e.g. 0-500, 500-1000, ...) is one of three cases (starting with float variables above and below both 0):
Trivial: Both start and end point are below your threshold - below += 1
Trivial: Both start and end point are above your threshold - above += 1
Interesting: One point is below, one above your threshold. Let's call the smaller value min and the higher value max. Now we do above += (max-threshold)/(max-min) and below += (threshold-min)/(max-min), so we linearily distribute this interval between both states.
Finally normalize the results by dividing both above and below by the number of intervals. This will give you a pair of numbers, that represent the fractions, i.e. that add up to 1 modulo rounding errors. Ofcourse multiplication with 100 gives you the percentages.
EDIT
#phoog pointed out in the comment, that I did not mention an "equal" case. This is by design, as your question already contains that: You chose >= as a comparison, so I definitly ment to use the same comparison here.
If I've understood the problem correctly, you can use a class like this to hold each entry:
public class DataEntry
{
public DataEntry(int time, double reading)
{
Time = time;
Reading = reading;
}
public int Time { get; set; }
public double Reading { get; set; }
}
And then the following link statement to get segments above 1.3:
var entries = new List<DataEntry>()
{
new DataEntry(0, 1.4),
new DataEntry(500, 1.3),
new DataEntry(1000, 1.2),
new DataEntry(1500, 1.5),
new DataEntry(2000, 1.3),
new DataEntry(2500, 1.3),
new DataEntry(3000, 1.2),
new DataEntry(3500, 1.3)
};
double totalTime = entries
.OrderBy(e => e.Time)
.Take(entries.Count - 1)
.Where((t, i) => t.Reading >= 1.3 && entries[i + 1].Reading >= 1.3)
.Sum(t => 500);
var perct = (totalTime / entries.Max(e => e.Time));
This should give you the 500ms segments that remained above 1.3.

Categories

Resources