(ML.NET) How to train a dataset that doesn't contain labels

(ML.NET) How to train a dataset that doesn't contain labels - c#

For a webshop I want to create a model that gives recommendations based on what is on someone's wishlist: a "someone who has X on their wishlist we also recommend Y" scenario. The issue is that the trainers don't work due to a lack of proper Labels which I do not have in my dataset or a lack of enough data altogether. This results in either inaccurate data or prediction scores of float.NAN (either all or most scores end up like this)
At my disposal I have all existing wishlists with the subsequent ProfileId and ItemId's (both are integers). These are grouped in ProfileID-ItemID combinations (representing an item on a wishlist, so a user with 3 items will have 3 combinations). In total, there are around 150.000 combinations I can work with for 16.000 users and 50.000 items. Items that only appear on a single wishlist (or not at all) or users with only one item on their wishlist are excluded from the training data (the above numbers are already filtered). If I want to, I could add extra columns of data representing the category an item is a part of (toys, books, etc.), prices and other metadata.
What I do not have are ratings, since the webshop doesn't use those. Therefore, I cannot use them to represent the "Label"
public class WishlistItem
{
// these variables are either uint32 or a Single (float) based on the training algorithm.
public uint ProfileId;
public uint ItemId;
public float Label;
}
What I expect I need to fix the issue:
A combination or either of the three:
1) that I need to use a different trainer. If so, which would be best suited?
2) that I need to insert different values for the Label variable. If so, how should it be generated?
3) that I need to generate different 'fake' dataset to pad the trainingdata. If so, how should it be generated?
Explanation of the problem and failed attempts to remedy it
I have tried to parse the data using different trainers to see what would work best for my dataset: the FieldAwareFactorizationMachine, the MatrixFactorizationMachine and the OLSTrainer. I've also tried to use the MatrixFactorizationMachine for LossFunctionType.SquareLossOneClass, where rather than a ProfileID-ItemID combination combinations of ItemIds on a Wishlist are inserted. (eg. item1-item2, item2-item3, item1-item3 from a wishlist where 3 items are present)
The machines are based on information found in their subsequent tutorials:
FieldAware: https://xamlbrewer.wordpress.com/2019/04/23/machine-learning-with-ml-net-in-uwp-field-aware-factorization-machine/
MatrixFactorization: https://learn.microsoft.com/en-us/dotnet/machine-learning/tutorials/movie-recommendation
MatrixFactorization (OneClass): https://medium.com/machinelearningadvantage/build-a-product-recommender-using-c-and-ml-net-machine-learning-ab890b802d25
OLS: https://learn.microsoft.com/en-us/dotnet/api/microsoft.ml.mklcomponentscatalog.ols?view=ml-dotnet
Here is an example of one of the pipelines, the others are very similar:
string profileEncoded = nameof(WishlistItem.ProfileId) + "Encoded";
string itemEncoded = nameof(WishlistItem.ItemId) + "Encoded";
// the Matrix Factorization pipeline
var options = new MatrixFactorizationTrainer.Options {
MatrixColumnIndexColumnName = profileEncoded,
MatrixRowIndexColumnName = itemEncoded,
LabelColumnName = nameof(WishlistItem.Label),
NumberOfIterations = 100,
ApproximationRank = 100
};
trainerEstimator = Context.Transforms.Conversion.MapValueToKey(outputColumnName: profileEncoded, inputColumnName: nameof(WishlistItem.ProfileId))
.Append(Context.Transforms.Conversion.MapValueToKey(outputColumnName: itemEncoded, inputColumnName: nameof(WishlistItem.ItemId)))
.Append(Context.BinaryClassification.Trainers.FieldAwareFactorizationMachine(new string[] { "Features" }));
In order to mitigate the issue of lacking labels, I've tried several workarounds:
leaving them blank (a 0f float value)
using the hashcodes of the itemid, profileid or a combination of both
counting the amount of items a specific itemid or profileid is included, also manipulating that figure to create less extreme values in case an item is represented hundreds of times. (using square root or a log function, creating Label = Math.Log(amountoftimes); or Label = Math.Ceiling(Math.Log(amountoftimes)
for the FieldAware machine, where the Label is a Boolean rather than a Float, the calculation above is used to determine whether the float result is above the average of below the average for all items
When testing, I test using the following of 2 possible methods to determine what recommendations "Y" can be created for Item "X":
compare ItemID X to all existing items, with the ProfileID of the user.
List<WishlistItem> predictionsForUser = profileMatrix.DistinctBy(x => x.ItemID).Select(x => new WishlistItem(userId, x.GiftId, x.Label));
IDataView transformed = trainedModel.Transform(Context.Data.LoadFromEnumerable(predictionsForUser));
CoPurchasePrediction[] predictions = Context.Data.CreateEnumerable<CoPurchasePrediction>(transformed, false).ToArray();
IEnumerable<KeyValuePair<WishlistItem, CoPurchasePrediction>> results = Enumerable.Range(0, predictions.Length).ToDictionary(x => predictionsForUser[x], x => predictions[x]).Where(x => OrderByDescending(x => x.Value.Score).Take(10);
return results.Select(x => x.Key.GiftId.ToString()).ToArray();
Compare the ItemID X to items on other people's wishlists where X is also present. This one is used for the FieldAware Factorization Trainer, which uses a Boolean as Label.
public IEnumerable<WishlistItem> CreatePredictDataForUser(string userId, IEnumerable<WishlistItem> userItems)
{
Dictionary<string, IEnumerable<WishlistItem>> giftIdGroups = profileMatrix.GroupBy(x => x.GiftId).ToDictionary(x => x.Key, x => x.Select(y => y));
Dictionary<string, IEnumerable<WishlistItem>> profileIdGroup = profileMatrix.GroupBy(x => x.ProfileId).ToDictionary(x => x.Key, x => x.Select(y => y));
profileIdGroup.Add(userId, userItems);
List<WishlistItem> results = new List<WishlistItem>();
foreach (WishlistItem wi in userItems)
{
IEnumerable<WishlistItem> giftIdGroup = giftIdGroups[wi.GiftId];
foreach(WishlistItem subwi in giftIdGroup)
{
results.AddRange(profileIdGroup[subwi.ProfileId]);
}
}
IEnumerable<WishlistItem> filtered = results.ExceptBy(userItems, x => x.GiftId);
// get duplicates
Dictionary<string, float> duplicates = filtered.GroupBy(x => x.GiftId).ToDictionary(x => x.Key, x => giftLabelValues[x.First().GiftId]);
float max = duplicates.Values.Max();
return filtered.DistinctBy(x => x.GiftId).Select(x => new WishlistItem(userId, x.GiftId, duplicates[x.GiftId] * 2 > max));
}
However, the testing data remains either completely or partially unusable (float.NAN), or creates always the same recommendation results (we recommend Y and Z for item X) regardless of the item inserted.
When evaluating the data using a testdataview (DataOperationsCatalog.TrainTestData split = Context.Data.TrainTestSplit(data, 0.2)) It either shows promising results with high accuracy or a random value all over the place, and it doesn't add up with the results I'm getting; high accuracy still results in float.NAN or 'always the same'
Online it is pointed out that float.NAN may be the result of a small dataset. to compensate, I have tried creating 'fake' datasets; profile-item combinations (with label 0f or false, while the rest is 0f+ or true) that are randomly generated based on existing profileid's and itemid's. (It is checked beforehand to rule out that these random 'negative' data isn't also a 'real' combinationset on accident). However, this has shown little to no effect.

I don't think any of the solutions you have tried will work, as, as you have pointed out, you do not have any label data. Faking the label data will not work either as the ML algorithm will work with this faked label.
What I believe you are looking for is a One-Class Matrix Factorization algorithm.
Your "label" or "score" is implicit - the fact that the item is in the user's wishlist itself indicates the label - that the user has an interest in the item. The One-Class Matrix Factorization uses this kind of implicit labelling.
Have a read through this article:
https://medium.com/machinelearningadvantage/build-a-product-recommender-using-c-and-ml-net-machine-learning-ab890b802d25

What you are looking for is a classic recommender system solution. Recommender systems are accustomed to missing and sparse data. There are many approaches to solve this problem, and I recommend starting with this article. Generally, there are two approaches in recommender systems - model-based and memory-based. In my experience, model-based methods perform much better than memory-based ones. There's a nice summary here regarding the different models and solutions. Look at the matrix factorization solution by Koren and Bell here which works very well in many cases.

Related

How can I increase the speed of filtering a list

I display data in a data grid and want to filter the data with range sliders (slider with two handles). The change-event of the range slider only sets the string variable filterTrigger. The filter itself is triggered via mouseup event.
private void ApplyFilter()
{
if (filterTrigger != "")
{
filteredData.Clear();
suitableData.ForEach((item) =>
{
filteredData.Add(item); // create not referenced copy of list suitableData that was created in time consuming calculations
});
switch (filterTrigger)
{
case "foo":
// remove too small and too large Foos
_ = filteredData.RemoveAll(x => x.Foo > fooRangeSliderHandlesMinMax.ElementAt(1) || x.Foo < fooRangeSliderHandlesMinMax.ElementAt(0));
// set new minimum and maximum of range of range slider
barRangeSliderMinimum = filteredData.Min(x => x.Bar) - 0.1;
barRangeSliderMaximum = filteredData.Max(x => x.Bar) + 0.1;
// set new position of range slider handles
barRangeSliderHandlesMinMax = new double[2] { Math.Max(barRangeSliderHandlesMinMax.ElementAt(0), barRangeSliderMinimum + 0.1), Math.Min(barRangeSliderHandlesMinMax.ElementAt(1), barRangeSliderMaximum - 0.1) };
break;
case "bar":
_ = filteredData.RemoveAll(x => x.Bar > barRangeSliderHandlesMinMax.ElementAt(1) || x.Bar < barRangeSliderHandlesMinMax.ElementAt(0));
fooRangeSliderMinimum = filteredData.Min(x => x.Foo) - 0.1;
fooRangeSliderMaximum = filteredData.Max(x => x.Foo) + 0.1;
fooRangeSliderHandlesMinMax = new double[2] { Math.Max(fooRangeSliderHandlesMinMax.ElementAt(0), fooRangeSliderMinimum + 0.1), Math.Min(fooRangeSliderHandlesMinMax.ElementAt(1), fooRangeSliderMaximum - 0.1) };
break;
default:
break;
}
// remove values of foo if filterTrigger was "bar" and vice versa
_ = filteredData.RemoveAll(x => x.Foo > fooRangeSliderHandlesMinMax.ElementAt(1) || x.Foo < fooRangeSliderHandlesMinMax.ElementAt(0) || x.Bar > barRangeSliderHandlesMinMax.ElementAt(1) || x.Bar < barRangeSliderHandlesMinMax.ElementAt(0));
// update data grid data
IFilteredData = filteredData;
dataGrid.Reload();
filterTrigger = "";
}
}
The code is working fluently when I comment out all the lines that start with a discard _. But of course, I need these lines. The problem is, that they need much processor power. It is still working but when I move the mouse with clicked handle of a filter, the handle is extremely lagging (and my laptop sounds like a helicopter).
I know that a part of the last filter is redundant, because when filterTrigger is foo, foo was already filtered. But filtering only what was not filtered before, will not alone solve the problem, because above I only show two filters but there are actually about ten filters.
So, is there a way I could optimize this code?

When optimizing code the first rule is to measure, preferably with a profiler that can tell you exactly what part of the code takes most of the time.
Second rule would be to use a optimal algorithm, but unless you have a huge number of items and some reasonable way to sort or index said items, linear time is the best you can do.
Here are some guesses and suggestions of things that might be improved:
Avoid using .ElementAt, this might create a new enumerator object, and that will take some time. Especially inside inner loops. Prefer to use indexers and/or store it in local variables instead.
Avoid using Linq. Linq is great for readability, but it will have some overhead. So when optimizing it might be worthwhile to use regular loops to see if the overhead is significant or not.
Try to do all processing in one go. Instead iterating over all items once to find the minimum and once to do the maximum, do both at the same time. Memory is slow, and by doing as much processing of an item as possible when it is already cached helps reduce memory traffic.
I would consider replacing RemoveAll with a loop that copies items that pass the check to a empty list. This should help ensure items are copied at most once.
A rule of thumb when optimizing is to use low level language features. These are often easier for the jitter to optimize well. But it may make the code harder to read, so use a profiler to optimize the places that need it the most.

or-tools - Compute the stdev from a SumArray()

I need to generate plannings for employees using Google's Optimization Tools.
One of the constraints would be that every employee has approximately the same amount of working hours.
Thus, I want to aggregate in a list how many hours each employee is working, and then minimize the standard deviation of this list.
var workingTimes = new List<SumArray>();
foreach (var employee in employees) {
// Gather the duration of each task the employee is
// assigned to in a list
// o.IsAssign is an IntVar and task.Duration is an int
var allDurations = shifts.Where(o => o.Employee == employee.Name)
.Select(o => o.IsAssigned * task[o.Task].Duration);
// Total time the employee is working
var workTime = new SumArray(allDurations);
workingTimes.Add(workTime);
}
Now I want to minimize the stdev of workingTimes. I tried the following:
IntegerExpression workingTimesMean = new SumArray(workingTimes) * (1/workingTimes.Count);
var gaps = workingTimes.Select(o => (o - workingTimesMean)*(o - workingTimesMean));
var stdev = new SumArray(gaps) * (1/gaps.Count());
model.Minimize(stdev);
But the LINQ query at the 2nd line of the last code snippet is throwing me an error:
Can't apply operator * to IntegerExpression and IntegerExpression
How can I compute the standard deviation of a Google.OrTools.Sat.SumArray?

The 'natural' API only supports linear expressions.
You need to use the AddProductEquality() API.
Please note that 1 / Gaps.Count() will always return 0 (we are in integer arithmetic).
So you need to scale everything up.
Personally, I would just minimize the unscaled sum of abs(val - average). No need to divide by the number of elements.
Just check that the computation of the average has the right precision (once again, we are in integer arithmetic).
You could also consider just minimize the max(abs(val - average)). This is simpler and may be good enough.

Best way to group list of doubles, to add up to a specific value

I have task, that I'm unsure on how i should approach.
there's a list of doubles, and i need to group them together to add up to a specific value.
Say i have:
14.6666666666666,
14.6666666666666,
2.37499999999999,
1.04166666666665,
1.20833333333334,
1.20833333333334,
13.9583333333333,
1.20833333333334,
3.41666666666714,
3.41666666666714,
1.20833333333334,
1.20833333333334,
14.5416666666666,
1.20833333333335,
1.04166666666666,
And i would like to group into set values such as 12,14,16
I would like to take the highest value in the list then group it with short ones to equal the closest value above.
example:
take double 14.6666666666666, and group it with 1.20833333333334 to bring me close to 16, and if there are anymore small doubles left in the list, group them with that as well.
Then move on to the next double in the list..

That's literally the "Cutting stock Problem" (Sometimes called the 1 Dimensional Bin Packing Problem). There are a number of well documented solutions.
The only way to get the "Optimal" solution (other than a quantum computer) is to cycle through every combination, and select the best outcome.
A quicker way to get an "OK" solution is called the "First Fit Algorithm". It takes the requested values in the order they come, and removes them from the first piece of material that can fulfill the request.
The "First Fit Algorithm" can be slightly improved by pre-ordering the the values from largest to smallest, and pre-ordering the materials from smallest to largest. You could also uses the material that is closest to being completely consumed by the request, instead of the first piece that can fulfill the request.
A compromise, but one that requires more code is a "Genetic Algorithm". This is an over simplification, but you could use the basic idea of the "First Fit Algorithm", but randomly swap two of the values before each pass. If the efficiency increases, you keep the change, and if it decreases, you go back to the last state. Repeat until a fixed amount of time has passed or until you're happy.

Put the doubles in a list and sort them. Grab the highest value that is less than the target to start. Then loop through from the start of the list adding the values until you reach a point where adding the value will put you over the limit.
var threshold = 16;
List<double> values = new List<double>();
values.Add(14.932034);
etc...
Sort the list:
values = values.OrderBy(p => p).ToList();
Grab the highest value that is less than your threshold:
// Highest value under threshold
var highestValue = values.Where(x => x < threshold).Max();
Now perform your search and calculations until you reach your solution:
currentValue = highestValue
Console.WriteLine("Starting with: " + currentValue);
foreach(var val in values)
{
if(currentValue + val <= theshold)
{
currentValue = currentValue + val;
Console.WriteLine(" + " + val.ToString());
}
else
break;
}
Console.WriteLine("Finished with: " + currentValue.ToString());
Console.ReadLine();
Repeat the process for the next value and so on until you've output all of the solutions you want.

How to match / connect / pair integers from a List <T>

I have a list, with even number of nodes (always even). My task is to "match" all the nodes in the least costly way.
So I could have listDegree(1,4,5,6), which represents all the odd-degree nodes in my graph. How can I pair the nodes in the listDegree, and save the least costly combination to a variable, say int totalCost.
Something like this, and I return the least totalCost amount.
totalCost = (1,4) + (5,6)
totalCost = (1,5) + (4,6)
totalCost = (1,6) + (4,5)
--------------- More details (or a rewriting of the upper) ---------------
I have a class, that read my input-file and store all the information I need, like the costMatrix for the graph, the edges, number of edges and nodes.
Next i have a DijkstrasShortestPath algorithm, which computes the shortest path in my graph (costMatrix) from a given start node to a given end node.
I also have a method that examines the graph (costMatrix) and store all the odd-degree nodes in a list.
So what I was looking for, was some hints to how I can pair all the odd-degree nodes in the least costly way (shortest path). To use the data I have is easy, when I know how to combine all the nodes in the list.
I dont need a solution, and this is not homework.
I just need a hint to know, when you have a list with lets say integers, how you can combine all the integers pairwise.
Hope this explenation is better... :D

Perhaps:
List<int> totalCosts = listDegree
.Select((num,index) => new{num,index})
.GroupBy(x => x.index / 2)
.Select(g => g.Sum(x => x.num))
.ToList();
Demo
Edit:
After you've edited your question i understand your requirement. You need a total-sum of all (pairwise) combinations of all elements in a list. I would use this combinatorics project which is quite efficient and informative.
var listDegree = new[] { 1, 4, 5, 6 };
int lowerIndex = 2;
var combinations = new Facet.Combinatorics.Combinations<int>(
listDegree,
lowerIndex,
Facet.Combinatorics.GenerateOption.WithoutRepetition
);
// get total costs overall
int totalCosts = combinations.Sum(c => c.Sum());
// get a List<List<int>> of all combination (the inner list count is 2=lowerIndex since you want pairs)
List<List<int>> allLists = combinations.Select(c => c.ToList()).ToList();
// output the result for demo purposes
foreach (IList<int> combis in combinations)
{
Console.WriteLine(String.Join(" ", combis));
}

(Without more details on the cost, I am going to assume cost(1,5) = 1-5, and you want the sum to get as closest as possible to 0.)
You are describing the even partition problem, which is NP-Complete.
The problem says: Given a list L, find two lists A,B such that sum(A) = sum(B) and #elements(A) = #elements(B), with each element from L must be in A or B (and never both).
The reduction to your problem is simple, each left element in the pair will go to A, and each right element in each pair will go to B.
Thus, there is no known polynomial solution to the problem, but you might want to try exponential exhaustive search approaches (search all possible pairs, there are Choose(2n,n) = (2n!)/(n!*n!) of those).
An alternative is pseudo-polynomial DP based solutions (feasible for small integers).

Get closest/next match in .NET Hashtable (or other structure)

I have a scenario at work where we have several different tables of data in a format similar to the following:
Table Name: HingeArms
Hght Part #1 Part #2
33 S-HG-088-00 S-HG-089-00
41 S-HG-084-00 S-HG-085-00
49 S-HG-033-00 S-HG-036-00
57 S-HG-034-00 S-HG-037-00
Where the first column (and possibly more) contains numeric data sorted ascending and represents a range to determine the proper record of data to get (e.g. height <= 33 then Part 1 = S-HG-088-00, height <= 41 then Part 1 = S-HG-084-00, etc.)
I need to lookup and select the nearest match given a specified value. For example, given a height = 34.25, I need to get second record in the set above:
41 S-HG-084-00 S-HG-085-00
These tables are currently stored in a VB.NET Hashtable "cache" of data loaded from a CSV file, where the key for the Hashtable is a composite of the table name and one or more columns from the table that represent the "key" for the record. For example, for the above table, the Hashtable Add for the first record would be:
ht.Add("HingeArms,33","S-HG-088-00,S-HG-089-00")
This seems less than optimal and I have some flexibility to change the structure if necessary (the cache contains data from other tables where direct lookup is possible... these "range" tables just got dumped in because it was "easy"). I was looking for a "Next" method on a Hashtable/Dictionary to give me the closest matching record in the range, but that's obviously not available on the stock classes in VB.NET.
Any ideas on a way to do what I'm looking for with a Hashtable or in a different structure? It needs to be performant as the lookup will get called often in different sections of code. Any thoughts would be greatly appreciated. Thanks.

A hashtable is not a good data structure for this, because items are scattered around the internal array according to their hash code, not their values.
Use a sorted array or List<T> and perform a binary search, e.g.
Setup:
var values = new List<HingeArm>
{
new HingeArm(33, "S-HG-088-00", "S-HG-089-00"),
new HingeArm(41, "S-HG-084-00", "S-HG-085-00"),
new HingeArm(49, "S-HG-033-00", "S-HG-036-00"),
new HingeArm(57, "S-HG-034-00", "S-HG-037-00"),
};
values.Sort((x, y) => x.Height.CompareTo(y.Height));
var keys = values.Select(x => x.Height).ToList();
Lookup:
var index = keys.BinarySearch(34.25);
if (index < 0)
{
index = ~index;
}
var result = values[index];
// result == { Height = 41, Part1 = "S-HG-084-00", Part2 = "S-HG-085-00" }

You can use a sorted .NET array in combination with Array.BinarySearch().
If you get a non negative value this is the index of exact match.
Otherwise, if result is negative use formula
int index = ~Array.BinarySearch(sortedArray, value) - 1
to get index of previous "nearest" match.
The meaning of nearest is defined by a comparer you use. It must be the same you used when sorting the array. See:
http://gmamaladze.wordpress.com/2011/07/22/back-to-the-roots-net-binary-search-and-the-meaning-of-the-negative-number-of-the-array-binarysearch-return-value/

How about LINQ-to-Objects (This is by no means meant to be a performant solution, btw.)
var ht = new Dictionary<string, string>();
ht.Add("HingeArms,33", "S-HG-088-00,S-HG-089-00");
decimal wantedHeight = 34.25m;
var foundIt =
ht.Select(x => new { Height = decimal.Parse(x.Key.Split(',')[1]), x.Key, x.Value }).Where(
x => x.Height < wantedHeight).OrderBy(x => x.Height).SingleOrDefault();
if (foundIt != null)
{
// Do Something with your item in foundIt
}

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.