I am trying to find a way to determine the quality of an estimation function.
I have a dictionary that contains int values.
The total "sum" of this dictionary is Dictionary Key * Value.
public int RealValue
{
get
{
return Items.Sum(x => x.Key * x.Value);
}
}
The estimated sum of the Dictionary is calculated by using windows and weights.
public int EstimatedValue
{
get
{
return Items.Where(x => x.Key < window1).Sum(x => weight1 * x.Value) +
(Items.Where(x => x.Key >= window1 && x.Key < window2).Sum(x => weight2 * x.Value)) +
Items.Where(x => x.Key >= window2 && x.Key < window3).Sum(x => weight3 * x.Value);
}
}
Now I want to assing a rating to this Estimation Function, i.e. to the quality of the choosen windows and weights.
The estimation function is good, if it can successfully determine which of two dictionaries contain the greater value. It does not matter how close the estimation is to the real count. Of course the Estimation Function is supposed to work with any random pair of dictionaries that are candidates for testing.
What would be a good approach to solve the above problem?
It seems that in the end you are looking at the ranking problem, as you assign each object a speciffic order as compares to the remaining ones ("which one is bigger?") and do not care about actual value, thus there is a whole research regarding ranking evaluation and in general learning to rank. There are plenty of reliable metrics here, from the simpluest ones - testing number of correct pairwise comparisons (how many times your model gives good ordering of two objects as compared to all compared) to more statistical measures like Kendall tank correlation coefficient.
Related
For a webshop I want to create a model that gives recommendations based on what is on someone's wishlist: a "someone who has X on their wishlist we also recommend Y" scenario. The issue is that the trainers don't work due to a lack of proper Labels which I do not have in my dataset or a lack of enough data altogether. This results in either inaccurate data or prediction scores of float.NAN (either all or most scores end up like this)
At my disposal I have all existing wishlists with the subsequent ProfileId and ItemId's (both are integers). These are grouped in ProfileID-ItemID combinations (representing an item on a wishlist, so a user with 3 items will have 3 combinations). In total, there are around 150.000 combinations I can work with for 16.000 users and 50.000 items. Items that only appear on a single wishlist (or not at all) or users with only one item on their wishlist are excluded from the training data (the above numbers are already filtered). If I want to, I could add extra columns of data representing the category an item is a part of (toys, books, etc.), prices and other metadata.
What I do not have are ratings, since the webshop doesn't use those. Therefore, I cannot use them to represent the "Label"
public class WishlistItem
{
// these variables are either uint32 or a Single (float) based on the training algorithm.
public uint ProfileId;
public uint ItemId;
public float Label;
}
What I expect I need to fix the issue:
A combination or either of the three:
1) that I need to use a different trainer. If so, which would be best suited?
2) that I need to insert different values for the Label variable. If so, how should it be generated?
3) that I need to generate different 'fake' dataset to pad the trainingdata. If so, how should it be generated?
Explanation of the problem and failed attempts to remedy it
I have tried to parse the data using different trainers to see what would work best for my dataset: the FieldAwareFactorizationMachine, the MatrixFactorizationMachine and the OLSTrainer. I've also tried to use the MatrixFactorizationMachine for LossFunctionType.SquareLossOneClass, where rather than a ProfileID-ItemID combination combinations of ItemIds on a Wishlist are inserted. (eg. item1-item2, item2-item3, item1-item3 from a wishlist where 3 items are present)
The machines are based on information found in their subsequent tutorials:
FieldAware: https://xamlbrewer.wordpress.com/2019/04/23/machine-learning-with-ml-net-in-uwp-field-aware-factorization-machine/
MatrixFactorization: https://learn.microsoft.com/en-us/dotnet/machine-learning/tutorials/movie-recommendation
MatrixFactorization (OneClass): https://medium.com/machinelearningadvantage/build-a-product-recommender-using-c-and-ml-net-machine-learning-ab890b802d25
OLS: https://learn.microsoft.com/en-us/dotnet/api/microsoft.ml.mklcomponentscatalog.ols?view=ml-dotnet
Here is an example of one of the pipelines, the others are very similar:
string profileEncoded = nameof(WishlistItem.ProfileId) + "Encoded";
string itemEncoded = nameof(WishlistItem.ItemId) + "Encoded";
// the Matrix Factorization pipeline
var options = new MatrixFactorizationTrainer.Options {
MatrixColumnIndexColumnName = profileEncoded,
MatrixRowIndexColumnName = itemEncoded,
LabelColumnName = nameof(WishlistItem.Label),
NumberOfIterations = 100,
ApproximationRank = 100
};
trainerEstimator = Context.Transforms.Conversion.MapValueToKey(outputColumnName: profileEncoded, inputColumnName: nameof(WishlistItem.ProfileId))
.Append(Context.Transforms.Conversion.MapValueToKey(outputColumnName: itemEncoded, inputColumnName: nameof(WishlistItem.ItemId)))
.Append(Context.BinaryClassification.Trainers.FieldAwareFactorizationMachine(new string[] { "Features" }));
In order to mitigate the issue of lacking labels, I've tried several workarounds:
leaving them blank (a 0f float value)
using the hashcodes of the itemid, profileid or a combination of both
counting the amount of items a specific itemid or profileid is included, also manipulating that figure to create less extreme values in case an item is represented hundreds of times. (using square root or a log function, creating Label = Math.Log(amountoftimes); or Label = Math.Ceiling(Math.Log(amountoftimes)
for the FieldAware machine, where the Label is a Boolean rather than a Float, the calculation above is used to determine whether the float result is above the average of below the average for all items
When testing, I test using the following of 2 possible methods to determine what recommendations "Y" can be created for Item "X":
compare ItemID X to all existing items, with the ProfileID of the user.
List<WishlistItem> predictionsForUser = profileMatrix.DistinctBy(x => x.ItemID).Select(x => new WishlistItem(userId, x.GiftId, x.Label));
IDataView transformed = trainedModel.Transform(Context.Data.LoadFromEnumerable(predictionsForUser));
CoPurchasePrediction[] predictions = Context.Data.CreateEnumerable<CoPurchasePrediction>(transformed, false).ToArray();
IEnumerable<KeyValuePair<WishlistItem, CoPurchasePrediction>> results = Enumerable.Range(0, predictions.Length).ToDictionary(x => predictionsForUser[x], x => predictions[x]).Where(x => OrderByDescending(x => x.Value.Score).Take(10);
return results.Select(x => x.Key.GiftId.ToString()).ToArray();
Compare the ItemID X to items on other people's wishlists where X is also present. This one is used for the FieldAware Factorization Trainer, which uses a Boolean as Label.
public IEnumerable<WishlistItem> CreatePredictDataForUser(string userId, IEnumerable<WishlistItem> userItems)
{
Dictionary<string, IEnumerable<WishlistItem>> giftIdGroups = profileMatrix.GroupBy(x => x.GiftId).ToDictionary(x => x.Key, x => x.Select(y => y));
Dictionary<string, IEnumerable<WishlistItem>> profileIdGroup = profileMatrix.GroupBy(x => x.ProfileId).ToDictionary(x => x.Key, x => x.Select(y => y));
profileIdGroup.Add(userId, userItems);
List<WishlistItem> results = new List<WishlistItem>();
foreach (WishlistItem wi in userItems)
{
IEnumerable<WishlistItem> giftIdGroup = giftIdGroups[wi.GiftId];
foreach(WishlistItem subwi in giftIdGroup)
{
results.AddRange(profileIdGroup[subwi.ProfileId]);
}
}
IEnumerable<WishlistItem> filtered = results.ExceptBy(userItems, x => x.GiftId);
// get duplicates
Dictionary<string, float> duplicates = filtered.GroupBy(x => x.GiftId).ToDictionary(x => x.Key, x => giftLabelValues[x.First().GiftId]);
float max = duplicates.Values.Max();
return filtered.DistinctBy(x => x.GiftId).Select(x => new WishlistItem(userId, x.GiftId, duplicates[x.GiftId] * 2 > max));
}
However, the testing data remains either completely or partially unusable (float.NAN), or creates always the same recommendation results (we recommend Y and Z for item X) regardless of the item inserted.
When evaluating the data using a testdataview (DataOperationsCatalog.TrainTestData split = Context.Data.TrainTestSplit(data, 0.2)) It either shows promising results with high accuracy or a random value all over the place, and it doesn't add up with the results I'm getting; high accuracy still results in float.NAN or 'always the same'
Online it is pointed out that float.NAN may be the result of a small dataset. to compensate, I have tried creating 'fake' datasets; profile-item combinations (with label 0f or false, while the rest is 0f+ or true) that are randomly generated based on existing profileid's and itemid's. (It is checked beforehand to rule out that these random 'negative' data isn't also a 'real' combinationset on accident). However, this has shown little to no effect.
I don't think any of the solutions you have tried will work, as, as you have pointed out, you do not have any label data. Faking the label data will not work either as the ML algorithm will work with this faked label.
What I believe you are looking for is a One-Class Matrix Factorization algorithm.
Your "label" or "score" is implicit - the fact that the item is in the user's wishlist itself indicates the label - that the user has an interest in the item. The One-Class Matrix Factorization uses this kind of implicit labelling.
Have a read through this article:
https://medium.com/machinelearningadvantage/build-a-product-recommender-using-c-and-ml-net-machine-learning-ab890b802d25
What you are looking for is a classic recommender system solution. Recommender systems are accustomed to missing and sparse data. There are many approaches to solve this problem, and I recommend starting with this article. Generally, there are two approaches in recommender systems - model-based and memory-based. In my experience, model-based methods perform much better than memory-based ones. There's a nice summary here regarding the different models and solutions. Look at the matrix factorization solution by Koren and Bell here which works very well in many cases.
I want to have all combination of elements in a list for a result like this:
List: {1,2,3}
1
2
3
1,2
1,3
2,3
My problem is that I have 180 elements, and I want to have all combinations up to 5 elements. With my tests with 4 elements, it took a long time (2 minutes) but all went well. But with 5 elements, I get a run out of memory exception.
My code presently is this:
public IEnumerable<IEnumerable<Rondin>> getPossibilites(List<Rondin> rondins)
{
var combin5 = rondins.Combinations(5);
var combin4 = rondins.Combinations(4);
var combin3 = rondins.Combinations(3);
var combin2 = rondins.Combinations(2);
var combin1 = rondins.Combinations(1);
return combin5.Concat(combin4).Concat(combin3).Concat(combin2).Concat(combin1).ToList();
}
With the fonction: (taken from this question: Algorithm to return all combinations of k elements from n)
public static IEnumerable<IEnumerable<T>> Combinations<T>(this IEnumerable<T> elements, int k)
{
return k == 0 ? new[] { new T[0] } :
elements.SelectMany((e, i) =>
elements.Skip(i + 1).Combinations(k - 1).Select(c => (new[] { e }).Concat(c)));
}
I need to search in the list for a combination where each element added up is near (with a certain precision) to a value, this for each element in an other list. There is all my code for this part:
var possibilites = getPossibilites(opt.rondins);
possibilites = possibilites.Where(p => p.Sum(r => r.longueur + traitScie) < 144);
foreach(BilleOptimisee b in opt.billesOptimisees)
{
var proches = possibilites.Where(p => p.Sum(r => (r.longueur + traitScie)) < b.chute && Math.Abs(b.chute - p.Sum(r => r.longueur)) - (p.Count() * 0.22) < 0.01).OrderByDescending(p => p.Sum(r => r.longueur)).ElementAt(0);
if(proches != null)
{
foreach (Rondin r in proches)
{
opt.rondins.Remove(r);
b.rondins.Add(r);
possibilites = possibilites.Where(p => !p.Contains(r));
}
}
}
With the code I have, how can I limit the memory taken by my list ? Or is there a better solution to search in a very big set of combinations ?
Please, if my question is not good, tell me why and I will do my best to learn and ask better questions next time ;)
Your output list for combinations of 5 elements will have ~1.5*10^9 (that's billion with b) sublists of size 5. If you use 32bit integers, even neglecting lists overhead and assuming you have a perfect list with 0b overhead - that will be ~200GB!
You should reconsider if you actually need to generate the list like you do, some alternative might be: streaming the list of elements - i.e. generating them on the fly.
That can be done by creating a function, which gets the last combination as an argument - and outputs the next. (to think how it is done, think about increasing by one a number. you go from last to first, remembering a "carry over" until you are done)
A streaming example for choosing 2 out of 4:
start: {4,3}
curr = start {4, 3}
curr = next(curr) {4, 2} // reduce last by one
curr = next(curr) {4, 1} // reduce last by one
curr = next(curr) {3, 2} // cannot reduce more, reduce the first by one, and set the follower to maximal possible value
curr = next(curr) {3, 1} // reduce last by one
curr = next(curr) {2, 1} // similar to {3,2}
done.
Now, you need to figure how to do it for lists of size 2, then generalize it for arbitrary size - and program your streaming combination generator.
Good Luck!
Let your precision be defined in the imaginary spectrum.
Use a real index to access the leaf and then traverse the leaf with the required precision.
See PrecisLise # http://net7mma.codeplex.com/SourceControl/latest#Common/Collections/Generic/PrecicseList.cs
While the implementation is not 100% complete as linked you can find where I used a similar concept here:
http://net7mma.codeplex.com/SourceControl/latest#RtspServer/MediaTypes/RFC6184Media.cs
Using this concept I was able to re-order h.264 Access Units and their underlying Network Access Layer Components in what I consider a very interesting way... outside of interesting it also has the potential to be more efficient using close the same amount of memory.
et al, e.g, 0 can be proceeded by 0.1 or 0.01 or 0.001, depending on the type of the key in the list (double, float, Vector, inter alia) you may have the added benefit of using the FPU or even possibly Intrinsics if supported by your processor, thus making sorting and indexing much faster than would be possible on normal sets regardless of the underlying storage mechanism.
Using this concept allows for very interesting ordering... especially if you provide a mechanism to filter the precision.
I was also able to find several bugs in the bit-stream parser of quite a few well known media libraries using this methodology...
I found my solution, I'm writing it here so that other people that has a similar problem than me can have something to work with...
I made a recursive fonction that check for a fixed amount of possibilities that fit the conditions. When the amount of possibilities is found, I return the list of possibilities, do some calculations with the results, and I can restart the process. I added a timer to stop the research when it takes too long. Since my condition is based on the sum of the elements, I do every possibilities with distinct values, and search for a small amount of possibilities each time (like 1).
So the fonction return a possibility with a very high precision, I do what I need to do with this possibility, I remove the elements of the original list, and recall the fontion with the same precision, until there is nothing returned, so I can continue with an other precision. When many precisions are done, there is only about 30 elements in my list, so I can call for all the possibilities (that still fits the maximum sum), and this part is much easier than the beginning.
There is my code:
public List<IEnumerable<Rondin>> getPossibilites(IEnumerable<Rondin> rondins, int nbElements, double minimum, double maximum, int instance = 0, double longueur = 0)
{
if(instance == 0)
timer = DateTime.Now;
List<IEnumerable<Rondin>> liste = new List<IEnumerable<Rondin>>();
//Get all distinct rondins that can fit into the maximal length
foreach (Rondin r in rondins.Where(r => r.longueur < (maximum - longueur)).DistinctBy(r => r.longueur).OrderBy(r => r.longueur))
{
//Check the current length
double longueur2 = longueur + r.longueur + traitScie;
//If the current length is under the maximal length
if (longueur2 < maximum)
{
//Get all the possibilities with all rondins except the current one, and add them to the list
foreach (IEnumerable<Rondin> poss in getPossibilites(rondins.Where(rondin => rondin.id != r.id), nbElements - liste.Count, minimum, maximum, instance + 1, longueur2).Select(possibilite => possibilite.Concat(new Rondin[] { r })))
{
liste.Add(poss);
if (liste.Count >= nbElements && nbElements > 0)
break;
}
//If this the current length in higher than the minimum, add it to the list
if (longueur2 >= minimum)
liste.Add(new Rondin[] { r });
}
//If we have enough possibilities, we stop the research
if (liste.Count >= nbElements && nbElements > 0)
break;
//If the research is taking too long, stop the research and return the list;
if (DateTime.Now.Subtract(timer).TotalSeconds > 30)
break;
}
return liste;
}
int [] n=new int[10]{2,3,33,33,55,55,123,33,88,234};
output=2,3,123,88,234;
use LINQ
i can do it using two for loops by continuously checking.but i need a more simple way using LINQ
its not removing duplicates..
removing duplicates by distinct will give = 2,3,123,33,55,88,234
my output should be = 2,3,123,,88,234;
I combined your grouping idea and matiash's count. Not sure about its speed.
var result = n.GroupBy(s => s).Where(g => g.Count() == 1).Select(g => g.Key);
Update: i have measured the speed and it seems the time is linear, so you can use it on large collections
var result = n.Where(d => n.Count(d1 => d1 == d) <= 1);
This reads: only take those elements that are present at most 1 times in n.
It's quadratic though. Doesn't matter for short collections, but could possibly be improved.
EDIT Dmitry's solution is linear, and hence far better.
I have a list, with even number of nodes (always even). My task is to "match" all the nodes in the least costly way.
So I could have listDegree(1,4,5,6), which represents all the odd-degree nodes in my graph. How can I pair the nodes in the listDegree, and save the least costly combination to a variable, say int totalCost.
Something like this, and I return the least totalCost amount.
totalCost = (1,4) + (5,6)
totalCost = (1,5) + (4,6)
totalCost = (1,6) + (4,5)
--------------- More details (or a rewriting of the upper) ---------------
I have a class, that read my input-file and store all the information I need, like the costMatrix for the graph, the edges, number of edges and nodes.
Next i have a DijkstrasShortestPath algorithm, which computes the shortest path in my graph (costMatrix) from a given start node to a given end node.
I also have a method that examines the graph (costMatrix) and store all the odd-degree nodes in a list.
So what I was looking for, was some hints to how I can pair all the odd-degree nodes in the least costly way (shortest path). To use the data I have is easy, when I know how to combine all the nodes in the list.
I dont need a solution, and this is not homework.
I just need a hint to know, when you have a list with lets say integers, how you can combine all the integers pairwise.
Hope this explenation is better... :D
Perhaps:
List<int> totalCosts = listDegree
.Select((num,index) => new{num,index})
.GroupBy(x => x.index / 2)
.Select(g => g.Sum(x => x.num))
.ToList();
Demo
Edit:
After you've edited your question i understand your requirement. You need a total-sum of all (pairwise) combinations of all elements in a list. I would use this combinatorics project which is quite efficient and informative.
var listDegree = new[] { 1, 4, 5, 6 };
int lowerIndex = 2;
var combinations = new Facet.Combinatorics.Combinations<int>(
listDegree,
lowerIndex,
Facet.Combinatorics.GenerateOption.WithoutRepetition
);
// get total costs overall
int totalCosts = combinations.Sum(c => c.Sum());
// get a List<List<int>> of all combination (the inner list count is 2=lowerIndex since you want pairs)
List<List<int>> allLists = combinations.Select(c => c.ToList()).ToList();
// output the result for demo purposes
foreach (IList<int> combis in combinations)
{
Console.WriteLine(String.Join(" ", combis));
}
(Without more details on the cost, I am going to assume cost(1,5) = 1-5, and you want the sum to get as closest as possible to 0.)
You are describing the even partition problem, which is NP-Complete.
The problem says: Given a list L, find two lists A,B such that sum(A) = sum(B) and #elements(A) = #elements(B), with each element from L must be in A or B (and never both).
The reduction to your problem is simple, each left element in the pair will go to A, and each right element in each pair will go to B.
Thus, there is no known polynomial solution to the problem, but you might want to try exponential exhaustive search approaches (search all possible pairs, there are Choose(2n,n) = (2n!)/(n!*n!) of those).
An alternative is pseudo-polynomial DP based solutions (feasible for small integers).
I have an object with two doubles:
class SurveyData(){
double md;
double tvd;
}
I have a list of these values that is already sorted ascending. I would like to find and return the index of the object in the list with the maximum tvd value that is less than or equal to a double. How can I efficiently accomplish this task?
Assuming you've got LINQ and are happy to use TakeUntil from MoreLINQ, I suspect you want:
var maxCappedValue = values.TakeUntil(data => data.Tvd >= limit)
.LastOrDefault();
That will get you the first actual value rather than the index, but you could always do:
var maxCappedPair = values.Select((value, index) => new { value, index })
.TakeUntil(pair => pair.value.Tvd >= limit)
.LastOrDefault();
for the index/value pair. In both cases the result would be null if all values were above the limit.
Of course, it would be more efficient to use a binary search - but also slightly more complicated. You could create a "dummy" value with the limit TVD, then use List<T>.BinarySearch(dummy, comparer) where comparer would be an implementation of IComparer<SurveyData> which compared by TVD. You'd then need to check whether the return value was non-negative (exact match found) or negative (exact match not found, return value is complement of where it would be inserted).
The difference in complexity is between O(n) for the simple scan, or O(log n) for the binary search. Without knowing how big your list is (or how important performance is), it's hard to advise whether the extra implementation complexity of the binary search would be worth it.
First filter by the objects that are less than or equal to the filter value (Where), and then select the maximum of those objects' values.
Since it's already in ascending order, just iterate through the set until you find a value greater than the filter value, then return the previous index.
Here's a way to do it with Linq:
int indexOfMax =
data.Select((d, i) => new { Data = d, Index = i }) // associate an index with each item
.Where(item => item.Data.tvd <= maxValue) // filter values greater than maxValue
.Aggregate( // Compute the max
new { MaxValue = double.MinValue, Index = -1 },
(acc, item) => item.Data.tvd <= acc.MaxValue ? acc : new { MaxValue = item.Data.tvd, Index = item.Index },
acc => acc.Index);
But in a case like this, Linq is probably not the best option... a simple loop would be much clearer.