Get closest/next match in .NET Hashtable (or other structure) - c#

I have a scenario at work where we have several different tables of data in a format similar to the following:
Table Name: HingeArms
Hght Part #1 Part #2
33 S-HG-088-00 S-HG-089-00
41 S-HG-084-00 S-HG-085-00
49 S-HG-033-00 S-HG-036-00
57 S-HG-034-00 S-HG-037-00
Where the first column (and possibly more) contains numeric data sorted ascending and represents a range to determine the proper record of data to get (e.g. height <= 33 then Part 1 = S-HG-088-00, height <= 41 then Part 1 = S-HG-084-00, etc.)
I need to lookup and select the nearest match given a specified value. For example, given a height = 34.25, I need to get second record in the set above:
41 S-HG-084-00 S-HG-085-00
These tables are currently stored in a VB.NET Hashtable "cache" of data loaded from a CSV file, where the key for the Hashtable is a composite of the table name and one or more columns from the table that represent the "key" for the record. For example, for the above table, the Hashtable Add for the first record would be:
ht.Add("HingeArms,33","S-HG-088-00,S-HG-089-00")
This seems less than optimal and I have some flexibility to change the structure if necessary (the cache contains data from other tables where direct lookup is possible... these "range" tables just got dumped in because it was "easy"). I was looking for a "Next" method on a Hashtable/Dictionary to give me the closest matching record in the range, but that's obviously not available on the stock classes in VB.NET.
Any ideas on a way to do what I'm looking for with a Hashtable or in a different structure? It needs to be performant as the lookup will get called often in different sections of code. Any thoughts would be greatly appreciated. Thanks.

A hashtable is not a good data structure for this, because items are scattered around the internal array according to their hash code, not their values.
Use a sorted array or List<T> and perform a binary search, e.g.
Setup:
var values = new List<HingeArm>
{
new HingeArm(33, "S-HG-088-00", "S-HG-089-00"),
new HingeArm(41, "S-HG-084-00", "S-HG-085-00"),
new HingeArm(49, "S-HG-033-00", "S-HG-036-00"),
new HingeArm(57, "S-HG-034-00", "S-HG-037-00"),
};
values.Sort((x, y) => x.Height.CompareTo(y.Height));
var keys = values.Select(x => x.Height).ToList();
Lookup:
var index = keys.BinarySearch(34.25);
if (index < 0)
{
index = ~index;
}
var result = values[index];
// result == { Height = 41, Part1 = "S-HG-084-00", Part2 = "S-HG-085-00" }

You can use a sorted .NET array in combination with Array.BinarySearch().
If you get a non negative value this is the index of exact match.
Otherwise, if result is negative use formula
int index = ~Array.BinarySearch(sortedArray, value) - 1
to get index of previous "nearest" match.
The meaning of nearest is defined by a comparer you use. It must be the same you used when sorting the array. See:
http://gmamaladze.wordpress.com/2011/07/22/back-to-the-roots-net-binary-search-and-the-meaning-of-the-negative-number-of-the-array-binarysearch-return-value/

How about LINQ-to-Objects (This is by no means meant to be a performant solution, btw.)
var ht = new Dictionary<string, string>();
ht.Add("HingeArms,33", "S-HG-088-00,S-HG-089-00");
decimal wantedHeight = 34.25m;
var foundIt =
ht.Select(x => new { Height = decimal.Parse(x.Key.Split(',')[1]), x.Key, x.Value }).Where(
x => x.Height < wantedHeight).OrderBy(x => x.Height).SingleOrDefault();
if (foundIt != null)
{
// Do Something with your item in foundIt
}

Related

(ML.NET) How to train a dataset that doesn't contain labels

For a webshop I want to create a model that gives recommendations based on what is on someone's wishlist: a "someone who has X on their wishlist we also recommend Y" scenario. The issue is that the trainers don't work due to a lack of proper Labels which I do not have in my dataset or a lack of enough data altogether. This results in either inaccurate data or prediction scores of float.NAN (either all or most scores end up like this)
At my disposal I have all existing wishlists with the subsequent ProfileId and ItemId's (both are integers). These are grouped in ProfileID-ItemID combinations (representing an item on a wishlist, so a user with 3 items will have 3 combinations). In total, there are around 150.000 combinations I can work with for 16.000 users and 50.000 items. Items that only appear on a single wishlist (or not at all) or users with only one item on their wishlist are excluded from the training data (the above numbers are already filtered). If I want to, I could add extra columns of data representing the category an item is a part of (toys, books, etc.), prices and other metadata.
What I do not have are ratings, since the webshop doesn't use those. Therefore, I cannot use them to represent the "Label"
public class WishlistItem
{
// these variables are either uint32 or a Single (float) based on the training algorithm.
public uint ProfileId;
public uint ItemId;
public float Label;
}
What I expect I need to fix the issue:
A combination or either of the three:
1) that I need to use a different trainer. If so, which would be best suited?
2) that I need to insert different values for the Label variable. If so, how should it be generated?
3) that I need to generate different 'fake' dataset to pad the trainingdata. If so, how should it be generated?
Explanation of the problem and failed attempts to remedy it
I have tried to parse the data using different trainers to see what would work best for my dataset: the FieldAwareFactorizationMachine, the MatrixFactorizationMachine and the OLSTrainer. I've also tried to use the MatrixFactorizationMachine for LossFunctionType.SquareLossOneClass, where rather than a ProfileID-ItemID combination combinations of ItemIds on a Wishlist are inserted. (eg. item1-item2, item2-item3, item1-item3 from a wishlist where 3 items are present)
The machines are based on information found in their subsequent tutorials:
FieldAware: https://xamlbrewer.wordpress.com/2019/04/23/machine-learning-with-ml-net-in-uwp-field-aware-factorization-machine/
MatrixFactorization: https://learn.microsoft.com/en-us/dotnet/machine-learning/tutorials/movie-recommendation
MatrixFactorization (OneClass): https://medium.com/machinelearningadvantage/build-a-product-recommender-using-c-and-ml-net-machine-learning-ab890b802d25
OLS: https://learn.microsoft.com/en-us/dotnet/api/microsoft.ml.mklcomponentscatalog.ols?view=ml-dotnet
Here is an example of one of the pipelines, the others are very similar:
string profileEncoded = nameof(WishlistItem.ProfileId) + "Encoded";
string itemEncoded = nameof(WishlistItem.ItemId) + "Encoded";
// the Matrix Factorization pipeline
var options = new MatrixFactorizationTrainer.Options {
MatrixColumnIndexColumnName = profileEncoded,
MatrixRowIndexColumnName = itemEncoded,
LabelColumnName = nameof(WishlistItem.Label),
NumberOfIterations = 100,
ApproximationRank = 100
};
trainerEstimator = Context.Transforms.Conversion.MapValueToKey(outputColumnName: profileEncoded, inputColumnName: nameof(WishlistItem.ProfileId))
.Append(Context.Transforms.Conversion.MapValueToKey(outputColumnName: itemEncoded, inputColumnName: nameof(WishlistItem.ItemId)))
.Append(Context.BinaryClassification.Trainers.FieldAwareFactorizationMachine(new string[] { "Features" }));
In order to mitigate the issue of lacking labels, I've tried several workarounds:
leaving them blank (a 0f float value)
using the hashcodes of the itemid, profileid or a combination of both
counting the amount of items a specific itemid or profileid is included, also manipulating that figure to create less extreme values in case an item is represented hundreds of times. (using square root or a log function, creating Label = Math.Log(amountoftimes); or Label = Math.Ceiling(Math.Log(amountoftimes)
for the FieldAware machine, where the Label is a Boolean rather than a Float, the calculation above is used to determine whether the float result is above the average of below the average for all items
When testing, I test using the following of 2 possible methods to determine what recommendations "Y" can be created for Item "X":
compare ItemID X to all existing items, with the ProfileID of the user.
List<WishlistItem> predictionsForUser = profileMatrix.DistinctBy(x => x.ItemID).Select(x => new WishlistItem(userId, x.GiftId, x.Label));
IDataView transformed = trainedModel.Transform(Context.Data.LoadFromEnumerable(predictionsForUser));
CoPurchasePrediction[] predictions = Context.Data.CreateEnumerable<CoPurchasePrediction>(transformed, false).ToArray();
IEnumerable<KeyValuePair<WishlistItem, CoPurchasePrediction>> results = Enumerable.Range(0, predictions.Length).ToDictionary(x => predictionsForUser[x], x => predictions[x]).Where(x => OrderByDescending(x => x.Value.Score).Take(10);
return results.Select(x => x.Key.GiftId.ToString()).ToArray();
Compare the ItemID X to items on other people's wishlists where X is also present. This one is used for the FieldAware Factorization Trainer, which uses a Boolean as Label.
public IEnumerable<WishlistItem> CreatePredictDataForUser(string userId, IEnumerable<WishlistItem> userItems)
{
Dictionary<string, IEnumerable<WishlistItem>> giftIdGroups = profileMatrix.GroupBy(x => x.GiftId).ToDictionary(x => x.Key, x => x.Select(y => y));
Dictionary<string, IEnumerable<WishlistItem>> profileIdGroup = profileMatrix.GroupBy(x => x.ProfileId).ToDictionary(x => x.Key, x => x.Select(y => y));
profileIdGroup.Add(userId, userItems);
List<WishlistItem> results = new List<WishlistItem>();
foreach (WishlistItem wi in userItems)
{
IEnumerable<WishlistItem> giftIdGroup = giftIdGroups[wi.GiftId];
foreach(WishlistItem subwi in giftIdGroup)
{
results.AddRange(profileIdGroup[subwi.ProfileId]);
}
}
IEnumerable<WishlistItem> filtered = results.ExceptBy(userItems, x => x.GiftId);
// get duplicates
Dictionary<string, float> duplicates = filtered.GroupBy(x => x.GiftId).ToDictionary(x => x.Key, x => giftLabelValues[x.First().GiftId]);
float max = duplicates.Values.Max();
return filtered.DistinctBy(x => x.GiftId).Select(x => new WishlistItem(userId, x.GiftId, duplicates[x.GiftId] * 2 > max));
}
However, the testing data remains either completely or partially unusable (float.NAN), or creates always the same recommendation results (we recommend Y and Z for item X) regardless of the item inserted.
When evaluating the data using a testdataview (DataOperationsCatalog.TrainTestData split = Context.Data.TrainTestSplit(data, 0.2)) It either shows promising results with high accuracy or a random value all over the place, and it doesn't add up with the results I'm getting; high accuracy still results in float.NAN or 'always the same'
Online it is pointed out that float.NAN may be the result of a small dataset. to compensate, I have tried creating 'fake' datasets; profile-item combinations (with label 0f or false, while the rest is 0f+ or true) that are randomly generated based on existing profileid's and itemid's. (It is checked beforehand to rule out that these random 'negative' data isn't also a 'real' combinationset on accident). However, this has shown little to no effect.
I don't think any of the solutions you have tried will work, as, as you have pointed out, you do not have any label data. Faking the label data will not work either as the ML algorithm will work with this faked label.
What I believe you are looking for is a One-Class Matrix Factorization algorithm.
Your "label" or "score" is implicit - the fact that the item is in the user's wishlist itself indicates the label - that the user has an interest in the item. The One-Class Matrix Factorization uses this kind of implicit labelling.
Have a read through this article:
https://medium.com/machinelearningadvantage/build-a-product-recommender-using-c-and-ml-net-machine-learning-ab890b802d25
What you are looking for is a classic recommender system solution. Recommender systems are accustomed to missing and sparse data. There are many approaches to solve this problem, and I recommend starting with this article. Generally, there are two approaches in recommender systems - model-based and memory-based. In my experience, model-based methods perform much better than memory-based ones. There's a nice summary here regarding the different models and solutions. Look at the matrix factorization solution by Koren and Bell here which works very well in many cases.

combining two different information in binary code

I have Dictionary<string,T> where string represents the key of record, and I have two other pieces of information about the record that I need to maintain for each record in the dictionary, which are the category of the record and its redundancy (how many times its repeated).
For example: the record XYZ1 is of category 1, and its repeated 1 times. therefore the implementation has to be something like this:
"XYZ1", {1,1}
Now moving on, I may encounter the same record in my dataset, therefore the value of the key has to be updated like:
"XYZ1", {1,2}
"XYZ1", {1,3}
...
Since I am processing big number of records such as 100K, I tried this approach but it seems inefficient because the extra effort of fetching the value from dictionary and then slicing {1,1} and then converting both slices into integer puts lot of overhead on the execution.
I was thinking of using binary digits to represent both category and repatation and maybe bitmask to fetch these pieces.
Edit: I tried to use object with 2 properties, and then Tuple<int,int>. Complexity got worse !
My question: is it possible to do so ?
if not (in terms of complexity) any suggestions?
What is your type T? You could define a custom type which holds the information you need (category and occurences) .
class MyInfo {
public int c { get; set; }
public int o { get; set; }
}
Dictionary<String, MyInfo> data;
Then when traversing your data you can easily check whether some key is already present. If yes, just increment the occurences, else insert a new element.
MyInfo d;
foreach (var e in elements) {
if (!data.TryGet(e.key, out d))
data.Add(e.key, new MyInfo { c = e.cat, o= 1});
else
d.o++;
}
EDIT
You could also combine the category and the number of occurences into one UInt64. For instance take the category in the higher 32 bit (ie you can have 4 billion categories) and the number of occurenes in the lower 32 bit (ie each key can occur 4 billion times)
Dictionary<string, UInt64> data;
UInt64 d;
foreach (var e in elements) {
if (!data.TryGet(e.key, out d))
data[e.key] = (e.cat << 32) + 1;
else
data[e.key] = d + 1;
}
And if you want to get the number of occurrences for one specific key you can just inspect the respective part of the value.
var d = data["somekey"];
var occurrences = d & 0xFFFFFFFF;
var category = d >> 32;
It seems like category never changes. So rather than using a simple string for the key of your dictionary, I would instead do something like:
Dictionary<Tuple<string,int>,int> where the key of the dictionary is a Tuple<string,int> where the string is the record and the int is the category. Then the value in the dictionary is just a count.
A dictionary is probably going to be the fastest data structure for what you're trying to accomplish as it has near constant time O(1) lookup and entry.
You can speed it up a little bit by using the Tuple, as now the category is part of the key and no longer a bit of information you have to access separately.
At the same time you could also keep the string as the key and store a Tuple<int,int> as the value and simply set Item1 as the category and Item2 as the count.
Either way is going to be roughly equivalent in speed. Processing 100k records in such a manner should be pretty fast either way.

finding repeated sequences

I need help in finding proper algorithms to solve my goal.
Let's say I have a dataset with 10000 records about some events. I have 50 event types so each record in my dataset is assigned a number of event (from 1 to 50).
Example of my dataset (2 columns: Record number, event number):
1. 13
2. 24
3. 6
4. 50
5. 24
6. 6
...
10000. 46
As you can see in this example, I have one repetitive sequence of numbers 24, 6. Now I would like to find out how many of these and also other unknown sequences are there in my dataset. I would also like to know multiplicity of each sequence. I have checked Rabin–Karp algorithm but it seems to me, that I have to specify the pattern / sequence first. However I would like that algorithm to find it on its own.
I was told to look also on hierarchical clustering, but I am not sure if it fits my requirements.
To sum up, I would like to find algorithm that will find all repetitive sequences with their multiplicity in a dataset like above.
I assumed you have this data in a text file with the same structure you provided,
I used the LINQ to group and count each value as shown in following code:
static void Main(string[] args)
{
//read lines from the text file
var arr = File.ReadAllLines("dataset.txt").AsQueryable();
//convert the data to List<object> by convert each line to anonymous object
var data = arr.Select(line => new { Index = line.Split('.')[0], Value = line.Split('.')[1] });
//group the data by the value and then select the value and its count
var res = data.GroupBy(item => item.Value).Select(group => new { Value = group.First().Value, Count = group.Count() });
//printing result
Console.WriteLine("Value\t\tCount");
foreach (var item in res)
{
Console.WriteLine("{0}\t\t{1}",item.Value,item.Count);
}
Console.ReadLine();
}
The result of previous code
hope that will help you.

How to match / connect / pair integers from a List <T>

I have a list, with even number of nodes (always even). My task is to "match" all the nodes in the least costly way.
So I could have listDegree(1,4,5,6), which represents all the odd-degree nodes in my graph. How can I pair the nodes in the listDegree, and save the least costly combination to a variable, say int totalCost.
Something like this, and I return the least totalCost amount.
totalCost = (1,4) + (5,6)
totalCost = (1,5) + (4,6)
totalCost = (1,6) + (4,5)
--------------- More details (or a rewriting of the upper) ---------------
I have a class, that read my input-file and store all the information I need, like the costMatrix for the graph, the edges, number of edges and nodes.
Next i have a DijkstrasShortestPath algorithm, which computes the shortest path in my graph (costMatrix) from a given start node to a given end node.
I also have a method that examines the graph (costMatrix) and store all the odd-degree nodes in a list.
So what I was looking for, was some hints to how I can pair all the odd-degree nodes in the least costly way (shortest path). To use the data I have is easy, when I know how to combine all the nodes in the list.
I dont need a solution, and this is not homework.
I just need a hint to know, when you have a list with lets say integers, how you can combine all the integers pairwise.
Hope this explenation is better... :D
Perhaps:
List<int> totalCosts = listDegree
.Select((num,index) => new{num,index})
.GroupBy(x => x.index / 2)
.Select(g => g.Sum(x => x.num))
.ToList();
Demo
Edit:
After you've edited your question i understand your requirement. You need a total-sum of all (pairwise) combinations of all elements in a list. I would use this combinatorics project which is quite efficient and informative.
var listDegree = new[] { 1, 4, 5, 6 };
int lowerIndex = 2;
var combinations = new Facet.Combinatorics.Combinations<int>(
listDegree,
lowerIndex,
Facet.Combinatorics.GenerateOption.WithoutRepetition
);
// get total costs overall
int totalCosts = combinations.Sum(c => c.Sum());
// get a List<List<int>> of all combination (the inner list count is 2=lowerIndex since you want pairs)
List<List<int>> allLists = combinations.Select(c => c.ToList()).ToList();
// output the result for demo purposes
foreach (IList<int> combis in combinations)
{
Console.WriteLine(String.Join(" ", combis));
}
(Without more details on the cost, I am going to assume cost(1,5) = 1-5, and you want the sum to get as closest as possible to 0.)
You are describing the even partition problem, which is NP-Complete.
The problem says: Given a list L, find two lists A,B such that sum(A) = sum(B) and #elements(A) = #elements(B), with each element from L must be in A or B (and never both).
The reduction to your problem is simple, each left element in the pair will go to A, and each right element in each pair will go to B.
Thus, there is no known polynomial solution to the problem, but you might want to try exponential exhaustive search approaches (search all possible pairs, there are Choose(2n,n) = (2n!)/(n!*n!) of those).
An alternative is pseudo-polynomial DP based solutions (feasible for small integers).

Select items from List of structs

I've got List of sctructs. In struct there is field x. I would like to select those of structs, which are rather close to each other by parameter x. In other words, I'd like to clusterise them by x.
I guess, there should be one-line solution.
Thanks in advance.
If I understood correctly what you want, then you might need to sort your list by the structure's field X.
Look at the GroupBy extension method:
var items = mylist.GroupBy(c => c.X);
This article gives a lot of examples using group by.
If you're doing graph-style clustering, the easiest way to do it is by building up a list of clusters which is initially empty. Then loop over the input and, for each value, find all of the clusters which have at least one element which is close to the current value. All those clusters should then be merged together with the value. If there aren't any, then the value goes into a cluster all by itself.
Here is some sample code for how to do it with a simple list of integers.
IEnumerable<int> input;
int threshold;
List<List<int>> clusters = new List<List<int>>();
foreach(var current in input)
{
// Search the current list of clusters for ones which contain at least one
// entry such that the difference between it and x is less than the threshold
var matchingClusters =
clusters.Where(
cluster => cluster.Any(
val => Math.Abs(current - val) <= threshold)
).ToList();
// Merge all the clusters that were found, plus x, into a new cluster.
// Replace all the existing clusters with this new one.
IEnumerable<int> newCluster = new List<int>(new[] { current });
foreach (var match in matchingClusters)
{
clusters.Remove(match);
newCluster = newCluster.Concat(match);
}
clusters.Add(newCluster.ToList());
}

Categories

Resources