Accord Decision Tree null output on node

Accord Decision Tree null output on node - c#

I am using .net framework in my project and I run into the problem.
I am using 7 DecisionVariables to create a decision tree. 5 of them are Continuous, 2 of them are Discrete and I am using C45Learning.
Way I am creating Decision Varibale:
Continuous
new DecisionVariable(SupportedValueType.ToString(), DecisionVariableKind.Continuous)
Discrete (in my case i created Discrete variable representing Day of month)
int PossibleValues = 30;
new DecisionVariable(SupportedValueType.ToString(), PossibleValues)
Now when I create a tree, its leaf nodes are nodes with Discrete decision variable and the output on this node is NULL, so when i run
tree.Decide(sample)
and it ends in this leaf node, it returns NULL.
Can anybody tell me what the problem is ?
When I was creating an input to create this Decision tree, I did not "use" every of this 30 possible values, only 2-3 of them. Could it be the problem ?
For example: (x variables are values of other decision variables and of course i provide more input data, not only 3 rows, but i only changed x values and only used this 3 days)
input: label:
x,x,x,x,x,1 -> Small
x,x,x,x,x,2 -> Medium
x,x,x,x,x,3 -> Big

Yes, my guess is that your tree is simply incomplete.
When using that algorithm, if I do not provide it with a comprehensive training set containing every possible combination of the input columns, then my chance of getting null leaves is much higher. However, this algorithm is known for having null paths anyway as a byproduct of its pruning process.
Check to see if other samples also return null. If they all do, then you might have an issue. If only a couple are returning NULL/unknown then it is probably simply a result of the way the tree built itself. In that case you will need to handle it with a default decision value.
I have read that there are default values you can provide for the algorithm to apply on its own, however I have never used those.

Related

How to design unit tests for creating an object in which user can set many values?

Supose that I have to create an object which the use can set many vaules, for example 10 values. And this object has another values that are set by the system according to the values given by the uesr.
Is is a good idea create all the test doing all the possible combinations of the values that the user can set? Because in this case, if a value can have 5 possible values, another 3, another 6... all the combinations can be so much although for a small method.
Another case is a property that is set by the system and it doesn't depend on the values that can set the user, for example, the date of modification. For example:
MyClass
{
Property01;
Property02;
Property03;
ModificationDate;
}
method01(int param01, int param02, paramObject)
{
paramObject.ModificationDate = DateTime.Now;
if(param01 < 0 && param02 > 0)
{
paramObject.ModificationDate = new DateTime(2000, 01, 01);
}
}
In this case, if I don't test all the possibles values of the int, or at least when it is lower than 0 and bigger than 0, I couldn't find the error, becase supone that I test only a case where param01 is bigger than 0, I wouldn't be able to detect the error.
But this case is easy, what would happen if I have more parameters, or more possible values for the parameters? The combination would be so big really.
In general, I would like to know some common practice to test objects that can have many values, because at the beggining I started to considerate to try all the possibles combination of all possibles values of the parameters, but in some cases it amount of work that makes me think if there is not a better aproach.
Thanks.

The number of combinations may become huge, but most likely only if you design your test cases with a black box approach.
In practice, code that has to distinguish between different parameter constellations typically follows a divide-and-conquer approach: Early in the code some fundamental distinctions are made.
For example, functions often check first whether the input arguments are in a valid range. The cases where there are invalid arguments are then handled separately (sometimes causing exceptions, or leading to an early exit of the function).
When doing glass box testing you can benefit from the knowledge about this approach: You don't have to create a cartesian product of all inputs with all combinations of valid and invalid inputs: For testing whether invalid inputs are detected and handled correctly you have test cases focusing on one input at a time: To test if invalid data for input i1 is detected, you provide all other inputs with valid data and only i1 gets invalid data. Similarly for the other inputs. Thus, you can ignore combinations like 'i1' invalid and i2 invalid and i3 invalid.
Such glass box knowledge can help to reduce the amount of test cases tremendously. It comes at a price, however: When you change your implementation, you will also have to adjust your test cases. Moreover, with glass box testing you may overlook some scenarios of relevance. But, all in all its a tradeoff to end up with an amount of test cases that can be handled.

Compare 10 Million Entities

I have to write a program that compares 10'000'000+ Entities against one another. The entities are basically flat rows in a database/csv file.
The comparison algorithm has to be pretty flexible, it's based on a rule engine where the end user enters rules and each entity is matched against every other entity.
I'm thinking about how I could possibly split this task into smaller workloads but I haven't found anything yet. Since the rules are entered by the end user pre-sorting the DataSet seems impossible.
What I'm trying to do now is fit the entire DataSet in memory and process each item. But that's not highly efficient and requires approx. 20 GB of memory (compressed).
Do you have an idea how I could split the workload or reduce it's size?
Thanks

If your rules are on the highest level of abstraction (e.g. any unknown comparison function), you can't achive your goal. 10^14 comparison operations will run for ages.
If the rules are not completely general I see 3 solutions to optimize different cases:
if comparison is transitive and you can calculate hash (somebody already recommended this), do it. Hashes can also be complicated, not only your rules =). Find good hash function and it might help in many cases.
if entities are sortable, sort them. For this purpose I'd recommend not sorting in-place, but build an array of indexes (or IDs) of items. If your comparison can be transformed to SQL (as I understand your data is in database), you can perform this on the DBMS side more efficiently and read the sorted indexes (for example 3,1,2 which means that item with ID=3 is the lowest, with ID=1 is in the middle and with ID=2 is the largest). Then you need to compare only adjacent elements.
if things are worth, I would try to use some heuristical sorting or hashing. I mean I would create hash which not necessarily uniquely identifies equal elements, but can split your dataset in groups between which there are definitely no one pair of equal elements. Then all equal pairs will be in the inside groups and you can read groups one by one and do manual complex function calculation in the group of not 10 000 000, but for example 100 elements. The other sub-approach is heuristical sorting with the same purpose to guarantee that equal elements aren't on the different endings of a dataset. After that you can read elements one by one and compare with 1000 previous elements for example (already read and kept in memory). I would keep in memory for example 1100 elements and free oldest 100 every time new 100 comes. This would optimize your DB reads. The other implementation of this may be possible also in case your rules contains rules like (Attribute1=Value1) AND (...), or rule like (Attribute1 < Value2) AND (...) or any other simple rule. Then you can make clusterisation first by this criterias and then compare items in created clusters.
By the way, what if your rule considers all 10 000 000 elements equal? Would you like to get 10^14 result pairs? This case proves that you can't solve this task in general case. Try making some limitations and assumptions.

I would try to think about rule hierarchy.
Let's say for example that rule A is "Color" and rule B is "Shape".
If you first divide objects by color,
than there is no need to compare Red circle with Blue triangle.
This will reduce the number of compares you will need to do.

I would create a hashcode from each entity. You probably have to exclude the id from the hash generation and then test for equals. If you have the hashs you could order all the hashcodes alphabetical. Having all entities in order means that it's pretty easy to check for doubles.

If you want to compare each entity with all entities than effectively you need to cluster the data , there is very fewer reasons to compare totally unrelated things ( compare Clothes with Human does not make sense) , i think your rules will try to cluster the data.
so you need to cluster the data , try some clustering algorithms like K-Means.
Also see , Apache Mahout

Are you looking for the best suitable sorting algorithm, kind of a, for this?
I think Divide and Concur seems good.
If the algorithm seems good, you can have plenty of other ways to do the calculation. Specially parallel processing using MPICH or something may give you a final destination.
But before decide how to execute, you have to think if algorithm fits first.

String parsing and matching algorithm

I am solving the following problem:
Suppose I have a list of software packages and their names might looks like this (the only known thing is that these names are formed like SOMETHING + VERSION, meaning that the version always comes after the name):
Efficient.Exclusive.Zip.Archiver-PROPER.v.122.24-EXTENDED
Efficient.Exclusive.Zip.Archiver.123.01
Efficient-Exclusive.Zip.Archiver(2011)-126.24-X
Zip.Archiver14.06
Zip-Archiver.v15.08-T
Custom.Zip.Archiver1.08
Custom.Zip.Archiver1
Now, I need to parse this list and select only latest versions of each package. For this example the expected result would be:
Efficient-Exclusive.Zip.Archiver(2011)-126.24-X
Zip-Archiver.v15.08-T
Custom.Zip.Archiver1.08
Current approach that I use can be described the following way:
Split the initial strings into groups by their starting letter,
ignoring spaces, case and special symbols.
(`E`, `Z`, `C` for the example list above)
Foreach element {
Apply the regular expression (or a set of regular expressions),
which tries to deduce the version from the string and perform
the following conversion `STRING -> (VERSION, STRING_BEFORE_VERSION)`
// Example for this step:
// 'Efficient.Exclusive.Zip.Archiver-PROPER.v.122.24-EXTENDED' ->
// (122.24, Efficient.Exclusive.Zip.Archiver-PROPER)
Search through the corresponding group (in this example - the 'E' group)
and find every other strings, which starts from the 'STRING_BEFORE_VERSION' or
from it's significant part. This comparison is performed in ignore-case and
ignore-special-symbols mode.
// The matches for this step:
// Efficient.Exclusive.Zip.Archiver-PROPER, {122.24}
// Efficient.Exclusive.Zip.Archiver, {123.01}
// Efficient-Exclusive.Zip.Archiver, {126.24, 2011}
// The last one will get picked, because year is ignored.
Get the possible version from each match, ***pick the latest, yield that match.***
Remove every possible match (including the initial element) from the list.
}
This algorithm (as I assume) should work for something like O(N * V + N lg N * M), where M stands for the average string matching time and V stands for the version regexp working time.
However, I suspect there is a better solution (there always is!), maybe specific data structure or better matching approach.
If you can suggest something or make some notes on the current approach, please do not hesitate to do this.

How about this? (Pseudo-Code)
Dictionary<string,string> latestPackages=new Dictionary<string,string>(packageNameComparer);
foreach element
{
(package,version)=applyRegex(element);
if(!latestPackages.ContainsKey(package) || isNewer)
{
latestPackages[package]=version;
}
}
//print out latestPackages
Dictionary operations are O(1), so you have O(n) total runtime. No pre-grouping necessary and instead of storing all matches, you only store the one which is currently the newest.
Dictionary has a constructor which accepts a IEqualityComparer-object. There you can implement your own semantic of equality between package names. Keep in mind however that you need to implement a GetHashCode method in this IEqualityComparer which should return the same values for objects that you consider equal. To reproduce the example above you could return a hash code for the first character in the string, which would reproduce the grouping you had inside your dictionary. However you will get more performance with a smarter hash code, which doesn't have so many collisions. Maybe using more characters if that still yields good results.

I think you could probably use a DAWG (http://en.wikipedia.org/wiki/Directed_acyclic_word_graph) here to good effect. I think you could simply cycle down each node till you hit one that has only 1 "child". On this node, you'll have common prefixes "up" the tree and version strings below. From there, parse the version strings by removing everything that isn't a digit or a period, splitting the string by the period and converting each element of the array to an integer. This should give you an int array for each version string. Identify the highest version, record it and travel to the next node with only 1 child.
EDIT: Populating a large DAWG is a pretty expensive operation but lookup is really fast.

MonoTouch on iPad: How to make text search faster?

I need to do text search based on user input in a relative large list (about 37K lines with 50 to 100 chars each line). The search is done after entering each character and the result is shown in a UITableView. This is my current code:
if (input.Any(x => Char.IsUpper(x)))
return _list.Where(x => x.Desc.Contains(input));
else
return _list.Where(x => x.Desc.ToLower().Contains(input));
It performs okay on a MacBook running simulator, but too slow on iPad.
On interesting thing I observed is that it takes longer and longer as input grows. For example, say "examin" as input. It takes about 1 second after entering e, 2 seconds after x, 5 seconds after a, but 28 seconds after m and so on. Why that?
I hope there is a simple way to improve it.

Always take care to avoid memory allocations in time sensitive code.
For example we often produce code often allocates string without realizing it, e.g.
x => x.Desc.ToLower().Contains(input)
That will allocate a string to return from ToLower. From your description this will occurs many time. You can easily avoid this by using:
x = x.Desc.IndexOf ("s", StringComparison.OrdinalIgnoreCase) != -1
note: just select the StringComparison.*IgnoreCase that match your need.
Also LINQ is nice but it hides allocations in many cases - maybe not in your case but measuring is key to get things faster. In that case using another algorithm (like suggested in another answer) could give you much better results (but keep in mind the allocations ;-)
UPDATE:
Mono's Contains(string) will call, after a few checks, the following:
CultureInfo.CurrentCulture.CompareInfo.IndexOf (this, value, 0, length, CompareOptions.Ordinal);
which, with your ToLower requirement that using StringComparison.OrdinalIgnoreCase is the perfect (i.e. identical) match for your existing code (it did not do any culture specific comparison).

Generally I've found that contains operations are not preferable for search, so I'd recommend you take a look at the Mastering Core Data Session (login required ) video on the WWDC 2010 page (around the 10 min mark). Apple knows that 'contains' is terrible w/ SQLite on mobile devices, you can essentially do what Apple does to sort of "hack" FTS on the version of SQLite they ship.
Essentially they do prefix matching by creating a table like:
[[ pk_id || input || normalized_input ]]
Where input and normalized_input are both indexed explicitly. Then they prefix match against the normalized value. So for instance if a user is searching for 'snuggles' and so far they've typed in 'snu' the prefix matching query would look like:
normalized_input >= 'snu' and normalized_input < 'snt'
Not sure if this translates given your use case, but I thought it was worth mentioning. Hope it's helpful!

You need to use a trie. See http://en.wikipedia.org/wiki/Trie

All valid combinations of points, in the most (speed) effective way

I know there are quite some questions out there on generating combinations of elements, but I think this one has a certain twist to be worth a new question:
For a pet proejct of mine I've to pre-compute a lot of state to improve the runtime behavior of the application later. One of the steps I struggle with is this:
Given N tuples of two integers (lets call them points from here on, although they aren't in my use case. They roughly are X/Y related, though) I need to compute all valid combinations for a given rule.
The rule might be something like
"Every point included excludes every other point with the same X coordinate"
"Every point included excludes every other point with an odd X coordinate"
I hope and expect that this fact leads to an improvement in the selection process, but my math skills are just being resurrected as I type and I'm unable to come up with an elegant algorithm.
The set of points (N) starts small, but outgrows 64 soon (for the "use long as bitmask" solutions)
I'm doing this in C#, but solutions in any language should be fine if it explains the underlying idea
Thanks.
Update in response to Vlad's answer:
Maybe my idea to generalize the question was a bad one. My rules above were invented on the fly and just placeholders. One realistic rule would look like this:
"Every point included excludes every other point in the triagle above the chosen point"
By that rule and by choosing (2,1) I'd exclude
(2,2) - directly above
(1,3) (2,3) (3,3) - next line
and so on
So the rules are fixed, not general. They are unfortunately more complex than the X/Y samples I initially gave.

How about "the x coordinate of every point included is the exact sum of some subset of the y coordinates of the other included points". If you can come up with a fast algorithm for that simply-stated constraint problem then you will become very famous indeed.
My point being that the problem as stated is so vague as to admit NP-complete or NP-hard problems. Constraint optimization problems are incredibly hard; if you cannot put extremely tight bounds on the problem then it very rapidly becomes not analyzable by machines in polynomial time.

For some special rule types your task seems to be simple. For example, for your example rule #1 you need to choose a subset of all possible values of X, and than for each value from the subset assign an arbitrary Y.
For generic rules I doubt that it's possible to build an efficient algorithm without any AI.

My understanding of the problem is: Given a method bool property( Point x ) const, find all points the set for which property() is true. Is that reasonable?
The brute-force approach is to run all the points through property(), and store the ones which return true. The time complexity of this would be O( N ) where (a) N is the total number of points, and (b) the property() method is O( 1 ). I guess you are looking for improvements from O( N ). Is that right?
For certain kind of properties, it is possible to improve from O( N ) provided suitable data structure is used to store the points and suitable pre-computation (e.g. sorting) is done. However, this may not be true for any arbitrary property.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.