Teaching a ANN how to add

Teaching a ANN how to add - c#

Preface: I'm currently learning about ANNs because I have ~18.5k images in ~83 classes. They will be used to train a ANN to recognize approximately equal images in realtime. I followed the image example in the book, but it doesn't work for me. So I'm going back to the beginning as I've likely missed something.
I took the Encog XOR example and extended it to teach it how to add numbers less than 100. So far, the results are mixed, even for exact input after training.
Inputs (normalized from 100): 0+0, 1+2, 3+4, 5+6, 7+8, 1+1, 2+2, 7.5+7.5, 7+7, 50+50, 20+20.
Outputs are the numbers added, then normalized to 100.
After training 100,000 times, some sample output from input data:
0+0=1E-18 (great!)
1+2=6.95
3+4=7.99 (so close!)
5+6=9.33
7+8=11.03
1+1=6.70
2+2=7.16
7.5+7.5=10.94
7+7=10.48
50+50=99.99 (woo!)
20+20=41.27 (close enough)
From cherry-picked unseen data:
2+4=7.75
6+8=10.65
4+6=9.02
4+8=9.91
25+75=99.99 (!!)
21+21=87.41 (?)
I've messed with layers, neuron numbers, and [Resilient|Back]Propagation, but I'm not entirely sure if it's getting better or worse. With the above data, the layers are 2, 6, 1.
I have no frame of reference for judging this. Is this normal? Do I have not enough input? Is my data not complete or random enough, or too weighted?

You are not the first one to ask this. It seems logical to teach an ANN to add. We teach them to function as logic gates, why not addition/multiplication operators. I can't answer this completely, because I have not researched it myself to see how well an ANN performs in this situation.
If you are just teaching addition or multiplication, you might have best results with a linear output and no hidden layer. For example, to learn to add, the two weights would need to be 1.0 and the bias weight would have to go to zero:
linear( (input1 * w1) + (input2 * w2) + bias) =
becomes
linear( (input1 * 1.0) + (input2 * 1.0) + (0.0) ) =
Training a sigmoid or tanh might be more problematic. The weights/bias and hidden layer would basically have to undo the sigmoid to truely get back to an addition like above.
I think part of the problem is that the neural network is recognizing patterns, not really learning math.

ANN can learn arbitrary function, including all arithmetics. For example, it was proved that addition of N numbers can be computed by polynomial-size network of depth 2. One way to teach NN arithmetics is to use binary representation (i.e. not normalized input from 100, but a set of input neurons each representing one binary digit, and same representation for output). This way you will be able to implement addition and other arithmetics. See this paper for further discussion and description of ANN topologies used in learning arithmetics.
PS. If you want to work with image recognition, its not good idea to start practicing with your original dataset. Try some well-studied dataset like MNIST, where it is known what results can be expected from correctly implemented algorithms. After mastering classical examples, you can move to work with your own data.

I am in the middle of a demo that makes the computer to learn how to multiply and I share my progress on this: as Jeff suggested I used the Linear approach and in particular ADALINE. At this moment my program "knows" how to multiply by 5. This is the output I am getting:
1 x 5 ~= 5.17716232607829
2 x 5 ~= 10.147218373698
3 x 5 ~= 15.1172744213176
4 x 5 ~= 20.0873304689373
5 x 5 ~= 25.057386516557
6 x 5 ~= 30.0274425641767
7 x 5 ~= 34.9974986117963
8 x 5 ~= 39.967554659416
9 x 5 ~= 44.9376107070357
10 x 5 ~= 49.9076667546553
Let me know if you are interested in this demo. I'd be happy to share.

Related

Calculation of orig, norm attributes of NormContinous in PMML

Overview
I am currently working on a normalization PMML-Model executor in c#.
These PMML normalization models look like this:
<TransformationDictionary>
<DerivedField displayName="BU01" name="BU01*" optype="continuous" dataType="double">
<Extension name="summary" extender="KNIME" value="Min/Max (0.0, 1) normalization on 17 column(s)"/>
<NormContinuous field="BU01">
<LinearNorm orig="0.0" norm="-0.6148417019560395"/>
<LinearNorm orig="1.0" norm="-0.6140350877192982"/>
</NormContinuous>
</DerivedField>
(...)
I do know how min-max normalization in theory works using
z_i = (x_i - min(x)) / (max(x) - min(x))
to normalize a dataset into the range of 0-1 and obviously it's not hard to reverse this equation.
Problem
So to execute the normlization and denormalization I somehow have to translate this orig, norm values into min, max values. But I just can't figure out how these orig/norm values are being calculated and how they relate to min/max.
Question
So I'm asking if some does know an equation to transform orig/norm to min/max and back. Or is someone able to explain how to directly use orig/norm values to normalize/denormalize my fields?
Further Explanation
EDIT: It loks like as if I did not state clearly what the problem exactly is so here is another approach:
I try to get an attribut of a dataset normalized into the range from 0-1 using Min-Max normalization method (aka Feature Scaling). Using the Data Analysis tool Knime I can do this and export my "scaling" as a PMML Model. (Example of this is the XML provided above)
With these normalized attributes I train my MLP Model. Now if I export my MLP Model as PMML I have to put normalized values in and get normalized output out when caluclating a prediction. (Computing the MLP Network already works)
In a deployed scenario where Knime can't do this normalization for me I want to use my normalization Model. As already described I do know the theory behing Feature Scaling and can easily compute de-/normalization if I am provided with min and max of my attribute. The problem is that PMML has another let's say "notation" for saving this min-max information which is somehow inside the orig and norm value.
So what I am ultimately looking for is a way to convert orig/norm to min/max or how min/max information is "encoded" into orig/norm values.
Extra Info
[Why this "encoding" is done in the first place seems to be because computation speed reasons (which is not important in my scenario) and to easier encode min/max normlization info for ranges other than 0-1.]
Example #1
To give an example:
Let's say I want to normalize the array of [0, 1, 2, 4, 8] into the range of 0-1. Clearly the answer is [0, 0.125, 0.25, 0.5, 1] as computed by Feature Scaling with min = 0, max = 8. Easy. But now if I look at the PMML normalization Model:
<TransformationDictionary>
<DerivedField displayName="column1" name="column1*" optype="continuous" dataType="double">
<Extension name="summary" extender="KNIME" value="Min/Max (0.0, 1) normalization on 1 column(s)"/>
<NormContinuous field="column1">
<LinearNorm orig="0.0" norm="0.0"/>
<LinearNorm orig="1.0" norm="0.125"/>
</NormContinuous>
</DerivedField>
</TransformationDictionary>
Example #2
[1, 2, 4, 8] -> [0, 0.333, 0.667, 1]
With:
<TransformationDictionary>
<DerivedField displayName="column1" name="column1*" optype="continuous" dataType="double">
<Extension name="summary" extender="KNIME" value="Min/Max (0.0, 1) normalization on 1 column(s)"/>
<NormContinuous field="column1">
<LinearNorm orig="0.0" norm="-0.3333333333333333"/>
<LinearNorm orig="1.0" norm="0.0"/>
</NormContinuous>
</DerivedField>
</TransformationDictionary>
Question
So how am I supposed to scale with orig/norm or compute min/max from these values?

What I'm about to say depends on what you mean by (min, max).
I'm going to assume that min equals the value where 0.5% of the total lies below and max equals the value where 0.5% of the total lies above.
If we agree on that, a symmetric normal distribution would have a mean value of approximately mean ~ (max+min)/2. (You call the mean the origin.)
Six standard deviations encompasses 99% of a normal distribution, so the standard deviation is approximately sigma ~ (max-min)/6.
The definition of normalized z = (x - mean)/sigma.
With those values you can get yourself back to the denormalized distribution.

Found the answer. After carefully reading again through the Documentation (which is extremly confusing imo) i came across this sentence:
The sequence of LinearNorm elements defines a sequence of points for a stepwise linear interpolation function. The sequence must contain at least two elements. Within NormContinous the elements LinearNorm must be strictly sorted by ascending value of orig.
Which basically explains it all. Normalization in PMML is done by using a stepwise interpolation with only 2 points. So in fact just a simple conversion function.
In the case of normalization into a range of 0-1 it even get's easier as the two points will always be at x1=0 and x2=1 (orig values). And will therefore always have their y axis intercept at orig=0 norm-value. As far as the slope of the function is concerned it is also very easy to calculate by slope = (y2-y1)/(x2-x1) = (y2-y1)/(1-0) = y2-y1 which are just the 2 norm-values.
So to get our interpolation function which will always be a polynom 1st grade we just calculate:
f(x) = ax + b = (y2-y1)x + y1 = (norm(orig=1)-norm(orig=0) * x + norm(orig=0) This is used for normalization.
and now we can calculate the inverse:
x = (f(x) - norm(orig=0)) / (norm(orig=1)-norm(orig=0)) This is used for de-normalization
Hope this helps everyone who at someday will also go through the hassle of implementing their own PMML executor engine and gets stuck at this topic.

Matrix shortest path with jumping points

I have a random Matrix eg.:
3 A 6 8
9 2 7* 1
6 6 9 1
2 #3 4 B
I need to find, the shortest path from A to B. The * and # marks are jumping points. If you stay on the * marked number, you can jump to the # marked number.
I thought a lot around this, but can't solve.
How can i achive this?

In case the values in your matrix are the movement cost of one field to another, the algorithm you need is A*. Wikipedia offers you some pseudo code to get started, but if you ask Google, you will find loads and loads of example implementation in every language there is.
In case the movement cost if always the same, it is a A* algorithm too, but in that special case it is Dijkstra's algorithm.
A* is basically Dijkstra's algorithm with the addition of considering changing movement costs.

Consistent number generator from multiple input variables

I wan't to generate a fictional job title from some information I have about the visitor.
For this, I have a table of about 30 different job titles:
01 CEO
02 CFO
03 Key Account Manager
...
29 Window Cleaner
30 Dishwasher
I'm trying to find a way to generate one of these titles from a few different variables like name, age, education history, work history and so on. I wan't it to be somewhat random but still consistent so that the same variables always result in the same title.
I also wan't the different variables to have some impact on the result. Lower numbers are "better" jobs and higher numbers are "worse" jobs, but it doesn't have to be very accurate, just not completely random.
So take these two people as an example.
Name: Joe Smith
Number of previous employers: 10
Number of years education: 8
Age: 56
Name: Samantha Smith
Number of previous employers: 1
Number of years education: 0
Age: 19
Now the reason I wan't the name in there is to have a bit of randomness, so that two co-workers of the same age with the same background doesn't get exactly the same title. So I was thinking of using the number of letters in the name to mix it up a bit.
Now I can generate consistent numbers in an infinite number of ways, like the number of letters in the name * age * years of education * number of employers. This would come out as 35 840 for Joe Smith and 247 for Samantha Smith. But I wan't it to be a number between 1-30 where Samantha is closer to 25-30 and Joe is closer to 1-5.
Maybe this is more of a math problem than a programming problem, but I have seen a lot of "What's your pirate name?" and similar apps out there and I can't figure out how they work. "What's your pirate name?" might be a bad example, since it's probably completely random and I wan't my variables to matter some, but the idea is the same.
What I have tried
I tried adding weights to variable groups so I would get an easier number to use in my calculations.
Age
01-20 5
20-30 4
30-40 3
40-50 2
...
Years of education
00-01 0
01-02 1
02-03 2
04-05 3
...
Add them together and play around with those numbers, but there was a lot of problems like everyone ending up in pretty much the same mid-range (no one got to be CEO or dishwasher, everyone was somewhere in the middle), not to mention how messy the code was.
Is there a good way to accomplish what I want to do without having to build a massive math engine?

int numberOfTitles = 30;
var semiRandomID = person.Name.GetHashCode()
^ person.NumberOfPreviousEmployers.GetHashCode()
^ person.NumberOfYearsEducation.GetHashCode()
^ person.Age.GetHashCode();
var semiRandomTitle = Math.Abs(semiRandomID) % numberOfTitles;
// adjust semiRandomTitle as you see fit
semiRandomTitle += ((person.Age / 10) - 2);
semiRandomTitle += (person.NumberOfYearsEducation / 2);
The semiRandomID is a number that is generated from unique hashes of each component. The numbers are unique so that you will always generate the same number for "Joe" for example, but they don't mean anything. It's just a number. So we take all those unique numbers and generate one job title out of the 30 available. Every person has the same chance to get each job title (probably some math freak will proof that there's egde cases to the contrary, but for all practical, non-cryptographic means, it's sufficient).
Now each person has one job title assigned that looks random. However, as it's math and not randomness, they will get the same every time.
Now lets assume Joe got Taxi-Driver, the number 20. However, he has 10 years of formal education, so you decide you want to have that aspect have some weight. You could just add the years onto the job title number, but that would make anyone with 30 years of college parties CEO, so you decide (arbitrarily) that each year of education counts for half a job title. You add (NumberOfYearsEducation / 2) to the job title.
Lets assume Jane got CIO, the number 5. However, she is only 22 years old, a little young to be that high on the list. Again, you could just add the years onto the job title number, but that would make anyone with 30 years of age a CEO, so you decide (arbitrarily) that each year counts as 1/10 of a job title. In addition, you think that being very young should instead subtract from the job title. All years below the first 20 should indeed be a negative weight. So the formula would be ((Age / 10) - 2). One point for each 10 years of age, with the first 2 counting as negative.

given 10 functions y=a+bx and 1000's of (x,y) data points rounded to ints, how to derive 10 best (a,b) tuples?

We build software that audits fees charged by banks to merchants that accept credit and debit cards. Our customers want us to tell them if the card processor is overcharging them. Per-transaction credit card fees are calculated like this:
fee = fixed + variable*transaction_price
A "fee scheme" is the pair of (fixed, variable) used by a group of credit cards, e.g. "MasterCard business debit gold cards issued by First National Bank of Hollywood". We believe there are fewer than 10 different fee schemes in use at any time, but we aren't getting a complete nor current list of fee schemes from our partners. (yes, I know that some "fee schemes" are more complicated than the equation above because of caps and other gotchas, but our transactions are known to have only a + bx schemes in use).
Here's the problem we're trying to solve: we want to use per-transaction data about fees to derive the fee schemes in use. Then we can compare that list to the fee schemes that each customer should be using according to their bank.
The data we get about each transaction is a data tuple: (card_id, transaction_price, fee).
transaction_price and fee are in integer cents. The bank rolls over fractional cents for each transation until the cumulative is greater than one cent, and then a "rounding cent" will be attached to the fees of that transaction. We cannot predict which transaction the "rounding cent" will be attached to.
card_id identifies a group of cards that share the same fee scheme. In a typical day of 10,000 transactions, there may be several hundred unique card_id's. Multiple card_id's will share a fee scheme.
The data we get looks like this, and what we want to figure out is the last two columns.
card_id transaction_price fee fixed variable
=======================================================================
12345 200 22 ? ?
67890 300 21 ? ?
56789 150 8 ? ?
34567 150 8 ? ?
34567 150 "rounding cent"-> 9 ? ?
34567 150 8 ? ?
The end result we want is a short list like this with 10 or fewer entries showing the fee schemes that best fit our data. Like this:
fee_scheme_id fixed variable
======================================
1 22 0
2 21 0
3 ? ?
4 ? ?
...
The average fee is about 8 cents. This means the rounding cents have a huge impact and the derivation above requires a lot of data.
The average transaction is 125 cents. Transaction prices are always on 5-cent boundaries.
We want a short list of fee schemes that "fit" 98%+ of the 3,000+ transactions each customer gets each day. If that's not enough data to achieve 98% confidence, we can use multiple days' of data.
Because of the rounding cents applied somewhat arbitrarily to each transaction, this isn't a simple algebra problem. Instead, it's a kind of statistical clustering exercise that I'm not sure how to solve.
Any suggestions for how to approach this problem? The implementation can be in C# or T-SQL, whichever makes the most sense given the algorithm.

Hough transform
Consider your problem in image terms: If you would plot your input data on a diagram of price vs. fee, each scheme's entries would form a straight line (with rounding cents being noise). Consider the density map of your plot as an image, and the task is reduced to finding straight lines in an image. Which is just the job of the Hough transform.
You would essentially approach this by plotting one line for each transaction into a diagram of possible fixed fee versus possible variable fee, adding the values of lines where they cross. At the points of real fee schemes, many lines will intersect and form a large local maximum. By detecting this maximum, you find your fee scheme, and even a degree of importance for the fee scheme.
This approach will surely work, but might take some time depending on the resolution you want to achieve. If computation time proves to be an issue, remember that a Voronoi diagram of a coarse Hough space can be used as a classificator - and once you have classified your points into fee schemes, simple linear regression solves your problem.

Considering, that a processing query's storage requirements are in the same power of 2 as a day's worth of transaction data, I assume that such storage is not a problem, so:
First pass: Group the transactions for each card_id by transaction_price, keeping card_id, transaction_price and average fee. This can easily be done in SQL. This assumes, there are not outliers - but you can catch those at after this stage if so required. The resulting number of rows is guaranteed to be no higher than the number of raw data points.
Second pass: Per group walk these new data points (with a cursor or in C#) and calculate the average value of b. Again any outliers can be caught if desired after this stage.
Third pass: Per group calculate the average value of a, now that b is known. This is basic SQL. Outliers as allways
If you decide to do the second step in a cursor you can stuff all that into a stored procedure.
Different card_id groups, that use the same fee scheme can now be coalesced (Sorry of this is the wrong word, non-english native) into fee schemes by rounding a and b with a sane precision and again grouping.

The Hough transform is the most general answer, though I don't know how one would implement it in SQL (rather than pulling the data out and processing it in a general purpose language of your choice).
Alas, the naive version is known to be slow if you have a lot of input data (1000 points is kinda medium sized) and if you want high precision results (scales as size_of_the_input / (rho_precision * theta_precision)).
There is a faster approach based on 2^n-trees, but there are few implementations out on the web to just plug in. (I recently did one in C++ as a testbed for a project I'm involved in. Maybe I'll clean it up and post it somewhere.)
If there is some additional order to the data you may be able to do better (i.e. do the line segments form a piecewise function?).
Naive Hough transform
Define an accumulator in (theta,rho) space spanning [-pi,pi) and [0,max(hypotenuse(x,y)] as an 2D-array.
Foreach point in the input data
Foreach bin in theta
find the distance rho of the altitude from the origin to
a line through (a,y) and making angle theta with the horizontal
rho = x cos(theta) + y sin(theta)
and increment the bin (theta,rho) in the accumulator
Find the maximum bin in the accumulator, this
represents the most line-like structure in the data
if (theta !=0) {a = rho/sin(theta); b = -1/tan(theta);}
Reliably getting multiple lines out of a single pass takes a little more bookkeeping, but it is not significantly harder.
You can improve the result a little by smoothing the data near the candidate peaks and fitting to get sub-bin precision which should be faster than using smaller bins and should pickup the effect of the "rounding" cents fairly smoothly.

You're looking at the rounding cent as a significant source of noise in your calculations, so I'd focus on minimizing the noise due to that issue. The easiest way to do this IMO is to increase the sample size.
Instead of viewing your data as thousands of y=mx + b (+Rounding) group your data into larger subsets:
If you combine X transactions with the same and look at this as (sum of X fees) = (variable rate)*(sum of X transactions) + X(base rates) (+Rounding) your rounding number the noise will likely fall to the wayside.
Get enough groups of size 'X' and you should be able to come up with a pretty close representation of the real numbers.

Generating Combinations

Recently I have been reading about lotto wheeling and combination generating. I thought I'd give it a whirl and looked about for example code. I managed to cobble together a number wheel based on some VB but I've introduced an interesting bug while porting it.
http://www.xtremevbtalk.com/showthread.php?t=168296
It allows you to basically ID any combination. You feed it N numbers, K picks and an index and it returns that combination in lexicographical order.
It works well at low values but as the number of balls (N) rises I get additional numbers occurring for example. 40 balls, 2 picks. Combination No. 780 Returns 40 and 41! The more picks and numbers I added the higher this goes, It seem to happen at the end of a run when the number preceding is due to cycle.
I found the method for generating number of possible combination on the VB forum to not make a lot of sense, so I found a simpler one:
http://www.dreamincode.net/code/snippet2334.htm
Then I discovered that using doubles seems to cause a lack of resolution. Using long works, but now I can't use higher values of N because the multiplying goes out of range for a long! I then tried ulong and decimal neither could go much past 26-28 numbers (N).
So I reverted to the version on the VB site.
http://www.xtremevbtalk.com/showthread.php?s=6548354125cb4f312fc555dd0864853e&t=129902
The code is a method to avoid hitting the 96bit ceiling and claims to be able to calculate as high as N 98, K 49.
For some reason I cannot get this to behave, it spits out some very strange numbers.
After giving up for a while I decided to re-read the wiki suggested. While most of it was over my head, I was able to discover that certain ways of calculating a binomial coefficient have inaccuracy. This wouldn't be appropriate for a system where you are essentially dialing up (wheeling) to a game. After a bit of searching and reading I came across this:
http://dmitrybrant.com/2008/04/29/binomial-coefficients-stirling-numbers-csharp
Turns out this is exactly the information I was looking for! The first method is accurate and plenty fast for anything I'm doing. Much thanks for psYchotic going to the trouble of joining just to post here!

There are exactly 780 combinations of 2 numbers to generate out of a set of 40. If your combination generator uses a zero-based index, any index >= the maximum amount of combinations would be invalid.
You can use the binomial coefficient to determine the number of combinations that can be formed.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.