CNTK C# API: How does TrainingParameterScheduleDouble work?

CNTK C# API: How does TrainingParameterScheduleDouble work? - c#

I'm trying to understand how TrainingParameterScheduleDouble works in the CNTK C# API. Unfortunately, there is no documentation and the previous SO thread here appears to be incorrect/incomplete, so I've tried to reverse engineer the behavior myself. Can anyone confirm my conclusions and answer the lingering questions that I have?
Overload #1
TrainingParameterScheduleDouble(value, minibatchSize)
This sets the learning rate to value per minibatchSize number of samples, regardless of the actual minibatch size passed to GetNextMinibatch. Thus, using minibatchSize: 1 is an easy way to specify a per-sample learning rate.
It seems to me that calling the second parameter minibatchSize is very misleading in this context, since it's totally unrelated to the actual size of each minibatch. I think a better name would have been something like perNumSamples, or am I missing something?
Overload #2
TrainingParameterScheduleDouble(value)
This is the same as setting minibatchSize: 0 above, and has the effect of using the "natural" minibatchSize that's passed to GetNextMinibatch as the number of samples.
So if we have GetNextMinibatch(64) then new TrainingParameterScheduleDouble(0.001) will result in a 64x slower learning rate than new TrainingParameterScheduleDouble(0.001, 1).
Overload #3
TrainingParameterScheduleDouble(schedule)
This changes the learning rate over time, using the "natural" minibatch size. So a schedule of (30, 0.321), (1, 0.123) will use a per-actual-minibatch learning rate of 0.321 for the first 30 minibatches and a rate of 0.123 thereafter.
Overload #4
TrainingParameterScheduleDouble(schedule, epochSize)
epochSize causes IsSweepBased() to return False instead of True, but otherwise has no apparent effect on the learning rate or anything else. This is surprising. Can anyone explain the purpose of epochSize in this context?
Overload #5
TrainingParameterScheduleDouble(schedule, epochSize, minibatchSize)
This is the only way to change the learning rate over time without using the natural minibatch size. So a schedule of (30, 0.321), (1, 0.123) with minibatchSize: 1 will use a per-sample learning rate of 0.321 for the first 30 samples (regardless of the actual minibatch size) and a rate of 0.123 thereafter. As before, the epoch size has no apparent effect.
Assuming this is correct, it's not clear to me what happens if the learning rate changes in the middle of a minibatch. Can anyone clarify?

Related

Detect unstable trend (timeseries)

I'm looking for a way to detect faulty sensors in an IOT environment.
In this case a tank level sensor. The readings are always fluctuating somewhat, and the "hop" at the beginning is a tank refill which is "normal". On Sep 16 the sensor started to malfunction and just gives apparent random values after that.
As a programmer ideally I'd like a simple way of detecting the problem (and as soon after it starts as possible).
I can mess about with "if direction of vector between two hourly averages changes direction more than once per day it is unstable". But I guess there are more sound and stable algorithms out there.

Two simple options:
domain knowledge based: If you know the max possible output of the tank (say 5 liter/h), any output above that would signal an error. I.e. in case of the example, if
t1-t2 > 5
assuming t1 and t2 show the tank capacity at hourly intervall. You might want to add sensor accuracy related safety margin.
past data based: Assuming that all tanks are similar regarding output capacity and used sensor quality, calculate the following for all your data of non-faulty sensors:
max(t1-t2)
The result is the error threshold to be used, similar to the value 5 above.
Note: tank refill operation might require additional consideration.
Additional methods are described e.g. here. You can find other papers for sure.
http://bourbon.usc.edu/leana/pubs-full/sensorfaults.pdf

Standard deviation.
You're looking at how much variation there is between the measurements. Standard deviation is an easy formula, and well known. Look for a high value, and you know there's a problem.
You can also use coefficient of variation, which is the ratio of the mean to standard deviation.

Neural Network OCR - help needed with parameters - Coursera Ng's example

I'm implementing Ng's example of OCR neural network in C#.
I think I've got all formulas correctly implemented [vectorized version] and my app is training the network.
Any advice on how can I see my network improving in recognition - without testing examples manually by drawing them after the training is done? I want to see where my training is going while it's being trained.
I've test my trained weights on a drawn digits, output on all neurons is quite similar(approx. 0.077,or something like that ...on all neurons) ,and the largest value is on the wrong neuron. So the result doesn't match the drawn image.
This is the only test I'm doing so far: Cost Function changes with epochs
So, this is what happens with Cost function (some call it objective function? ) in 50 epochs.
my Lambda value is set to 3.0 , learning rate is 0.01, 5000 examples, I do batch after each epoch i.e. after those 5000 examples. Activation function: sigmoid.
input: 400
hidden: 25
output:10
I don't know what proper values are for lambda and learning rate so that my network can learn without overfitting or underfitting.
Any suggestions how to find out my network is learning well?
Also, what value should J cost function have after all this training?
Should it approach zero?
Should I have more epochs?
Is it bad that my examples are all ordered by digits?
Any help is appreciated.

Q: Any suggestions how to find out my network is learning well?
A: Split the data into three groups training, cross validation and test.Validate your result with test data. This is actually address in the course later.
Q: Also, what value should J cost function have after all this training? Should it approach zero?
A: I recall in the homework Ng mentioned what is the expected value. The regularized cost should not be zero since it includes a sum of all the weights.
Q: Should I have more epochs?
A: If you run your program long enough ( less than 20 minutes? ) you will see the cost is not getting smaller, I assume it reached the local/global optimum so more epochs would not be necessary.
Q: Is it bad that my examples are all ordered by digits?
A: The algorithm modify the weights for every example so different order of data does affect each step in a batch. However the final result should not have much difference.

PID With Prediction

I have a stream of data that I get at around 10 snapshots per second. I wrote a C# controller that takes this data and adjusts some data-gathering parameters based on the distance away from the expected result. Currently, I just do a linear scale operation on my data so that the farther away I am from the expected result the more it corrects. The problem is that the incoming data stream is delayed by somewhere between 0.5 and 2 seconds (I can calculate that at runtime). Because of this delay, it is correcting for results from a while ago and is constantly overcorrecting and even correcting in the wrong direction sometimes.
I am looking for an algorithm to do the following:
Implement a correction algorithm (I'd prefer PID) that will attempt to hone-in on the optimal value
Predict a certain amount of time (or number of datasets) in advance based on the history of corrections and results
What options do I have in terms of algorithms that can accomplish this?
C# code samples would be appreciated; I am not great at converting complex pseudo-code in to a working implementation.

Approach to solve delimited scheduling

I'm facing a problem and Im having problems to decide/figure-out an approach to solve it. The problem is the following:
Given N phone calls to be made, schedule in a way that the maximum of them be made.
Know Info:
Number of phone calls pending
Number callers (people who will talk on the phone)
Type of phone call (Reminder, billing, negotiation, etc...)
Estimate duration of phone call type (reminder:1min, billing:3min, negotiation:15min, etc...)
Number of phone calls pending
Ideal date for a given call
"Minimum" date of the a given call (can't happen before...)
"Maximum" date of the a given call (can't happen after...)
A day only have 8 hours
Rules:
Phone calls cannot be made before the "Minimum" or after the "Maximum" date
Reminder call placed award 1 point, reminder call missed -2 points
Billing call placed award 6 points, billing call missed -9 points
Negotiation call placed award 20 points, Negotiation call missed -25 points
A phone calls to John must be placed by the first person to ever call him. Notice that it does not HAVE TO, but, that call will earn extra points if you do...
I know a little about A.I. and I can recognize this a problem that fits the class, but i just dont know which approach to take... should i use neural networks? Graph search?
PS: this is not a academic question. This a real world problem that im facing.
PS2: Pointing system is still being created... the points here sampled are not the real ones...
PS3: The resulting algol can be executed several times (batch job style) or it can be resolved online depending on the performance...
PS4: My contract states that I will charge the client based on: (amount of calls I place) + (ratio * the duration of the call), but theres a clause about quality of service, and only placing reminders calls is not good for me, because even when reminded, people still forget to attend their appointments... which reduces the "quality" of the service I provide... i dont know yet the exact numbers

This does not seem like a problem for AI.
If it were me I would create a set of rules, ordered by priority. Then start filling in the caller's schedule.
Mabey one of the rules is to assign the shortest duration call types first (to satisfy the "maximum number of calls made" criteria).
This is sounding more and more like a knapsack problem, where you would substitute in call duration and call points for weight and price.

This is just a very basic answer, but you could try to "brute force" an optimum solution:
Use the Combinatorics library (it's in NuGet too) to generate every permutation of calls for a given person to make in a given time period (looking one week into the future, for instance).
For each permutation, group the calls into 8-hour chunks by estimated duration, and assign a date to them.
Iterate through the chunks - if you get to a call too early, discard that permutation. Otherwise add or subtract points based on whether the call was made before the end date. Store the total score as the score for that permutation.
Choose the permutation with the highest score.

Why is the data type of System.Timers.Timer.Interval a double?

This is a bit of an academic question as I'm struggling with the thinking behind Microsoft using double as the data type for the Interval property!
Firstly from MDSN Interval is the time, in milliseconds, between Elapsed events; I would interpret that to be a discrete number so why the use of a double? surely int or long makes greater sense!?
Can Interval support values like 5.768585 (5.768585 ms)? Especially when one considers System.Timers.Timer to have nowhere near sub millisecond accuracy... Most accurate timer in .NET?
Seems a bit daft to me.. Maybe I'm missing something!

Disassembling shows that the interval is consumed via a call to (int)Math.Ceiling(this.interval) so even if you were to specify a real number, it would be turned into an int before use. This happens in a method called UpdateTimer.
Why? No idea, perhaps the spec said that double was required at one point and that changed? The end result is that double is not strictly required, because it is eventually converted to an int and cannot be larger than Int32.MaxValue according to the docs anyway.
Yes, the timer can "support" real numbers, it just doesn't tell you that it silently changed them. You can initialise and run the timer with 100.5d, it turns it into 101.
And yes, it is all a bit daft: 4 wasted bytes, potential implicit casting, conversion calls, explicit casting, all needless if they'd just used int.

The reason to use a double here is the attempt to provide enough accuracy.
In detail: The systems interrupt time slices are given by ActualResolution which is returned by NtQueryTimerResolution(). NtQueryTimerResolution is exported by the native Windows NT library NTDLL.DLL. The System time increments are given by TimeIncrement which is returned by GetSystemTimeAdjustment().
These two values are determining the behavior of the system timers. They are integer values and the express 100 ns units. However, this is already insufficient for certain hardware today. On some systems ActualResolution is returned 9766 which would correspond to 0.9766 ms. But in fact these systems are operating at 1024 interrupts per second (tuned by proper setting of the multimedia interface). 1024 interrupts a second will cause the interrupt period to be 0.9765625 ms. This is of too high detail, it reaches into the 100 ps regime and can therefore not be hold in the standard ActualResolution format.
Therefore it has been decided to put such time-parameters into double. But: This does not mean that all of the posible values are supported/used. The granularity given by TimeIncrement will persist, no matter what.
When dealing with timers it is always advisable to look at the granularity of the parameters involved.
So back to your question: Can Interval support values like 5.768585 (ms) ?
No, the system I've taken as an example above cannot.
But it can support 5.859375 (ms)!
Other systems with different hardware may support other numbers.
So the idea of introducing a double here is not such a stupid idea and actually makes sense. Spending another 4 bytes to get things finally right is a good investment.
I've summarized some more details about Windows time matters here.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.