I'd like your advice: could you recommend a library that allows you to add/subtract/multiply/divide PDFs (Probability Density Functions) like real numbers?
Behind the scenes, it would have to do a Monte Carlo to work the result out, so I'd probably prefer something fast and efficient, that can take advantage of any GPU in the system.
Update:
This is the sort of C# code I am looking for:
var a = new Normal(0.0, 1.0); // Creates a PDF with mean=0, std. dev=1.0.
var b = new Normal(0.0, 2.0); // Creates a PDF with mean=0, std. dev=2.0.
var x = a + b; // Creates a PDF which is the sum of a and b.
// i.e. perform a Monte Carlo by taking thousands of samples
// of a and b to construct the resultant PDF.
Update:
What I'm looking for is a method to implement the algebra on "probability shapes" in The Flaw of Averages by Sam Savage. The video Monte Carlo Simulation in Matlab explains the effect I want - a library to perform math on a series of input distributions.
Update:
Searching for the following will produce info on the appropriate libraries:
"monte carlo library"
"monte carlo C++"
"monte carlo Matlab"
"monte carlo .NET"
The #Risk Developer Kit allows you to start with a set of probability density functions, then perform algebra on the inputs to get some output, i.e. P = A + B.
The keywords on this page can be used to find other competing offerings, e.g. try searching for:
"monte carlo simulation model C++"
"monte carlo simulation model .NET"
"risk analysis toolkit"
"distributing fitting capabilties".
Its not all that difficult to code this up in a language such as C++ or .NET. The Monte Carlo portion is probably only about 50 lines of code:
Read "The Flaw Of Averages" by Sam Savage to understand how you can use algebra on "probability shapes".
Have some method of generating a "probability shape", either by bootstrapping from some sampled data, or from a pre-determined probability density function, or by using the Math.NET probability library.
Take 10000 samples from the input probability shapes.
Do the algebra on the samples, i.e. +, -, /, *, etc, to get 1000 outputs. You can also form a probability tree which implies and, or, etc on the inputs.
Combine these 10000 outputs into a new "probability shape" by putting the results into 100 discrete "buckets".
Now that we have a new "probability shape", we can then use that as the input into a new probability tree, or perform an integration to get the area, which converts it back into a hard probability number given some threshold.
The video Monte Carlo Simulation in Matlab explains this entire process much better than I can.
#Gravitas - Based on that exchange with #user207442, it sounds like you just want an object that abstracts away a convolution for addition and subtraction. There is certainly a closed form solution for the product of two random variables, but it might depend on the distribution.
C#'s hot new step sister, F#, let's you do some fun FP techniques, and it integrates seamlessly with C#. Your goal of abstracting out a "random variable" type that can be "summed" (convolved) or "multiplied" (??) seems like it is screaming for a monad. Here is a simple example.
Edit: do you need to reinvent mcmc in c#? we use winbugs for this at my school ... this is the c++ library winbugs uses: http://darwin.eeb.uconn.edu/mcmc++/mcmc++.html. rather than reinventing the wheel, could you just wrap your code around the c++ (again, seems like monads would come in hand here)?
Take a look at the Math.NET Numerics library. Here is the page specific to probability distribution support.
Related
I'm trying to reproduce same curve fitting (called "trending") in Excel but in C#: Exponential, Linear, Logarithmic, Polynomial and Power.
I found linear and polynomial as :
Tuple<double, double> line = Fit.Line(xdata, ydata);
double[] poly2 = Fit.Polynomial(xdata, ydata, 2);
I also found Exponential fit.
But I wonder how to do curve fitting for Power. Anybody has an idea?
I should be able to get both constants like shown into the Excel screen shot formula:
power
multiplier (before x)
Before anybody would be the fifth who vote to close this question...
I asked the question directly to the forum of mathdotnet (that I recently discovered). Christoph Ruegg, the main developper of the lib, answered me something excellent that I want to share in order to help other with the same problem:
Assuming with power you’re referring to a target function along the
lines of y : x -> a*x^b, then this is a simpler version of what I’ve
described in Linearizing non-linear models by transformation.
This seems to be used often enough so I’ve started to add a new
Fit.Power and Fit.Exponential locally for this case - not pushed yet
since it first needs more testing, but I expect it to be part of v4.1.
Alternatively, by now we also support non-linear optimization which
could also be used for use cases like this (FindMinimum module).
Link to my question: mathdonet - Curve fitting: Power
I started using the MathNet Numerics Library and I need it to calculate the largest Eigenvalues corresponding to their Eigenvectors of my adjacency matrix.
When using large amount of points my adjacency Matrix gets quite big (i.e. 5782x5782 entries)
Most of the entries are '0' so I thought I could use the 'SparseMatrix'. But when I use it, it still takes ages for computation. In fact I never really waited that long until its finished.
I tried the whole thing in matlab and there wasn't any problem at all. Matlab solved it within a few seconds.
Do you have any suggestions for me?
Here is what I'm doing:
// initialize matrix and fill it with zeros
Matrix<double> A = SparseMatrix.Create(count, count, 0);
... fill matrix with values ...
// get eigenvalues and eigenvectors / this part takes centuries =)
Evd<double> eigen = A.Evd(Symmetricity.Symmetric);
Vector<Complex> eigenvector = eigen.EigenValues;
Math.Net Numerics's implementation is purely C# based. Therefore, performance may not be on-par with tools such as MATLAB since they mostly rely on native and highly optimized BLAS libraries for performing numerical computations.
You may want to use the native wrappers that come with Math.Net that leverage highly optimized linear algebra libraries (such as Intel's MKL or AMD's ACML). There is a guide on this MSDN page that explains how to build Math.NET with ACML support (look under Compiling and Using AMD ACML in Math.NET Numerics).
Long story short, i have to solve 20..200 block-tridiagonal linear systems during an iterational process. Size of systems is 50..100 blocks, 50..100 x 50..100 each. I will write down here my thoughts on it, and i ask you to share your opinion on my thoughts, as it is possible that i am mistaken in one regard or another.
To solve those equations, i use a matrix version of Thomas algorithm. It's exactly like scalar one, except instead of scalar coefficients in equations i have matrices (i.e. instead of "a_i x_{i-1} + b_i x_i + c_i x_{i+1} = f_i" i have "A_i X_{i-1} + B_i X_i + C_i X_{i+1} = F_i", where A_i, B_i, C_i - matrices; F_i and X_i are vectors.
Asymptotic complexity of such algorithm is O(N*M^3), where N is the size of overall matrix in blocks, and M is the size of each block.
Right now my bottleneck is inversion operation. Deep inside nested loops i have to calculate /a lot/ of inversions that look like "(c_i - a_i * alpha_i)^-1", where alpha_i is a dense MxM matrix. I am doing it using Gauss-Jordan algorithm, using additional memory (which i will have to use anyway later in the program) and O(M^3) operations.
Trying to find info on how to optimize inversion operation, i've found only threads about solving AX=B systems 'canonically', i.e. X=A^-1 B, with suggestions to use LU factorization instead of it. Sadly, as my inversion is a part of Thomas algorithm, if i resort to LU factorization, i will have to do it for a M*NxM*N matrix, which wil rise the complexity of solving the linear system by extra N^2 to O(N^3*M^3). That's slowing down by a factor of 2500..10000, which is quite bad.
Approximate or iterative inversions are out of scope, too, as slightest residual with exact inversion will cumulate very fast and cause global iterational process to explode.
I do calculations in parallel with Parallel.For(), solving each of the 20..200 systems separately.
Right now, to solve 20 such systems with N,M=50 on average takes 872ms (i7-3630QM, 2.4Ghz, 4cores (8 with hyper threading)).
And finally, here come the questions.
Am i correct on what i wrote here? Is there an algorithm to significantly speed up calculations over what they are now?
Inside of number-grinder part of my program i use only For loops (most of them are with constant boundaries, the exception being one of the loops inside of inversion algorithm) double arithmetic (+,-,*,/) and standard arrays ([], [,], [,,]). Will there be any speed-up if i rewrite this part as unsafe? Or as a library in C?
How much is C# overhead on such tasks (double arrays grinding)? Are C compilers better at optimization of such simple code than C# 'compiler'?
What should i look at when optimizing numbergrinder in C#? Is it suited for such task at all?
i want to try and create an application which rates the user's facebook posts based on the content (Sentiment Analysis).
I tried creating an algorithm myself initially but i felt it wasn't that reliable.
Created a dictionary list of words and scanned the posts against the dictionary and rate if it was positive or negative.
However, i feel this is minimal. I would like to rate the mood or feelings/personality traits of the person based on the posts. Is this possible to be done?
Would hope to make use of some online APIs, please assist. Thanks ;)
As #Jared pointed out, using a dictionary-based approach can work quite well in some situations, depending on the quality of your training corpus. This is actually how CLIPS pattern and TextBlob's implementations work.
Here's an example using TextBlob:
from text.blob import TextBlob
b = TextBlob("StackOverflow is very useful")
b.sentiment # returns (polarity, subjectivity)
# (0.39, 0.0)
By default, TextBlob uses pattern's dictionary-based algorithm. However, you can easily swap out algorithms. You can, for example, use a Naive Bayes classifier trained on a movie reviews corpus.
from text.blob import TextBlob
from text.sentiments import NaiveBayesAnalyzer
b = TextBlob("Today is a good day", analyzer=NaiveBayesAnalyzer())
b.sentiment # returns (label, prob_pos, prob_neg)
# ('pos', 0.7265237431528468, 0.2734762568471531)
The algorithm you describe should actually work well, but the quality of the result depends greatly on the word list used. For Sentimental, we take comments on Facebook posts and score them based on sentiment. Using the AFINN 111 word list to score the comments word by word, this approach is (perhaps surprisingly) effective. By normalizing and stemming the words first, you should be able to do even better.
There are lots of sentiment analysis APIs that you can easily incorporate into your app, also many have a free usage allowance (usually, 500 requests a day). I started a small project that compares how each API (currently supporting 10 different APIs: AIApplied, Alchemy, Bitext, Chatterbox, Datumbox, Lymbix, Repustate, Semantria, Skyttle, and Viralheat) classifies a given set of texts into positive, negative or neutral: https://github.com/skyttle/sentiment-evaluation
Each specific API can offer lots of other features, like classifying emotions (delight, anger, sadness, etc) or linking sentiment to entities the sentiment is attributed to. You just need to go through available features and pick the one that suits your needs.
TextBlob is another possiblity, though it will only classify texts into pos/neg/neu.
If you are looking for an Open Source implementation of sentiment analysis engine based on Naive Bayes classifier in C#, take a peek at https://github.com/amrishdeep/Dragon. It works best on large corpus of words like blog posts or multi-paragraph product reviews. However, I am not sure if it would work for facebook posts that have a handful of words
This is probably a long shot but I asked a question about converting one of the statistics toolbox codes earlyier into C# realising that it was just a huge and lengthy process and there was not much in the way to automate it (really what I wanted as the references I provided explained why it was so hard to do by hand, as the comments I got where: why dont you try convert it and ask questions on where you are stuck, which obviously my question wasnt understood!).
The reason I was looking to do this is because of the long processing time required by matlab to complete what im working on (k-means and bayes classifiers on large data sets). So I thought well hey why not just convert the code into C# and try my hand at multithreading and Multiprocessing, this might provide a functional means to decrease the processing time. But obviously its extremely hard to convert all of matlabs functions to C# by hand to accommandate this.
So my question is if I import matlabs files into C# is it possible to have them used/ran in multithreading and multiprocessing fashion or will the imported files just run like they do in matlab?
The reason (I think) it runs slow in matlab is that the functions or some of them in the statistics toolbox only benefit from multithreading specifically:
MATHEMATICS
Arrays and matrices
• Basic information: ISFINITE, ISINF, ISNAN, MAX, MIN
• Operators: +, -, .*, ./, .\, .^, *, ^, \ (MLDIVIDE), / (MRDIVIDE)
• Array operations: PROD, SUM
• Array manipulation: BSXFUN, SORT
Linear algebra
• Matrix Analysis: DET, RCOND
• Linear Equations: CHOL, INV, LINSOLVE, LU, QR
• Eigenvalues and singular values: EIG, HESS, SCHUR, SVD, QZ
Elementary math
• Trigonometric: ACOS, ACOSD, ACOSH, ASIN, ASIND, ASINH, ATAN, ATAND, ATANH, COS, COSD, COSH,HYPOT, SIN, SIND, SINH, TAN, TAND, TANH
• Exponential: EXP, POW2, SQRT
• Complex: ABS
• Rounding and remainder: CEIL, FIX, FLOOR, MOD, REM, ROUND
Special Functions
• ERF, ERFC, ERFCINV, ERFCX, ERFINV, GAMMA, GAMMALN
DATA ANALYSIS
• CONV2, FILTER, FFT and IFFT of multiple columns or long vectors, FFTN, IFFTN
So im not to sure how or in what way I could potentially decrease the processing time, the kmeans and bayes classifier when processeing near tens of thousands of records really is just unbearable on its processing time (understandable).
This is not something you will be able to do easily. In fact I would say it is not possible.
If you attempt it you have the following issues to deal with:
Find a (semi) automated way to convert math lab functionality into C#
This does not exist to my knowledge.
Alter the resulting code to be multithreading enabled
To make modify a mathematical algorithm to supported multiple threads is very difficult and sometimes even impossible due to the data structures used
Also keep in mind that some mathematical problems do not scale with the number of processors, so you might not even get the benefit you expected.