I am currently working on a bayesian spam filter, made a filter using an algorithm, but it wil not work for long emails, there are just too much values to multiply and it excedes the range of double. I thought about making it so that I only take 10 or 20 most important (highest values for both spam and ham) and multiply only them. I thought about making another Dictionary inside and then multiply values out of it.
This is how it looks right now:
if (countsWordOccurenceSpam.ContainsKey(word.Key) && (!countsWordOccurenceOk.ContainsKey(word.Key)))
{
int spamValue = 0;
countsWordOccurenceSpam.TryGetValue(word.Key, out spamValue);
totals = spamValue ;
fprob_spam = ((double)spamValue) / ile_spam;
sum_spam = (((weight * probability) + (totals * fprob_spam)) / (totals + weight));
sum_ok = ((weight * probability) / (totals + weight));
sum_spam = Math.Pow(sum_spam, word.Value);
sum_ok = Math.Pow(sum_ok, word.Value);
wp_spam_1 = wp_spam_1 * sum_spam;
last_o_1 = last_o_1 * sum_ok;
}
This is one part of algorithm, now I am thinking about putting all the values from sum_spam to one Dictionary, and all the values from sum_ok to another and take using .Take(10) to select 10 highest values and multiply all of them.
Does it seem right? I am really thinking it would be very inefficient, Is there any way to do it?
Related
After several searches and mistakes on my part, I finally managed to get out of it and get a result from the MSF solver.
However, it's not perfect, because I still have a difference against me in my C# code.
In the Excel workbook I have 6 solvers, relatively identical.
Only one solver per tab, but I have a lot of calculations.
In order to best stick to the Excel workbook, I created one method per cell containing a formula.
My code works, in the sense that if I give it the same data as Excel I have the same results, but with the solver I have a little difference.
Here's what I did, and I'd like you to tell me if there's anything I can improve by trying to keep my methods (representing my Excel Cells)
Each representation of the cells is created twice.
I need to have the value of my cell to do other calculations and it seems that I can't put methods returning a double, in the solver.
Classic method:
private double Cell_I5()
{
double res = 0;
res = (Math.Exp(-Var.Calc.Var4 * Var.Calc.De * M23) - 1) / (-Var.Calc.Var4 * Var.Calc.De * M23);
return res;
}
Method for the solver:
private Term Solv_I5()
{
Term res = 0;
res = (Model.Exp(-Var.Calc.Var4 * Var.Calc.De * Solver_M23) - 1) / (-Var.Calc.Var4 * Var.Calc.De * Solver_M23);
return res;
}
'M23' is a double
'Solver_M23' is a Decision
'Var4' is a double as well as 'De'.
So I use the return value with "Term" and I change all the Math functions to 'Model', except Math.Pi which is a constant.
You can imagine that there are close to 60 to 70 methods involved like that.
My method for the solver:
public void StartSolver()
{
var solver = SolverContext.GetContext();
solver.ClearModel();
var model = solver.CreateModel();
//Instanciation des variables du Solver en format Real(double) Non Negative
Solver_M22 = new Decision(Domain.RealNonnegative, "M22");
Solver_M23 = new Decision(Domain.RealNonnegative, "M23");
Solver_M24 = new Decision(Domain.RealNonnegative, "M24");
Solver_M25 = new Decision(Domain.RealNonnegative, "M25");
Solver_M26 = new Decision(Domain.RealNonnegative, "M26");
model.AddDecision(Solver_M22);
model.AddDecision(Solver_M23);
model.AddDecision(Solver_M24);
model.AddDecision(Solver_M25);
model.AddDecision(Solver_M26);
model.AddConstraint("M22a", Solver_M22 <= 4);
model.AddConstraint("M22b", Solver_M22 >= 0);
model.AddConstraint("M23a", Solver_M23 <= 2);
model.AddConstraint("M23b", Solver_M23 >= 0.001);
model.AddConstraint("M24a", Solver_M24 <= 2);
model.AddConstraint("M24b", Solver_M24 >= 0);
model.AddConstraint("M25a", Solver_M25 <= 2);
model.AddConstraint("M25b", Solver_M25 >= 0);
model.AddConstraint("M26a", Solver_M26 <= 2);
model.AddConstraint("M26b", Solver_M26 >= 0.001);
//Test with classical calculation methods
double test = Cell_H33() + Cell_H23();
//Adding Solver Methods
model.AddGoal("SommeDesCarresDesEquartsGlobal", GoalKind.Minimize, Solv_H33() + Solv_H23());
// Solve our problem
var solution = solver.Solve();
// Get our decisions
M22 = Solver_M22.ToDouble();
M23 = Solver_M23.ToDouble();
M24 = Solver_M24.ToDouble();
M25 = Solver_M25.ToDouble();
M26 = Solver_M26.ToDouble();
string s = solution.Quality.ToString();
//For test
double testSortie = Cell_H33() + Cell_H23();
}
Questions:
1)
At no time do I indicate whether it is a linear calculation or not. How to indicate if necessary?
In Excel it is declared nonlinear
I saw that the solver was looking for the best method on its own.
2)
Is there something I'm not doing right, because I don't have the same value (with Excel)? I checked several times all the methods by one, with the amount that I missed, maybe, something, I will recheck tomorrow.
3)
Apart from doing the calculation with the classic methods, I have not found a way to find my result. From the 'solution' object
How to extract it from the result if possible?
4)
Here is the result of the 5 variables I find MSF C#:
0.06014756519010750
0.07283670953453890
0.07479568348101340
0.02864805010533950
0.00100000002842722
And what I find the Excel solver:
0.0000
0.0010
0.0141
0.0000
0.0010
Is there a way to restrict the number of decimal places directly in the calculations?
Because when I reduce manually (after calculation) that changes my result quite a bit?
Thank you.
[EDIT] Forgot to post this message it was still pending.
This morning I ran the C# solver calculation again and the result is really different with a huge difference in the result.
I remind you that I want to minimize the result.
Excel = 3.92
C#=8122.34
Result not acceptable at all.
[EDIT 2]
I may have a clue:
When I doing a simple calculation, such as:
private Term Solv_I5()
{
Term res = 0;
res = Model.Exp(-Var.Calc.Var4 * Var.Calc.Den * Solver_M25);
return res;
}
the result is:
{Exp(Times(-4176002161226263/70368744177664, M25))}
Why "Times"
All formulas with multiplication contain Times.
For divisions there is 'Quotient', additions 'Plus', but multiplications 'Times !!!
Question 4)
Am I doing the multiplications wrong in a 'Term'.?
Do you have an idea?
[EDIT 3]
I just saw that "times" was not a stupid term, another misunderstanding on my part of the English language, sorry.
So that doesn't solve my problem.
Can you help me please.
List<int> NPower = new List<int>();
List<double> list = new List<double>();
try
{
for (int i = 1; i < dataGridView1.Rows.Count; i++)
{
for (int n = 0; n < i + 30; n++)
{
NPower.Add(Convert.ToInt32(dataGridView1.Rows[i + n].Cells[6].Value));
}
}
average = NPower.Average();
total = Math.Pow(average, 4);
NPower.Clear();
}
catch (Exception)
{
average = NPower.Average();
NP = Convert.ToInt32(Math.Pow(average, (1.0 / 3.0)));
label19.Text = "Normalised Power: " + NP.ToString();
NPower.Clear();
}
Hi so i'm trying to calculate the normalized power for a cycling polar cycle. I know that for the normalized power you need to:
1) starting at the 30 s mark, calculate a rolling 30 s average (of the preceeding time points, obviously).
2) raise all the values obtained in step #1 to the 4th power.
3) take the average of all of the values obtained in step #2.
4) take the 4th root of the value obtained in step #3.
I think i have done that but the normalized power comes up with 16 which isnt correct. Could anyone look at my code to see if they could figure out a solution. Thankyou, sorry for my code i'm still quite new to this so my code might be in the incorrect format.
I'm not sure that I understand your requirements or code completely, but a few things I noticed:
Since you're supposed to start taking the rolling average after 30 seconds, shouldn't i be initialized to 30 instead of 1?
Since it's a rolling average, shouldn't n be initialized to the value of i instead of 0?
Why is the final result calculated inside a catch block?
Shouldn't it be Math.Pow(average, (1.0 / 4.0)) since you want the fourth root, not the third?
I'm having a problem generating the Terras number sequence.
Here is my unsuccessful attempt:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
namespace Terras
{
class Program
{
public static int Terras(int n)
{
if (n <= 1)
{
int return_value = 1;
Console.WriteLine("Terras generated : " + return_value);
return return_value;
}
else
{
if ((n % 2) == 0)
{
// Even number
int return_value = 1 / 2 * Terras(n - 1);
Console.WriteLine("Terras generated : " + return_value);
return return_value;
}
else
{
// Odd number
int return_value = 1 / 2 * (3 * Terras(n - 1) + 1);
Console.WriteLine("Terras generated : " + return_value);
return return_value;
}
}
}
static void Main(string[] args)
{
Console.WriteLine("TERRAS1");
Terras(1); // should generate 1
Console.WriteLine("TERRAS2");
Terras(2); // should generate 2 1 ... instead of 1 and 0
Console.WriteLine("TERRAS5");
Terras(5); // should generate 5,8,4,2,1 not 1 0 0 0 0
Console.Read();
}
}
}
What am I doing wrong?
I know the basics of recursion, but I don’t understand why this doesn’t work.
I observe that the first number of the sequence is actually the number that you pass in, and subsequent numbers are zero.
Change 1 / 2 * Terros(n - 1); to Terros(n - 1)/2;
Also 1 / 2 * (3 * Terros(n - 1) + 1); to (3 * Terros(n - 1) + 1)/2;
1/2 * ... is simply 0 * ... with int math.
[Edit]
Recursion is wrong and formula is mis-guided. Simple iterate
public static void Terros(int n) {
Console.Write("Terros generated :");
int t = n;
Console.Write(" " + t);
while (t > 1) {
int t_previous = t;
if (t_previous%2 == 0) {
t = t_previous/2;
}
else {
t = (3*t_previous+1)/2;
}
Console.Write(", " + t);
}
Console.WriteLine("");
}
The "n is even" should be "t(subscript n-1) is even" - same for "n is odd".
int return_value = 1 / 2 * Terros(n - 1);
int return_value = 1 / 2 * (3 * Terros(n - 1) + 1);
Unfortunately you've hit a common mistake people make with ints.
(int)1 / (int)2 will always be 0.
Since 1/2 is an integer divison it's always 0; in order to correct the math, just
swap the terms: not 1/2*n but n/2; instead of 1/2* (3 * n + 1) put (3 * n + 1) / 2.
Another issue: do not put computation (Terros) and output (Console.WriteLine) in the
same function
public static String TerrosSequence(int n) {
StringBuilder Sb = new StringBuilder();
// Again: dynamic programming is far better here than recursion
while (n > 1) {
if (Sb.Length > 0)
Sb.Append(",");
Sb.Append(n);
n = (n % 2 == 0) ? n / 2 : (3 * n + 1) / 2;
}
if (Sb.Length > 0)
Sb.Append(",");
Sb.Append(n);
return Sb.ToString();
}
// Output: "Terros generated : 5,8,4,2,1"
Console.WriteLine("Terros generated : " + TerrosSequence(5));
The existing answers guide you in the correct direction, but there is no ultimate one. I thought that summing up and adding detail would help you and future visitors.
The problem name
The original name of this question was “Conjuncture of Terros”. First, it is conjecture, second, the modification to the original Collatz sequence you used comes from Riho Terras* (not Terros!) who proved the Terras Theorem saying that for almost all t₀ holds that ∃n ∈ ℕ: tₙ < t₀. You can read more about it on MathWorld and chux’s question on Math.SE.
* While searching for who is that R. Terras mentioned on MathWorld, I found not only the record on Geni.com, but also probable author of that record, his niece Astrid Terras, and her family’s genealogy. Just for the really curious ones. ☺
The formula
You got the formula wrong in your question. As the table of sequences for different t₀ shows, you should be testing for parity of tₙ₋₁ instead of n.
Formula taken from MathWorld.
Also the second table column heading is wrong, it should read t₀, t₁, t₂, … as t₀ is listed too.
You repeat the mistake with testing n instead of tₙ₋₁ in your code, too. If output of your program is precisely specified (e.g. when checked by an automatic judge), think once more whether you should output t₀ or not.
Integer vs float arithmetic
When making an operation with two integers, you get an integer. If a float is involved, the result is float. In both branches of your condition, you compute an expression of this form:
1 / 2 * …
1 and 2 are integers, therefore the division is integer division. Integer division always rounds down, so the expression is in fact
0 * …
which is (almost*) always zero. Mystery solved. But how to fix it?
Instead of multiplying by one half, you can divide by two. In even branch, division by 2 gives no remainder. In odd branch, tₙ₋₁ is odd, so 3 · tₙ₋₁ is odd too. Odd plus 1 is even, so division by two always produces remainder equal to zero in both branches. Integer division is enough, the result is precise.
Also, you could use float division, just replace 1 with 1.0. But this will probably not give correct results. You see, all members of the sequence are integers and you’re getting float results! So rounding with Math.Round() and casting to integer? Nah… If you can, always evade using floats. There are very few use cases for them, I think, most having something to do with graphics or numerical algorithms. Most of the time you don’t really need them and they just introduce round-off errors.
* Zero times whatever could produce NaN too, but let’s ignore the possibility of “whatever” being from special float values. I’m just pedantic.
Recursive solution
Apart from the problems mentioned above, your whole recursive approach is flawed. Obviously you intended Terras(n) to be tₙ. That’s not utterly bad. But then you forgot that you supply t₀ and search for n instead of the other way round.
To fix your approach, you would need to set up a “global” variable int t0 that would be set to given t₀ and returned from Terras(0). Then Terras(n) would really return tₙ. But you wouldn’t still know the value of n when the sequence stops. You could only repeat for bigger and bigger n, ruining time complexity.
Wait. What about caching the results of intermediate Terras() calls in an ArrayList<int> t? t[i] will contain result for Terras(i) or zero if not initialized. At the top of Terras() you would add if (n < t.Count() && t[n] != 0) return t[n]; for returning the value immediately if cached and not repeating the computation. Otherwise the computation is really made and just before returning, the result is cached:
if (n < t.Count()) {
t[n] = return_value;
} else {
for (int i = t.Count(); i < n; i++) {
t.Add(0);
}
t.Add(return_value);
}
Still not good enough. Time complexity saved, but having the ArrayList increases space complexity. Try tracing (preferably manually, pencil & paper) the computation for t0 = 3; t.Add(t0);. You don’t know the final n beforehand, so you must go from 1 up, till Terras(n) returns 1.
Noticed anything? First, each time you increment n and make a new Terras() call, you add the computed value at the end of cache (t). Second, you’re always looking just one item back. You’re computing the whole sequence from the bottom up and you don’t need that big stupid ArrayList but always just its last item!
Iterative solution
OK, let’s forget that complicated recursive solution trying to follow the top-down definition and move to the bottom-up approach that popped up from gradual improvement of the original solution. Recursion is not needed anymore, it just clutters the whole thing and slows it down.
End of sequence is still found by incrementing n and computing tₙ, halting when tₙ = 1. Variable t stores tₙ, t_previous stores previous tₙ (now tₙ₋₁). The rest should be obvious.
public static void Terras(int t) {
Console.Write("Terras generated:");
Console.Write(" " + t);
while (t > 1) {
int t_previous = t;
if (t_previous % 2 == 0) {
t = t_previous / 2;
} else {
t = (3 * t_previous + 1) / 2;
}
Console.Write(", " + t);
}
Console.WriteLine("");
}
Variable names taken from chux’s answer, just for the sake of comparability.
This can be deemed a primitive instance of dynamic-programming technique. The evolution of this solution is common to the whole class of such problems. Slow recursion, call result caching, dynamic “bottom-up” approach. When you are more experienced with dynamic programming, you’ll start seeing it directly even in more complicated problems, not even thinking about recursion.
Algorithm to be coded in C#:
fn = f(xn)
f′n = df(xn)/dx
∆xn = -fn / f′n
Update: xn+1 = xn + ∆xn
Repeat the process until ∆xn ≤ e
I must use the Newton-Raphson method to solve but I do not know how to do a loop that puts in the next answer each time. How do I compute this?
This is my broken code
double a = 1, Lspan = 30, Lcable = 33, fn, fdn, dfn, j;
fn = (2 * a * (Math.Sinh(Lspan / 2 * a))) - Lcable;
fdn = (2 * (Math.Sinh(Lspan / 2 * a)) - ((Lspan / 2 * a) * Math.Cosh(Lspan / 2 * a)));
dfn = -fn / fdn;
do
j = a + dfn;
while (dfn > 0.00000000001);
Console.WriteLine( " {0} ",j) ;
Console.ReadKey();
Your loop performs the same calculation each time, because neither a or dfn change between iterations. I'm sure I've actually implemented a Newton-Raphson method myself years ago, but I don't remember enough about it to check that your arithmetic is correct without looking it up.
I expect that you intended fdn and dfn to be updated on each iteration - although your pseudocode statement of the method is ambiguous since it implies that only the whole solution is updated on each iteration, whereas actually each term needs to be updated or you'll just keep adding the starting value of ∆xn forever. I think the solution is to move the second, third and fourth lines inside the loop.
Does this make sense?
(It looks as though you were expecting C# to work with symbolic mathematics, which isn't the case. C# is basically procedural within the body of a method, so making an assignment statement fn = some terms; happens once, when the program hits that line. There is no knowledge built into that variable of how it was calculated, it's just a box with a number in it.)
I need to compare a 1-dimensional array, in that I need to compare each element of the array with each other element. The array contains a list of strings sorted from longest to the shortest. No 2 items in the array are equal however there will be items with the same length. Currently I am making N*(N+1)/2 comparisons (127.8 Billion) and I'm trying to reduce the number of over all comparisons.
I have implemented a feature that basically says: If the strings are different in length by more than x percent then don't bother they not equal, AND the other guys below him aren't equal either so just break the loop and move on to the next element.
I am currently trying to further reduce this by saying that: If element A matches element C and D then it stands to reason that elements C and D would also match so don't bother checking them (i.e. skip that operation). This is as far as I've factored since I don't currently know of a data structure that will allow me to do that.
The question here is: Does anyone know of such a data structure? or Does anyone know how I can further reduce my comparisons?
My current implementation is estimated to take 3.5 days to complete in a time window of 10 hours (i.e. it's too long) and my only options left are either to reduce the execution time, which may or may not be possible, or distrubute the workload accross dozens of systems, which may not be practical.
Update: My bad. Replace the word equal with closely matches with. I'm calculating the Levenstein distance
The idea is to find out if there are other strings in the array which closely matches with each element in the array. The output is a database mapping of the strings that were closely related.
Here is the partial code from the method. Prior to executing this code block there is code that loads items into the datbase.
public static void RelatedAddressCompute() {
TableWipe("RelatedAddress");
decimal _requiredDistance = Properties.Settings.Default.LevenshteinDistance;
SqlConnection _connection = new SqlConnection(Properties.Settings.Default.AML_STORE);
_connection.Open();
string _cacheFilter = "LevenshteinCache NOT IN ('','SAMEASABOVE','SAME')";
SqlCommand _dataCommand = new SqlCommand(#"
SELECT
COUNT(DISTINCT LevenshteinCache)
FROM
Address
WHERE
" + _cacheFilter + #"
AND
LEN(LevenshteinCache) > 12", _connection);
_dataCommand.CommandTimeout = 0;
int _addressCount = (int)_dataCommand.ExecuteScalar();
_dataCommand = new SqlCommand(#"
SELECT
Data.LevenshteinCache,
Data.CacheCount
FROM
(SELECT
DISTINCT LevenshteinCache,
COUNT(LevenshteinCache) AS CacheCount
FROM
Address
WHERE
" + _cacheFilter + #"
GROUP BY
LevenshteinCache) Data
WHERE
LEN(LevenshteinCache) > 12
ORDER BY
LEN(LevenshteinCache) DESC", _connection);
_dataCommand.CommandTimeout = 0;
SqlDataReader _addressReader = _dataCommand.ExecuteReader();
string[] _addresses = new string[_addressCount + 1];
int[] _addressInstance = new int[_addressCount + 1];
int _itemIndex = 1;
while (_addressReader.Read()) {
string _address = (string)_addressReader[0];
int _count = (int)_addressReader[1];
_addresses[_itemIndex] = _address;
_addressInstance[_itemIndex] = _count;
_itemIndex++;
}
_addressReader.Close();
decimal _comparasionsMade = 0;
decimal _comparisionsAttempted = 0;
decimal _comparisionsExpected = (decimal)_addressCount * ((decimal)_addressCount + 1) / 2;
decimal _percentCompleted = 0;
DateTime _startTime = DateTime.Now;
Parallel.For(1, _addressCount, delegate(int i) {
for (int _index = i + 1; _index <= _addressCount; _index++) {
_comparisionsAttempted++;
decimal _percent = _addresses[i].Length < _addresses[_index].Length ? (decimal)_addresses[i].Length / (decimal)_addresses[_index].Length : (decimal)_addresses[_index].Length / (decimal)_addresses[i].Length;
if (_percent < _requiredDistance) {
decimal _difference = new Levenshtein().threasholdiLD(_addresses[i], _addresses[_index], 50);
_comparasionsMade++;
if (_difference <= _requiredDistance) {
InsertRelatedAddress(ref _connection, _addresses[i], _addresses[_index], _difference);
}
}
else {
_comparisionsAttempted += _addressCount - _index;
break;
}
}
if (_addressInstance[i] > 1 && _addressInstance[i] < 31) {
InsertRelatedAddress(ref _connection, _addresses[i], _addresses[i], 0);
}
_percentCompleted = (_comparisionsAttempted / _comparisionsExpected) * 100M;
TimeSpan _estimatedDuration = new TimeSpan((long)((((decimal)(DateTime.Now - _startTime).Ticks) / _percentCompleted) * 100));
TimeSpan _timeRemaining = _estimatedDuration - (DateTime.Now - _startTime);
string _timeRemains = _timeRemaining.ToString();
});
}
InsertRelatedAddress is a function that updates the database, and there are 500,000 items in the array.
OK. With the updated question, I think it makes more sense. You want to find pairs of strings with a Levenshtein Distance less than a preset distance. I think the key is that you don't compare every set of strings and rely on the properties of Levenshtein distance to search for strings within your preset limit. The answer involves computing the tree of possible changes. That is, compute possible changes to a given string with distance < n and see if any of those strings are in your set. I supposed this is only faster if n is small.
It looks like the question posted here: Finding closest neighbour using optimized Levenshtein Algorithm.
More info required. What is your desired outcome? Are you trying to get a count of all unique strings? You state that you want to see if 2 strings are equal and that if 'they are different in length by x percent then don't bother they not equal'. Why are you checking with a constraint on length by x percent? If you're checking for them to be equal they must be the same length.
I suspect you are trying to something slightly different to determining an exact match in which case I need more info.
Thanks
Neil