This excellent article on implementing a Hidden Markov Model in C# does a fair job of classifying a single bit sequence based on training data.
How to modify the algorithm, or build it out (multiple HMMs?) to support the classification of multiple simultaneous bit sequences?
Example
Instead of classifying just one stream:
double t1 = hmm.Evaluate(new int[] { 0,1 }); // 0.49999423004045024
double t2 = hmm.Evaluate(new int[] { 0,1,1,1 }); // 0.11458685045803882
Rather classify a dual bit stream:
double t1 = hmm.Evaluate(new int[] { [0, 0], [0, 1] });
double t2 = hmm.Evaluate(new int[] { [0, 0], [1, 1], [0, 1], [1, 1] });
Or even better, three streams:
double t1 = hmm.Evaluate(new int[] { [0, 0, 1], [0, 0, 1] });
double t2 = hmm.Evaluate(new int[] { [0, 0, 1], [1, 1, 0], [0, 1, 1], [1, 1, 1] });
Obviously the training data would also be expanded.
The trick is to model the set of observations as the n-ary cartesian product of all possible values of each sequence, in your case the HMM will have 2^n output symbol where n is the number of bit sequences.
Example: for three bit sequences, the 8 symbols are: 000 001 010 011 100 101 110 111, as if we created a megavariable whose values are all the possible tuples of values of the individual observation sequences (0/1 of each bit sequence)
The article mentioned deals with the hidden Markov model implementation in the Accord.NET Framework. When using the complete version of the framework, and not just the subproject available in that article, one can use the generic HiddenMarkovModel model and use any suitable emission symbol distribution. If the user would like to express the joint probability between two or three discrete variables, it would be worth to use the JointDistribution class.
If, however, there are many symbol variables, such that expression all possible variable combinations is not practical, it should be better to use a continuous representation for the features and use a Multivariate Normal distribution instead.
An example would be:
// Specify a initial normal distribution for the samples.
var initialDensity = MultivariateNormalDistribution(3); // 3 dimensions
// Create a continuous hidden Markov Model with two states organized in a forward
// topology and an underlying multivariate Normal distribution as probability density.
var model = new HiddenMarkovModel<MultivariateNormalDistribution>(new Ergodic(2), density);
// Configure the learning algorithms to train the sequence classifier until the
// difference in the average log-likelihood changes only by as little as 0.0001
var teacher = new BaumWelchLearning<MultivariateNormalDistribution>(model)
{
Tolerance = 0.0001,
Iterations = 0,
};
// Fit the model
double likelihood = teacher.Run(sequences);
Related
I have a large binary file that is around 70MB in size. In my program, I have a method that looks up byte[] array patterns against the file, to see if they exist within the file or not. I have around 1-10 millions patterns to run against the file. The options I see are the following:
Read the file into memory by doing byte[] file = File.ReadAllBytes(path) then perform byte[] lookup of byte[] pattern(s) against the file bytes. I have used multiple methods for doing that from different topics on SO such as:
byte[] array pattern search
Find an array (byte[]) inside another array?
Best way to check if a byte[] is contained in another byte[] Though, byte[] versus byte[] lookups are extremely slow when the source is large in size. It would take take weeks to run 1 million patterns on normal computers.
Convert both the file and patterns into hex strings then do the comparisons using contains() method to perform the lookup. This one is faster than byte[] lookups but converting bytes to hex would result in the file being larger in memory which results in more processing time.
Convert both the file and pattern into strings using Encoding.GetEncoding(1252).GetBytes() and perform the lookups. Then, compensate for the limitation of binary to string conversion (I know they incompatible) by running the matches of contains() against another method which performs byte[] lookups (suggested first option). This one is the fastest option for me.
Using the third approach, which is the fastest, 1 million patterns would take 2/3 of a day to a day depending on CPU. I need information on how to speed up the lookups.
Thank you.
Edit: Thanks to #MySkullCaveIsADarkPlace I now have a fourth approach which is faster than the three approaches above. I was using limited byte[] lookup algorithms and now I am using MemoryExtensions.IndexOf() byte[] lookup method which is slightly faster than the three approaches above. Though, even though this method is faster, the lookups are still slow. It takes 1 minute for 1000 pattern lookups.
The patterns are 12-20 bytes each.
I assume that you are looking up one pattern after the other. I.e., you are doing 1 to 10 million pattern searches at every position in the file!
Consider doing it the other way round. Loop once through your file bytes and determine if the current position is the start of a pattern.
To do this efficiently, I suggest organizing the patterns in an array of list of patterns. Each pattern is stored in a list at array index 256 * byte[0] + byte[1].
With 10 million patterns you will have an average of 152 patterns in the lists at each array position. This allows a fast lookup.
You could also use the 3 first bytes (256 * (256 * byte[0] + byte[1]) + byte[2]) resulting in an array of length 256^3 ~ 16 millions (I worked with longer arrays; no problem for C#). Then you would have less than one pattern per array position in average. This would result in a nearly linear search time O(n) with respect to the file length. A huge improvement compared to the quadratic O(num_of_patterns * file_length) for a straight forward algorithm.
We can use a simple byte by byte comparison to compare the patterns, since we can compare starting at a known position. (Boyer Moore is of no use here.)
2 bytes index (patterns must be at least 2 bytes long)
byte[] file = { 23, 36, 43, 76, 125, 56, 34, 234, 12, 3, 5, 76, 8, 0, 6, 125, 234, 56, 211, 122, 22, 4, 7, 89, 76, 64, 12, 3, 5, 76, 8, 0, 6, 125 };
byte[][] patterns = {
new byte[]{ 12, 3, 5, 76, 8, 0, 6, 125, 11 },
new byte[]{ 211, 122, 22, 4 },
new byte[]{ 17, 211, 5, 8 },
new byte[]{ 22, 4, 7, 89, 76, 64 },
};
var patternMatrix = new List<byte[]>[256 * 256];
// Add patterns to matrix.
// We assume pattern.Length >= 2.
foreach (byte[] pattern in patterns) {
int index = 256 * pattern[0] + pattern[1];
patternMatrix[index] ??= new List<byte[]>(); // Ensure we have a list.
patternMatrix[index].Add(pattern);
}
// The search. Loop through the file
for (int fileIndex = 0; fileIndex < file.Length - 1; fileIndex++) { // Length - 1 because we need 2 bytes.
int patternIndex = 256 * file[fileIndex] + file[fileIndex + 1];
List<byte[]> candiatePatterns = patternMatrix[patternIndex];
if (candiatePatterns != null) {
foreach (byte[] candidate in candiatePatterns) {
if (fileIndex + candidate.Length <= file.Length) {
bool found = true;
// We know that the 2 first bytes are matching,
// so let's start at the 3rd
for (int i = 2; i < candidate.Length; i++) {
if (candidate[i] != file[fileIndex + i]) {
found = false;
break;
}
}
if (found) {
Console.WriteLine($"pattern {{{candidate[0]}, {candidate[1]} ..}} found at file index {fileIndex}");
}
}
}
}
}
Same algorithm with 3 bytes (even faster!)
3 bytes index (patterns must be at least 3 bytes long)
var patternMatrix = new List<byte[]>[256 * 256 * 256];
// Add patterns to matrix.
// We assume pattern.Length >= 3.
foreach (byte[] pattern in patterns) {
int index = 256 * 256 * pattern[0] + 256 * pattern[1] + pattern[2];
patternMatrix[index] ??= new List<byte[]>(); // Ensure we have a list.
patternMatrix[index].Add(pattern);
}
// The search. Loop through the file
for (int fileIndex = 0; fileIndex < file.Length - 2; fileIndex++) { // Length - 2 because we need 3 bytes.
int patternIndex = 256 * 256 * file[fileIndex] + 256 * file[fileIndex + 1] + file[fileIndex + 2];
List<byte[]> candiatePatterns = patternMatrix[patternIndex];
if (candiatePatterns != null) {
foreach (byte[] candidate in candiatePatterns) {
if (fileIndex + candidate.Length <= file.Length) {
bool found = true;
// We know that the 3 first bytes are matching,
// so let's start at the 4th
for (int i = 3; i < candidate.Length; i++) {
if (candidate[i] != file[fileIndex + i]) {
found = false;
break;
}
}
if (found) {
Console.WriteLine($"pattern {{{candidate[0]}, {candidate[1]} ..}} found at file index {fileIndex}");
}
}
}
}
}
Why is it faster?
A simple nested loops algorithm compares up to ~ 706 * 106 = 7 * 1014 (700 trillion) patterns! 706 is the length of the file. 106 is the number of patterns.
My algorithm with a 2 bytes index makes ~ 706 * 152 = 1010 pattern comparisons. The number 152 comes from the fact that there are in average 152 patterns for a given 2 bytes index ~ 106/(256 * 256). This is 65,536 times faster.
With 3 bytes you get less than about 706 pattern comparisons. This is more than 10 million times faster. This is the case because we store all the patterns in an array whose length is greater (16 millions) than the number of patterns (10 millions or less). Therefore, at any byte position plus 2 following positions within the file, we can pick up only the patterns starting with the same 3 bytes. And this is in average less than one pattern. Sometimes there may be 0 or 1, sometimes 2 or 3, but rarely more patterns at any array position.
Try it. The shift is from O(n2) to near O(n). The initialization time is O(n). The assumption is that the 2 or 3 first bytes of the patterns are more or less randomly distributed. If this was not the case, my algorithm would degrade to O(n2) in the worst case.
Okay, that's the theory. Since the 3 bytes index version is slower at initialization it may have only an advantage with huge data sets. Other improvements could be made by using Span<byte>.
See: Big O notation - Wikipedia.
One idea is to group the patterns by their length, put each group in a HashSet<byte[]> for searching with O(1) complexity, and then scan the source byte[] index by index for all groups. Since the number of groups in your case is small (only 9 groups), this optimization should yield significant performance improvements. Here is an implementation:
IEnumerable<byte[]> FindMatches(byte[] source, byte[][] patterns)
{
Dictionary<int, HashSet<ArraySegment<byte>>> buckets = new();
ArraySegmentComparer comparer = new();
foreach (byte[] pattern in patterns)
{
HashSet<ArraySegment<byte>> bucket;
if (!buckets.TryGetValue(pattern.Length, out bucket))
{
bucket = new(comparer);
buckets.Add(pattern.Length, bucket);
}
bucket.Add(pattern); // Implicit cast byte[] => ArraySegment<byte>
}
for (int i = 0; i < source.Length; i++)
{
foreach (var (length, bucket) in buckets)
{
if (i + length > source.Length) continue;
ArraySegment<byte> slice = new(source, i, length);
if (bucket.TryGetValue(slice, out var pattern))
{
yield return pattern.Array;
bucket.Remove(slice);
}
}
}
}
Currently (.NET 6) there is no equality comparer for sequences available in the standard libraries, so you'll have to provide a custom one:
class ArraySegmentComparer : IEqualityComparer<ArraySegment<byte>>
{
public bool Equals(ArraySegment<byte> x, ArraySegment<byte> y)
{
return x.AsSpan().SequenceEqual(y);
}
public int GetHashCode(ArraySegment<byte> obj)
{
HashCode hashcode = new();
hashcode.AddBytes(obj);
return hashcode.ToHashCode();
}
}
This algorithm assumes that there are no duplicates in the patterns. In case that's not the case, only one of the duplicates will be emitted.
In my (not very speedy) PC this algorithm takes around 10 seconds to create the buckets dictionary (for 10,000,000 patterns with size 12-20), and then additional 5-6 minutes to scan a source byte[] of size 70,000,000 (scans around 200,000 bytes per second). The number of the patterns does not affect the scanning phase (as long as the number of the groups is not increased).
Parallelizing this algorithm is not trivial, because the buckets are mutated during the scan.
I am trying to solve a Matrix in Math.Net when one of the actual solutions to the matrix is 0, but I am getting -NaN- as results.
Here is an example matrix which has already been reduced for simplicity.
1 0 1 | 10000
0 1 -1 | 1000
0 0 0 | 0
Code example:
public void DoExample()
{
Matrix<double> A = Matrix<double>.Build.DenseOfArray(new double[,] {
{ 1, 0, 1 },
{ 0, 1, -1 },
{ 0, 0, 0 },
});
Vector<double> B = Vector<double>.Build.Dense(new double[] { 10000, 1000, 0 });
var result = A.Solve(B);
}
The solution I am hoping to get to is [ 10000, 1000, 0 ].
As you can see, the result I want is already the augment vector. This is because I simplified the matrix to reduced row echelon form (RREF) by hand using Gauss-Jordan for this example. If I could somehow use a Gauss-Jordan operations within Math.Net to do this, I could check for the scenario where an all 0 row exists in the RREF matrix. Can this be done?
Otherwise, is there any way I can recognize when 0 is the only possible solution for one of the variables using the existing Math.Net linear algebra solver operations?
Thanks!
This is degenerate matrix with rank 2, and you cannot expect to get true solution (there are infinity number of solutions)
The iterative solver can actually handle this, for example
using MathNet.Numerics.LinearAlgebra.Double.Solvers;
A.SolveIterative(B, new MlkBiCgStab());
returns
[10000, 1000, 0]
Interestingly, with the MKL Native Provider this also works with the normal Solve routine, but not with the managed provider (as you have found out) nor with e.g. the OpenBLAS native provider.
I have a maths issue within my program. I think the problem is simple but I'm not sure what terms to use, hence my own searches returned nothing useful.
I receive some values in a method, the only thing I know (in terms of logic) is the numbers will be something which can be duplicated.
In other words, the numbers I could receive are predictable and would be one of the following
1
2
4
16
256
65536
etc
I need to know at what index they appear at. In othewords, 1 is always at index 0, 2 at index 1, 4 at index 3, 16 is at index 4 etc.
I know I could write a big switch statement but I was hoping a formula would be tidier. Do you know if one exists or any clues as the names of the math forumula's I'm using.
The numbers you listed are powers of two. The inverse function of raising a number to a power is the logarithm, so that's what you use to go backwards from (using your terminology here) a number to an index.
var num = 256;
var ind = Math.Log(num, 2);
Above, ind is the base-2 logarithm of num. This code will work for any base; just substitute that base for 2. If you are only going to be working with powers of 2 then you can use a special-case solution that is faster based on the bitwise representation of your input; see What's the quickest way to compute log2 of an integer in C#?
Try
Math.Log(num, base)
where base is 2
MSDN: http://msdn.microsoft.com/en-us/library/hd50b6h5.aspx
Logarithm will return to You power of base from you number.
But it's in case if your number really are power of 2,
otherwise you have to understand exactly what you have, what you need
It also look like numbers was powered to 2 twice, so that try this:
private static int getIndexOfSeries(UInt64 aNum)
{
if (aNum == 1)
return 0;
else if (aNum == 2)
return 1;
else
{
int lNum = (int)Math.Log(aNum, 2);
return 1+(int)Math.Log(lNum, 2);
}
}
Result for UInt64[] Arr = new UInt64[] { 1, 2, 4, 16, 256, 65536, 4294967296 } is:
Num[0] = 1
Num[1] = 2
Num[2] = 4
Num[3] = 16
Num[4] = 256
Num[5] = 65536
Num[6] = 4294967296 //65536*65536
where [i] - index
You should calculate the base 2 logarithm of the number
Hint: For the results:
0 2
1 4
2 16
3 256
4 65536
5 4294967296
etc.
The formula is, for a give integer x:
Math.Pow(2, Math.Pow(2, x));
that is
2 to the power (2 to the power (x) )
Once the formula is known, one could solve it for x (I won't go through that since you already got an answer).
Question
Even only 52 cards, the permutationIndex where I describe in Explanations section, would be a huge number; it is a number in one of 52!, and need 29 bytes to store.
Thus I don't know a simple way to calculate the permutationIndex of a huge range, and store the index with a mininal cost, or maybe it can also be calculated. I'm thinking solution of this question would be three algorithms:
An algorithm which compute the correct permutationIndex to implement the Dealing method
An algorithm which compute the correct permutationIndex to implement the Collect method
An algorithm which stores(or computes) permutationIndex with a minimal cost
Explanations
I originally try to implement a integer handle generator of a range from int.MinVale to int.MaxValue using permutation.
Because the range is really huge for that, thus I start from implement a Dealer class with 52 cards which doesn't really store a deck of cards like hashset or array, and even don't want random(except initial).
With a given range of ordinal numbers, I consider every sequence of one of full permutations has a index, and named it permutationIndex. I use the index to remember which permutation it is and don't really store a sequence. The sequence is one of the possible order of the deck of card.
And here is an example of emulation in animated graphics to show what I thought of.
Everytime I dealt a card, I change the permutationIndex and dealt(count of dealt cards), that I know which cards are those dealt, and which are still in hand. When I collect a dealt card back, I'll know the card number, and put it on the top, it's also become the card for next time to deal. In the animation, colleted is the card number.
For more information, as follows.
Description of code
A conceptual sample Dealer class for only three 3 is as following.
The code is written in c#, and I'm also considering any language-agnostic solutions.
Here're some descriptions of the sample code
With the method Dealing(), we get a number of the card which treat as dealt. It always returns the right most number (relevant to the array) and then rolls the number left from it (say the next available) to the right most position by changing permutationIndex.
The method Collect(int) is for collecting and put the dealt cards back into the deck.
It would change permutationIndex also, according to what the number of card was returned back to the dealer.
The integer dealt tells how many cards we've dealt; from the left most to the count stored in dealt are dealt cards. With permutationIndex, we know the sequence of cards.
The int[,] array in the sample code is not used, just for helping imagine the permutations. The switch statements are considered to be implemented with algorithms which compute for the permutationIndex.
The permutationIndex is the same thing described in this answer of
Fast permutation -> number -> permutation mapping algorithms
Sample code
public static class Dealer {
public static void Collect(int number) {
if(1>dealt)
throw new IndexOutOfRangeException();
switch(permutationIndex) {
case 5:
case 0:
switch(number) {
case 3:
break;
case 2:
permutationIndex=1;
break;
case 1:
permutationIndex=4;
break;
}
break;
case 4:
case 3:
switch(number) {
case 3:
permutationIndex=5;
break;
case 2:
permutationIndex=2;
break;
case 1:
break;
}
break;
case 2:
case 1:
switch(number) {
case 3:
permutationIndex=0;
break;
case 2:
break;
case 1:
permutationIndex=3;
break;
}
break;
}
--dealt;
}
public static int Dealing() {
if(dealt>2)
throw new IndexOutOfRangeException();
var number=0;
switch(permutationIndex) {
case 5:
permutationIndex=3;
number=3;
break;
case 4:
permutationIndex=0;
number=1;
break;
case 3:
permutationIndex=1;
number=1;
break;
case 2:
permutationIndex=4;
number=2;
break;
case 1:
permutationIndex=5;
number=2;
break;
case 0:
permutationIndex=2;
number=3;
break;
}
++dealt;
return number;
}
static int[,] sample=
new[,] {
{ 1, 2, 3 }, // 0
{ 1, 3, 2 }, // 1
{ 3, 1, 2 }, // 2
{ 3, 2, 1 }, // 3
{ 2, 3, 1 }, // 4
{ 2, 1, 3 }, // 5
};
static int permutationIndex;
static int dealt;
}
Not exactly what you are trying to accomplish here, but if you want to deal from a random ordering of a deck of cards, you use a shuffle algorithm. The typical shuffle algorithm is Fisher-Yates. The shuffle algorithm will create an array listing the card numbers in random order ( 13,5,7,18,22,... etc ). To deal you start at the first element in the array and continue forward.
I am also struggling to see the whole picture here, but you could convert each permutation to base(52) with a single character representing each card and have a string representing each permutation.
So Spades could be 1-9 (ace - 9), 0ABC (10, J Q K), then DEFG... starting the hearts and so on.
So a deck of 3 cards, 2 Spade (2), 3 Heart (F) and 2 Diamond (say e), would have these permutation numbers:
2Fe
2eF
F2e
Fe2
eF2
e2F
You could convert these back and forth to a int/long/bigint by doing base 52 to base 10 conversions.
Here's an introduction to converting between bases.
So e2F would be F + 2*52 + e * 52^2 which would be 16 + 2*52 + 43*52*52 = 116392
So 116392 would be your permutation number.
(btw. I'm guessing about it 2 diamond being 'e' and 43, you can count it up and see exact what it would be)
One way to tackle this is to use (pseudo)random number generator (like a Mersenne Twister), then store only the seed number for each deal. Since you get the same sequence of random numbers each time from the same seed, it serves to represent the whole deal (using the random numbers generated from that seed to drive what cards are dealt).
[edit...]
Some pseudo-code for the deal:
while (#cards < cardsNeed)
card = getCard(random())
if (alreadyHaveThisCard(card))
continue
[do something with the card...]
If I've understood you right, the following code implements this:
public class Dealer {
public int Dealing() {
var number=
_freeCards.Count>0
?_freeCards.Dequeue()
:_lastNumber++;
_dealtCards.Add(number);
return number;
}
public void Collect(int number) {
if(!_dealtCards.Remove(number))
throw new ArgumentException("Card is not in use", "number");
_freeCards.Enqueue(number);
}
readonly HashSet<int> _dealtCards=new HashSet<int>();
readonly Queue<int> _freeCards=new Queue<int>(); // "Pool" of free cards.
int _lastNumber;
}
While I have a bit of a problem understanding what you are really trying to accomplish here, I suppose a coprime will generate a bunch of permutation numbers; that is: if you don't care too much about the distribution. You can use the Euclidian algorithm for that.
Algebra (set theory) states that you can simply use x = (x + coprime) % set.Length to get all elements in the set. I suppose each coprime is a permutation number as you describe it.
That said I'm not sure what distribution you get when using a generated coprime as 'random number generator'; I suppose certain distributions will occur more frequently than others and that a lot of distributions will be excluded from the generated numbers, for the simple reason that the generator will pick numbers in a ring. I'm being a bit creative here so perhaps it fits your needs, although it probably won't be the answer you're looking for.
I really don't get your question, but I interpret it like this: you want to calculate the permutationIndex of a sequence of 52 cards. A given permutation index maps one-to-one to a sequence of cards. Since there are 52! possible arrangements of 52 cards, you'll need at least 226 bits, or 29 bytes. So, your permutationIndex will already be very big!
Since your permutation index is already 29 bytes long, some extra bytes won't make much of a difference and make the solution a lot easier.
For example, you could map each letter of the Latin alphabet to a card. Given that we have 26 lower case letters, 26 upper case letters, we have lo and behold 52 letters available to represent the 52 cards.
abcdefghijklm nopqrstuvwxyz
♥ A234567890JQK ♦ A234567890JQK
ABCDEFGHIJKLM NOPQRSTUVWXYZ
♣ A234567890JQK ♠ A234567890JQK
Now you can make a string of 52 letters. Each unique letter string represents a unique permutation of 52 cards. With this you can:
Generate a random string of letters to get a random permutation.
Immediately find out what card is where just by looking at the letter at a given position.
Shuffle, reorder, insert and remove cards easily.
Each character in a string is represented (in C#) as a 16-bit Unicode value, but for 52 cards you would only need 6 bits. So you have some more options to choose a representation:
832 bits, or 104 bytes: string of 52 Unicode characters
416 bits, or 52 bytes: array of 52 bytes
320 bits, or 40 bytes: array of 10 32-bit integers to hold 52 * 6 bits
312 bits, or 39 bytes: array of 39 bytes to hold 52 * 6 bits
226 bits, or 29 bytes: absolute lower bound
Representations 3 and 4 require quite some clever bit fiddling to get the 6 bits for a specific card out of the sequence. I would recommend representation 2, which preserves most of the advantages mentioned above.
When you are using a binary representation instead of a character string representation, then you can create an enum with a unique value for each card, and use that:
public enum Cards : byte
{
HeartsAce
HeartsTwo
// ...
HeartsTen
HeartsJack
HeartsQueen
HeartsKing
DiamondsAce
DiamondsTwo
// ...
SpadesTen
SpadesJack
SpadesQueen
SpadesKing
}
You have working - and extremely efficient c# example for The kth Permutation of Order n (aka PermutationIndex) at this very old post:
http://msdn.microsoft.com/en-us/library/Aa302371.aspx#permutat_topic3
For those interested in Combinations topic:
http://msdn.microsoft.com/en-us/magazine/cc163957.aspx
http://msdn.microsoft.com/en-us/library/Aa289166(VS.71).aspx
I suggest that you read through, before going into specific implementation.
Like the others I am not sure what you want to do but if you want to save as much space a possible on the communication/storage of the dealt cards I would do the following:
I would store the cards dealt on a single Long using a enum with the flag attribute so I could use bitwise comparisons to see which card has been dealt.
Because each card is a separate "flag" with a unique number which is set to the exponent of 2 so they will never clash.
In total even if you deal all the cards the storage will still be 8 bytes. any extra data you need you can bolt on the end.
Please see the working example below.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace ConsoleApplication12
{
class Program
{
static void Main(string[] args)
{
// Because each card is unique you could use Flag attributed Enum see the enum below and set each item a unique value I used 2 to the power of 52
Cards cardsdealt = Cards.Clubs_10 | Cards.Clubs_2 | Cards.Diamonds_3;
if ((cardsdealt & Cards.Clubs_10) == Cards.Clubs_10)
{
Console.WriteLine("Card.Clubs_10 was dealt");
}
// Storage would always be 8 bytes for the long data type
}
[Flags]
public enum Cards : long
{
Spades_Ace = 1,
Spades_2 = 2,
Spades_3 = 4,
Spades_4 = 8,
Spades_5 = 16,
Spades_6 = 32,
Spades_7 = 64,
Spades_8 = 128,
Spades_9 = 256,
Spades_10 = 512,
Spades_Jack = 1024,
Spades_Queen = 2048,
Spades_King = 4096,
Hearts_Ace = 8192,
Hearts_2 = 16384,
Hearts_3 = 32768,
Hearts_4 = 65536,
Hearts_5 = 131072,
Hearts_6 = 262144,
Hearts_7 = 524288,
Hearts_8 = 1048576,
Hearts_9 = 2097152,
Hearts_10 = 4194304,
Hearts_Jack = 8388608,
Hearts_Queen = 16777216,
Hearts_King = 33554432,
Diamonds_Ace = 67108864,
Diamonds_2 = 134217728,
Diamonds_3 = 268435456,
Diamonds_4 = 536870912,
Diamonds_5 = 1073741824,
Diamonds_6 = 2147483648,
Diamonds_7 = 4294967296,
Diamonds_8 = 8589934592,
Diamonds_9 = 17179869184,
Diamonds_10 = 34359738368,
Diamonds_Jack = 68719476736,
Diamonds_Queen = 137438953472,
Diamonds_King = 274877906944,
Clubs_Ace = 549755813888,
Clubs_2 = 1099511627776,
Clubs_3 = 2199023255552,
Clubs_4 = 4398046511104,
Clubs_5 = 8796093022208,
Clubs_6 = 17592186044416,
Clubs_7 = 35184372088832,
Clubs_8 = 70368744177664,
Clubs_9 = 140737488355328,
Clubs_10 = 281474976710656,
Clubs_Jack = 562949953421312,
Clubs_Queen = 1125899906842620,
Clubs_King = 2251799813685250,
}
}
}
I have a List<int> and I need to remove the outliers so want to use an approach where I only take the middle n. I want the middle in terms of values, not index.
For instance, given the following list if I wanted the middle 80% i would expect that the 11 and 100 would be removed.
11,22,22,33,44,44,55,55,55,100.
Is there an easy / built in way to do this in LINQ?
I have a List<int> and i need to remove the outliers so want to use an approach where I only take the middle n. I want the middle in terms of values, not index.
Removing outliers correctly depends entirely on the statistical model that accurately describes the distribution of the data -- which you have not supplied for us.
On the assumption that it is a normal (Gaussian) distribution, here's what you want to do.
First compute the mean. That's easy; it's just the sum divided by the number of items.
Second, compute the standard deviation. Standard deviation is a measure of how "spread out" the data is around the mean. Compute it by:
take the difference of each point from the mean
square the difference
take the mean of the squares -- this is the variance
take the square root of the variance -- this is the standard deviation
In a normal distribution 80% of the items are within 1.2 standard deviations of the mean. So, for example, suppose the mean is 50 and the standard deviation is 20. You would expect that 80% of the sample would fall between 50 - 1.2 * 20 and 50 + 1.2 * 20. You can then filter out items from the list that are outside of that range.
Note however that this is not removing "outliers". This is removing elements that are more than 1.2 standard deviations from the mean, in order to get an 80% interval around the mean. In a normal distribution one expects to see "outliers" on a regular basis. 99.73% of items are within three standard deviations of the mean, which means that if you have a thousand observations, it is perfectly normal to see two or three observations more than three standard deviations outside the mean! In fact, anywhere up to, say, five observations more than three standard deviations away from the mean when given a thousand observations probably does not indicate an outlier.
I think you need to very carefully define what you mean by outlier and describe why you are attempting to eliminate them. Things that look like outliers are potentially not outliers at all, they are real data that you should be paying attention to.
Also, note that none of this analysis is correct if the normal distribution is incorrect! You can get into big, big trouble eliminating what look like outliers when in fact you've actually got the entire statistical model wrong. If the model is more "tail heavy" than the normal distribution then outliers are common, and not actually outliers. Be careful! If your distribution is not normal then you need to tell us what the distribution is before we can recommend how to identify outliers and eliminate them.
You could use the Enumerable.OrderBy method to sort your list, then use Enumerable.Skip and the Enumerable.Take functions, e.g.:
var result = nums.OrderBy(x => x).Skip(1).Take(8);
Where nums is your list of integers.
Figuring out what values to use as arguments for Skip and Take should look something like this, if you just want the "middle n values":
nums.OrderBy(x => x).Skip((nums.Count - n) / 2).Take(n);
However, when the result of (nums.Count - n) / 2 is not an integer, how do you want the code to behave?
Assuming you're not doing any weighted average funny business:
List<int> ints = new List<int>() { 11,22,22,33,44,44,55,55,55,100 };
int min = ints.Min();
double range = (ints.Max() - min);
var results = ints.Select(o => new { IntegralValue = o, Weight = (o - ints.Min()) / range} );
results.Where(o => o.Weight >= .1 && o.Weight < .9);
You can then filter on Weight as needed. Drop the top/botton n% as desired.
In your case:
results.Where(o => o.Weight >= .1 && o.Weight < .9)
Edit: As an extension method, because I like extension methods:
public static class Lulz
{
public static List<int> MiddlePercentage(this List<int> ints, double Percentage)
{
int min = ints.Min();
double range = (ints.Max() - min);
var results = ints.Select(o => new { IntegralValue = o, Weight = (o - ints.Min()) / range} );
double tolerance = (1 - Percentage) / 2;
return results.Where(o => o.Weight >= tolerance && o.Weight < 1 - tolerance).Select(o => o.IntegralValue).ToList();
}
}
Usage:
List<int> ints = new List<int>() { 11,22,22,33,44,44,55,55,55,100 };
var results = ints.MiddlePercentage(.8);
Normally, if you wanted to exclude statistical outliers from a set of values, you'd compute the arithmetic mean and standard deviation for the set, and then remove values lying further from the mean than you'd like (measure in standard deviations). A normal distribution — your classic bell-shaped curve — exhibits the following properties:
About 68% of the data will lie within +/- 1 standard deviation from the mean.
About 95% of the data will lie within +/- 2 standard deviations from the mean.
About 99.7% of the data will lie within +/- 3 standard deviations of the mean.
You can get Linq extension methods for computation of standard deviation (and other statistical functions) at http://www.codeproject.com/KB/linq/LinqStatistics.aspx
I am not going to question the validity of calculating outliers since I had a similar need to do exactly this kind of selection. The answer to the specific question of taking the middle n is:
List<int> ints = new List<int>() { 11,22,22,33,44,44,55,55,55,100 };
var result = ints.Skip(1).Take(ints.Count() - 2);
This skips the first item, and stops before the last giving you just the middle n items. Here is a link to a .NET Fiddle demonstrating this query.
https://dotnetfiddle.net/p1z7em
I have a List and I need to remove the outliers so want to use an approach where I only take the middle n. I want the middle in terms of values, not index.
If I understand correctly we want to keep any values that fall into the middle 80% of the 11-100 range, or
min + (max - min - (max - min) * 0.8) / 2 < x < max - (max - min - (max - min) * 0.8) / 2
Assuming an ordered list, we can SkipWhile the values are lower than the lowerBound, and then TakeWhile the numbers are lover than the upperBound
public void Calculalte()
{
var numbers = new[] { 11, 22, 22, 33, 44, 44, 55, 55, 55, 100 };
var percentage = 0.8;
var result = RemoveOutliers(numbers, percentage);
}
private IEnumerable<int> RemoveOutliers(int[] numbers, double percentage)
{
int min = numbers.First();
int max = numbers.Last();
double range = (max - min);
double lowerBound = min + (range - range * percentage) / 2;
double upperBound = max - (range - range * percentage) / 2;
return numbers.SkipWhile(n => n < lowerBound).TakeWhile(n => n < upperBound);
}