I understand that some matrices have a lot of data, while others have mainly 0's or are empty. But what is the advantage of creating a SparseMatrix object to hold a sparsely populated matrix over creating a DenseMatrix object to hold a sparsely populated matrix? They both seem to offer more or less the same operations as far as methods go.
I'm also wondering when you would use a Matrix object to hold data -- as in are there any advantages or situations where this would be preferred over using the other two.
For small matrices (e.g. less than 1000x1000) dense matrices work well. But in practice there are a lot of problems where much larger matrices are needed, but where almost all values are zero (often with non-zero values close to the diagonal). With sparse matrices it is possible to handle very large matrices in cases where the dense structure is unfeasible (because it needs too much memory or is way to expensive to compute with CPU-time wise).
Note that as of today the Math.NET Numerics direct matrix decomposition methods are optimized for dense matrices only; use iterative solvers for sparse data instead.
Regarding types, in Math.NET Numerics v3 the hierarchy for double-valued matrices is as follows:
Matrix<double>
|- Double.Matrix
|- Double.DenseMatrix
|- Double.SparseMatrix
|- Double.DiagonalMatrix
With Matrix<T> I refer to the full type MathNet.Numerics.LinearAlgebra.Matrix<T>, with
Double.Matrix to MathNet.Numerics.LinearAlgebra.Double.Matrix, etc.
Matrix<double>: always declare all variables, properties and arguments using this generic type only. Indeed, in most cases this is the only type needed in user code.
Double.Matrix: do not use
Double.DenseMatrix: use for creating a dense matrix only - if you do not wish to use the builder (Matrix<double>.Build.Dense...)
Double.SparseMatrix: use for creating a sparse matrix only - if you do not wish to use the builder
Double.DiagonalMatrix: use for creating a diagonal matrix only - if you do not wish to use the builder
They each are optimized for that specific use. For example sparse matrix uses CSR format.
Compressed sparse row (CSR or CRS)
CSR is effectively identical to the Yale Sparse Matrix format, except
that the column array is normally stored ahead of the row index array.
I.e. CSR is (val, col_ind, row_ptr), where val is an array of the
(left-to-right, then top-to-bottom) non-zero values of the matrix;
col_ind is the column indices corresponding to the values; and,
row_ptr is the list of value indexes where each row starts. The name
is based on the fact that row index information is compressed relative
to the COO format. One typically uses another format (LIL, DOK, COO)
for construction. This format is efficient for arithmetic operations,
row slicing, and matrix-vector products. See scipy.sparse.csr_matrix.
See wiki for more info.
Related
I need a fast collection that maps 2D int-typed point to custom class in C#.
The collection needs to have:
Fast lookup (coords to custom class), adding a point if it does not exist
Fast remove range of key-point (outside of given rect). This actually rules out Dictionary<Point2D, ...>, as profiling found out this op is taking 35% of entire frame time in my sample implementation :-(
EDIT: To stress out: I want to remove all fields OUTSIDE of given rect (kill unused cache)
The coordinates can take any int-values (they are used to cache [almost] infinite isometric 2D map tiles that are near camera in Unity).
The points will be always organized in rect-like structure (I can relax this requirement to always follow rect, actually I am using isometric projection).
The structure itself is used for caching tile-specific data (like tile-transitions)
EDIT: Updated with outcome of discussion
You can use a sparse, static matrix for each "Chunk" in the cache and a cursor to represent the current viewport. You can then either use modulus math or a Quad tree to access each chunk, depending on the specific use case.
Old Answer:
If they are uniformly spaced, they why do you need to hash at all? You could just use a matrix of objects with NULL where is the default value if nothing is cached there.
Since you are using objects, the array is actually just references under the hood, the memory footprint of the array wouldn't really be affected by the null values.
If you truly need it to be infinite, you nest the matrices with a Quad Tree and create some kind of "Chunk" system.
I think this is what you need: RTree
Here's an interesting data structure conundrum that perhaps you all can help me with, for context I am writing C#.
Context/Constraints:
I'm using a library (the new Unity ECS preview package, specifically) that allows me to store data in a very compact/efficient/native fashion for lightning-fast access and manipulation with no garbage collection. For some time it supported storing data in FixedArrays:
ComponentType.FixedArray<T>(int fixedCapacity) //pseudo-code
The API does not allow for any sort of managed data to be stored in these arrays, for performance and safety reasons, which means they must all be linear (no nested arrays or multiple dimensions) and the data elements themselves must be extremely simple (primitives or directly serializable structs, no fancy LinkedLists or references to other data structures). I cannot use a HashTable or Dictionary or any other similar data high-level data structure to solve this problem, I must use the provided data structure!
Problem:
I am trying to store a basic "Entity" object in the array which has an associated 3D point integer coordinate. I want to access the structure with this coordinate in-hand and retrieve/modify/delete my object from said coordinate. So it was a simple problem of accessing a linear indexed, fixed-width array using 3D coordinates, made possible using a hashing function.
//Pseudo-Code, this is not the actual code itself.
//In actuality the Fixed Arrays are associated with Entities in an ECS system.
var myStructure = new ComponentType.FixedArray<Entity>(512);//define array
struct DataPair {
Entity entity;//the element we're storing
Vector3 threeDIntegerCoordinate;//from 1x1x1 to 8x8x8
}
//...
int lookupFunction(Vector3 coordinate) {...} //converts 3D coord to 2D linear index
DataPair exampleDataPair = new DataPair(...);
//Data WAS stored like this:
myStructure[lookupFunction(exampleDataPair.threeDIntegerCoordinate)] = exampleDataPair.entity;
//Extremely fast access/lookup time due to using coordinate as index value.
Basically, I generated a variable amount (1 to 512, one 8x8x8 cube) Entities and store them by index into the FixedArray using a translation function that correlates a linear index value with every single 3D point coordinate. Lookup in a fixed array for a coordinate value was extremely fast, as simple as accessing an index.
However!
The package has been updated and they replaced the FixedArray with a new DynamicBuffer data structure, which is now variable-width. The same constraints apply on what data can be stored, but now, if the cube of Entities is sparse (not entirely full) then it does not need to reserve space for that non-existent Entity reference in the structure. This will drastically cut down on my memory usage considering most cubes of Entities are not entirely full, and I'm storing literally millions of these buffers in memory at a time. The buffer elements are indexed by integer. It is possible to use multiple DynamicBuffers at once (which means we could store the coordinates alongside the elements in two parallel buffers if necessary).
//New data structure provided. Variable-width! Also indexed linearly.
var myStructure = new ComponentType.DynamicBuffer<Entity>();
//Is very similar to C# List or ArrayList in Java, for example, contains functions:
myStructure.Add(T element);
myStructure.AddRange(...);
myStructure.Length();
myStructure.resizeUninitialized(int capacity);
myStructure.Clear();
In essence, what is the most efficient way to store these variable number of elements in a dynamic, dimensionless data structure (similar to a List) while maintaining 3D coordinate-based indexing, without using complex nested data structures to do so? My system is more performance-bound by the lookup/access time than it is by memory space.
Possible Solutions:
The naive solution is just to make all the DynamicBuffers length equal to the max number of elements I would ever want to store (512 Entity volume), simulating FixedArrays. This would require minimal changes to my codebase, and would allow me to access them by-coordinate using my current translation function, but would not take advantage of the space-saving features of the dynamic data structure. It would look like this:
//Naive Solution:
myStructure.resizeUninitialized(512); //resize buffer to 512 elements
//DynamicBuffer is now indexed identically to FixedArray
Entity elementToRetrieve = myStructure[exampleDataPair.threeDIntegerCoordinate];
My projected solution is to use two parallel DynamicBuffers: one with all the Entities, the other with all the 3D points. Then when I want to find an Entity by coordinate, I lookup the 3D point in the coordinate buffer, and use the index of that element to find the appropriate Entity in the primary buffer.
//Possible better solution:
myStructure1 = new ComponentType.DynamicBuffer<Entity>();
myStructure2 = new ComponentType.DynamicBuffer<Vector3>();
//to access an element:
Entity elementToRetrieve = myStructure1[myStructure2.Find(exampleDataPair.threeDIntegerCoordinate)];
//I would have to create this theoretical Find function.
Cons of this solution:
Requires searching which means it would probably also require sorting.
Sorting would need to be performed every time the structure is significantly modified, which is going to add a LARGE amount of computational overhead.
Would need to write my own search/sort algorithms on top of an already extremely complicated data structure, which is not designed to be searched/sorted (it is possibly not stored linearly in memory).
Locality of reference? It is very important that processor caching/speculative execution is preserved for high-performance.
How can I find a happy medium between the naive solution and a complex solution involving searching/sorting? Are there any theoretical data structures or algorithms to solve this problem that I'm just completely missing? Basically I need to efficiently use a List like a Map.
Sorry if this is a really long question, but I wanted to get this right, this is my first ever post here on StackExchange. Please be gentle! Thanks for all your help!
I started using the MathNet Numerics Library and I need it to calculate the largest Eigenvalues corresponding to their Eigenvectors of my adjacency matrix.
When using large amount of points my adjacency Matrix gets quite big (i.e. 5782x5782 entries)
Most of the entries are '0' so I thought I could use the 'SparseMatrix'. But when I use it, it still takes ages for computation. In fact I never really waited that long until its finished.
I tried the whole thing in matlab and there wasn't any problem at all. Matlab solved it within a few seconds.
Do you have any suggestions for me?
Here is what I'm doing:
// initialize matrix and fill it with zeros
Matrix<double> A = SparseMatrix.Create(count, count, 0);
... fill matrix with values ...
// get eigenvalues and eigenvectors / this part takes centuries =)
Evd<double> eigen = A.Evd(Symmetricity.Symmetric);
Vector<Complex> eigenvector = eigen.EigenValues;
Math.Net Numerics's implementation is purely C# based. Therefore, performance may not be on-par with tools such as MATLAB since they mostly rely on native and highly optimized BLAS libraries for performing numerical computations.
You may want to use the native wrappers that come with Math.Net that leverage highly optimized linear algebra libraries (such as Intel's MKL or AMD's ACML). There is a guide on this MSDN page that explains how to build Math.NET with ACML support (look under Compiling and Using AMD ACML in Math.NET Numerics).
public double[] result = new double[ ??? ];
I am storing results and total number of the results are bigger than the 2,147,483,647 which is max int32.
I tried biginteger, ulong etc. but all of them gave me errors.
How can I extend the size of the array that can store > 50,147,483,647 results (double) inside it?
Thanks...
An array of 2,147,483,648 doubles will occupy 16GB of memory. For some people, that's not a big deal. I've got servers that won't even bother to hit the page file if I allocate a few of those arrays. Doesn't mean it's a good idea.
When you are dealing with huge amounts of data like that you should be looking to minimize the memory impact of the process. There are several ways to go with this, depending on how you're working with the data.
Sparse Arrays
If your array is sparsely populated - lots of default/empty values with a small percentage of actually valid/useful data - then a sparse array can drastically reduce the memory requirements. You can write various implementations to optimize for different distribution profiles: random distribution, grouped values, arbitrary contiguous groups, etc.
Works fine for any type of contained data, including complex classes. Has some overheads, so can actually be worse than naked arrays when the fill percentage is high. And of course you're still going to be using memory to store your actual data.
Simple Flat File
Store the data on disk, create a read/write FileStream for the file, and enclose that in a wrapper that lets you access the file's contents as if it were an in-memory array. The simplest implementation of this will give you reasonable usefulness for sequential reads from the file. Random reads and writes can slow you down, but you can do some buffering in the background to help mitigate the speed issues.
This approach works for any type that has a static size, including structures that can be copied to/from a range of bytes in the file. Doesn't work for dynamically-sized data like strings.
Complex Flat File
If you need to handle dynamic-size records, sparse data, etc. then you might be able to design a file format that can handle it elegantly. Then again, a database is probably a better option at this point.
Memory Mapped File
Same as the other file options, but using a different mechanism to access the data. See System.IO.MemoryMappedFile for more information on how to use Memory Mapped Files from .NET.
Database Storage
Depending on the nature of the data, storing it in a database might work for you. For a large array of doubles this is unlikely to be a great option however. The overheads of reading/writing data in the database, plus the storage overheads - each row will at least need to have a row identity, probably a BIG_INT (8-byte integer) for a large recordset, doubling the size of the data right off the bat. Add in the overheads for indexing, row storage, etc. and you can very easily multiply the size of your data.
Databases are great for storing and manipulating complicated data. That's what they're for. If you have variable-width data - strings and the like - then a database is probably one of your best options. The flip-side is that they're generally not an optimal solution for working with large amounts of very simple data.
Whichever option you go with, you can create an IList<T>-compatible class that encapsulates your data. This lets you write code that doesn't have any need to know how the data is stored, only what it is.
BCL arrays cannot do that.
Someone wrote a chunked BigArray<T> class that can.
However, that will not magically create enough memory to store it.
You can't. Even with gcAllowVeryLargeObjects, the maximum size of any dimension in an array (of non-bytes) is 2,146,435,071
So you'll need to rethink your design, or use an alternative implementation such as a jagged array.
Another possible approach is to implement your own BigList. First note that List is implemented as an array. Also, you can set the initial size of the List in the constructor, so if you know it will be big, get a big chunk of memory up front.
Then
public class myBigList<T> : List<List<T>>
{
}
or, maybe more preferable, use a has-a approach:
public class myBigList<T>
{
List<List<T>> theList;
}
In doing this you will need to re-implement the indexer so you can use division and modulo to find the correct indexes into your backing store. Then you can use a BigInt as the index. In your custom indexer you will decompose the BigInt into two legal sized ints.
I ran into the same problem. I solved it using a list of list which mimics very well an array but can go well beyond the 2Gb limit. Ex List<List> It worked for an 250k x 250k of sbyte running on a 32Gb computer even if this elephant represent a 60Gb+ space:-)
C# arrays are limited in size to System.Int32.MaxValue.
For bigger than that, use List<T> (where T is whatever you want to hold).
More here: What is the Maximum Size that an Array can hold?
I have very little data for my analysis, and so I want to produce more data for analysis through interpolation.
My dataset contain 23 independent attributes and 1 dependent attribute.....how can this done interpolation?
EDIT:
my main problem is of shortage of data, i hv to increase the size of my dataset, n attributes are categorical for example attribute A may be low, high, meduim, so interpolation is the right approach for it or not????
This is a mathematical problem but there is too little information in the question to properly answer. Depending on distribution of your real data you may try to find a function that it follows. You can also try to interpolate data using artificial neural network but that would be complex. The thing is that to find interpolations you need to analyze data you already have and that defeats the purpose. There is probably more to this problem but not explained. What is the nature of the data? Can you place it in n-dimensional space? What do you expect to get from analysis?
Roughly speaking, to interpolate an array:
double[] data = LoadData();
double requestedIndex = /* set to the index you want - e.g. 1.25 to interpolate between values at data[1] and data[2] */;
int previousIndex = (int)requestedIndex; // in example, would be 1
int nextIndex = previousIndex + 1; // in example, would be 2
double factor = requestedIndex - (double)previousIndex; // in example, would be 0.25
// in example, this would give 75% of data[1] plus 25% of data[2]
double result = (data[previousIndex] * (1.0 - factor)) + (data[nextIndex] * factor);
This is really pseudo-code; it doesn't perform range-checking, assumes your data is in an object or array with an indexer, and so on.
Hope that helps to get you started - any questions please post a comment.
If the 23 independent variables are sampled in a hyper-grid (regularly spaced), then you can choose to partition into hyper-cubes and do linear interpolation of the dependent value from the vertex closest to the origin along the vectors defined from that vertex along the hyper-cube edges away from the origin. In general, for a given partitioning, you project the interpolation point onto each vector, which gives you a new 'coordinate' in that particular space, which can then be used to compute the new value by multiplying each coordinate by the difference of the dependent variable, summing the results, and adding to the dependent value at the local origin. For hyper-cubes, this projection is straightforward (you simply subtract the nearest vertex position closest to the origin.)
If your samples are not uniformly spaced, then the problem is much more challenging, as you would need to choose an appropriate partitioning if you wanted to perform linear interpolation. In principle, Delaunay triangulation generalizes to N dimensions, but it's not easy to do and the resulting geometric objects are a lot harder to understand and interpolate than a simple hyper-cube.
One thing you might consider is if your data set is naturally amenable to projection so that you can reduce the number of dimensions. For instance, if two of your independent variables dominate, you can collapse the problem to 2-dimensions, which is much easier to solve. Another thing you might consider is taking the sampling points and arranging them in a matrix. You can perform an SVD decomposition and look at the singular values. If there are a few dominant singular values, you can use this to perform a projection to the hyper-plane defined by those basis vectors and reduce the dimensions for your interpolation. Basically, if your data is spread in a particular set of dimensions, you can use those dominating dimensions to perform your interpolation, since you don't really have much information in the other dimensions anyway.
I agree with the other commentators, however, that your premise may be off. You generally don't want to interpolate to perform analysis, as you're just choosing to interpolate your data in different ways and the choice of interpolation biases the analysis. It only makes sense if you have a compelling reason to believe that a particular interpolation is physically consistent and you simply need additional points for a particular algorithm.
May I suggest Cubic Spline Interpolation
http://www.coastrd.com/basic-cubic-spline-interpolation
unless you have a very specific need, this is easy to implement and calculates splines well.
Have a look at the regression methods presented in Elements of statistical learning; most of them may be tested in R. There are plenty of models that can be used: linear regression, local models and so on.