I have very little data for my analysis, and so I want to produce more data for analysis through interpolation.
My dataset contain 23 independent attributes and 1 dependent attribute.....how can this done interpolation?
EDIT:
my main problem is of shortage of data, i hv to increase the size of my dataset, n attributes are categorical for example attribute A may be low, high, meduim, so interpolation is the right approach for it or not????
This is a mathematical problem but there is too little information in the question to properly answer. Depending on distribution of your real data you may try to find a function that it follows. You can also try to interpolate data using artificial neural network but that would be complex. The thing is that to find interpolations you need to analyze data you already have and that defeats the purpose. There is probably more to this problem but not explained. What is the nature of the data? Can you place it in n-dimensional space? What do you expect to get from analysis?
Roughly speaking, to interpolate an array:
double[] data = LoadData();
double requestedIndex = /* set to the index you want - e.g. 1.25 to interpolate between values at data[1] and data[2] */;
int previousIndex = (int)requestedIndex; // in example, would be 1
int nextIndex = previousIndex + 1; // in example, would be 2
double factor = requestedIndex - (double)previousIndex; // in example, would be 0.25
// in example, this would give 75% of data[1] plus 25% of data[2]
double result = (data[previousIndex] * (1.0 - factor)) + (data[nextIndex] * factor);
This is really pseudo-code; it doesn't perform range-checking, assumes your data is in an object or array with an indexer, and so on.
Hope that helps to get you started - any questions please post a comment.
If the 23 independent variables are sampled in a hyper-grid (regularly spaced), then you can choose to partition into hyper-cubes and do linear interpolation of the dependent value from the vertex closest to the origin along the vectors defined from that vertex along the hyper-cube edges away from the origin. In general, for a given partitioning, you project the interpolation point onto each vector, which gives you a new 'coordinate' in that particular space, which can then be used to compute the new value by multiplying each coordinate by the difference of the dependent variable, summing the results, and adding to the dependent value at the local origin. For hyper-cubes, this projection is straightforward (you simply subtract the nearest vertex position closest to the origin.)
If your samples are not uniformly spaced, then the problem is much more challenging, as you would need to choose an appropriate partitioning if you wanted to perform linear interpolation. In principle, Delaunay triangulation generalizes to N dimensions, but it's not easy to do and the resulting geometric objects are a lot harder to understand and interpolate than a simple hyper-cube.
One thing you might consider is if your data set is naturally amenable to projection so that you can reduce the number of dimensions. For instance, if two of your independent variables dominate, you can collapse the problem to 2-dimensions, which is much easier to solve. Another thing you might consider is taking the sampling points and arranging them in a matrix. You can perform an SVD decomposition and look at the singular values. If there are a few dominant singular values, you can use this to perform a projection to the hyper-plane defined by those basis vectors and reduce the dimensions for your interpolation. Basically, if your data is spread in a particular set of dimensions, you can use those dominating dimensions to perform your interpolation, since you don't really have much information in the other dimensions anyway.
I agree with the other commentators, however, that your premise may be off. You generally don't want to interpolate to perform analysis, as you're just choosing to interpolate your data in different ways and the choice of interpolation biases the analysis. It only makes sense if you have a compelling reason to believe that a particular interpolation is physically consistent and you simply need additional points for a particular algorithm.
May I suggest Cubic Spline Interpolation
http://www.coastrd.com/basic-cubic-spline-interpolation
unless you have a very specific need, this is easy to implement and calculates splines well.
Have a look at the regression methods presented in Elements of statistical learning; most of them may be tested in R. There are plenty of models that can be used: linear regression, local models and so on.
Related
I've got two lists of points, let's call them L1( P1(x1, y1), ... Pn(xn, yn)) and L2(P'1(x'1, y'1), ... P'n(x'n, y'n)).
My task is to find the best match between their points for minimizing the sum of their distances.
Any clue on some algorithm? The two lists contain approx. 200-300 points.
Thanks and bests.
If the use case of your problem involves matching ever point present in list L1 with a point in list L2, then the Hungarian Algorithm would serve as a perfect fit.
The weights corresponding to your Hungarian matrix would be the distance between the point annotated for the row vs the column. The overall runtime for the optimized Hungarian algorithm is O(n3) which will comfortably fit for your given constraint of n = 300
A pretty nice tutorial covering the ideology and implementation of the Hungarian algorithm is https://www.topcoder.com/community/competitive-programming/tutorials/assignment-problem-and-hungarian-algorithm/
If not for the Hungarian algorithm, you can also morph the given problem into a max-flow-min-cost problem - the details of which I'll omit for now but can discuss if required.
I am having trouble implementing this into my current path finding algorithm.
Currently I have Dijkstra written and works like it should, but I need to step further away and add a limit (range). I can better explain with an image:
Let's say I have range of 80. I want to go from A to E. My current algorithm, works as it should, so it results in A->B-E.
However, I need to go only on paths with weight not more than the range - 80, which would mean that A->B->E is not the option any more, but A->C->D->B->E (considering that range/limit resets on every stop)
So far, I have implemented a bool named Possible which would return for the single part of path (e.g. A->B) is it possible comparing to my limit / range.
My main problem is that I do not know where/how to start. My only idea was to see where Possible is false (A->B on the total route A->B->E) and run the algorithm from A to A->E again without / excluding B stop/vertex.
Is this a good approach? Because of that my big O notation would increment twice (as far as I understand it).
I see two ways of doing this
Create a new graph G' that contains only edges < 80, and look for shortest path there... reduction time is O(V+E), and additional O(V+E) memory usage
You can change Dijkstra's algorithm, to ignore edges > 80, just skip edges >80, when giving values to neighbor vertices, the complexity and memory usage will stay the same in this case
Create a temporary version of your graph, and set all weights above the threshold to infinity. Then run the ordinary Dijkstra algorithm on it.
Complexity will increase or not, depending on your version of the algorithm:
if you have O(V^2) then it will increase to O(E + V^2)
if you have the O(ElogV) version then it will increase to O(E + ElogV)
if you have the O(E + VlogV) version it will remain the same
As noted by ArsenMkrt you can as well remove these edges, which makes even more sense but will make the complexity a bit worse. Modifying the algorithm to just skip those edges seems to be the best option though, as he suggested in his answer.
I am doing some research into methods of comparing time series data. One of the algorithms that I have found being used for matching this type of data is the DTW (Dynamic Time Warping) algorithm.
The data I have, resemble the following structure (this can be one path):
Path Event Time Location (x,y)
1 1 2:30:02 1,5
1 2 2:30:04 2,7
1 3 2:30:06 4,4
...
...
Now, I was wondering whether there are other algorithms that would be suitable to find the closest match for the given path.
The keyword you are looking for is "(dis-)similarity measures".
Euclidean Distance (ED) as referred to by Adam Mihalcin (first answer) is easily computable and somehow reflects the natural understanding of the word distance in natural language. Yet when comparing two time series, DTW is to be preffered - especially when applied to real world data.
1) ED can only be applied to series of equal length. Therefore when points are missing, ED simply is not computable (unless also cutting the other sequence, thus loosing more information).
2) ED does not allow time-shifting or time-warping opposed to all algorithms which are based on DTW.
Thus ED is not a real alternative to DTW, because the requirements and restrictions are much higher. But to answer your question, I want to recommend to you this lecture:
Time-series clustering – A decade review
Saeed Aghabozorgi, Ali Seyed Shirkhorshidi, Teh Ying Wah
http://www.sciencedirect.com/science/article/pii/S0306437915000733
This paper gives an overview about (dis-)similarity measures used in time series clustering. Here a little excerpt to motivate your actually reading the paper:
If two paths are the same length, say n, then they are really points in an 2n-dimensional space. The first location determines the first two dimensions, the second location determines the next two dimensions, and so on. For example, if we just take the three points in your example, the path can be represented as the single 6-dimensional point (1, 5, 2, 7, 4, 4). If we want to compare this to another three-point path, we can compute either the Euclidean distance (square root of the sum of squares of per-dimension distances between the two points) or the Manhattan distance (sum of the per-dimension differences).
For example, the boring path that stays at (0, 0) for all three times becomes the 6-dimensional point (0, 0, 0, 0, 0, 0). Then the Euclidean distance between this point and your example path is sqrt((1-0)^2 + (5-0)^2 + (2-0)^2 + (7-0)^2 + (4-0)^2 + (4-0)^2) = sqrt(111) = 10.54. The Manhattan distance is abs(1-0) + abs(5-0) + abs(2-0) + abs(7-0) + abs(4-0) + abs(4-0) = 23. This kind of a difference between the metrics is not unusual, since the Manhattan distance is provably at least as great as the Euclidean distance.
Of course one problem with this approach is that not all paths will be of the same length. However, you can easily cut off the longer path to the same length as the shorter path, or consider the shorter of the two paths to stay at the same location or moving in the same direction after measurements end, until both paths are the same length. Either approach will introduce some inaccuracies, but no matter what you do you have to deal with the fact that you are missing data on the short path and have to make up for it somehow.
EDIT:
Assuming that path1 and path2 are both List<Tuple<int, int>> objects containing the points, we can cut off the longer list to match the shorter list as:
// Enumerable.Zip stops when it finishes one of the sequences
List<Tuple<int, int, int, int>> matchingPoints = Enumerable.Zip(path1, path2,
(tupl1, tupl2) =>
Tuple.Create(tupl1.Item1, tupl1.Item2, tupl2.Item1, tupl2.Item2));
Then, you can use the following code to find the Manhattan distance:
int manhattanDistance = matchingPoints
.Sum(tupl => Math.Abs(tupl.Item1 - tupl.Item3)
+ Math.Abs(tupl.Item2 - tupl.Item4));
With the same assumptions as for the Manhattan distance, we can generate the Euclidean distance as:
int euclideanDistanceSquared = matchingPoints
.Sum(tupl => Math.Pow(tupl.Item1 - tupl.Item3, 2)
+ Math.Pow(tupl.Item2 - tupl.Item4, 2));
double euclideanDistance = Math.Sqrt(euclideanDistanceSquared);
There's another question here that might be of some help. If you already have a given path, you can find the closest match by using the cross-track distance algorithm; on the other hand, if you actually want to solve the pattern-recognition problem, you might want to find out more about Levenshtein distance and Elastic Matching (from Wikipedia: "Elastic matching can be defined as an optimization problem of two-dimensional warping specifying corresponding pixels between subjected images".
Imagine I want to, say, compute the first one million terms of the Fibonacci sequence using the GPU. (I realize this will exceed the precision limit of a 32-bit data type - just used as an example)
Given a GPU with 40 shaders/stream processors, and cheating by using a reference book, I can break up the million terms into 40 blocks of 250,000 strips, and seed each shader with the two start values:
unit 0: 1,1 (which then calculates 2,3,5,8,blah blah blah)
unit 1: 250,000th term
unit 2: 500,000th term
...
How, if possible, could I go about ensuring that pixels are processed in order? If the first few pixels in the input texture have values (with RGBA for simplicity)
0,0,0,1 // initial condition
0,0,0,1 // initial condition
0,0,0,2
0,0,0,3
0,0,0,5
...
How can I ensure that I don't try to calculate the 5th term before the first four are ready?
I realize this could be done in multiple passes but setting a "ready" bit whenever a value is calculated, but that seems incredibly inefficient and sort of eliminates the benefit of performing this type of calculation on the GPU.
OpenCL/CUDA/etc probably provide nice ways to do this, but I'm trying (for my own edification) to get this to work with XNA/HLSL.
Links or examples are appreciated.
Update/Simplification
Is it possible to write a shader that uses values from one pixel to influence the values from a neighboring pixel?
You cannot determine the order the pixels are processed. If you could, that would break the massive pixel throughput of the shader pipelines. What you can do is calculating the Fibonacci sequence using the non-recursive formula.
In your question, you are actually trying to serialize the shader units to run one after another. You can use the CPU right away and it will be much faster.
By the way, multiple passes aren't as slow as you might think, but they won't help you in your case. You cannot really calculate any next value without knowing the previous ones, thus killing any parallelization.
i would like to effeciently generate positions for objects on a given surface. As you probably guessed this is for a game. The surface is actually a 3D terrain, but the third dimension does not matter as it is determined by terrain height.
The problem is i would like to do this in the most effecient and easy way, but still get good results. What i mean by "natural" is something like mentoined in this article about Perlin noise. (trees forming forests, large to small groups spread out on the land) The approach is nice, but too complicated. I need to do this quite often and prefferably without any more textures involved, even at the cost of worse performance (so the results won't be as pretty, but still good enough to give a nice natural terrain with vegetation).
The amount of objects placed varies, but generally is around 50. A nice enhancement would be to somehow restrict placement of objects at areas with very high altitude (mountains) but i guess it could be done by placing a bit more objects and deleting those placed above a given altitude.
This might not be the answer you are looking for, but I believe that Perlin Noise is the solution to your problem.
Perlin Noise itself involves no textures; I do believe that you have a misunderstanding about what it is. It's basically, for your purposes, a 2D index of, for each point, a value between 0 and 1. You don't need to generate any textures. See this description of it for more information and an elegant explanation. The basics of Perlin Noise involves making a few random noise maps, starting with one with very few points, and each new one having twice as many points of randomness (and lower amplitude), and adding them together.
Especially, if your map is discretely tiled, you don't even have to generate the noise at a high resolution :)
How "often" are you planning to do this? If you're going to be doing it 10+ times every single frame, then Perlin Noise might not be your answer. However, if you're doing it once every few seconds (or less), then I don't think that you should have any worries about speed impact -- at least, for 2D Perlin Noise.
Establishing that, you could look at this question and my personal answer to it, which is trying to do something very similar to what you are trying to do. The basic steps involve this:
Generate perlin noise; higher turbulence = less clumping and more isolated features.
Set a "threshold" (ie, 0.5) -- anything above this threshold is considered "on" and anything above it is considered "off". Higher threshold = more frequent, lower threshold = less frequent.
Populate "on" tiles with whatever you are making.
Here are some samples of Perlin Noise to generate 50x50 tile based map. Note that the only difference between the nature of the two are the "threshold". Bigger clumps means lower threshold, smaller clumps means a higher one.
A forest, with blue trees and brown undergrowth
A marsh, with deep areas surrounded by shallower areas
Note you'll have to tweak the constants a bit, but you could do something like this
First, pick a random point. (say 24,50).
Next, identify points of interest for this object. If it's a rock, your points might be the two mountains at 15,13 or 50,42. If it was a forest, it would maybe do some metrics to find the "center" of a couple local forests.
Next, calculate the distance vectors between the the point and the points of interest, and scale them by some constant.
Now, add all those vectors to the point.
Next determine if the object is in a legal position. If it is, move to the next object. If it's not, repeat the process.
Adapt as necessary. :-)
One thing: If you want to reject things like trees on mountains you don't add extra tries, you keep trying to place an object until you find a suitable location or you've tried it a bunch of times and you need to bail out because it doesn't look placeable.