I am doing some research into methods of comparing time series data. One of the algorithms that I have found being used for matching this type of data is the DTW (Dynamic Time Warping) algorithm.
The data I have, resemble the following structure (this can be one path):
Path Event Time Location (x,y)
1 1 2:30:02 1,5
1 2 2:30:04 2,7
1 3 2:30:06 4,4
...
...
Now, I was wondering whether there are other algorithms that would be suitable to find the closest match for the given path.
The keyword you are looking for is "(dis-)similarity measures".
Euclidean Distance (ED) as referred to by Adam Mihalcin (first answer) is easily computable and somehow reflects the natural understanding of the word distance in natural language. Yet when comparing two time series, DTW is to be preffered - especially when applied to real world data.
1) ED can only be applied to series of equal length. Therefore when points are missing, ED simply is not computable (unless also cutting the other sequence, thus loosing more information).
2) ED does not allow time-shifting or time-warping opposed to all algorithms which are based on DTW.
Thus ED is not a real alternative to DTW, because the requirements and restrictions are much higher. But to answer your question, I want to recommend to you this lecture:
Time-series clustering – A decade review
Saeed Aghabozorgi, Ali Seyed Shirkhorshidi, Teh Ying Wah
http://www.sciencedirect.com/science/article/pii/S0306437915000733
This paper gives an overview about (dis-)similarity measures used in time series clustering. Here a little excerpt to motivate your actually reading the paper:
If two paths are the same length, say n, then they are really points in an 2n-dimensional space. The first location determines the first two dimensions, the second location determines the next two dimensions, and so on. For example, if we just take the three points in your example, the path can be represented as the single 6-dimensional point (1, 5, 2, 7, 4, 4). If we want to compare this to another three-point path, we can compute either the Euclidean distance (square root of the sum of squares of per-dimension distances between the two points) or the Manhattan distance (sum of the per-dimension differences).
For example, the boring path that stays at (0, 0) for all three times becomes the 6-dimensional point (0, 0, 0, 0, 0, 0). Then the Euclidean distance between this point and your example path is sqrt((1-0)^2 + (5-0)^2 + (2-0)^2 + (7-0)^2 + (4-0)^2 + (4-0)^2) = sqrt(111) = 10.54. The Manhattan distance is abs(1-0) + abs(5-0) + abs(2-0) + abs(7-0) + abs(4-0) + abs(4-0) = 23. This kind of a difference between the metrics is not unusual, since the Manhattan distance is provably at least as great as the Euclidean distance.
Of course one problem with this approach is that not all paths will be of the same length. However, you can easily cut off the longer path to the same length as the shorter path, or consider the shorter of the two paths to stay at the same location or moving in the same direction after measurements end, until both paths are the same length. Either approach will introduce some inaccuracies, but no matter what you do you have to deal with the fact that you are missing data on the short path and have to make up for it somehow.
EDIT:
Assuming that path1 and path2 are both List<Tuple<int, int>> objects containing the points, we can cut off the longer list to match the shorter list as:
// Enumerable.Zip stops when it finishes one of the sequences
List<Tuple<int, int, int, int>> matchingPoints = Enumerable.Zip(path1, path2,
(tupl1, tupl2) =>
Tuple.Create(tupl1.Item1, tupl1.Item2, tupl2.Item1, tupl2.Item2));
Then, you can use the following code to find the Manhattan distance:
int manhattanDistance = matchingPoints
.Sum(tupl => Math.Abs(tupl.Item1 - tupl.Item3)
+ Math.Abs(tupl.Item2 - tupl.Item4));
With the same assumptions as for the Manhattan distance, we can generate the Euclidean distance as:
int euclideanDistanceSquared = matchingPoints
.Sum(tupl => Math.Pow(tupl.Item1 - tupl.Item3, 2)
+ Math.Pow(tupl.Item2 - tupl.Item4, 2));
double euclideanDistance = Math.Sqrt(euclideanDistanceSquared);
There's another question here that might be of some help. If you already have a given path, you can find the closest match by using the cross-track distance algorithm; on the other hand, if you actually want to solve the pattern-recognition problem, you might want to find out more about Levenshtein distance and Elastic Matching (from Wikipedia: "Elastic matching can be defined as an optimization problem of two-dimensional warping specifying corresponding pixels between subjected images".
Related
Consider the following 2 dimensional jagged array
[0,0] [0,1] [0,2]
[1,0] [1,1]
[2,0] [2,1] [2,2]
Lets say I want to know all elements that fall within a certain range for example [0,0]-[2,0], I would like a list of all of those elements, what would be the best approach for achieving this and are there any pre-existing algorithms that achieve this?
I have attempted to implement this in C# but didn't get much further than some for loops.
The example below better provides further detail upon what I would like to achieve.
Using the array defined above, lets say the start index is [0,0] and the end index of the range is [2,1]
I would like to create a method that returns the values of all the indexes that fall within this range.
Expected results would be for the method to return the stored values for the following index.
[0,0] [0,1] [0,2] [1,0] [1,1] [2,0] [2,1]
If the 2d array is "sorted", meaning that as you go from left to right in each 1d array the y increases and as you go from up to down the x increases, you can find the first point and the last point that you need to report using binary searches in total time of O(logn), and after that report every point between those 2 points in O(k) where k is the number of points that you need to report (notice that you the tome complexity will be Omega(k) in every algorithm).
If the 2D array is not sorted and you just want to output all the pairs between pair A and pair B:
should_print = False
should_stop = False
for i in range(len(2dArray)):
for j in range(len(2dArray[i]))
should_print = (should_print or (2dArray[i][j] == A))
if should_print:
print(2dArray[i][j])
should_stop = (2dArray[i][j] == B)
if should_stop:
break
if should_stop:
break
If you just have n general 2d points and you wish to answer the query "find me all the points in a given rectangle", there are 2 data structures that can help you - kd trees and range trees. These 2 data structures provide you a good query time but they are a bit complicated. I am not sure what is your current level but if you are just starting to get into DS and algorithms these data structures are probably an overkill.
Edit (a bit about range trees and kd trees):
First I'll explain the basic concept behind range trees. Lets start by trying to answer the range query for 1D points. This is easy - just build a bst (balanced search tree) and with it you can answer queries in O(logn + k) where k is the number of points being reported. It takes O(nlogn) time to build the bst and it take O(n) space.
Now, let us try to take this solution and make it work for 2D. We will build a bst for the x coordinates of the points. For each node t in the bst denote all the points in the subtree of t by sub(t). Now, for every node t we will build a bst for the y coordinates of the points sub(t).
Now, given a range, we will for find all the subtrees contained in the x range using the first bst, and for each subtree we will find all the points contained in the y range (note that the bst corresponding the the subtree of sub(t) is saved at the node t).
A query takes O(log^2n) time. Building the DS takes O(nlog^2n) time and finally, it takes O(nlogn) space. I'll let you prove these statements. With more work the query time can be reduced to O(logn) and the building time can be reduced to O(nlogn). You can read about it here: http://www.cs.uu.nl/docs/vakken/ga/2021/slides/slides5b.pdf.
Now a word about KD trees. The idea is to split the 2D sapce in the middle with a vertical line, after that split each side in the middle with a horizontal line and so on. The query time in this DS will take O(sqrt(n)), the building time is O(nlogn) and the space it takes is O(n). You can read more about this DS here: http://www.cs.uu.nl/docs/vakken/ga/2021/slides/slides5a.pdf
I'm trying to find the fastest and easiest way in a C# program to calculate the intersections of two circles. From what I can tell there are two possible methods, and you'll have to forgive me for not knowing the official names for them.
We're assuming you know the center points for both circles and their exact radii, from which you can calculate the distance between them, so all that is missing are the point(s) of intersection. This is taking place on a standard x-y plot.
The first is a kind of substitution method like the one described here where you combine the two circle formulas and isolate either x or y, then sub it back in to an original formula to end up with a quadratic equation that can be solved for two (or possibly one or none) coordinates for an axis, which then lets you find the corresponding coordinates on the other axis.
The second I have seen a reference to is using a Law of Cosines method to determine the angles, which would then let you plot a line for each side on the grid, and put in your radius to find the actual intersection point.
I have written out the steps for the first method, and it seems rather lengthy. The second one is going to take some research/learning to write out but sounds simpler. What I've never done is translate processes like this into code, so I don't know ultimately which one will be the easiest for that application. Does anyone have advice on that? Or am I perhaps going about the the complete wrong way? Is there a library already out there that I can use for it instead of reinventing the wheel?
Some context: I'm worried mainly about the cost to the CPU to do these calculations. I plan on the application doing a heck of a lot of them at once, repeatedly, hence why I want the simplest way to accomplish it.
Computational geometry is almost always a pain to implement. It's also almost always quite CPU-intensive. That said, this problem is just algebra if you set it up right.
Compute d = hypot(x2-x1, y2-y1), the distance between the two centres. If r1 + r2 < d, there is no intersection. If r1+r2 == d, the intersection is at (x1, y1) + r1/(r1+r2) * (x2-x1,y2-y1). If d < abs(r1-r2), one circle is contained in the other and there is no intersection. You can work out the case where the two circles are tangent and one is contained in the other. I will only deal with the remaining case.
You want to find distances h orthogonal to (x2-x1,y2-y1) and p parallel to (x2-x1,y2-y1) so that p^2 + h^2 = r1^2 and (d-p)^2 + h^2 = r2^2. Subtract the two equations to get a linear equation in p: d^2-2dp = r2^2-r1^2. Solve this linear equation for p. Then h = sqrt(r1^2 - p^2).
The coordinates of the two points are (x1,y1) + p (x2-x1,y2-y1) / d +/- h (y2-y1, x1-x2) / d. If you work through the derivation above and solve for p/d and h/d instead, you may get something that does fewer operations.
I'm measuring some system performance data to store it in a database. From those data points I'm drawing line graphs over time. In their nature, those data points are a bit noisy, ie. every single point deviates at least a bit from the local mean value. When drawing the line graph straight from one point to the next, it produces jagged graphs. At a large time scale like > 10 data points per pixel, this noise is compressed into a wide jagged line area that is, say, 20px high instead of 1px as in smaller scales.
I've read about line smoothing, anti-aliasing, simplifying and all these things. But everything I've found seems to be about something else.
I don't need anti-aliasing, .NET already does that for me when drawing the line on the screen.
I don't want simplification. I need the extreme values to remain visible, at least most of them.
I think it goes in the direction of spline curves but I couldn't find much example images to evaluate whether the described thing is what I want. I did find a highly scientific book at Google Books though, full of half-page long formulas, which I wasn't like reading through now...
To give you an example, just look at Linux/Gnome's system monitor application. I draws the recent CPU/memory/network usage with a smoothed line. This may be a bit oversimplified, but I'd give it a try and see if I can tweak it.
I'd prefer C# code but algorithms or code in other languages is fine, too, as long as I can port it to C# without external references.
You can do some data-smoothing. Instead of using the real data, apply a simple smoothing algorithm that keeps the peaks like a Savitzky-Golayfilter.
You can get the coefficients here.
The easiest to do is:
Take the top coefficients from the website I linked to:
// For np = 5 = 5 data points
var h = 35.0;
var coeff = new float[] { 17, 12, -3 }; // coefficients from the site
var easyCoeff = new float[] {-3, 12, 17, 12, -3}; // Its symmetrical
var center = 2; // = the center of the easyCoeff array
// now for every point from your data you calculate a smoothed point:
smoothed[x] =
((data[x - 2] * easyCoeff[center - 2]) +
(data[x - 1] * easyCoeff[center - 1]) +
(data[x - 0] * easyCoeff[center - 0]) +
(data[x + 1] * easyCoeff[center + 1]) +
(data[x + 2] * easyCoeff[center + 2])) / h;
The first 2 and last 2 points you cannoth smooth when using 5 points.
If you want your data to be more "smoothed" you can experiment with coefficents with larger data points.
Now you can draw a line through your "smoothed" data. The larger your np = number of points, the smoother your data. But you also loose peak accuracy, but not as much when simply averaging some points together.
You cannot fix this in the graphics code. If your data is noisy then the graph is going to be noisy as well, no matter what kind of line smoothing algorithm you use. You'll need to filter the data first. Create a second data set with points that are interpolated from the original data. A Least Squares fit is a common technique. Averaging is simple to implement but tends to hide extremes.
I think what you are looking for is a routine to provide 'splines'. Here is a link describing splines:
http://en.wikipedia.org/wiki/Spline_(mathematics)
If that is the case I don't have any recommendations for a spline library, but an initial google search turned up a bunch.
Sorry for no code, but hopefully knowing the terminology will aid you in your search.
Bob
Reduce the number of data points, using MIN/MAX/AVG before you display them. It'll look nicer and it'll be faster
Graphs of network traffic often use a weighted average. You can sample once per second into a circular list of length 10 and for the graph, at each sample, graph the average of the samples.
If 10 isn't enough you can store many more. You don't need to recalculate the average from scratch, either:
new_average = (old_average*10 - replaced_sample + new_sample)/10
If you don't want to store all 10, however, you can approximate with this:
new_average = old_average*9/10 + new_sample/10
Lots of routers use this to save on storage. This ramps toward the current traffic rate exponentially.
If you do implement this, do something like this:
new_average = old_average*min(9,number_of_samples)/10 + new_sample/10
number_of_samples++
to avoid the initial ramp-up. You should also adjust the 9/10, 1/10 ratio to actually reflect the time preiod of each sample because your timer won't fire exactly once per second.
I have very little data for my analysis, and so I want to produce more data for analysis through interpolation.
My dataset contain 23 independent attributes and 1 dependent attribute.....how can this done interpolation?
EDIT:
my main problem is of shortage of data, i hv to increase the size of my dataset, n attributes are categorical for example attribute A may be low, high, meduim, so interpolation is the right approach for it or not????
This is a mathematical problem but there is too little information in the question to properly answer. Depending on distribution of your real data you may try to find a function that it follows. You can also try to interpolate data using artificial neural network but that would be complex. The thing is that to find interpolations you need to analyze data you already have and that defeats the purpose. There is probably more to this problem but not explained. What is the nature of the data? Can you place it in n-dimensional space? What do you expect to get from analysis?
Roughly speaking, to interpolate an array:
double[] data = LoadData();
double requestedIndex = /* set to the index you want - e.g. 1.25 to interpolate between values at data[1] and data[2] */;
int previousIndex = (int)requestedIndex; // in example, would be 1
int nextIndex = previousIndex + 1; // in example, would be 2
double factor = requestedIndex - (double)previousIndex; // in example, would be 0.25
// in example, this would give 75% of data[1] plus 25% of data[2]
double result = (data[previousIndex] * (1.0 - factor)) + (data[nextIndex] * factor);
This is really pseudo-code; it doesn't perform range-checking, assumes your data is in an object or array with an indexer, and so on.
Hope that helps to get you started - any questions please post a comment.
If the 23 independent variables are sampled in a hyper-grid (regularly spaced), then you can choose to partition into hyper-cubes and do linear interpolation of the dependent value from the vertex closest to the origin along the vectors defined from that vertex along the hyper-cube edges away from the origin. In general, for a given partitioning, you project the interpolation point onto each vector, which gives you a new 'coordinate' in that particular space, which can then be used to compute the new value by multiplying each coordinate by the difference of the dependent variable, summing the results, and adding to the dependent value at the local origin. For hyper-cubes, this projection is straightforward (you simply subtract the nearest vertex position closest to the origin.)
If your samples are not uniformly spaced, then the problem is much more challenging, as you would need to choose an appropriate partitioning if you wanted to perform linear interpolation. In principle, Delaunay triangulation generalizes to N dimensions, but it's not easy to do and the resulting geometric objects are a lot harder to understand and interpolate than a simple hyper-cube.
One thing you might consider is if your data set is naturally amenable to projection so that you can reduce the number of dimensions. For instance, if two of your independent variables dominate, you can collapse the problem to 2-dimensions, which is much easier to solve. Another thing you might consider is taking the sampling points and arranging them in a matrix. You can perform an SVD decomposition and look at the singular values. If there are a few dominant singular values, you can use this to perform a projection to the hyper-plane defined by those basis vectors and reduce the dimensions for your interpolation. Basically, if your data is spread in a particular set of dimensions, you can use those dominating dimensions to perform your interpolation, since you don't really have much information in the other dimensions anyway.
I agree with the other commentators, however, that your premise may be off. You generally don't want to interpolate to perform analysis, as you're just choosing to interpolate your data in different ways and the choice of interpolation biases the analysis. It only makes sense if you have a compelling reason to believe that a particular interpolation is physically consistent and you simply need additional points for a particular algorithm.
May I suggest Cubic Spline Interpolation
http://www.coastrd.com/basic-cubic-spline-interpolation
unless you have a very specific need, this is easy to implement and calculates splines well.
Have a look at the regression methods presented in Elements of statistical learning; most of them may be tested in R. There are plenty of models that can be used: linear regression, local models and so on.
I have implemented a basic Karplus-Strong algorithm.
Ringbuffer, filling with white noise, output a sample from the front and append the average of first two elements to the end and delete the first element. Repeat last to steps.
For better results and control over them I tried to implement a extended version of the algorithm.
Therefore instead of an averaging filter I need a frequency filter like a low pass filter.
My averaging filter has two inputs and one output: avg(a,b) = (a+b)/2
The sample code on the wikipedia page gives as many outputs as inputs.
http://en.wikipedia.org/wiki/Low-pass_filter
I have found other (mathematic) versions like:
http://cnx.org/content/m15490/latest/
H(z) = (1+(1/z))/2
I guess z is a complex number.
Both version have two inputs but also two outputs.
How do I get one meaningful value out of this?
Or do I have to rewrite bigger parts of the algorithm?
If thats the case where can I find a good explanation of it?
Your filter is a specialization of the Finite Impulse Response filter. You're using the moving average method to select the coefficients, using N = 1. It already is a low-pass filter.
Calculating the coefficient and order for the filter to tune it to a specific frequency response involves tricky math. Best thing to do is to use a software package to calculate the coefficients if moving average doesn't fit your bill. Matlab is the usual choice, GNU Octave is an open source option.
Filters can expressed in a number of ways:
On the complex plain, your example H(z) = (1+(1/z))/2
As a filter, y[i] = h[0]*x[i] + h[1]*x[i-1] + h[2]*x[i-2] + ...
In the frequency domain, Y[f] = H[f] * X[f]
The second of these is actually a convolution of the h and x arrays. This is also the easiest to understand.
The previous answer explained where to start on constructing a filter. Assuming you have your filter coefficients, the h's, then it is simply summing over the non-negative ones.
I believe I see what you are asking. Though you do not need more than one output. From the Wikipedia page the Karplus-Strong string synthesis algorithm needs a buffer of length L. If we have M filter coefficients (h) that gives an output of the form,
y[i] = x[i] + h[0]*y[i-L] + h[1]*y[i-(L+1)] + h[2]*y[i-(L+2)] + ...
The Karplus-Strong synthesis from here uses a ring buffer to hold the last L outputs, y[i-1],...,y[i-L]. This is initialised to be the x[i] noise values for i<=L; however, for i>L x[i]=0. The algorithm will be space efficient as you only store L values. The signal x[i] for i>L is just added to the ring buffer.
Finally, as a note of warning, if you are not careful with both the number of coefficients h and the values the outputs y may not have the desired behaviour.