Ive found simlar posts before about this but nothing really answers the question.
In my fingerprinting, i produce a recordset which has 5 integers. For example:
33,42,88,121,194
These correspond to the frequencies which have the highest magnitude for a particular sample of music.
Eg: for 30ms of audio sample i have buckets of the following frequencies:
0-40
40-80
80-120
120-180
180-250
Im trying to produce a hash (a forgiving one) which will perhaps produce the same hash for
33,42,88,121,194 as it would for say
33,43,88,122,195
where there are minor differences in the frequencies a similar hash would be formed.
1st off is this LSH? as i have read that this is best for Audio Fingerprinting.
If not, could anyone provide some psuedocode or c# for a function that might do what im looking for? i have read up on LSH and matlab and perl implementations but i dont understand them so posting a link to them won't really help me too much.
thanks again!
This might be a duplicate of this: Compare two spectogram to find the offset where they match algorithm, what it appears you are trying to do is produce a histogram for the rough distribution of the peaks in the sample. There are several methods to do this, another "example" is here: Compare two spectogram to find the offset where they match algorithm
One method of doing this is to use a Fast-Fourier-Transform of the peak data and its distribution (over time) to produce a rough equivalence of the sample in a distilled form. To do this you do something roughly similar to:
Divide the sample into some discrete parts (say 1sec)
For each sample part develop a fingerprint that approximates the sample (say taking 5-7 high and low peaks, normalizing them, and then hashing them
You can now either keep each fingerprint individually (in a collection), or run a transform over the sequence to generate a single fingerprint depending on your needs. Mostly you would just append the sequences together to get a linear fingerprint in 1 sec intervals.
To compare the fingerprint, you run the same process over the second sample, and then use a Diff algorithm to compare the two, using some "fuzz" to decide how close they are. You will need to compare the fingerprints on two dimensions, the order of the discrete fingerprints, as well as the overall difference in each sample.
This article on making a rough Java equivalent to Shazaam was posted a while ago: http://www.redcode.nl/blog/2010/06/creating-shazam-in-java/ and may be of some help to you.
Related
What are the best clustering algorithms to use in order to cluster data with more than 100 dimensions (sometimes even 1000). I would appreciate if you know any implementation in C, C++ or especially C#.
It depends heavily on your data. See curse of dimensionality for common problems. Recent research (Houle et al.) showed that you can't really go by the numbers. There may be thousands of dimensions and the data clusters well, and of course there is even one-dimensional data that just doesn't cluster. It's mostly a matter of signal-to-noise.
This is why for example clustering of TF-IDF vectors works rather well, in particular with cosine distance.
But the key point is that you first need to understand the nature of your data. You then can pick appropriate distance functions, weights, parameters and ... algorithms.
In particular, you also need to know what constitutes a cluster for you. There are many definitions, in particular for high-dimensional data. They may be in subspaces, they may or may not be arbitrarily rotated, they may overlap or not (k-means for example, doesn't allow overlaps or subspaces).
well i know something called vector quantization, its a nice algorithem to cluster stuf with many dimentions.
i've used k-means on data with 100's dimensions, it is very common so i'm sure theres an implementation in any language, worst case scenario - it is very easy to implement by your self.
It might also be worth trying some dimensionality reduction techniques like Principle Component Analysis or an auto-associative neural net before you try to cluster it. It can turn a huge problem into a much smaller one.
After that, go k-means or mixture of gaussians.
The EM-tree and K-tree algorithms in the LMW-tree project can cluster high dimensional problems like this. It is implemented in C++ and supports many different representations.
We have novel algorithms clustering binary vectors created by LSH / Random Projections, or anything else that emits binary vectors that can be compared via Hamming distance for similarity.
I want to make a program which will determine the Rank of a given Matrix in C# Console Application. But I cannot able to make the algorithm for that. Can you please help me make that algorithm?
You can just use the basic Gauss elimination method. Counting the number of non-zero rows will give you the rank. But this method is not really numerically robust. As the wikipedia article says, there are a few other algorithms, such as singular value decomposition (SVD)or QR decomposition with pivoting. For both, you should easily be able to find basic implementations.
But working with accurate numbers as you need for this, you always have to think about the numerical inaccuracies of the IEEE representation of floats in the computer.
Read more about it on:
http://en.wikipedia.org/wiki/IEEE_754
This question is probably quite different from what you are used to reading here - I hope it can provide a fun challenge.
Essentially I have an algorithm that uses 5(or more) variables to compute a single value, called outcome. Now I have to implement this algorithm on an embedded device which has no memory limitations, but has very harsh processing constraints.
Because of this, I would like to run a calculation engine which computes outcome for, say, 20 different values of each variable and stores this information in a file. You may think of this as a 5(or more)-dimensional matrix or 5(or more)-dimensional array, each dimension being 20 entries long.
In any modern language, filling this array is as simple as having 5(or more) nested for loops. The tricky part is that I need to dump these values into a file that can then be placed onto the embedded device so that the device can use it as a lookup table.
The questions now, are:
What format(s) might be acceptable
for storing the data?
What programs (MATLAB, C#, etc)
might be best suited to compute the
data?
C# must be used to import the data
on the device - is this possible
given your answer to #1?
Edit:
Is it possible to read from my lookup table file without reading the entire file into memory? Can you explain how that might be done in C#?
I'll comment on 1 and 3 as well. It may be preferable to use a fixed width output file rather than a CSV. This may take up more or less space than a CSV, depending on the output numbers. However, it tends to work well for lookup tables, as figuring out where to look in a fixed width data file can be done without reading the entire file. This is usually important for a lookup table.
Fixed width data, as with CSV, is trivial to read and write. Some math-oriented languages might offer poor string and binary manipulation functionality, but it should be really easy to convert the data to fixed width during the import step regardless.
Number 2 is harder to answer, particularly without knowing what kind of algorithm you are computing. Matlab and similar programs tend to be great about certain types of computations and often have a lot of stuff built in to make it easier. That said, a lot of the math stuff that is built into such languages is available for other languages in the form of libraries.
I'll comment on (1) and (3). All you need to do is dump the data in slices. Pick a traversal and dump data out in that order. Write it out as comma-delimited numbers.
I am looking for some kind of intelligent (I was thinking AI or Neural network) library that I can feed a list of historical data and this will predict the next sequence of outputs.
As an example I would like to feed the library the following figures 1,2,3,4,5
and based on this, it should predict the next sequence is 6,7,8,9,10 etc.
The inputs will be a lot more complex and contain much more information.
This will be used in a C# application.
If you have any recommendations or warning that will be great.
Thanks
EDIT
What I am trying to do i using historical sales data, predict what amount a specific client is most likely going to spend in the next period.
I do understand that there are dozens of external factors that can influence a clients purchases but for now I need to merely base it on the sales history and then plot a graph showing past sales and predicted sales.
If you're looking for a .NET API, then I would recommend you try AForge.NET http://code.google.com/p/aforge/
If you just want to try various machine learning algorithms on a data set that you have at your disposal, then I would recommend that you play around with Weka; it's (relatively) easy to use and it implements a lot of ML/AI algorithms. Run multiple runs with different settings for each algorithm and try as many algorithms as you can. Most of them will have some predictive power and if you combine the right ones, then you might really get something useful.
If I understand your question correctly, you want to approximate and extrapolate an unknown function. In your example, you know the function values
f(0) = 1
f(1) = 2
f(2) = 3
f(3) = 4
f(4) = 5
A good approximation for these points would be f(x) = x+1, and that would yield f(5) = 6... as expected. The problem is, you can't solve this without knowledge about the function you want to extrapolate: Is it linear? Is it a polynomial? Is it smooth? Is it (approximately or exactly) cyclic? What is the range and domain of the function? The more you know about the function you want to extrapolate, the better your predictions will be.
I just have a warning, sorry. =)
Mathematically, there is no reason for your sequence above to be followed by a "6". I can easily give you a simple function, whose next value is any value you like. Its just that humans like simple rules, and therefore tend to see a connection in these sequences, that in reality is not there. Therefore, this is a impossible task for a computer, if you do not want to feed it with additional information.
Edit:
In the case that you suspect your data to have a known functional dependence, and there are uncontrollable outside factors, maybe regression analysis will have good results. To start easy, look at linear regression first.
If you cannot assume linear dependence, there is a nice application that looks for functions fitting your historical data... I'll update this post with its name as soon as I remember. =)
I need an algorithm that can compare two text files and highlight their difference and ( even better!) can compute their difference in a meaningful way (like two similar files should have a similarity score higher than two dissimilar files, with the word "similar" defined in the normal terms). It sounds easy to implement, but it's not.
The implementation can be in c# or python.
Thanks.
I can recommend to take a look at Neil Fraser's code and articles:
google-diff-match-patch
Currently available in Java,
JavaScript, C++ and Python. Regardless
of language, each library features the
same API and the same functionality.
All versions also have comprehensive
test harnesses.
Neil Fraser: Diff Strategies - for theory and implementation notes
In Python, there is difflib, as also others have suggested.
difflib offers the SequenceMatcher class, which can be used to give you a similarity ratio. Example function:
def text_compare(text1, text2, isjunk=None):
return difflib.SequenceMatcher(isjunk, text1, text2).ratio()
Look at difflib. (Python)
That will calculate the diffs in various formats. You could then use the size of the context diff as a measure of how different two documents are?
My current understanding is that the best solution to the Shortest Edit Script (SES) problem is Myers "middle-snake" method with the Hirschberg linear space refinement.
The Myers algorithm is described in:
E. Myers, ``An O(ND) Difference
Algorithm and Its Variations,''
Algorithmica 1, 2 (1986), 251-266.
The GNU diff utility uses the Myers algorithm.
The "similarity score" you speak of is called the "edit distance" in the literature which is the number of inserts or deletes necessary to transform one sequence into the other.
Note that a number of people have cited the Levenshtein distance algorithm but that is, albeit easy to implement, not the optimal solution as it is inefficient (requires the use of a possibly huge n*m matrix) and does not provide the "edit script" which is the sequence of edits that could be used to transform one sequence into the other and vice versa.
For a good Myers / Hirschberg implementation look at:
http://www.ioplex.com/~miallen/libmba/dl/src/diff.c
The particular library that it is contained within is no longer maintained but to my knowledge the diff.c module itself is still correct.
Mike
Bazaar contains an alternative difference algorithm, called patience diff (there's more info in the comments on that page) which is claimed to be better than the traditional diff algorithm. The file 'patiencediff.py' in the bazaar distribution is a simple command line front end.
If you need a finer granularity than lines, you can use Levenshtein distance. Levenshtein distance is a straight-forward measure on how to similar two texts are.
You can also use it to extract the edit logs and can a very fine-grained diff, similar to that on the edit history pages of SO.
Be warned though that Levenshtein distance can be quite CPU- and memory-intensive to calculate, so using difflib,as Douglas Leder suggested, is most likely going to be faster.
Cf. also this answer.
There are a number of distance metrics, as paradoja mentioned there is the Levenshtein distance, but there is also NYSIIS and Soundex. In terms of Python implementations, I have used py-editdist and ADVAS before. Both are nice in the sense that you get a single number back as a score. Check out ADVAS first, it implements a bunch of algorithms.
As stated, use difflib. Once you have the diffed output, you may find the Levenshtein distance of the different strings as to give a "value" of how different they are.
You could use the solution to the Longest Common Subsequence (LCS) problem. See also the discussion about possible ways to optimize this solution.
One method I've employed for a different functionality, to calculate how much data was new in a modified file, could perhaps work for you as well.
I have a diff/patch implementation C# that allows me to take two files, presumably old and new version of the same file, and calculate the "difference", but not in the usual sense of the word. Basically I calculate a set of operations that I can perform on the old version to update it to have the same contents as the new version.
To use this for the functionality initially described, to see how much data was new, I simple ran through the operations, and for every operation that copied from the old file verbatim, that had a 0-factor, and every operation that inserted new text (distributed as part of the patch, since it didn't occur in the old file) had a 1-factor. All characters was given this factory, which gave me basically a long list of 0's and 1's.
All I then had to do was to tally up the 0's and 1's. In your case, with my implementation, a low number of 1's compared to 0's would mean the files are very similar.
This implementation would also handle cases where the modified file had inserted copies from the old file out of order, or even duplicates (ie. you copy a part from the start of the file and paste it near the bottom), since they would both be copies of the same original part from the old file.
I experimented with weighing copies, so that the first copy counted as 0, and subsequent copies of the same characters had progressively higher factors, in order to give a copy/paste operation some "new-factor", but I never finished it as the project was scrapped.
If you're interested, my diff/patch code is available from my Subversion repository.
Take a look at the Fuzzy module. It has fast (written in C) based algorithms for soundex, NYSIIS and double-metaphone.
A good introduction can be found at: http://www.informit.com/articles/article.aspx?p=1848528