.Net sorted sets - efficient range dissection operations

.Net sorted sets - efficient range dissection operations - c#

In an attempt to not reinvent the wheel, I've been looking for a fast, efficient data structure that I can use in my code which is somewhat analogous to Redis' sorted sets.
Adding ranges of items is easy enough, but I've discovered that there doesn't seem to be anything that allows me to drop ranges of entries based on an upper and lower bound. For example, if my set is keyed on a double, I'd like to be able to drop all values between 0.3 and 0.7. I was quite surprised that I couldn't find any easy way to do this that felt like the correct approach. Also, performance is important for my use case, both in terms of speed and memory usage.
Examples and implementations from NuGet, CodeProject, GitHub, etc. are all acceptable.
p.s. For what it's worth, I'm wanting to maintain a cache of items indexed across 4 dimensions, and need to be able to discard items across different dimensional ranges when usage strays too far from a given range. Feel free to suggest something to this effect as well.

Related

[Full Text Search]Implement Full Text Search

I am implementing full text search on a single entity, document which contains name and content. The content can be quite big (20+ pages of text). I am wondering how to do it.
Currently I am looking at using Redis and RedisSearch, but I am not sure if it can handle search in big chunks of text. We are talking about a multitenant application with each customer having more than 1000 documents that are quite big.
TLDR: What to use to search into big chunks of text content.
This space is a bit unclear to me, sorry for the confusion. Will update the question when I have more clarity.

I can't tell you what the right answer is, but I can give you some ideas about how to decide.
Normally if I had documents/content in a DB I'd be inclined to search there - assuming that the search functionality that I could implement was (a) functionally effect enough, (b) didn't require code that was super ugly, and (c) it wasn't going to kill the database. There's usually a lot of messing around trying to implement search features and filters that you want to provide to the user - UI components, logic components, and then translating that with how the database & query language actually works.
So, based on what you've said, the key trade-offs are probably:
Functionality / functional fit (creating the features you need, to work in a way that's useful).
Ease of development & maintenance.
Performance - purely on the basis that gathering search results across "documents" is not necessarily the fastest thing you can do with a IT system.
Have you tried doing a simple whiteboard "options analysis" exercise? If not try this:
Get a small number of interested and smart people around a whiteboard. You can do this exercise alone, but bouncing ideas around with others is almost always better.
Agree what the high level options are. In your case you could start with two: one based on MSSQL, the other based on Redis.
Draw up a big table - each option has it's own column (starting at column 2).
In Column 1 list out all the important things which will drive your decision. E.g. functional fit, Ease of development & maintenance, performance, cost, etc.
For each driver in column 1, do a score for each option.
How you do it is up to you: you could use a 1-5 point system (optionally you could use planning poker type approach to avoid anchoring) or you could write down a few key notes.
Be ready to note down any questions that come up, important assumptions, etc so they don't get lost.
Sometimes as you work through the exercise the answer becomes obvious. If it's really close you can rely on scores - but that's not ideal. It's more likely that of all the drivers listed some will be more important than others, so don't ignore the significance of those.

High Dimensional Data Clustering

What are the best clustering algorithms to use in order to cluster data with more than 100 dimensions (sometimes even 1000). I would appreciate if you know any implementation in C, C++ or especially C#.

It depends heavily on your data. See curse of dimensionality for common problems. Recent research (Houle et al.) showed that you can't really go by the numbers. There may be thousands of dimensions and the data clusters well, and of course there is even one-dimensional data that just doesn't cluster. It's mostly a matter of signal-to-noise.
This is why for example clustering of TF-IDF vectors works rather well, in particular with cosine distance.
But the key point is that you first need to understand the nature of your data. You then can pick appropriate distance functions, weights, parameters and ... algorithms.
In particular, you also need to know what constitutes a cluster for you. There are many definitions, in particular for high-dimensional data. They may be in subspaces, they may or may not be arbitrarily rotated, they may overlap or not (k-means for example, doesn't allow overlaps or subspaces).

well i know something called vector quantization, its a nice algorithem to cluster stuf with many dimentions.

i've used k-means on data with 100's dimensions, it is very common so i'm sure theres an implementation in any language, worst case scenario - it is very easy to implement by your self.

It might also be worth trying some dimensionality reduction techniques like Principle Component Analysis or an auto-associative neural net before you try to cluster it. It can turn a huge problem into a much smaller one.
After that, go k-means or mixture of gaussians.

The EM-tree and K-tree algorithms in the LMW-tree project can cluster high dimensional problems like this. It is implemented in C++ and supports many different representations.
We have novel algorithms clustering binary vectors created by LSH / Random Projections, or anything else that emits binary vectors that can be compared via Hamming distance for similarity.

When Would You Implement Your Own Sorting Algorithm?

Forgive me if this is a silly question....but I think back to my Comp. Sci. classes and I distinctly remember learning/being quizzed on several sorting algorithms and the corresponding 'Big O' notation.
Outside of the classroom though, I've never actually written code to sort.
When I get results from a database, I use 'Order By'. Otherwise, I use a collection class that implements a sort. I have implemented IComparable to allow sorting; but I've never gone beyond that.
Was sorting always just an academic pursuit for those of us who don't implement languages/frameworks? Or is it just that modern languages running on modern hardware make it a trivial detail to worry about?
Finally, when I call .Sort on a List(Of String), for example, what sort algorithm is being used under the hood?

While you rarely might need to implement a sorting algorithm yourself understanding the different algorithms and their complexity might help you in solving more complex problems.
Finally, when I call .Sort on a List(Of String), for example, what sort algorithm is being used under the hood?
Quick Sort

I've never implemented my own sorting algorithm once since I took my CS classes in college and if I was ever even contemplating writing my own, I'd want my head examined first.
List<T> uses Quicksort per the MSDN documentation:
http://msdn.microsoft.com/en-us/library/b0zbh7b6.aspx

You probably won't implement you own sorting algorithm if you are using high level languages...
What you have learnt in classroom was merely there to teach you of the existence and importance of the big O (omicron) notation.
It was there to make you know that optimization is always a goal in programming and that when you code something you must always think of how will it execute.
It teaches you that loops inside loops and recursions can lead to big performance problems if not analyzed/optimized well before coding starts.
It is a guidance to check your design before and be able to approximate the execution speed.

It is important for a programmer to know how theses algorithms work. One reason would be that, in certain conditions, certain algorithms are better, although, sorting is rarely the bottleneck.
In some frameworks, the .Sort function uses various methods, depending on the situation.

Modern languages running on modern hardware make it a trivial detail to worry about, unless a profiler shows that sorting is the bottleneck of your code.
According to this, List.Sort uses Array.Sort, which uses QuickSort.

IMO, it's become a bit of an academic exercise. You need to understand algorithmic complexity, and sorting is a good example for working through it because you can easily see the results and calculate the different complexities. In real life, though, there's almost certainly a library call that sorts your range faster than you would be able to do if you try to roll your own.
I don't know what the .Net libraries use for their defualt implementation, but I'd guess it's Quicksort or Shellsort. I'd be interested to find out if it's something else.

I've occasionally had to write my own sort methods, but only when I was writing for a relatively immature and underpowered platform (like .Net 1.1 in Windows CE or embedded java or somesuch).
On anything resembling a modern computer, the best sorting algorithm is the ORDER BY clause in T-SQL.

Implementing your own sort is the kind of thing that you do to gain insight in how algorithms work, what the tradeoffs are, which tried-and-true approaches that solve a wide array of problems efficiently are known, etc.
As Darin Dimitrov's answer states, library sort routines need to have a very competitive average-case performance, so quicksort is typically chosen.

Was sorting always just an academic pursuit for those of us who don't implement languages/frameworks? Or is it just that modern languages running on modern hardware make it a trivial detail to worry about?
Even if you're not implementing your own routine, you may know how your data is likely to be arranged, and may want to choose a suitable library algorithm. For example, see this discussion on nearly sorted data.

I think there are times when you need to have a custom sorting method.
What if you wanted to sort by the make of cars, but not alphabetically?
For example, you have a database with the makes: Honda, Lexus, Toyota, Acura, Nissan, Infiniti
If you use a plain sort, you get the order: Acura, Ford, Honda, Hyundai, Lexus, Toyota
What if you wanted to sort them based on a car company's standard and luxury class together? Lexus, Toyota, Honda, Acura, Nissan, Infiniti.
I think you would need a custom sort method if that's the case.

Computing, storing, and retrieving values to and from an N-Dimensional matrix

This question is probably quite different from what you are used to reading here - I hope it can provide a fun challenge.
Essentially I have an algorithm that uses 5(or more) variables to compute a single value, called outcome. Now I have to implement this algorithm on an embedded device which has no memory limitations, but has very harsh processing constraints.
Because of this, I would like to run a calculation engine which computes outcome for, say, 20 different values of each variable and stores this information in a file. You may think of this as a 5(or more)-dimensional matrix or 5(or more)-dimensional array, each dimension being 20 entries long.
In any modern language, filling this array is as simple as having 5(or more) nested for loops. The tricky part is that I need to dump these values into a file that can then be placed onto the embedded device so that the device can use it as a lookup table.
The questions now, are:
What format(s) might be acceptable
for storing the data?
What programs (MATLAB, C#, etc)
might be best suited to compute the
data?
C# must be used to import the data
on the device - is this possible
given your answer to #1?
Edit:
Is it possible to read from my lookup table file without reading the entire file into memory? Can you explain how that might be done in C#?

I'll comment on 1 and 3 as well. It may be preferable to use a fixed width output file rather than a CSV. This may take up more or less space than a CSV, depending on the output numbers. However, it tends to work well for lookup tables, as figuring out where to look in a fixed width data file can be done without reading the entire file. This is usually important for a lookup table.
Fixed width data, as with CSV, is trivial to read and write. Some math-oriented languages might offer poor string and binary manipulation functionality, but it should be really easy to convert the data to fixed width during the import step regardless.
Number 2 is harder to answer, particularly without knowing what kind of algorithm you are computing. Matlab and similar programs tend to be great about certain types of computations and often have a lot of stuff built in to make it easier. That said, a lot of the math stuff that is built into such languages is available for other languages in the form of libraries.

I'll comment on (1) and (3). All you need to do is dump the data in slices. Pick a traversal and dump data out in that order. Write it out as comma-delimited numbers.

.NET Neural Network or AI for Future Predictions

I am looking for some kind of intelligent (I was thinking AI or Neural network) library that I can feed a list of historical data and this will predict the next sequence of outputs.
As an example I would like to feed the library the following figures 1,2,3,4,5
and based on this, it should predict the next sequence is 6,7,8,9,10 etc.
The inputs will be a lot more complex and contain much more information.
This will be used in a C# application.
If you have any recommendations or warning that will be great.
Thanks
EDIT
What I am trying to do i using historical sales data, predict what amount a specific client is most likely going to spend in the next period.
I do understand that there are dozens of external factors that can influence a clients purchases but for now I need to merely base it on the sales history and then plot a graph showing past sales and predicted sales.

If you're looking for a .NET API, then I would recommend you try AForge.NET http://code.google.com/p/aforge/
If you just want to try various machine learning algorithms on a data set that you have at your disposal, then I would recommend that you play around with Weka; it's (relatively) easy to use and it implements a lot of ML/AI algorithms. Run multiple runs with different settings for each algorithm and try as many algorithms as you can. Most of them will have some predictive power and if you combine the right ones, then you might really get something useful.

If I understand your question correctly, you want to approximate and extrapolate an unknown function. In your example, you know the function values
f(0) = 1
f(1) = 2
f(2) = 3
f(3) = 4
f(4) = 5
A good approximation for these points would be f(x) = x+1, and that would yield f(5) = 6... as expected. The problem is, you can't solve this without knowledge about the function you want to extrapolate: Is it linear? Is it a polynomial? Is it smooth? Is it (approximately or exactly) cyclic? What is the range and domain of the function? The more you know about the function you want to extrapolate, the better your predictions will be.

I just have a warning, sorry. =)
Mathematically, there is no reason for your sequence above to be followed by a "6". I can easily give you a simple function, whose next value is any value you like. Its just that humans like simple rules, and therefore tend to see a connection in these sequences, that in reality is not there. Therefore, this is a impossible task for a computer, if you do not want to feed it with additional information.
Edit:
In the case that you suspect your data to have a known functional dependence, and there are uncontrollable outside factors, maybe regression analysis will have good results. To start easy, look at linear regression first.
If you cannot assume linear dependence, there is a nice application that looks for functions fitting your historical data... I'll update this post with its name as soon as I remember. =)

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.