C# - Efficiently search through a data set

C# - Efficiently search through a data set - c#

For my project, I have to search through a data set to find a string that matches. Previously, it was implemented by comparing every single item with the result string, but now my team wants it to run faster. I cannot use a hash map because we are searching through multiple datasets to find a string so I need an alternative. Please help, thank you!
//Looking for a specific string (time) in data set
foreach(var dataset in datasets)
{
if(dataset.Tables[0].Rows[i].ItemArray[0].ToString() == time.ToString())
{
//Enter this block
}
}

Use bitmaps. For example this one here - https://github.com/Auralytical/CRoaring.Net
Split your files into tokens, normalize them (to lower, trim, etc)
Assign index to each unique token (plain list, for example, any new token is just appended at the end)
If file contain token - set bit at its index.
You will get compact indexes for each file (sparse bitmap).
All you have to do now is to make the same with your query to get query bitmap, then just scan through all bitmaps to find which files are matched. You can make binary search tree to make it logarithmic.
pros
is that final complexity will be O(N) for indexing, O(log(N)) for finding any query you desire. N - count of documents.
cons
of this approach is if your tokens are random (hashes, md5, telephone numbers, datetimes, etc) - it will work very slow. Because domain is practically infinite and you should ignore those tokens and put them in some kind of offloading tree (or just into key-value database).
PS
This is how most of FullText search engines do it, so you can just straight use ElasticSearch or Lucene shard (which I recommend you to use, because it is simple and very flexible engine, and it is production ready which means there is gazilion of people who use this for most perverted business logic, like building their own search engines or porn story catalogs)

Related

Compare 10 Million Entities

I have to write a program that compares 10'000'000+ Entities against one another. The entities are basically flat rows in a database/csv file.
The comparison algorithm has to be pretty flexible, it's based on a rule engine where the end user enters rules and each entity is matched against every other entity.
I'm thinking about how I could possibly split this task into smaller workloads but I haven't found anything yet. Since the rules are entered by the end user pre-sorting the DataSet seems impossible.
What I'm trying to do now is fit the entire DataSet in memory and process each item. But that's not highly efficient and requires approx. 20 GB of memory (compressed).
Do you have an idea how I could split the workload or reduce it's size?
Thanks

If your rules are on the highest level of abstraction (e.g. any unknown comparison function), you can't achive your goal. 10^14 comparison operations will run for ages.
If the rules are not completely general I see 3 solutions to optimize different cases:
if comparison is transitive and you can calculate hash (somebody already recommended this), do it. Hashes can also be complicated, not only your rules =). Find good hash function and it might help in many cases.
if entities are sortable, sort them. For this purpose I'd recommend not sorting in-place, but build an array of indexes (or IDs) of items. If your comparison can be transformed to SQL (as I understand your data is in database), you can perform this on the DBMS side more efficiently and read the sorted indexes (for example 3,1,2 which means that item with ID=3 is the lowest, with ID=1 is in the middle and with ID=2 is the largest). Then you need to compare only adjacent elements.
if things are worth, I would try to use some heuristical sorting or hashing. I mean I would create hash which not necessarily uniquely identifies equal elements, but can split your dataset in groups between which there are definitely no one pair of equal elements. Then all equal pairs will be in the inside groups and you can read groups one by one and do manual complex function calculation in the group of not 10 000 000, but for example 100 elements. The other sub-approach is heuristical sorting with the same purpose to guarantee that equal elements aren't on the different endings of a dataset. After that you can read elements one by one and compare with 1000 previous elements for example (already read and kept in memory). I would keep in memory for example 1100 elements and free oldest 100 every time new 100 comes. This would optimize your DB reads. The other implementation of this may be possible also in case your rules contains rules like (Attribute1=Value1) AND (...), or rule like (Attribute1 < Value2) AND (...) or any other simple rule. Then you can make clusterisation first by this criterias and then compare items in created clusters.
By the way, what if your rule considers all 10 000 000 elements equal? Would you like to get 10^14 result pairs? This case proves that you can't solve this task in general case. Try making some limitations and assumptions.

I would try to think about rule hierarchy.
Let's say for example that rule A is "Color" and rule B is "Shape".
If you first divide objects by color,
than there is no need to compare Red circle with Blue triangle.
This will reduce the number of compares you will need to do.

I would create a hashcode from each entity. You probably have to exclude the id from the hash generation and then test for equals. If you have the hashs you could order all the hashcodes alphabetical. Having all entities in order means that it's pretty easy to check for doubles.

If you want to compare each entity with all entities than effectively you need to cluster the data , there is very fewer reasons to compare totally unrelated things ( compare Clothes with Human does not make sense) , i think your rules will try to cluster the data.
so you need to cluster the data , try some clustering algorithms like K-Means.
Also see , Apache Mahout

Are you looking for the best suitable sorting algorithm, kind of a, for this?
I think Divide and Concur seems good.
If the algorithm seems good, you can have plenty of other ways to do the calculation. Specially parallel processing using MPICH or something may give you a final destination.
But before decide how to execute, you have to think if algorithm fits first.

Data structure for indexed searches of subsets

I'm working on a c# jquery implementation and am trying to figure out an efficient algorithm for locating elements in a subset of the entire DOM (e.g. a subselector). At present I am creating an index of common selectors: class, id, and tag when the DOM is built.
The basic data structure is as one would expect, a tree of Elements which contain IEnumerable<Element> Children and a Parent. This is simple when searching the whole domain using a Dictonary<string,HashSet<Element>> to store the index.
I have not been able to get my head around the most effective way to search subsets of elements using an index. I use the term "subset" to refer to the starting set from which a subsequent selector in a chain will be run against. The following are methods I've thought of:
Retrieve matches from entire DOM for a subquery, and eliminate those that are not part of the subset. This requires traversing up the parents of each match until the root is found (and it is eliminated) or a member of the subset is found (and it is a child, hence included)
Maintain the index separately for each element.
Maintain a set of parents for each element (to make #1 fast by eliminating traversal)
Rebuild the entire index for each subquery.
Just search manually except for primary selectors.
The cost of each possible technique depends greatly on the exact operation being done. #1 is probably pretty good most of the time, since most of the time when you do a sub-select, you're targeting specific elements. The number of iterations required would be the number of results * the average depth of each element.
The 2nd method would be by far the fastest for selecting, but at the expense of storage requirements that increase exponentially with depth, and difficult index maintenance. I've pretty much eliminated this.
The 3rd method has a fairly bad memory footprint (though much better than #2) - it may be reasonable, but in addition to the storage requirements, adding and removing elements becomes substantially more expensive and complicated.
The 4rd method requires traversing the entire selection anyway so it seems pointless since most subqueries are only going to be run once. It would only be beneficial if a subequery was expected to be repeated. (Alternatively, I could just do this while traversing a subset anyway - except some selectors don't require searching the whole subdomain, e.g. ID and position selectors).
The 5th method will be fine for limited subsets, but much worse than the 1st method for subsets that are much of the DOM.
Any thoughts or other ideas about how best to accomplish this? I could do some hybrid of #1 and #4 by guessing which is more efficient given the size of the subset being searched vs. the size of the DOM but this is pretty fuzzy and I'd rather find some universal solution. Right now I am just using #4 (only full-DOM queries use the index) which is fine, but really bad if you decided to do something like $('body').Find('#id')
Disclaimer: This is early optimization. I don't have a bottleneck that needs solving, but as an academic problem I can't stop thinking about it...
Solution
Here's the implementation for the data structure as proposed by the answer. Is working perfectly as a near drop-in replacement for a dictionary.
interface IRangeSortedDictionary<TValue>: IDictionary<string, TValue>
{
IEnumerable<string> GetRangeKeys(string subKey);
IEnumerable<TValue> GetRange(string subKey);
}
public class RangeSortedDictionary<TValue> : IRangeSortedDictionary<TValue>
{
protected SortedSet<string> Keys = new SortedSet<string>();
protected Dictionary<string,TValue> Index =
new Dictionary<string,TValue>();
public IEnumerable<string> GetRangeKeys(string subkey)
{
if (string.IsNullOrEmpty(subkey)) {
yield break;
}
// create the next possible string match
string lastKey = subkey.Substring(0,subkey.Length - 1) +
Convert.ToChar(Convert.ToInt32(subkey[subkey.Length - 1]) + 1);
foreach (var key in Keys.GetViewBetween(subkey, lastKey))
{
// GetViewBetween is inclusive, exclude the last key just in case
// there's one with the next value
if (key != lastKey)
{
yield return key;
}
}
}
public IEnumerable<TValue> GetRange(string subKey)
{
foreach (var key in GetRangeKeys(subKey))
{
yield return Index[key];
}
}
// implement dictionary interface against internal collections
}
Code is here: http://ideone.com/UIp9R

If you suspect name collisions will be uncommon, it may be fast enough to just walk up the tree.
If collisions are common though, it might be faster to use a data structure that excels at ordered prefix searches, such as a tree. Your various subsets make up the prefix. Your index keys would then include both selectors and total paths.
For the DOM:
<path>
<to>
<element id="someid" class="someclass" someattribute="1"/>
</to>
</path>
You would have the following index keys:
<element>/path/to/element
#someid>/path/to/element
.someclass>/path/to/element
#someattribute>/path/to/element
Now if you search these keys based on prefix, you can limit the query to any subset you want:
<element> ; finds all <element>, regardless of path
.someclass> ; finds all .someclass, regardless of path
.someclass>/path ; finds all .someclass that exist in the subset /path
.someclass>/path/to ; finds all .someclass that exist in the subset /path/to
#id>/body ; finds all #id that exist in the subset /body
A tree can find the lower bound (the first element >= to your search value) in O(log n), and because it is ordered from there you simply iterate until you come to a key that no longer matches the prefix. It will be very fast!
.NET doesn't have a suitable tree structure (it has SortedDictionary but that unfortunately doesn't expose the required LowerBound method), so you'll need to either write your own or use an existing third party one. The excellent C5 Generic Collection Library features trees with suitable Range methods.

Efficient Datastructure for tags?

Imagine you wanted to serialize and deserialize stackoverflow posts including their tags as space efficiently as possible (in binary), but also for performance when doing tag lookups. Is there a good datastructure for that kind of scenario?
Stackoverflow has about 28532 different tags, you could create a table with all tags and assign them an integer, Furthermore you could sort them by frequency so that the most common tags have the lowest numbers. Still storing them simply like a string in the format "1 32 45" seems a bit inefficent borth from a searching and storing perspective
Another idea would be to save tags as a variable bitarray which is attractive from a lookup and serializing perspective. Since the most common tags are first you potentially could fit tags into a small amount of memory.
The problem would be of course that uncommon tags would yield huge bitarrays. Is there any standard for "compressing" bitarrays for large spans of 0's? Or should one use some other structure completely?
EDIT
I'm not looking for a DB solution or a solution where I need to keep entire tables in memory, but a structure for filtering individual items

Not to undermine your question but 28k records is really not all that many. Are you perhaps optimizing prematurely?
I would first stick to using 'regular' indices on a DB table. The harshing heuristics they use are typically very efficient and not trivial to beat (or if you can is it really worth the effort in time and are the gains large enough?).
Also depending on where you actually do the tag query, is the user really noticing the 200ms time gain you optimized for?
First measure then optimize :-)
EDIT
Without a DB I would probably have a master table holding all tags together with an ID (if possible hold it in memory). Keep a regular sorted list of IDs together with each post.
Not sure how much storage based on commonality would help. A sorted list in which you can do a regular binary search may prove fast enough; measure :-)
Here you would need to iterate all posts for every tag query though.
If this ends up being to slow you could resort to storing a pocket of post identifiers for each tag. This data structure may become somewhat large though and may require a file to seek and read against.
For a smaller table you could resort to build one based on a hashed value (with duplicates). This way you could use it to quickly get down to a smaller candidate list of posts that need further checking to see if they match or not.

You need second table with 2 fields: tag_id question_id
That's it. Then you create indexes on tag_id, question_id and question_id, tag_id - that would be covering index so all your queries would be very fast.

I have a feeling you abstracted your question too much; you didn't say very much about how you want to access the datastructure, which is very important.
That being said, I suggest to count the number of occurances for each tag and then use Huffman coding to come up with the shortest encoding which can be used for the tags. This is not entirely perfect, but I'd stick with it until you've demonstrate that it's inappropriate. You can then associate the codes with each question.

If you want to efficiently lookup questions within a specific tag, you will need some kind of index. Maybe, all Tag objects could have an array of references (references, pointers, nummeric-id, etc) to all the questions that are tagged with this particular tag. This way you simply need to find the tag object and you have an array pointing to all the questions of that tag.

What is the fastest way to count the unique elements in a list of billion elements?

My problem is not usual. Let's imagine few billions of strings. Strings are usually less then 15 characters. In this list I need to find out the number of the unique elements.
First of all, what object should I use? You shouldn't forget if I add a new element I have to check if it is already existing in the list. It is not a problem in the beginning, but after few millions of words it can really slow down the process.
That's why I thought that Hashtable would be the ideal for this task because checking the list is ideally only log(1). Unfortunately a single object in .net can be only 2GB.
Next step will be to implement a custom hashtable which contains a list of 2GB hashtables.
I am wondering maybe some of you know a better solution.
(Computer has extremely high specification.)

I would skip the data structures exercise and just use an SQL database. Why write another custom data structure that you have to analyze and debug, just use a database. They are really good at answering queries like this.

I'd consider a Trie or a Directed acyclic word graph which should be more space-efficient than a hash table. Testing for membership of a string would be O(len) where len is the length of the input string, which is probably the same as a string hashing function.

This can be solved in worst-case O(n) time using radix sort with counting sort as a stable sort for each character position. This is theoretically better than using a hash table (O(n) expected but not guaranteed) or mergesort (O(n log n)). Using a trie would also result in a worst-case O(n)-time solution (constant-time lookup over n keys, since all strings have a bounded length that's a small constant), so this is comparable. I'm not sure how they compare in practice. Radix sort is also fairly easy to implement and there are plenty of existing implementations.
If all strings are d characters or shorter, and the number of distinct characters is k, then radix sort takes O(d (n + k)) time to sort n keys. After sorting, you can traverse the sorted list in O(n) time and increment a counter every time you get to a new string. This would be the number of distinct strings. Since d is ~15 and k is relatively small compared to n (a billion), the running time is not too bad.
This uses O(dn) space though (to hold each string), so it's less space-efficient than tries.

If the items are strings, which are comparable... then I would suggest abandoning the idea of a Hashtable and going with something more like a Binary Search Tree. There are several implementations out there in C# (none that come built into the Framework). Be sure to get one that is balanced, like a Red Black Tree or an AVL Tree.
The advantage is that each object in the tree is relatively small (only contains it's object, and a link to its parent and two leaves), so you can have a whole slew of them.
Also, because it's sorted, the retrieval and insertion time are both O log(n).

Since you specify that a single object cannot contain all of the strings, I would presume that you have the strings on disk or some other external memory. In that case I would probably go with sorting. From a sorted list it is simple to extract the unique elements. Merge sorting is popular for external sorts, and needs only an amount of extra space equal to what you have. Start by dividing the input into pieces that fit into memory, sort those and then start merging.

With a few billion strings, if even a few percent are unique, the chances of a hash collision are pretty high (.NET hash codes are 32-bit int, yielding roughly 4 billion unique hash values. If you have as few as 100 million unique strings, the risk of hash collision may be unacceptably high). Statistics isn't my strongest point, but some google research turns up that the probability of a collision for a perfectly distributed 32-bit hash is (N - 1) / 2^32, where N is the number of unique things that are hashed.
You run a MUCH lower probability of a hash collision using an algorithm that uses significantly more bits, such as SHA-1.
Assuming an adequate hash algorithm, one simple approach close to what you have already tried would be to create an array of hash tables. Divide possible hash values into enough numeric ranges so that any given block will not exceed the 2GB limit per object. Select the correct hash table based on the value of the hash, then search in that hash table. For example, you might create 256 hash tables and use (HashValue)%256 to get a hash table number from 0..255. Use that same algorithm when assigning a string to a bucket, and when checking/retrieving it.

divide and conquer - partition data by first 2 letters (say)
dictionary of xx=>dictionary of string=> count

I would use a database, any database would do.
Probably the fastest because modern databases are optimized for speed and memory usage.
You need only one column with index, and then you can count the number of records.

+1 for the SQL/Db solutions, keeps things simple --will allow you to focus on the real task at hand.
But just for academic purposes, I will like to add my 2 cents.
-1 for hashtables. (I cannot vote down yet). Because they are implemented using buckets, the storage cost can be huge in many practical implementation. Plus I agree with Eric J, the chances of collisions will undermine the time efficiency advantages.
Lee, the construction of a trie or DAWG will take up space as well as some extra time (initialization latency). If that is not an issue (that will be the case when you may need to perform search like operations on the set of strings in the future as well and you have ample memory available), tries can be a good choice.
Space will be the problem with Radix sort or similar implementations (as mentioned by KirarinSnow) because the dataset is huge.
The below is my solution for a one time duplicate counting with limits on how much space can be used.
If we have the storage available for holding 1 billion elements in my memory, we can go for sorting them in place by heap-sort in Θ(n log n) time and then by simply traversing the collection once in O(n) time and doing this:
if (a[i] == a[i+1])
dupCount++;
If we do not have that much memory available, we can divide the input file on disk into smaller files (till the size becomes small enough to hold the collection in memory); then sort each such small file by using the above technique; then merge them together. This requires many passes on the main input file.
I will like to keep away from quick-sort because the dataset is huge. If I could squeeze in some memory for the second case, I would better use it to reduce the number of passes rather than waste it in merge-sort/quick-sort (actually, it depends heavily on the type of input we have at hand).
Edit: SQl/DB solutions are good only when you need to store this data for a long duration.

Have you tried a Hash-map (Dictionary in .Net)?
Dictionary<String, byte> would only take up 5 bytes per entry on x86 (4 for the pointer to the string pool, 1 for the byte), which is about 400M elements. If there are many duplicates, they should be able to fit. Implementation-wise, it might be verrryy slow (or not work), since you also need to store all those strings in memory.
If the strings are very similar, you could also write your own Trie implementation.
Otherwise, you best bets would be to sort the data in-place on disk (after which counting unique elements is trivial), or use a lower-level, more memory-tight language like C++.

A Dictionary<> is internally organized as a list of lists. You won't get close to the (2GB/8)^2 limit on a 64-bit machine.

I agree with the other posters regarding a database solution, but further to that, a reasonably-intelligent use of triggers, and a potentially-cute indexing scheme (i.e. a numerical representation of the strings) would be the fastest approach, IMHO.

If What you need is a close approximation of the unique counts then look for HyperLogLog Algorithm. It is used to get a close estimation of the cardinality of large datasets like the one you are referring to. Google BigQuery, Reddit use that for similar purposes. Many modern databases have implemented this. It is pretty fast and can work with minimal memory.

Writing an Inverted Index in C# for an information retrieval application

I am writing an in-house application that holds several pieces of text information as well as a number of pieces of data about these pieces of text. These pieces of data will be held within a database (SQL Server, although this could change) in order of entry.
I'd like to be able to search for the most relevant of these pieces of information, with the most relevant of these to be at the top. I originally looked into using SQL Server Full-Text Search but it's not as flexible for my other needs as I had hoped so it seems that I'll need to develop my own solution to this.
From what I understand what is needed is an inverted index, then for the contents of said inverted index to be restored and modified based on the results of the additional information held (although for now this can be left for a later date as I just want the inverted index to index the main text from the database table/strings provided).
I've had a crack at writing this code in Java using a Hashtable with the key as the words and the value as a list of the occurrences of the word but in all honesty I'm still rather new at C# and have only really used things like DataSets and DataTables when handling information. If requested I'll upload the Java code soon once I've cleared this laptop of viruses.
If given a set of entries from a table or from a List of Strings, how could one create an inverted index in C# that will preferably save into a DataSet/DataTable?
EDIT: I forgot to mention that I have already tried Lucene and Nutch, but require my own solution as modifying Lucene to meet my needs would take far longer than writing an inverted index. I'll be handling a lot of meta-data that'll also need handling once the basic inverted index is completed, so all I require for now is a basic full-text search on one area using the inverted index. Finally, working on an inverted index isn't something I get to do every day so it'd be great to have a crack at it.

Here's a rough overview of an approach I've used successfully in C# in the past:
struct WordInfo
{
public int position;
public int fieldID;
}
Dictionary<string,List<WordInfo>> invertedIndex=new Dictionary<string,List<WordInfo>>();
public void BuildIndex()
{
foreach (int fieldID in GetDatabaseFieldIDS())
{
string textField=GetDatabaseTextFieldForID(fieldID);
string word;
int position=0;
while(GetNextWord(textField,out word,ref position)==true)
{
WordInfo wi=new WordInfo();
if (invertedIndex.TryGetValue(word,out wi)==false)
{
invertedIndex.Add(word,new List<WordInfo>());
}
wi.Position=position;
wi.fieldID=fieldID;
invertedIndex[word].Add(wi);
}
}
}
Notes:
GetNextWord() iterates through the field and returns the next word and position. To implement it look at using string.IndexOf() and char character type checking methods (IsAlpha etc).
GetDatabaseTextFieldForID() and GetDatabaseFieldIDS() are self explanatory, implement as required.

Lucene.net might be your best bet. Its a mature full text search engine using inverted indexes.
http://codeclimber.net.nz/archive/2009/09/02/lucene.net-your-first-application.aspx
UPDATE:
I wrote a little library for indexing against in-memory collections using Lucene.net - it might be useful for this. https://github.com/mcintyre321/Linqdex

If you're looking to spin your own, the Dictionary<T> class is most likely going to be your base, like your Java hashtables. As far as what is stored as the values in the dictionary, its hard to tell based on the information you provide, but typically search algorithms use some type of Set structure so you can run unions and intersections. LINQ gives you a much of that functionality on any IEnumerable, although a specialized Set class may boost performance.
One such implementation of a Set is in the Wintellect PowerCollections. I'm not sure if that would give you any performance benefit or not over LINQ.
As far as saving to a DataSet, I'm not sure what you're envisioning. I'm not aware of anything that "automagically" writes to a DataSet. I suspect you will have to write this yourself, especially since you mentioned several times about other third-party options not being flexible enough.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.