Best data structure for collection strings in c# [closed]

Best data structure for collection strings in c# [closed] - c#

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question appears to be off-topic because it lacks sufficient information to diagnose the problem. Describe your problem in more detail or include a minimal example in the question itself.
Closed 9 years ago.
Improve this question
I have a huge collection of strings. I will find out all the strings which starts with the given character more frequently. What would be a best collection to do this. I will initialize the collection in sorted order.
Thanks

If you want a map from a character to all strings starting with that character, you might find ILookup<TKey, TElement> suitable. It's very similar to a Dictionary<TKey, TValue>, with two main differences:
Instead of a 1:1 mapping, it performs a 1:n mapping (i.e. there can be more than one value per key).
You cannot instantiate (new) nor populate it (.Add(…)) yourself; instead, you let .NET derive a fully populated instance from another collection by calling .ToLookup(…) on the latter.
Here's an example how to build such a 1:n map:
using System.Collections.Generic; // for List<T>
using System.Linq; // for ILookup<TKey, TValue> and .ToLookup(…)
// This represents the source of your strings. It doesn't have to be sorted:
var strings = new List<string>() { "Foo", "Bar", "Baz", "Quux", … };
// This is how you would build a 1:n lookup table mapping from first characters
// to all strings starting with that character. Empty strings are excluded:
ILookup<char, string> stringsByFirstCharacter =
strings.Where(str => !string.IsNullOrEmpty(str)) // exclude empty strings
.ToLookup(str => str[0]); // key := first character
// This is how you would look up all strings starting with B.
// The output will be Bar and Baz:
foreach (string str in stringsByFirstCharacter['B'])
{
Console.WriteLine(str);
}
P.S.: The above hyperlink for ILookup<…> (the interface) refers you to the help page for Lookup<…> (the implementation class). This is on purpose, as I find the documentation for the class easier to read. I would however recommend to use the interface in your code.

If you need to search regularly with a huge collection of strings, then use a Hash table. Remember to distribute the table evenly to speed up the look-up operation.

Well so you need to create an index on function from string.
For this Id suggest using
Dictionary<string,List<string>> data structure.
ToLookup isn't so good cause it limits your ability to maniuplate the data structure.

Related

Generate short, unique identifiers [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
I'm looking for an algorithm which generates identifiers suitable for both, external use in e.g. URLs as well as persistence with the following requirements:
Short, like a max. of 8 characters
URL-friendly, so no special characters
Human-friendly, e.g. no ambigous characters like L/l, 0/O
Incremental for fast indexing
Random to prevent guessing without knowing the algorithm (would be nice, but not important)
Unique without requiring to check the database
I looked at various solutions, but all I found have some major tradeoffs. For example:
GUID: Too long, not incremental
GUID base64 encoded: Still too long, not incremental
GUID ascii85 encoded: Short, not incremental, too many unsuitable characters
GUID encodings like base32, base36: Short, but loss of information
Comb GUID: Too long, however incremental
All others based on random: Require checking the DB for uniqueness
Time-based: Prone to collisions in clustered or multi-threaded environments
Edit: Why has this been marked off-topic? The requirements describe a specific problem to which numerous legitimate solutions can be provided. In fact, some of the solutions here are so good, I'm struggling with choosing the one to mark as answer.

If at all possible I'd keep the user requirements (short, readable) and the database requirements (incremental, fast indexing) separate. User-facing requirements change. You don't want to have to modify your tables because tomorrow you decide to change the length or other specifics of your user-facing ID.
One approach is to generate your ID using user-friendly characters, like
23456789ABCDEFGHJKLMNPQRSTUVWXYZ and just make it random.
But when inserting into the database, don't make that value the primary key for the record it references or even store it in that table. Insert it into its own table with an identity primary key, and then store that int or bigint key with your record.
That way your primary table can have an incremental primary key. If you need to reference a record by its "friendly" ID then you join to your friendly ID table.
My guess is that if you're generating a high enough volume of these IDs that you're concerned about index performance then the rate at which human users retrieve those values will be much lower. So the slightly slower lookup of the random value in the friendly ID table won't be a problem.

The following uses a combination of an ID that is known to be unique (because it comes from a unique ID column in a relational database) and a random sequence of letters and numbers to generate a token:
public static string GenerateAccessToken(string uniqueId) // generates a unique, random, and alphanumeric token
{
const string availableChars = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz";
using (var generator = new RNGCryptoServiceProvider())
{
var bytes = new byte[16];
generator.GetBytes(bytes);
var chars = bytes.Select(b => availableChars[b % availableChars.Length]);
var token = new string(chars.ToArray());
return uniqueId + token;
}
}
The token is guaranteed to be both unique and random (or at least "pseudo random"). You can manipulate the length by changing the length of bytes.
To avoid confusion between "0" and "O" or "l" and "1", you can remove those characters from availableChars.
Edit
I just realized this doesn't quite pass the "no database check" requirement, though when I've used code like this, I've always already had an entity in memory that I knew contained a unique ID, so I'm hoping the same applies to your situation. I don't think it's possible to quite achieve all your requirements, so I'm hoping this would still be a good balance of attributes.

Did you tried proquints?
A Proquint is a PRO-nouncable QUINT-uplet of alternating unambiguous consonants and vowels, for example: "lusab".
I think they meet almost all your requirements.
See the proposal here.
And here is the official implementation in C and Java.
I've worked on a port to .NET that you can download as Proquint.NET.

A simple solution I implemented before does not fulfill all of your constraints but might be acceptable if you think about your problem a little bit differently.
First, I used a function to obfusticate the database id func(id) => y, and func(y) => id. (I used a Feistel cipher, and here is an example of implementing of such a function) Second, convert the obfusticated id to base 62 so it becomes short and url-friendly. (You can use a smaller character set to achieve Human-friendly) This creates a one-to-one mapping from database id to string identifiers. In my implementation, 1, 2 maps to 2PawdM, 5eeGE8 correspondingly, and I can get the database id 1, 2 back from the string 2PawdM and 5eeGE8. The mapping could be entirely different when you use a different obfustication function.
With this solution, the identifiers themselves are NOT incremental, however, because the identifiers maps directly to database id, you can compute the corresponding database id and directly do any database queries on id column instead. You don't need to generate a string identifier and store it to database, and the uniqueness is guaranteed by database itself when you store the record with a auto-incremented id column.

List vs Dictionary when referring by index/key [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
So I have been mainly using lists to retrieve small amounts of data from a database which feeds into a web application but have recently come across dictionaries which produce more readable code with keys but what is the performance difference when just referring by index/key?
I understand that a dictionary uses more memory but what is best practice in this scenario and is it worth the performance/maintenance trade-off bearing in mind that I will not be performing searches or sorting the data?

When you do want to find some one item through list, then you should see ALL items till you find its key.
Let's see some basic example. You have
Person
{
public int ID {get;set;}
public string Name {get;set;}
}
and you have collection List<Person> persons and you want to find some person by its ID:
var person = persons.FirstOrDefault(x => x.ID == 5);
As written it has to enumerate the entire List until it finds the entry in the List that has the correct ID (does entry 0 match the lambda? No... Does entry 1 match the lambda? No... etc etc). This is O(n).
However, if you want to find through the Dictionary dictPersons :
var person = dictPersons[person.ID];
If you want to find a certain element by key in a dictionary, it can instantly jump to where it is in the dictionary - this is O(1). O(n) for doing it for every person. (If you want to know how this is done - Dictionary runs a mathematical operation on the key, which turns it into a value that is a place inside the dictionary, which is the same place it put it when it was inserted. It is called hash-function)
So, Dictionary is faster than Listbecause Dictionary does not iterate through the all collection, but Dictionary takes the item from the exact place(hash-function calculates this place). It is a better algorithm.
Dictionary relies on chaining (maintaining a list of items for each hash table bucket) to resolve collisions whereas Hashtable uses rehashing for collision resolution (when a collision occurs, tries another hash function to map the key to a bucket). You can read how hash function works and difference between chaining and rehashing.

Unless you're actually experiencing performance issues and need to optimize it's better to go with what's more readable and maintainable. That's especially true since you mentioned that it's small amounts of data. Without exaggerating - it's possible that over the life of the application the cumulative difference in performance (if any) won't equal the time you save by making your code more readable.
To put it in perspective, consider the work that your application already does just to read request headers and parse views and read values from configuration files. Not only will the difference in performance between the list and the dictionary be small, it will also be a tiny fraction of the overall processing your application does just to serve a single page request.
And even then, if you were to see performance issues and needed to optimize, there would probably be plenty of other optimizations (like caching) that would make a bigger difference.

Faster Update of Dictionary values [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question appears to be off-topic because it lacks sufficient information to diagnose the problem. Describe your problem in more detail or include a minimal example in the question itself.
Closed 8 years ago.
Improve this question
My dictionary is
Dictionary<string, string> d = new Dictionary<string, string>();
I'm iterating through an XML file (very large) and saving key/value pairs in a dictionary.
The following snapshot of code is very slow in execution and I want to make it faster. It takes around more than one hour to complete where my ctr value reaches to 3332130.
if (d.ContainsKey(dKey))
{
dValue = d[dKey];
d[dKey] = dValue + "," + ctr;
}
else
d.Add(dKey, ctr.ToString());
ctr++;

3332130 is a large number to store in memory, you should not hold such a big collection in memory.
Being said that, Let's try to optimize this.
Dictionary<string, StringBuilder>() d = new Dictionary<string, StringBuilder>();
StringBuilder builder;
if (d.TryGetValue(dKey, out builder))
{
builder.Append(",");
builder.Append(ctr);
}
else
{
d.Add(dKey, new StringBuilder(ctr.ToString()));
}
String concatenation in tight loop is awfully slow, use
StringBuilder instead
Use TryGetValue which avoids you to call dValue = d[dKey];.
I believe this should increase performance significantly.

Performing a number of repeated concatenations not known at compile time on large strings is an inherently wasteful thing to do. If you end up concatting a lot of values together, and they are not particularly small, that could easily be the source of your problem.
If so, it would have nothing at all to do with the dictionary. You should consider using a StringBuilder, or building up a collection of separate strings that you can join using string.Join when you have all of the strings you'll need for that value.

You may want to consider using StringBuilders instead of strings:
var d = new Dictionary<string, StringBuilder>();
And append the values like this:
if (d.ContainsKey(dKey))
{
d[dKey].Append("," + ctr);
}
else
d.Add(dKey, new StringBuilder(ctr.ToString()));
++ctr;
But I suspect that the bottleneck is in fact somewhere else.

in addition to String concatenation enhancements, you can also split your XML into several data sets and then populate ConcurrentDictionary in parallel with them. Depending on your data and framework you are using the performance could increase in times.
More examples here and here

Efficiently find nearest dictionary key

I have a bunch of pairs of dates and monetary values in a SortedDictionary<DateTime, decimal>, corresponding to loan balances calculated into the future at contract-defined compounding dates. Is there an efficient way to find a date key that is nearest to a given value? (Specifically, the nearest key less than or equal to the target). The point is to store only the data at the points when the value changed, but efficiently answer the question "what was the balance on x date?" for any date in range.
A similar question was asked ( What .NET dictionary supports a "find nearest key" operation? ) and the answer was "no" at the time, at least from the people who responded, but that was almost 3 years ago.
The question How to find point between two keys in sorted dictionary presents the obvious solution of naively iterating through all keys. I am wondering if any built-in framework function exists to take advantage of the fact that the keys are already indexed and sorted in memory -- or alternatively a built-in Framework collection class that would lend itself better to this kind of query.

Since SortedDictionary is sorted on the key, you can create a sorted list of keys with
var keys = new List<DateTime>(dictionary.Keys);
and then efficiently perform binary search on it:
var index = keys.BinarySearch(key);
As the documentation says, if index is positive or zero then the key exists; if it is negative, then ~index is the index where key would be found at if it existed. Therefore the index of the "immediately smaller" existing key is ~index - 1. Make sure you handle correctly the edge case where key is smaller than any of the existing keys and ~index - 1 == -1.
Of course the above approach really only makes sense if keys is built up once and then queried repeatedly; since it involves iterating over the whole sequence of keys and doing a binary search on top of that there's no point in trying this if you are only going to search once. In that case even naive iteration would be better.
Update
As digEmAll correctly points out, you could also switch to SortedList<DateTime, decimal> so that the Keys collection implements IList<T> (which SortedDictionary.Keys does not). That interface provides enough functionality to perform a binary search on it manually, so you could take e.g. this code and make it an extension method on IList<T>.
You should also keep in mind that SortedList performs worse than SortedDictionary during construction if the items are not inserted in already-sorted order, although in this particular case it is highly likely that dates are inserted in chronological (sorted) order which would be perfect.

So, this doesn't directly answer your question, because you specifically asked for something built-in to the .NET framework, but facing a similar problem, I found the following solution to work best, and I wanted to post it here for other searchers.
I used the TreeDictionary<K, V> from the C5 Collections (GitHub/NuGet), which is an implementation of a red-black tree.
It has Predecessor/TryPredecessor and WeakPredessor/TryWeakPredecessor methods (as well as similar methods for successors) to easily find the nearest items to a key.
More useful in your case, I think, is the RangeFrom/RangeTo/RangeFromTo methods that allow you to retrieve a range of key-value-pairs between keys.
Note that all of these methods can also be applied to the TreeDictionary<K, V>.Keys collection, which allow you to work with only the keys as well.
It really is a very neat implementation, and something like it deserves to be in the BCL.

It is not possible to efficiently find the nearest key with SortedList, SortedDictionary or any other "built-in" .NET type, if you need to interleave queries with inserts (unless your data arrives pre-sorted, or the collection is always small).
As I mentioned on the other question you referenced, I created three data structures related to B+ trees that provide find-nearest-key functionality for any sortable data type: BList<T>, BDictionary<K,V> and BMultiMap<K,V>. Each of these data structures provide FindLowerBound() and FindUpperBound() methods that work like C++'s lower_bound and upper_bound.
These are available in the Loyc.Collections NuGet package, and BDictionary typically uses about 44% less memory than SortedDictionary.

public static DateTime RoundDown(DateTime dateTime)
{
long remainingTicks = dateTime.Ticks % PeriodLength.Ticks;
return dateTime - new TimeSpan(remainingTicks);
}

Efficiently pairing objects in lists based on key

So, here's the deal.
(My current use-case is in C#, but I'm also interested in the general algorithmic case)
I am given two Arrays of objects (I don't get to alter the code that creates these arrays, unfortunately).
Each object has (as part of it) a .Name property, a string.
These strings are unique per object, and they have zero or one matching strings in the other object.
What I need to do is efficiently pair these objects based on that string, into some sort of collection that allows me access to the paired objects. The strings need to match exactly to be considered a match, so I don't need any Upper or CaseInsensitive, etc.
Sadly, these lists are not sorted.
The lists themselves are maybe 30-50 items, but I need to repeat the algorithm on thousands of these array-pairs in a row, so efficiency is important.
Since I know that there's 0 or 1 match, and I know that most of them will be 1 match, I feel like there's a more efficient algorithm than x*y (Foreach item in x, foreach item in y, if x=y then x and y are a match)
I believe the most likely options are:
Keep the unsorted list and just do x*y, but drop items from the list once I've found them so I don't check ones already-found,
OR:
Convert both to Dictionaries and then do an indexed lookup on each (array2[currentArray1Item])
OR:
Sort the lists myself (Array.Sort()), and then having sorted arrays I can probably do something clever like jump to the index in B where I'd expect to find it (wherever it was in A) and then move up or down based on string until I either find it or pass where it should've been.
Then once that's done I need to figure out how to store it, I suppose I can make a custom ObjectPair class that just holds objects A and B. No need to do anything fancy here, since I'm just going to ForEach on the pairs.
So the questions are:
Are any of the above algorithms the fastest way to do this (if not, what is?) and is there some existing C# structure that'd conveniently hold the found pairs?
EDIT: Array.Sort() is a method that exists, so I don't need to convert the array to List to sort. Good to know. Updated above.

The question I have is: how much efficiency do we gain from the special handling if it requires us to sort both input arrays? According to the documentation for Array.Sort, it is O(n log n) on average and O(n ^ 2) in the worst case (quicksort). Once we have both arrays sorted, we then have another O(n) amount of work because we have to loop through the first one.
I interpret this to mean that the overall amount of work might actually increase because of the number of iterations required to sort, then process. This of course would be a different story if you could guarantee sorted arrays at the start, but as you said you cannot. (I should also note that you would need to create a custom IComparer<T> implementation to pass to Array.Sort so it knows to use the .Name property. That's not runtime work, but it's still work :-)
You might consider using a LINQ join, which only iterates the inner array a single time (see here for psuedocode). This is as opposed to the nested foreach statements, which would iterate the inner array for each element of the outer array. It's about as efficient as it can be in the general case and doesn't introduce the complexity of the special handling you suggested.
Here is an example implementation:
var pairs =
from item1 in array1
join item2 in array2 on item1.Name equals item2.Name
select new { item1, item2 };
foreach(var pair in pairs)
{
// Use the pair somehow
}
That very clearly states what you are doing with the data and also gives you an anonymous type representing each pair (so you don't have to invent a pairing). If you do end up going a different route, I would be interested in how it compares to this approach.

Sort the second array using Array.Sort method, then match objects in the second Array using Binary Search Algorithm.
Generally, for 30-50 items this would be a little faster than brute force x*y.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.