Quick (sub)string search in large set of data - c#

Given a city:
public class City
{
public int Id { get; set; }
public string Name { get; set; }
public string Country { get; set; }
public LatLong Location { get; set; }
}
I have a list of close to 3,000,000 cities (and towns and villages etc.) in a file. This file is read into memory; I have been playing with arrays, lists, dictionaries (key = Id) etc.
I want to find, as quick as possible, all cities matching a substring (case insensitive). So when I search for 'yor' I want to get all matches (1000+) ASAP (matching 'York Town', 'Villa Mayor', 'New York', ...).
Functionally you could write this as:
cities.Values.Where(c => c.Name.IndexOf("yor", StringComparison.OrdinalIgnoreCase) >= 0)
I don't mind doing some pre-processing when reading the file; as a matter of fact: that's what I'm mostly looking for. Read the file, "chew" on the data creating some sort of index or... and then be ready to answer queries like "yor".
I want this to be standalone, self-contained. I do not want to add dependencies like an RDBMS, ElasticSearch or whatever. I don't mind having (parts of) the list in memory more than once. I don't mind spending some memory on a datastructure to help me find my results quickly. I don't want libraries or packages. I want an algorithm I can implement myself.
Basically I want the above LINQ statement, but optimized for my case; currently plowing through almost 3,000,000 records takes about +/- 2 seconds. I want this sub 0.1 second so I could use the search and it's results as 'autocomplete'.
Creating an "index"(-alike) structure is probably what I need. As I'm writing I remember something about a "bloom filter" but I'm not sure if that would help or even supports substring search. Will look into that now.
Any tips, pointers, help very much appreciated.

I created a bit of a hybrid based on a suffix array / dictionary. Thanks to saibot for suggesting it first and all other people helping and suggesting.
This is what I came up with:
public class CitiesCollection
{
private Dictionary<int, City> _cities;
private SuffixDict<int> _suffixdict;
public CitiesCollection(IEnumerable<City> cities, int minLen)
{
_cities = cities.ToDictionary(c => c.Id);
_suffixdict = new SuffixDict<int>(minLen, _cities.Values.Count);
foreach (var c in _cities.Values)
_suffixdict.Add(c.Name, c.Id);
}
public IEnumerable<City> Find(string find)
{
var normalizedFind = _suffixdict.NormalizeString(find);
foreach (var id in _suffixdict.Get(normalizedFind).Where(v => _cities[v].Name.IndexOf(normalizedFind, StringComparison.OrdinalIgnoreCase) >= 0))
yield return _cities[id];
}
}
public class SuffixDict<T>
{
private readonly int _suffixsize;
private ConcurrentDictionary<string, IList<T>> _dict;
public SuffixDict(int suffixSize, int capacity)
{
_suffixsize = suffixSize;
_dict = new ConcurrentDictionary<string, IList<T>>(Environment.ProcessorCount, capacity);
}
public void Add(string suffix, T value)
{
foreach (var s in GetSuffixes(suffix))
AddDict(s, value);
}
public IEnumerable<T> Get(string suffix)
{
return Find(suffix).Distinct();
}
private IEnumerable<T> Find(string suffix)
{
foreach (var s in GetSuffixes(suffix))
{
if (_dict.TryGetValue(s, out var result))
foreach (var i in result)
yield return i;
}
}
public string NormalizeString(string value)
{
return value.Normalize().ToLowerInvariant();
}
private void AddDict(string suffix, T value)
{
_dict.AddOrUpdate(suffix, (s) => new List<T>() { value }, (k, v) => { v.Add(value); return v; });
}
private IEnumerable<string> GetSuffixes(string value)
{
var nv = NormalizeString(value);
for (var i = 0; i <= nv.Length - _suffixsize ; i++)
yield return nv.Substring(i, _suffixsize);
}
}
Usage (where I assume mycities to be an IEnumerable<City> with the given City object from the question):
var cc = new CitiesCollection(mycities, 3);
var results = cc.Find("york");
Some results:
Find: sterda elapsed: 00:00:00.0220522 results: 32
Find: york elapsed: 00:00:00.0006212 results: 155
Find: dorf elapsed: 00:00:00.0086439 results: 6095
Memory usage is very, very acceptable. Only 650MB total having the entire collection of 3,000,000 cities in memory.
In the above I'm storing Id's in the "SuffixDict" and I have a level of indirection (dictionary lookups to find id=>city). This can be further simplified to:
public class CitiesCollection
{
private SuffixDict<City> _suffixdict;
public CitiesCollection(IEnumerable<City> cities, int minLen, int capacity = 1000)
{
_suffixdict = new SuffixDict<City>(minLen, capacity);
foreach (var c in cities)
_suffixdict.Add(c.Name, c);
}
public IEnumerable<City> Find(string find, StringComparison stringComparison = StringComparison.OrdinalIgnoreCase)
{
var normalizedFind = SuffixDict<City>.NormalizeString(find);
var x = _suffixdict.Find(normalizedFind).ToArray();
foreach (var city in _suffixdict.Find(normalizedFind).Where(v => v.Name.IndexOf(normalizedFind, stringComparison) >= 0))
yield return city;
}
}
public class SuffixDict<T>
{
private readonly int _suffixsize;
private ConcurrentDictionary<string, IList<T>> _dict;
public SuffixDict(int suffixSize, int capacity = 1000)
{
_suffixsize = suffixSize;
_dict = new ConcurrentDictionary<string, IList<T>>(Environment.ProcessorCount, capacity);
}
public void Add(string suffix, T value)
{
foreach (var s in GetSuffixes(suffix, _suffixsize))
AddDict(s, value);
}
public IEnumerable<T> Find(string suffix)
{
var normalizedfind = NormalizeString(suffix);
var find = normalizedfind.Substring(0, Math.Min(normalizedfind.Length, _suffixsize));
if (_dict.TryGetValue(find, out var result))
foreach (var i in result)
yield return i;
}
private void AddDict(string suffix, T value)
{
_dict.AddOrUpdate(suffix, (s) => new List<T>() { value }, (k, v) => { v.Add(value); return v; });
}
public static string NormalizeString(string value)
{
return value.Normalize().ToLowerInvariant();
}
private static IEnumerable<string> GetSuffixes(string value, int suffixSize)
{
var nv = NormalizeString(value);
if (value.Length < suffixSize)
{
yield return nv;
}
else
{
for (var i = 0; i <= nv.Length - suffixSize; i++)
yield return nv.Substring(i, suffixSize);
}
}
}
This bumps the load time up from 00:00:16.3899085 to 00:00:25.6113214, memory usage goes down from 650MB to 486MB. Lookups/searches perform a bit better since we have one less level of indirection.
Find: sterda elapsed: 00:00:00.0168616 results: 32
Find: york elapsed: 00:00:00.0003945 results: 155
Find: dorf elapsed: 00:00:00.0062015 results: 6095
I'm happy with the results so far. I'll do a little polishing and refactoring and call it a day! Thanks everybody for the help!
And this is how it performs with 2,972,036 cities:
This has evolved into a case-insensitive, accent-insensitive search by modifying the code to this:
public static class ExtensionMethods
{
public static T FirstOrDefault<T>(this IEnumerable<T> src, Func<T, bool> testFn, T defval)
{
return src.Where(aT => testFn(aT)).DefaultIfEmpty(defval).First();
}
public static int IndexOf(this string source, string match, IEqualityComparer<string> sc)
{
return Enumerable.Range(0, source.Length) // for each position in the string
.FirstOrDefault(i => // find the first position where either
// match is Equals at this position for length of match (or to end of string) or
sc.Equals(source.Substring(i, Math.Min(match.Length, source.Length - i)), match) ||
// match is Equals to on of the substrings beginning at this position
Enumerable.Range(1, source.Length - i - 1).Any(ml => sc.Equals(source.Substring(i, ml), match)),
-1 // else return -1 if no position matches
);
}
}
public class CaseAccentInsensitiveEqualityComparer : IEqualityComparer<string>
{
private static readonly CompareOptions _compareoptions = CompareOptions.IgnoreCase | CompareOptions.IgnoreNonSpace | CompareOptions.IgnoreKanaType | CompareOptions.IgnoreWidth | CompareOptions.IgnoreSymbols;
private static readonly CultureInfo _cultureinfo = CultureInfo.InvariantCulture;
public bool Equals(string x, string y)
{
return string.Compare(x, y, _cultureinfo, _compareoptions) == 0;
}
public int GetHashCode(string obj)
{
return obj != null ? RemoveDiacritics(obj).ToUpperInvariant().GetHashCode() : 0;
}
private string RemoveDiacritics(string text)
{
return string.Concat(
text.Normalize(NormalizationForm.FormD)
.Where(ch => CharUnicodeInfo.GetUnicodeCategory(ch) != UnicodeCategory.NonSpacingMark)
).Normalize(NormalizationForm.FormC);
}
}
public class CitiesCollection
{
private SuffixDict<City> _suffixdict;
private HashSet<string> _countries;
private Dictionary<int, City> _cities;
private readonly IEqualityComparer<string> _comparer = new CaseAccentInsensitiveEqualityComparer();
public CitiesCollection(IEnumerable<City> cities, int minLen, int capacity = 1000)
{
_suffixdict = new SuffixDict<City>(minLen, _comparer, capacity);
_countries = new HashSet<string>();
_cities = new Dictionary<int, City>(capacity);
foreach (var c in cities)
{
_suffixdict.Add(c.Name, c);
_countries.Add(c.Country);
_cities.Add(c.Id, c);
}
}
public City this[int index] => _cities[index];
public IEnumerable<string> Countries => _countries;
public IEnumerable<City> Find(string find, StringComparison stringComparison = StringComparison.OrdinalIgnoreCase)
{
foreach (var city in _suffixdict.Find(find).Where(v => v.Name.IndexOf(find, _comparer) >= 0))
yield return city;
}
}
public class SuffixDict<T>
{
private readonly int _suffixsize;
private ConcurrentDictionary<string, IList<T>> _dict;
public SuffixDict(int suffixSize, IEqualityComparer<string> stringComparer, int capacity = 1000)
{
_suffixsize = suffixSize;
_dict = new ConcurrentDictionary<string, IList<T>>(Environment.ProcessorCount, capacity, stringComparer);
}
public void Add(string suffix, T value)
{
foreach (var s in GetSuffixes(suffix, _suffixsize))
AddDict(s, value);
}
public IEnumerable<T> Find(string suffix)
{
var find = suffix.Substring(0, Math.Min(suffix.Length, _suffixsize));
if (_dict.TryGetValue(find, out var result))
{
foreach (var i in result)
yield return i;
}
}
private void AddDict(string suffix, T value)
{
_dict.AddOrUpdate(suffix, (s) => new List<T>() { value }, (k, v) => { v.Add(value); return v; });
}
private static IEnumerable<string> GetSuffixes(string value, int suffixSize)
{
if (value.Length < 2)
{
yield return value;
}
else
{
for (var i = 0; i <= value.Length - suffixSize; i++)
yield return value.Substring(i, suffixSize);
}
}
}
With credit also to Netmage and Mitsugui. There are still some issues / edge-cases but it's continually improving!

You could use a suffix tree: https://en.wikipedia.org/wiki/Suffix_tree
It requires enough space to store about 20 times your list of words in memory
Suffix array is a space efficient alternative: https://en.wikipedia.org/wiki/Suffix_array

in query benchmark contains very faster then indexOf >0
cities.Values.Where(c => c.Name.Contans("yor"))

Related

C# sorted list - fast, with removable, duplicated Keys

I making a application with compression mechanism and need my own Dictionary. Every cicle in my app, It adds new element into a myDictionary and update(add a char to some previous elements in myDictionary ). I was doing it with normal list and Quicksort function, but it was really slow. I'm searching for some new methods how to do this but SortedList, Dictionary or LookUp doesnt seems like what I looking for. Is it better to make my own SortedList or is too hard/complex to manage?
Some of the code:
public class MyDictionary
{
private List<string> Contexts;
private List<string> Contents;
private int Count; //words count
//Konstruktor
public MyDictionary()
{
Count = 0;
Contexts = new List<string>();
Contents = new List<string>();
}
region Public Functions
public void AddChar(char ch, int contentSize)
{
for (int i = 0; i < Count; i++)
{
if (Contents[i].Length < contentSize)
{
Contents[i] = Contents[i] + ch;
}
}
}
public void Add(string context, string content)
{
Contexts.Add(Reverse(context)); //otočený kontext
Contents.Add(content);
Count++;
}
public void update()
{
quicksort(Contexts, Contents, 0, Count-1);
}
private void quicksort(List<String> context, List<String> content, int left, int right)
{
int i = left, j = right;
string pivot = context[(left + right) / 2];
while (i <= j)
{
while (context[i].CompareTo(pivot) < 0)
{
i++;
}
while (context[j].CompareTo(pivot) > 0)
{
j--;
}
if (i <= j)
{
swap(i,j);
i++;
j--;
}
}
// Recursive calls
if (left < j)
{
quicksort(context, content, left, j);
}
if (i < right)
{
quicksort(context, content, i, right);
}
}
private static string Reverse(string s)
{
char[] charArray = s.ToCharArray();
Array.Reverse(charArray);
return new string(charArray);
}
Here is a class that acts like a SortedDictionary, but can hold multiple values with the same key. You may need to flesh it out a little bit, with methods like Remove, and adding support for your own IComparer<TKey> if you need them. LINQPad file
public class SortedMultiValue<TKey, TValue> : IEnumerable<TValue>
{
private SortedDictionary<TKey, List<TValue>> _data;
public SortedMultiValue()
{
_data = new SortedDictionary<TKey, System.Collections.Generic.List<TValue>>();
}
public void Clear()
{
_data.Clear();
}
public void Add(TKey key, TValue value)
{
if (!_data.TryGetValue(key, out List<TValue> items))
{
items = new List<TValue>();
_data.Add(key, items);
}
items.Add(value);
}
public IEnumerable<TValue> Get(TKey key)
{
if (_data.TryGetValue(key, out List<TValue> items))
{
return items;
}
throw new KeyNotFoundException();
}
public IEnumerator<TValue> GetEnumerator()
{
return CreateEnumerable().GetEnumerator();
}
IEnumerator IEnumerable.GetEnumerator()
{
return CreateEnumerable().GetEnumerator();
}
IEnumerable<TValue> CreateEnumerable()
{
foreach (IEnumerable<TValue> values in _data.Values)
{
foreach (TValue value in values)
{
yield return value;
}
}
}
}
You can use it like this:
var data = new SortedMultiValue<string, string>();
data.Add("Dog", "Buddy");
data.Add("Dog", "Mr. Peanutbutter");
data.Add("cat", "Charlie");
data.Add("cat", "Sam");
data.Add("cat", "Leo");
foreach (string item in data)
{
Console.WriteLine(item);
}
Console.WriteLine();
foreach (string item in data.Get("cat"))
{
Console.WriteLine(item);
}
Console.WriteLine();
foreach (string item in data.Get("Dog"))
{
Console.WriteLine(item);
}
It produces this as the output (notice that the first group of names is sorted by the key they were inserted with):
Charlie
Sam
Leo
Buddy
Mr. Peanutbutter
Charlie
Sam
Leo
Buddy
Mr. Peanutbutter

c# Dictionary<object, T> lookup value

Not sure how to best phrase this which is probably why I'm having difficulty looking it up. Here is a sample console application to demonstrate my meaning.
class Program
{
static void Main(string[] args)
{
var item1 = new Item("Number");
var item2 = new Item("Number");
var dict = new Dictionary<Item, string>();
dict.Add(item1, "Value");
Console.WriteLine(dict.ContainsKey(item2));
var dict2 = new Dictionary<string, string>();
dict2.Add("Number", "Value");
Console.WriteLine(dict2.ContainsKey("Number"));
Console.Read();
}
class Item
{
readonly string number;
public Item(string number)
{
this.number = number;
}
}
}
In this example dict.ContainsKey(item2) returns false and dict2.ContainsKey("Number") returns true. Can Item be defined in such a way that it would behave like a string? The best I can come up with is
static void Main(string[] args)
{
var item1 = new Item("Number");
var item2 = new Item("Number");
var dict = new Dictionary<string, string>();
dict.Add(item1.ToString(), "Test");
Console.WriteLine(dict.ContainsKey(item2.ToString()));
Console.Read();
}
class Item
{
readonly string number;
public Item(string number)
{
this.number = number;
}
public override string ToString()
{
return number;
}
}
This example is contrived, Item would have more fields and ToString() would joint them all up.
You need to override Equals and GetHashCode. Dictionary use Equals and GetHashCode method to compare keys for equality.
class Item
{
readonly string number;
public Item(string number)
{
this.number = number;
}
public override bool Equals(object obj)
{
return Equals(obj as Item);
}
public override int GetHashCode()
{
// this is c# 6 feature
return number?.GetHashCode() ?? 0;
// If you are not using c# 6, you can use
// return number == null ? 0 : number.GetHashCode();
}
private bool Equals(Item another)
{
if (another == null)
return false;
return number == another.number;
}
}
If you have more than one field, you need to account all fields in the Equals and GetHashCode method.

Return a finite set matching a regex expression

Something similar to http://regexio.com/prototype.html, I'm trying to get a set matching a particular regex.
Basically, you need to parse the regular expression and then, instead of reading input while walking the parsed expression, output the variants.
I have hacked the following program doing what you need for a very simple regular expression (only alternate options using |, iteration using *, grouping using (), and escaping using \ is supported). Note that the iteration is done simply 0–5 times, conversion to possibly infinite iteration left as an exercise for the reader ;-).
I have used a straightforward recursive-descent parser building an abstract syntax tree in memory; this tree is in the end walked and all possible sets are built. The solution is probably not optimal at all, but it works. Enjoy:
public class TestPrg
{
static void Main()
{
var expression = new RegexParser("a(b|c)*d").Parse();
foreach (var item in expression.Generate())
{
Console.WriteLine(item);
}
}
}
public static class EnumerableExtensions
{
// Build a Cartesian product of a sequence of sequences
// Code by Eric Lippert, copied from <http://blogs.msdn.com/b/ericlippert/archive/2010/06/28/computing-a-cartesian-product-with-linq.aspx>
public static IEnumerable<IEnumerable<T>> CartesianProduct<T>(this IEnumerable<IEnumerable<T>> sequences)
{
IEnumerable<IEnumerable<T>> emptyProduct = new[] { Enumerable.Empty<T>() };
return sequences.Aggregate(
emptyProduct,
(accumulator, sequence) =>
from accseq in accumulator
from item in sequence
select accseq.Concat(new[] { item }));
}
}
public class RegexParser
{
private const char EOF = '\x0000';
private readonly string str;
private char curr;
private int pos;
public RegexParser(string s)
{
str = s;
}
public RegExpression Parse()
{
pos = -1;
Read();
return ParseExpression();
}
private void Read()
{
++pos;
curr = pos < str.Length ? str[pos] : EOF;
}
private RegExpression ParseExpression()
{
var term = ParseTerm();
if (curr == '|')
{
Read();
var secondExpr = ParseExpression();
return new Variants(term, secondExpr);
}
else
{
return term;
}
}
private RegExpression ParseTerm()
{
var factor = ParseFactor();
if (curr != '|' && curr != '+' && curr != '*' && curr != ')' && curr != EOF)
{
var secondTerm = ParseTerm();
return new Concatenation(factor, secondTerm);
}
else
{
return factor;
}
}
private RegExpression ParseFactor()
{
var element = ParseElement();
if (curr == '*')
{
Read();
return new Repeat(element);
}
else
{
return element;
}
}
private RegExpression ParseElement()
{
switch (curr)
{
case '(':
Read();
var expr = ParseExpression();
if (curr != ')') throw new FormatException("Closing paren expected");
Read();
return expr;
case '\\':
Read();
var escapedChar = curr;
Read();
return new Literal(escapedChar);
default:
var literal = curr;
Read();
return new Literal(literal);
}
}
}
public abstract class RegExpression
{
protected static IEnumerable<RegExpression> Merge<T>(RegExpression head, RegExpression tail, Func<T, IEnumerable<RegExpression>> selector)
where T : RegExpression
{
var other = tail as T;
if (other != null)
{
return new[] { head }.Concat(selector(other));
}
else
{
return new[] { head, tail };
}
}
public abstract IEnumerable<string> Generate();
}
public class Variants : RegExpression
{
public IEnumerable<RegExpression> Subexpressions { get; private set; }
public Variants(RegExpression term, RegExpression rest)
{
Subexpressions = Merge<Variants>(term, rest, c => c.Subexpressions);
}
public override IEnumerable<string> Generate()
{
return Subexpressions.SelectMany(sub => sub.Generate());
}
}
public class Concatenation : RegExpression
{
public IEnumerable<RegExpression> Subexpressions { get; private set; }
public Concatenation(RegExpression factor, RegExpression rest)
{
Subexpressions = Merge<Concatenation>(factor, rest, c => c.Subexpressions);
}
public override IEnumerable<string> Generate()
{
foreach (var variant in Subexpressions.Select(sub => sub.Generate()).CartesianProduct())
{
var builder = new StringBuilder();
foreach (var item in variant) builder.Append(item);
yield return builder.ToString();
}
}
}
public class Repeat : RegExpression
{
public RegExpression Expr { get; private set; }
public Repeat(RegExpression expr)
{
Expr = expr;
}
public override IEnumerable<string> Generate()
{
foreach (var subexpr in Expr.Generate())
{
for (int cnt = 0; cnt < 5; ++cnt)
{
var builder = new StringBuilder(subexpr.Length * cnt);
for (int i = 0; i < cnt; ++i) builder.Append(subexpr);
yield return builder.ToString();
}
}
}
}
public class Literal : RegExpression
{
public char Ch { get; private set; }
public Literal(char c)
{
Ch = c;
}
public override IEnumerable<string> Generate()
{
yield return new string(Ch, 1);
}
}
My answer to a similar question might work for you. If the regex doesn't involve
any * operations (so that the language it recognizes is finite), it should be easy
to rewrite the regex as a BNF grammar, then do a bottom-up analysis producing
the finite sets corresponding to each nonterminal symbol until you reach the
start symbol, at which point you're done.

Namespace dict?

I'm devising a template language. In it, there are 3 kinds of tokens: tags, directives, and variables. Each of these tokens have a name, and there's getting to be quite a few of them. They're extensible too.
To allow name reuse I want to add namespaces.
Right now all the variables are just stored in a dict. The key is the variable name, and the value is the variable value. That way I can quickly retrieve the value of a variable. However, supposing I want to allow dot-notation, namespace.variable, how can I store these variables, such that the namespace is optional? If the namespace is included the dict should only scan that namespace, if not, I guess it scans all namespaces.
Is there a container that will do this?
You should structure your symbol data internally as a dictionary of dictionary of string. The top level dictionary is for namespaces, and each dictionary below each namespace name is the container for all symbols in that namespace.
Looking up an unqualified symbol is simply a matter of looking for the symbol in each namespace in a particular order. In C# or Delphi, the order is determined by the order in which the namespaces are declared at the top of the source file, in reverse order of declaration (most recent is the first to be searched).
You can create your own implementation of IDictionary<string, object> instead of using the framework's Dictionary<TKey, TValue>.
Externally, there would be no change to the way you are consuming it.
Internally, it would consist of a Dictionary<string, Dictionary<string, object>>.
So, if your dictionary is asked for the value matching key "namespace.variable", internally it would split that string, get the Dictionary<string, Dictionary<string, object>> with key "namespace" and then return the value in that Dictionary<string, object> for key "variable."
To make the namespace optional, you have one entry where the key is string.Empty. Whether adding or getting items, any time a key is provided that does not contain ., you'll use the entry with key string.Empty.
My solution:
Class
public class NamespaceDictionary<T> : IDictionary<string, T>
{
private SortedDictionary<string, Dictionary<string, T>> _dict;
private const char _separator = '.';
public NamespaceDictionary()
{
_dict = new SortedDictionary<string, Dictionary<string, T>>();
}
public NamespaceDictionary(IEnumerable<KeyValuePair<string, T>> collection)
: this()
{
foreach (var item in collection)
Add(item);
}
#region Implementation of IEnumerable
public IEnumerator<KeyValuePair<string, T>> GetEnumerator()
{
return _dict.SelectMany(x => x.Value).GetEnumerator();
}
IEnumerator IEnumerable.GetEnumerator()
{
return GetEnumerator();
}
#endregion
private static Tuple<string, string> Split(string name)
{
int pos = name.LastIndexOf(_separator);
string ns = pos == -1 ? "" : name.Substring(0, pos);
string var = name.Substring(pos + 1);
return new Tuple<string, string>(ns, var);
}
#region Implementation of ICollection<KeyValuePair<string,TValue>>
public void Add(KeyValuePair<string, T> item)
{
Add(item.Key, item.Value);
}
public void Clear()
{
_dict.Clear();
}
public bool Contains(KeyValuePair<string, T> item)
{
throw new NotImplementedException();
}
public void CopyTo(KeyValuePair<string, T>[] array, int arrayIndex)
{
throw new NotImplementedException();
}
public bool Remove(KeyValuePair<string, T> item)
{
return Remove(item.Key);
}
public int Count
{
get { return _dict.Sum(p => p.Value.Count); }
}
public bool IsReadOnly
{
get { return false; }
}
#endregion
#region Implementation of IDictionary<string,TValue>
public bool ContainsKey(string name)
{
var tuple = Split(name);
return ContainsKey(tuple.Item1, tuple.Item2);
}
public bool ContainsKey(string ns, string key)
{
if (ns == "")
return _dict.Any(pair => pair.Value.ContainsKey(key));
return _dict.ContainsKey(ns) && _dict[ns].ContainsKey(key);
}
public void Add(string name, T value)
{
var tuple = Split(name);
Add(tuple.Item1, tuple.Item2, value);
}
public void Add(string ns, string key, T value)
{
if (!_dict.ContainsKey(ns))
_dict[ns] = new Dictionary<string, T>();
_dict[ns].Add(key, value);
}
public bool Remove(string ns, string key)
{
if (_dict.ContainsKey(ns) && _dict[ns].ContainsKey(key))
{
if (_dict[ns].Count == 1) _dict.Remove(ns);
else _dict[ns].Remove(key);
return true;
}
return false;
}
public bool Remove(string key)
{
var tuple = Split(key);
return Remove(tuple.Item1, tuple.Item2);
}
public bool TryGetValue(string name, out T value)
{
var tuple = Split(name);
return TryGetValue(tuple.Item1, tuple.Item2, out value);
}
public bool TryGetValue(string ns, string key, out T value)
{
if (ns == "")
{
foreach (var pair in _dict)
{
if (pair.Value.ContainsKey(key))
{
value = pair.Value[key];
return true;
}
}
}
else if (_dict.ContainsKey(ns) && _dict[ns].ContainsKey(key))
{
value = _dict[ns][key];
return true;
}
value = default(T);
return false;
}
public T this[string ns, string key]
{
get
{
if (ns == "")
{
foreach (var pair in _dict)
if (pair.Value.ContainsKey(key))
return pair.Value[key];
}
else if (_dict.ContainsKey(ns) && _dict[ns].ContainsKey(key))
return _dict[ns][key];
throw new KeyNotFoundException();
}
set
{
if (!_dict.ContainsKey(ns))
_dict[ns] = new Dictionary<string, T>();
_dict[ns][key] = value;
}
}
public T this[string name]
{
get
{
var tuple = Split(name);
return this[tuple.Item1, tuple.Item2];
}
set
{
var tuple = Split(name);
this[tuple.Item1, tuple.Item2] = value;
}
}
public ICollection<string> Keys
{
get { return _dict.SelectMany(p => p.Value.Keys).ToArray(); }
}
public ICollection<T> Values
{
get { return _dict.SelectMany(p => p.Value.Values).ToArray(); }
}
#endregion
}
Test
var dict = new NamespaceDictionary<int>();
dict.Add("ns1.var1", 1);
dict.Add("ns2.var1", 2);
dict.Add("var2", 3);
dict.Add("ns2.var2", 4);
dict.Add("ns3", "var1", 5);
dict["ns4.var1"] = 6;
Console.WriteLine(dict["var1"]);
Console.WriteLine(dict["ns2.var1"]);
Console.WriteLine(dict["var2"]);
Console.WriteLine(dict["ns2.var2"]);
Console.WriteLine(dict["ns2", "var2"]);
Console.WriteLine(dict["ns3.var1"]);
Console.WriteLine(dict["ns4", "var1"]);
Output
1
2
3
4
4
5
6
Help
I used a SortedDictionary thinking it would retain the order that the namespaces were added, but it's actually sorting the namespaces alphabetically. Is there an dict class that will retain the order the items were added, but not sort them?

Optimized Generic List Split

Read the edit below for more information.
I have some code below that I use to split a generic list of Object when the item is of a certain type.
public static IEnumerable<object>[] Split(this IEnumerable<object> tokens, TokenType type) {
List<List<object>> t = new List<List<object>>();
int currentT = 0;
t.Add(new List<object>());
foreach (object list in tokens) {
if ((list is Token) && (list as Token).TokenType == type) {
currentT++;
t.Add(new List<object>());
}
else if ((list is TokenType) && ((TokenType)list )== type) {
currentT++;
t.Add(new List<object>());
}
else {
t[currentT].Add(list);
}
}
return t.ToArray();
}
I dont have a clear question as much as I am curious if anyone knows of any ways I can optimize this code. I call it many times and it seems to be quite the beast as far as clock cycles go. Any ideas? I can also make it a Wiki if anyone is interested, maybe we can keep track of the latest changes.
Update: Im trying to parse out specific tokens. Its a list of some other class and Token classes. Token has a property (enum) of TokenType. I need to find all the Token classes and split on each of them.
{a b c T d e T f g h T i j k l T m}
would split like
{a b c}{d e}{f g h}{i j k l}{m}
EDIT UPDATE:
It seems like all of my speed problems come into the constant creation and addition of Generic Lists. Does anyone know how I can go about this without that?
This is the profile of what is happening if it helps anyone.
Your code looks fine.
My only suggestion would be replacing IEnumerable<object> with the non-generic IEnumerable. (In System.Collections)
EDIT:
On further inspection, you're casting more times than necessary.
Replace the if with the following code:
var token = list as Token;
if (token != null && token.TokenType == type) {
Also, you can get rid your currentT variable by writing t[t.Count - 1] or t.Last(). This will make the code clearer, but might have a tiny negative effect on performance.
Alternatively, you could store a reference to the inner list in a variable and use it directly. (This will slightly improve performance)
Finally, if you can change the return type to List<List<Object>>, you can return t directly; this will avoid an array copy and will be noticeably faster for large lists.
By the way, your variable names are confusing; you should swap the names of t and list.
Type-testing and casts can be a performance killer. If at all possible, your token types should implement a common interface or abstract class. Instead of passing in and object, you should pass in an IToken which wraps your object.
Here's some concept code you can use to get started:
using System;
using System.Collections.Generic;
namespace Juliet
{
interface IToken<T>
{
bool IsDelimeter { get; }
T Data { get; }
}
class DelimeterToken<T> : IToken<T>
{
public bool IsDelimeter { get { return true; } }
public T Data { get { throw new Exception("No data"); } }
}
class DataToken<T> : IToken<T>
{
public DataToken(T data)
{
this.Data = data;
}
public bool IsDelimeter { get { return false; } }
public T Data { get; private set; }
}
class TokenFactory<T>
{
public IToken<T> Make()
{
return new DelimeterToken<T>();
}
public IToken<T> Make(T data)
{
return new DataToken<T>(data);
}
}
class Program
{
static List<List<T>> SplitTokens<T>(IEnumerable<IToken<T>> tokens)
{
List<List<T>> res = new List<List<T>>();
foreach (IToken<T> token in tokens)
{
if (token.IsDelimeter)
{
res.Add(new List<T>());
}
else
{
if (res.Count == 0)
{
res.Add(new List<T>());
}
res[res.Count - 1].Add(token.Data);
}
}
return res;
}
static void Main(string[] args)
{
TokenFactory<string> factory = new TokenFactory<string>();
IToken<string>[] tokens = new IToken<string>[]
{
factory.Make("a"), factory.Make("b"), factory.Make("c"), factory.Make(),
factory.Make("d"), factory.Make("e"), factory.Make(),
factory.Make("f"), factory.Make("g"), factory.Make("h"), factory.Make(),
factory.Make("i"), factory.Make("j"), factory.Make("k"), factory.Make("l"), factory.Make(),
factory.Make("m")
};
List<List<string>> splitTokens = SplitTokens(tokens);
for (int i = 0; i < splitTokens.Count; i++)
{
Console.Write("{");
for (int j = 0; j < splitTokens[i].Count; j++)
{
Console.Write("{0}, ", splitTokens[i][j]);
}
Console.Write("}");
}
Console.ReadKey(true);
}
}
}
In principle, you can create instances of IToken<object> to have it generalized to tokens of multiple classes.
A: An all-lazy implementation will suffice if you just iterate through the results in a nested foreach:
using System;
using System.Collections.Generic;
public static class Splitter
{
public static IEnumerable<IEnumerable<T>> Split<T>(this IEnumerable<T> source, Predicate<T> match)
{
using (IEnumerator<T> enumerator = source.GetEnumerator())
{
while (enumerator.MoveNext())
{
yield return Split(enumerator, match);
}
}
}
static IEnumerable<T> Split<T>(IEnumerator<T> enumerator, Predicate<T> match)
{
do
{
if (match(enumerator.Current))
{
yield break;
}
else
{
yield return enumerator.Current;
}
} while (enumerator.MoveNext());
}
}
Use it like this:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
namespace MyTokenizer
{
class Program
{
enum TokenTypes { SimpleToken, UberToken }
class Token { public TokenTypes TokenType = TokenTypes.SimpleToken; }
class MyUberToken : Token { public MyUberToken() { TokenType = TokenTypes.UberToken; } }
static void Main(string[] args)
{
List<object> objects = new List<object>(new object[] { "A", Guid.NewGuid(), "C", new MyUberToken(), "D", new MyUberToken(), "E", new MyUberToken() });
var splitOn = TokenTypes.UberToken;
foreach (var list in objects.Split(x => x is Token && ((Token)x).TokenType == splitOn))
{
foreach (var item in list)
{
Console.WriteLine(item);
}
Console.WriteLine("==============");
}
Console.ReadKey();
}
}
}
B: If you need to process the results some time later and you wish to do it out-of-order, or you partition on one thread and then possibly dispatch the segments to multiple threads, then this would probably provide a good starting point:
using System;
using System.Collections.Generic;
using System.Linq;
public static class Splitter2
{
public static IEnumerable<IEnumerable<T>> SplitToSegments<T>(this IEnumerable<T> source, Predicate<T> match)
{
T[] items = source.ToArray();
for (int startIndex = 0; startIndex < items.Length; startIndex++)
{
int endIndex = startIndex;
for (; endIndex < items.Length; endIndex++)
{
if (match(items[endIndex])) break;
}
yield return EnumerateArraySegment(items, startIndex, endIndex - 1);
startIndex = endIndex;
}
}
static IEnumerable<T> EnumerateArraySegment<T>(T[] array, int startIndex, int endIndex)
{
for (; startIndex <= endIndex; startIndex++)
{
yield return array[startIndex];
}
}
}
C: If you really must return a collection of List<T> -s - which I doubt, unless you explicitly want to mutate them some time later on -, then try to initialize them to a given capacity before copying:
public static List<List<T>> SplitToLists<T>(this IEnumerable<T> source, Predicate<T> match)
{
List<List<T>> lists = new List<List<T>>();
T[] items = source.ToArray();
for (int startIndex = 0; startIndex < items.Length; startIndex++)
{
int endIndex = startIndex;
for (; endIndex < items.Length; endIndex++)
{
if (match(items[endIndex])) break;
}
List<T> list = new List<T>(endIndex - startIndex);
list.AddRange(EnumerateArraySegment(items, startIndex, endIndex - 1));
lists.Add(list);
startIndex = endIndex;
}
return lists;
}
D: If this is still not enough, I suggest you roll your own lightweight List implementation that can copy a range directly to its internal array from another instance.
My first thought would be instead of looking up t[currentT] all the time, just store a currentList and add directly to that.
This is the best I could do to eliminate as much of the allocation times as possible for the function (should only allocate when it goes over the capacity, which should be no more than what is required to create the largest sub list in the results). I've tested this implementation and it works as you described.
Please note that the results of the prior sub list are destroyed when the next list in the group is accessed.
public static IEnumerable<IEnumerable> Split(this IEnumerable tokens, TokenType type)
{
ArrayList currentT = new ArrayList();
foreach (object list in tokens)
{
Token token = list as Token;
if ((token != null) && token.TokenType == type)
{
yield return currentT;
currentT.Clear();
//currentT = new ArrayList(); <-- Use this instead of 'currentT.Clear();' if you want the returned lists to be a different instance
}
else if ((list is TokenType) && ((TokenType)list) == type)
{
yield return currentT;
currentT.Clear();
//currentT = new ArrayList(); <-- Use this instead of 'currentT.Clear();' if you want the returned lists to be a different instance
}
else
{
currentT.Add(list);
}
}
}
EDIT
Here's another version that doesn't make use of another list at all (shouldn't be doing any allocations). Not sure how well this will compare, but it does work as requested (also I've got no idea how this one will go if you try to cache a sub 'array').
Also, both of these require a "using System.Collections" statement (in addition to the Generic namespace).
private static IEnumerable SplitInnerLoop(IEnumerator iter, TokenType type)
{
do
{
Token token = iter.Current as Token;
if ((token != null) && token.TokenType == type)
{
break;
}
else if ((iter.Current is TokenType) && ((TokenType)iter.Current) == type)
{
break;
}
else
{
yield return iter.Current;
}
} while (iter.MoveNext());
}
public static IEnumerable<IEnumerable> Split(this IEnumerable tokens, TokenType type)
{
IEnumerator iter = tokens.GetEnumerator();
while (iter.MoveNext())
{
yield return SplitInnerLoop(iter, type);
}
}
I think that there are broken cases for these scenarios where assuming that list items are lower case letters, and the item with matching token type is T:
{ T a b c ... };
{ ... x y z T };
{ ... j k l T T m n o ... };
{ T }; and
{ }
Which will result in:
{ { } { a b c ... } };
{ { ... x y z } { } };
{ { ... j k l } { } { } { m n o ... } };
{ { } }; and
{ }
Doing a straight refactoring:
public static IEnumerable<object>[] Split(this IEnumerable<object> tokens,
TokenType type) {
var outer = new List<List<object>>();
var inner = new List<object>();
foreach (var item in tokens) {
Token token = item as token;
if (token != null && token.TokenType == type) {
outer.Add(inner);
inner = new List<object>();
continue;
}
inner.Add(item);
}
outer.Add(inner);
return outer.ToArray();
}
To fix the broken cases (assuming that those are truly broken), I recommend:
public static IEnumerable<object>[] Split(this IEnumerable<object> tokens,
TokenType type) {
var outer = new List<List<object>>();
var inner = new List<object>();
var enumerator = tokens.GetEnumerator();
while (enumerator.MoveNext()) {
Token token = enumerator.Current as token;
if (token == null || token.TokenType != type) {
inner.Add(enumerator.Current);
}
else if (inner.Count > 0) {
outer.Add(inner);
inner = new List<object>();
}
}
return outer.ToArray();
}
Which will result in:
{ { a b c ... } };
{ { ... x y z } };
{ { ... j k l } { m n o ... } };
{ }; and
{ }
Using LINQ you could try this: (I did not test it...)
public static IEnumerable<object>[] Split(this IEnumerable<object> tokens, TokenType type)
{
List<List<object>> l = new List<List<object>>();
l.Add(new List<object>());
return tokens.Aggregate(l, (c, n) =>
{
var t = n as Token;
if (t != null && t.TokenType == type)
{
t.Add(new List<object>());
}
else
{
l.Last().Add(n);
}
return t;
}).ToArray();
}
Second try:
public static IEnumerable<object>[] Split(this IEnumerable<object> tokens, TokenType type)
{
var indexes = tokens.Select((t, index) => new { token = t, index = index }).OfType<Token>().Where(t => t.token.TokenType == type).Select(t => t.index);
int prevIndex = 0;
foreach (int item in indexes)
{
yield return tokens.Where((t, index) => (index > prevIndex && index < item));
prevIndex = item;
}
}
Here is one possibility
The Token class ( could be what ever class )
public class Token
{
public string Name { get; set; }
public TokenType TokenType { get; set; }
}
Now the Type enum ( this could be what ever other grouping factor )
public enum TokenType
{
Type1,
Type2,
Type3,
Type4,
Type5,
}
The Extention Method (Declare this anyway you choose)
public static class TokenExtension
{
public static IEnumerable<Token>[] Split(this IEnumerable<Token> tokens)
{
return tokens.GroupBy(token => ((Token)token).TokenType).ToArray();
}
}
Sample of use ( I used a web project to spin this )
List<Token> tokens = new List<Token>();
tokens.Add(new Token { Name = "a", TokenType = TokenType.Type1 });
tokens.Add(new Token { Name = "b", TokenType = TokenType.Type1 });
tokens.Add(new Token { Name = "c", TokenType = TokenType.Type1 });
tokens.Add(new Token { Name = "d", TokenType = TokenType.Type2 });
tokens.Add(new Token { Name = "e", TokenType = TokenType.Type2 });
tokens.Add(new Token { Name = "f", TokenType = TokenType.Type3 });
tokens.Add(new Token { Name = "g", TokenType = TokenType.Type3 });
tokens.Add(new Token { Name = "h", TokenType = TokenType.Type3 });
tokens.Add(new Token { Name = "i", TokenType = TokenType.Type4 });
tokens.Add(new Token { Name = "j", TokenType = TokenType.Type4 });
tokens.Add(new Token { Name = "k", TokenType = TokenType.Type4 });
tokens.Add(new Token { Name = "l", TokenType = TokenType.Type4 });
tokens.Add(new Token { Name = "m", TokenType = TokenType.Type5 });
StringBuilder stringed = new StringBuilder();
foreach (Token token in tokens)
{
stringed.Append(token.Name);
stringed.Append(", ");
}
Response.Write(stringed.ToString());
Response.Write("</br>");
var q = tokens.Split();
foreach (var list in tokens.Split())
{
stringed = new StringBuilder();
foreach (Token token in list)
{
stringed.Append(token.Name);
stringed.Append(", ");
}
Response.Write(stringed.ToString());
Response.Write("</br>");
}
So all I am soing is using Linq, feel free to add or remove, you can actualy go crazy on this and group on many diferent properties.
Do you need to convert it to an array? You could potentially use LINQ and delayed execution to return the results.
EDIT:
With the clarified question it would be hard to bend LINQ to make it return the results the way you want. If you still want to have the execution of each cycle delayed you could write your own enumerator.
I recommend perf testing this compared to the other options to see if there is a performance gain for your scenario if you attempt this approach. It might cause more overhead managing the iterator which would be bad for cases with little data.
I hope this helps.
// This is the easy way to make your own iterator using the C# syntax
// It will return sets of separated tokens in a lazy fashion
// This sample is based on the version provided by #Ants
public static IEnumerable<IEnumerable<object>> Split(this IEnumerable<object> tokens,
TokenType type) {
var current = new List<object>();
foreach (var item in tokens)
{
Token token = item as Token;
if (token != null && token.TokenType == type)
{
if( current.Count > 0)
{
yield return current;
current = new List<object>();
}
}
else
{
current.Add(item);
}
}
if( current.Count > 0)
yield return current;
}
Warning: This compiles but has still might have hidden bugs. It is getting late here.
// This is doing the same thing but doing it all by hand.
// You could use this method as well to lazily iterate through the 'current' list as well
// This is probably overkill and substantially more complex
public class TokenSplitter : IEnumerable<IEnumerable<object>>, IEnumerator<IEnumerable<object>>
{
IEnumerator<object> _enumerator;
IEnumerable<object> _tokens;
TokenType _target;
List<object> _current;
bool _isDone = false;
public TokenSplitter(IEnumerable<object> tokens, TokenType target)
{
_tokens = tokens;
_target = target;
Reset();
}
// Cruft from the IEnumerable and generic IEnumerator
public IEnumerator<IEnumerable<object>> GetEnumerator() { return this; }
System.Collections.IEnumerator System.Collections.IEnumerable.GetEnumerator()
{
return GetEnumerator();
}
public IEnumerable<object> Current { get { return _current; } }
public void Dispose() { }
#region IEnumerator Members
object System.Collections.IEnumerator.Current { get { return Current; } }
// See if there is anything left to get
public bool MoveNext()
{
if (_isDone) return false;
FillCurrent();
return !_isDone;
}
// Reset the enumerators so that you could reuse this structure if you wanted
public void Reset()
{
_isDone = false;
_enumerator = _tokens.GetEnumerator();
_current = new List<object>();
FillCurrent();
}
// Fills the current set of token and then begins the next set
private void FillCurrent()
{
// Try to accumulate as many tokens as possible, this too could be an enumerator to delay the process more
bool hasNext = _enumerator.MoveNext();
for( ; hasNext; hasNext = _enumerator.MoveNext())
{
Token token = _enumerator.Current as Token;
if (token == null || token.TokenType != _target)
{
_current.Add(_enumerator.Current);
}
else
{
_current = new List<object>();
}
}
// Continue removing matching tokens and begin creating the next set
for( ; hasNext; hasNext = _enumerator.MoveNext())
{
Token token = _enumerator.Current as Token;
if (token == null || token.TokenType != _target)
{
_current.Add(_enumerator.Current);
break;
}
}
_isDone = !hasNext;
}
#endregion
}

Categories

Resources