Read the edit below for more information.
I have some code below that I use to split a generic list of Object when the item is of a certain type.
public static IEnumerable<object>[] Split(this IEnumerable<object> tokens, TokenType type) {
List<List<object>> t = new List<List<object>>();
int currentT = 0;
t.Add(new List<object>());
foreach (object list in tokens) {
if ((list is Token) && (list as Token).TokenType == type) {
currentT++;
t.Add(new List<object>());
}
else if ((list is TokenType) && ((TokenType)list )== type) {
currentT++;
t.Add(new List<object>());
}
else {
t[currentT].Add(list);
}
}
return t.ToArray();
}
I dont have a clear question as much as I am curious if anyone knows of any ways I can optimize this code. I call it many times and it seems to be quite the beast as far as clock cycles go. Any ideas? I can also make it a Wiki if anyone is interested, maybe we can keep track of the latest changes.
Update: Im trying to parse out specific tokens. Its a list of some other class and Token classes. Token has a property (enum) of TokenType. I need to find all the Token classes and split on each of them.
{a b c T d e T f g h T i j k l T m}
would split like
{a b c}{d e}{f g h}{i j k l}{m}
EDIT UPDATE:
It seems like all of my speed problems come into the constant creation and addition of Generic Lists. Does anyone know how I can go about this without that?
This is the profile of what is happening if it helps anyone.
Your code looks fine.
My only suggestion would be replacing IEnumerable<object> with the non-generic IEnumerable. (In System.Collections)
EDIT:
On further inspection, you're casting more times than necessary.
Replace the if with the following code:
var token = list as Token;
if (token != null && token.TokenType == type) {
Also, you can get rid your currentT variable by writing t[t.Count - 1] or t.Last(). This will make the code clearer, but might have a tiny negative effect on performance.
Alternatively, you could store a reference to the inner list in a variable and use it directly. (This will slightly improve performance)
Finally, if you can change the return type to List<List<Object>>, you can return t directly; this will avoid an array copy and will be noticeably faster for large lists.
By the way, your variable names are confusing; you should swap the names of t and list.
Type-testing and casts can be a performance killer. If at all possible, your token types should implement a common interface or abstract class. Instead of passing in and object, you should pass in an IToken which wraps your object.
Here's some concept code you can use to get started:
using System;
using System.Collections.Generic;
namespace Juliet
{
interface IToken<T>
{
bool IsDelimeter { get; }
T Data { get; }
}
class DelimeterToken<T> : IToken<T>
{
public bool IsDelimeter { get { return true; } }
public T Data { get { throw new Exception("No data"); } }
}
class DataToken<T> : IToken<T>
{
public DataToken(T data)
{
this.Data = data;
}
public bool IsDelimeter { get { return false; } }
public T Data { get; private set; }
}
class TokenFactory<T>
{
public IToken<T> Make()
{
return new DelimeterToken<T>();
}
public IToken<T> Make(T data)
{
return new DataToken<T>(data);
}
}
class Program
{
static List<List<T>> SplitTokens<T>(IEnumerable<IToken<T>> tokens)
{
List<List<T>> res = new List<List<T>>();
foreach (IToken<T> token in tokens)
{
if (token.IsDelimeter)
{
res.Add(new List<T>());
}
else
{
if (res.Count == 0)
{
res.Add(new List<T>());
}
res[res.Count - 1].Add(token.Data);
}
}
return res;
}
static void Main(string[] args)
{
TokenFactory<string> factory = new TokenFactory<string>();
IToken<string>[] tokens = new IToken<string>[]
{
factory.Make("a"), factory.Make("b"), factory.Make("c"), factory.Make(),
factory.Make("d"), factory.Make("e"), factory.Make(),
factory.Make("f"), factory.Make("g"), factory.Make("h"), factory.Make(),
factory.Make("i"), factory.Make("j"), factory.Make("k"), factory.Make("l"), factory.Make(),
factory.Make("m")
};
List<List<string>> splitTokens = SplitTokens(tokens);
for (int i = 0; i < splitTokens.Count; i++)
{
Console.Write("{");
for (int j = 0; j < splitTokens[i].Count; j++)
{
Console.Write("{0}, ", splitTokens[i][j]);
}
Console.Write("}");
}
Console.ReadKey(true);
}
}
}
In principle, you can create instances of IToken<object> to have it generalized to tokens of multiple classes.
A: An all-lazy implementation will suffice if you just iterate through the results in a nested foreach:
using System;
using System.Collections.Generic;
public static class Splitter
{
public static IEnumerable<IEnumerable<T>> Split<T>(this IEnumerable<T> source, Predicate<T> match)
{
using (IEnumerator<T> enumerator = source.GetEnumerator())
{
while (enumerator.MoveNext())
{
yield return Split(enumerator, match);
}
}
}
static IEnumerable<T> Split<T>(IEnumerator<T> enumerator, Predicate<T> match)
{
do
{
if (match(enumerator.Current))
{
yield break;
}
else
{
yield return enumerator.Current;
}
} while (enumerator.MoveNext());
}
}
Use it like this:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
namespace MyTokenizer
{
class Program
{
enum TokenTypes { SimpleToken, UberToken }
class Token { public TokenTypes TokenType = TokenTypes.SimpleToken; }
class MyUberToken : Token { public MyUberToken() { TokenType = TokenTypes.UberToken; } }
static void Main(string[] args)
{
List<object> objects = new List<object>(new object[] { "A", Guid.NewGuid(), "C", new MyUberToken(), "D", new MyUberToken(), "E", new MyUberToken() });
var splitOn = TokenTypes.UberToken;
foreach (var list in objects.Split(x => x is Token && ((Token)x).TokenType == splitOn))
{
foreach (var item in list)
{
Console.WriteLine(item);
}
Console.WriteLine("==============");
}
Console.ReadKey();
}
}
}
B: If you need to process the results some time later and you wish to do it out-of-order, or you partition on one thread and then possibly dispatch the segments to multiple threads, then this would probably provide a good starting point:
using System;
using System.Collections.Generic;
using System.Linq;
public static class Splitter2
{
public static IEnumerable<IEnumerable<T>> SplitToSegments<T>(this IEnumerable<T> source, Predicate<T> match)
{
T[] items = source.ToArray();
for (int startIndex = 0; startIndex < items.Length; startIndex++)
{
int endIndex = startIndex;
for (; endIndex < items.Length; endIndex++)
{
if (match(items[endIndex])) break;
}
yield return EnumerateArraySegment(items, startIndex, endIndex - 1);
startIndex = endIndex;
}
}
static IEnumerable<T> EnumerateArraySegment<T>(T[] array, int startIndex, int endIndex)
{
for (; startIndex <= endIndex; startIndex++)
{
yield return array[startIndex];
}
}
}
C: If you really must return a collection of List<T> -s - which I doubt, unless you explicitly want to mutate them some time later on -, then try to initialize them to a given capacity before copying:
public static List<List<T>> SplitToLists<T>(this IEnumerable<T> source, Predicate<T> match)
{
List<List<T>> lists = new List<List<T>>();
T[] items = source.ToArray();
for (int startIndex = 0; startIndex < items.Length; startIndex++)
{
int endIndex = startIndex;
for (; endIndex < items.Length; endIndex++)
{
if (match(items[endIndex])) break;
}
List<T> list = new List<T>(endIndex - startIndex);
list.AddRange(EnumerateArraySegment(items, startIndex, endIndex - 1));
lists.Add(list);
startIndex = endIndex;
}
return lists;
}
D: If this is still not enough, I suggest you roll your own lightweight List implementation that can copy a range directly to its internal array from another instance.
My first thought would be instead of looking up t[currentT] all the time, just store a currentList and add directly to that.
This is the best I could do to eliminate as much of the allocation times as possible for the function (should only allocate when it goes over the capacity, which should be no more than what is required to create the largest sub list in the results). I've tested this implementation and it works as you described.
Please note that the results of the prior sub list are destroyed when the next list in the group is accessed.
public static IEnumerable<IEnumerable> Split(this IEnumerable tokens, TokenType type)
{
ArrayList currentT = new ArrayList();
foreach (object list in tokens)
{
Token token = list as Token;
if ((token != null) && token.TokenType == type)
{
yield return currentT;
currentT.Clear();
//currentT = new ArrayList(); <-- Use this instead of 'currentT.Clear();' if you want the returned lists to be a different instance
}
else if ((list is TokenType) && ((TokenType)list) == type)
{
yield return currentT;
currentT.Clear();
//currentT = new ArrayList(); <-- Use this instead of 'currentT.Clear();' if you want the returned lists to be a different instance
}
else
{
currentT.Add(list);
}
}
}
EDIT
Here's another version that doesn't make use of another list at all (shouldn't be doing any allocations). Not sure how well this will compare, but it does work as requested (also I've got no idea how this one will go if you try to cache a sub 'array').
Also, both of these require a "using System.Collections" statement (in addition to the Generic namespace).
private static IEnumerable SplitInnerLoop(IEnumerator iter, TokenType type)
{
do
{
Token token = iter.Current as Token;
if ((token != null) && token.TokenType == type)
{
break;
}
else if ((iter.Current is TokenType) && ((TokenType)iter.Current) == type)
{
break;
}
else
{
yield return iter.Current;
}
} while (iter.MoveNext());
}
public static IEnumerable<IEnumerable> Split(this IEnumerable tokens, TokenType type)
{
IEnumerator iter = tokens.GetEnumerator();
while (iter.MoveNext())
{
yield return SplitInnerLoop(iter, type);
}
}
I think that there are broken cases for these scenarios where assuming that list items are lower case letters, and the item with matching token type is T:
{ T a b c ... };
{ ... x y z T };
{ ... j k l T T m n o ... };
{ T }; and
{ }
Which will result in:
{ { } { a b c ... } };
{ { ... x y z } { } };
{ { ... j k l } { } { } { m n o ... } };
{ { } }; and
{ }
Doing a straight refactoring:
public static IEnumerable<object>[] Split(this IEnumerable<object> tokens,
TokenType type) {
var outer = new List<List<object>>();
var inner = new List<object>();
foreach (var item in tokens) {
Token token = item as token;
if (token != null && token.TokenType == type) {
outer.Add(inner);
inner = new List<object>();
continue;
}
inner.Add(item);
}
outer.Add(inner);
return outer.ToArray();
}
To fix the broken cases (assuming that those are truly broken), I recommend:
public static IEnumerable<object>[] Split(this IEnumerable<object> tokens,
TokenType type) {
var outer = new List<List<object>>();
var inner = new List<object>();
var enumerator = tokens.GetEnumerator();
while (enumerator.MoveNext()) {
Token token = enumerator.Current as token;
if (token == null || token.TokenType != type) {
inner.Add(enumerator.Current);
}
else if (inner.Count > 0) {
outer.Add(inner);
inner = new List<object>();
}
}
return outer.ToArray();
}
Which will result in:
{ { a b c ... } };
{ { ... x y z } };
{ { ... j k l } { m n o ... } };
{ }; and
{ }
Using LINQ you could try this: (I did not test it...)
public static IEnumerable<object>[] Split(this IEnumerable<object> tokens, TokenType type)
{
List<List<object>> l = new List<List<object>>();
l.Add(new List<object>());
return tokens.Aggregate(l, (c, n) =>
{
var t = n as Token;
if (t != null && t.TokenType == type)
{
t.Add(new List<object>());
}
else
{
l.Last().Add(n);
}
return t;
}).ToArray();
}
Second try:
public static IEnumerable<object>[] Split(this IEnumerable<object> tokens, TokenType type)
{
var indexes = tokens.Select((t, index) => new { token = t, index = index }).OfType<Token>().Where(t => t.token.TokenType == type).Select(t => t.index);
int prevIndex = 0;
foreach (int item in indexes)
{
yield return tokens.Where((t, index) => (index > prevIndex && index < item));
prevIndex = item;
}
}
Here is one possibility
The Token class ( could be what ever class )
public class Token
{
public string Name { get; set; }
public TokenType TokenType { get; set; }
}
Now the Type enum ( this could be what ever other grouping factor )
public enum TokenType
{
Type1,
Type2,
Type3,
Type4,
Type5,
}
The Extention Method (Declare this anyway you choose)
public static class TokenExtension
{
public static IEnumerable<Token>[] Split(this IEnumerable<Token> tokens)
{
return tokens.GroupBy(token => ((Token)token).TokenType).ToArray();
}
}
Sample of use ( I used a web project to spin this )
List<Token> tokens = new List<Token>();
tokens.Add(new Token { Name = "a", TokenType = TokenType.Type1 });
tokens.Add(new Token { Name = "b", TokenType = TokenType.Type1 });
tokens.Add(new Token { Name = "c", TokenType = TokenType.Type1 });
tokens.Add(new Token { Name = "d", TokenType = TokenType.Type2 });
tokens.Add(new Token { Name = "e", TokenType = TokenType.Type2 });
tokens.Add(new Token { Name = "f", TokenType = TokenType.Type3 });
tokens.Add(new Token { Name = "g", TokenType = TokenType.Type3 });
tokens.Add(new Token { Name = "h", TokenType = TokenType.Type3 });
tokens.Add(new Token { Name = "i", TokenType = TokenType.Type4 });
tokens.Add(new Token { Name = "j", TokenType = TokenType.Type4 });
tokens.Add(new Token { Name = "k", TokenType = TokenType.Type4 });
tokens.Add(new Token { Name = "l", TokenType = TokenType.Type4 });
tokens.Add(new Token { Name = "m", TokenType = TokenType.Type5 });
StringBuilder stringed = new StringBuilder();
foreach (Token token in tokens)
{
stringed.Append(token.Name);
stringed.Append(", ");
}
Response.Write(stringed.ToString());
Response.Write("</br>");
var q = tokens.Split();
foreach (var list in tokens.Split())
{
stringed = new StringBuilder();
foreach (Token token in list)
{
stringed.Append(token.Name);
stringed.Append(", ");
}
Response.Write(stringed.ToString());
Response.Write("</br>");
}
So all I am soing is using Linq, feel free to add or remove, you can actualy go crazy on this and group on many diferent properties.
Do you need to convert it to an array? You could potentially use LINQ and delayed execution to return the results.
EDIT:
With the clarified question it would be hard to bend LINQ to make it return the results the way you want. If you still want to have the execution of each cycle delayed you could write your own enumerator.
I recommend perf testing this compared to the other options to see if there is a performance gain for your scenario if you attempt this approach. It might cause more overhead managing the iterator which would be bad for cases with little data.
I hope this helps.
// This is the easy way to make your own iterator using the C# syntax
// It will return sets of separated tokens in a lazy fashion
// This sample is based on the version provided by #Ants
public static IEnumerable<IEnumerable<object>> Split(this IEnumerable<object> tokens,
TokenType type) {
var current = new List<object>();
foreach (var item in tokens)
{
Token token = item as Token;
if (token != null && token.TokenType == type)
{
if( current.Count > 0)
{
yield return current;
current = new List<object>();
}
}
else
{
current.Add(item);
}
}
if( current.Count > 0)
yield return current;
}
Warning: This compiles but has still might have hidden bugs. It is getting late here.
// This is doing the same thing but doing it all by hand.
// You could use this method as well to lazily iterate through the 'current' list as well
// This is probably overkill and substantially more complex
public class TokenSplitter : IEnumerable<IEnumerable<object>>, IEnumerator<IEnumerable<object>>
{
IEnumerator<object> _enumerator;
IEnumerable<object> _tokens;
TokenType _target;
List<object> _current;
bool _isDone = false;
public TokenSplitter(IEnumerable<object> tokens, TokenType target)
{
_tokens = tokens;
_target = target;
Reset();
}
// Cruft from the IEnumerable and generic IEnumerator
public IEnumerator<IEnumerable<object>> GetEnumerator() { return this; }
System.Collections.IEnumerator System.Collections.IEnumerable.GetEnumerator()
{
return GetEnumerator();
}
public IEnumerable<object> Current { get { return _current; } }
public void Dispose() { }
#region IEnumerator Members
object System.Collections.IEnumerator.Current { get { return Current; } }
// See if there is anything left to get
public bool MoveNext()
{
if (_isDone) return false;
FillCurrent();
return !_isDone;
}
// Reset the enumerators so that you could reuse this structure if you wanted
public void Reset()
{
_isDone = false;
_enumerator = _tokens.GetEnumerator();
_current = new List<object>();
FillCurrent();
}
// Fills the current set of token and then begins the next set
private void FillCurrent()
{
// Try to accumulate as many tokens as possible, this too could be an enumerator to delay the process more
bool hasNext = _enumerator.MoveNext();
for( ; hasNext; hasNext = _enumerator.MoveNext())
{
Token token = _enumerator.Current as Token;
if (token == null || token.TokenType != _target)
{
_current.Add(_enumerator.Current);
}
else
{
_current = new List<object>();
}
}
// Continue removing matching tokens and begin creating the next set
for( ; hasNext; hasNext = _enumerator.MoveNext())
{
Token token = _enumerator.Current as Token;
if (token == null || token.TokenType != _target)
{
_current.Add(_enumerator.Current);
break;
}
}
_isDone = !hasNext;
}
#endregion
}
Related
How to create a List variable that can contain array of int or string in C#?
I have a class:
public class GoogleAdSlot
{
public IList<int[]> Size { get; set; }
}
How do I create a List such that
return new GoogleAdData
{
AdSlots = new Dictionary<string, GoogleAdSlot>
{
{
"googleAdSquareTile1",
new GoogleAdSlot {
Size = new List<int[]>
{
new[] {250, 250},
new[] {300, 300},
}
}
}
}
};
And:
return new GoogleAdData
{
AdSlots = new Dictionary<string, GoogleAdSlot>
{
{
"googleAdSquareTile1",
new GoogleAdSlot {
Size = new List<int[]>
{
new[] {fluid},
}
}
}
}
};
are both valid.
sample here
List<object> lst = new List<object>() { "12",1,"apple"};
lst.ForEach(m => { Console.WriteLine((m is int) ? "int Varible" : "String Varibale"); });
You can only store one Type and its derived types in a generic List, in the case you described you would have to use List<object> or it's non generic counterpart ArrayList, but non-generic collections are not recommended and don't support newer features like LINQ.
So that would look something like:
var list = new List<object> {"lol", 101};
foreach (var value in list)
{
if(value is string s)
Console.WriteLine(s);
if (value is int i)
Console.WriteLine(i);
}
I'm still not sure exactly what your requirements are, but here's a solution for the most general case.
First
Create a class capable of holding a collection of either integers or strings (but not both). Note that it uses the Collection initialization pattern (twice) (see https://learn.microsoft.com/en-us/dotnet/csharp/programming-guide/classes-and-structs/object-and-collection-initializers)
Note that the various GetEnumerator implementations are there only to satisfy the collection initialization pattern. They throw if called, the expectation is that a consumer would call GetIntegerSizes or GetStringSizes.
Here's the code:
public class AdSlotSizes : IEnumerable<int>, IEnumerable<string>
{
public enum CollectionType
{
Uninitialized,
Integers,
Strings,
}
private List<int> _integerSizes;
private List<string> _stringSizes;
public void Add(int sizeToAdd)
{
InitializeList(CollectionType.Integers);
_integerSizes.Add(sizeToAdd);
}
public void Add(string sizeToAdd)
{
InitializeList(CollectionType.Strings);
_stringSizes.Add(sizeToAdd);
}
public CollectionType SizesCollectionType => _collectionType;
private CollectionType _collectionType = CollectionType.Uninitialized;
private void InitializeList(CollectionType forCollectionType)
{
CollectionType oppositeCollectionType = (CollectionType)(((int) CollectionType.Strings + 1) - (int) forCollectionType);
if (_collectionType == oppositeCollectionType)
{
throw new ArgumentException($"A single {nameof(AdSlotSizes)} instance can only hold one type of sizes (int or string)");
}
if (_collectionType != CollectionType.Uninitialized)
{
return;
}
_collectionType = forCollectionType;
if (forCollectionType == CollectionType.Strings)
{
_stringSizes = _stringSizes ?? new List<string>();
}
if (forCollectionType == CollectionType.Integers)
{
_integerSizes = _integerSizes ?? new List<int>();
}
}
public IEnumerable<int> GetIntegerSizes()
{
if (_collectionType != CollectionType.Integers)
{
throw new ArgumentException("Size collection not initialized for integers");
}
foreach (var size in _integerSizes)
{
yield return size;
}
}
public IEnumerable<string> GetStringSizes()
{
if (_collectionType != CollectionType.Strings)
{
throw new ArgumentException("Size collection not initialized for strings");
}
foreach (var size in _stringSizes)
{
yield return size;
}
}
IEnumerator<string> IEnumerable<string>.GetEnumerator()
{
throw new NotImplementedException();
}
IEnumerator<int> IEnumerable<int>.GetEnumerator()
{
throw new NotImplementedException();
}
IEnumerator IEnumerable.GetEnumerator()
{
throw new NotImplementedException();
}
}
Second
Now I create a class that collections AdSlotSizes instances (again, using the collection initialization pattern). It also implements an indexed property (public AdSlotSizes this[int index]), allowing the consumer to index into the collection:
public class GoogleAddSlot : IEnumerable<AdSlotSizes>
{
private readonly List<AdSlotSizes> _slotSizes = new List<AdSlotSizes>();
public void Add(AdSlotSizes sizes)
{
_slotSizes.Add(sizes);
}
public AdSlotSizes this[int index] => _slotSizes[index];
public IEnumerator<AdSlotSizes> GetEnumerator()
{
foreach (var sizes in _slotSizes)
{
yield return sizes;
}
}
IEnumerator IEnumerable.GetEnumerator()
{
return (IEnumerator < AdSlotSizes > )this.GetEnumerator();
}
}
Finally, a test method:
public static void TestCollectionSizes()
{
var slot = new GoogleAddSlot
{
{new AdSlotSizes {200, 300, 400} },
{new AdSlotSizes {"fluid", "solid"}},
};
foreach (int size in slot[0].GetIntegerSizes())
{
Debug.WriteLine($"Integer Size: {size}");
}
foreach (string size in slot[1].GetStringSizes())
{
Debug.WriteLine($"String Size: {size}");
}
}
When run, this test method outputs:
Integer Size: 200
Integer Size: 300
Integer Size: 400
String Size: fluid
String Size: solid
Given a city:
public class City
{
public int Id { get; set; }
public string Name { get; set; }
public string Country { get; set; }
public LatLong Location { get; set; }
}
I have a list of close to 3,000,000 cities (and towns and villages etc.) in a file. This file is read into memory; I have been playing with arrays, lists, dictionaries (key = Id) etc.
I want to find, as quick as possible, all cities matching a substring (case insensitive). So when I search for 'yor' I want to get all matches (1000+) ASAP (matching 'York Town', 'Villa Mayor', 'New York', ...).
Functionally you could write this as:
cities.Values.Where(c => c.Name.IndexOf("yor", StringComparison.OrdinalIgnoreCase) >= 0)
I don't mind doing some pre-processing when reading the file; as a matter of fact: that's what I'm mostly looking for. Read the file, "chew" on the data creating some sort of index or... and then be ready to answer queries like "yor".
I want this to be standalone, self-contained. I do not want to add dependencies like an RDBMS, ElasticSearch or whatever. I don't mind having (parts of) the list in memory more than once. I don't mind spending some memory on a datastructure to help me find my results quickly. I don't want libraries or packages. I want an algorithm I can implement myself.
Basically I want the above LINQ statement, but optimized for my case; currently plowing through almost 3,000,000 records takes about +/- 2 seconds. I want this sub 0.1 second so I could use the search and it's results as 'autocomplete'.
Creating an "index"(-alike) structure is probably what I need. As I'm writing I remember something about a "bloom filter" but I'm not sure if that would help or even supports substring search. Will look into that now.
Any tips, pointers, help very much appreciated.
I created a bit of a hybrid based on a suffix array / dictionary. Thanks to saibot for suggesting it first and all other people helping and suggesting.
This is what I came up with:
public class CitiesCollection
{
private Dictionary<int, City> _cities;
private SuffixDict<int> _suffixdict;
public CitiesCollection(IEnumerable<City> cities, int minLen)
{
_cities = cities.ToDictionary(c => c.Id);
_suffixdict = new SuffixDict<int>(minLen, _cities.Values.Count);
foreach (var c in _cities.Values)
_suffixdict.Add(c.Name, c.Id);
}
public IEnumerable<City> Find(string find)
{
var normalizedFind = _suffixdict.NormalizeString(find);
foreach (var id in _suffixdict.Get(normalizedFind).Where(v => _cities[v].Name.IndexOf(normalizedFind, StringComparison.OrdinalIgnoreCase) >= 0))
yield return _cities[id];
}
}
public class SuffixDict<T>
{
private readonly int _suffixsize;
private ConcurrentDictionary<string, IList<T>> _dict;
public SuffixDict(int suffixSize, int capacity)
{
_suffixsize = suffixSize;
_dict = new ConcurrentDictionary<string, IList<T>>(Environment.ProcessorCount, capacity);
}
public void Add(string suffix, T value)
{
foreach (var s in GetSuffixes(suffix))
AddDict(s, value);
}
public IEnumerable<T> Get(string suffix)
{
return Find(suffix).Distinct();
}
private IEnumerable<T> Find(string suffix)
{
foreach (var s in GetSuffixes(suffix))
{
if (_dict.TryGetValue(s, out var result))
foreach (var i in result)
yield return i;
}
}
public string NormalizeString(string value)
{
return value.Normalize().ToLowerInvariant();
}
private void AddDict(string suffix, T value)
{
_dict.AddOrUpdate(suffix, (s) => new List<T>() { value }, (k, v) => { v.Add(value); return v; });
}
private IEnumerable<string> GetSuffixes(string value)
{
var nv = NormalizeString(value);
for (var i = 0; i <= nv.Length - _suffixsize ; i++)
yield return nv.Substring(i, _suffixsize);
}
}
Usage (where I assume mycities to be an IEnumerable<City> with the given City object from the question):
var cc = new CitiesCollection(mycities, 3);
var results = cc.Find("york");
Some results:
Find: sterda elapsed: 00:00:00.0220522 results: 32
Find: york elapsed: 00:00:00.0006212 results: 155
Find: dorf elapsed: 00:00:00.0086439 results: 6095
Memory usage is very, very acceptable. Only 650MB total having the entire collection of 3,000,000 cities in memory.
In the above I'm storing Id's in the "SuffixDict" and I have a level of indirection (dictionary lookups to find id=>city). This can be further simplified to:
public class CitiesCollection
{
private SuffixDict<City> _suffixdict;
public CitiesCollection(IEnumerable<City> cities, int minLen, int capacity = 1000)
{
_suffixdict = new SuffixDict<City>(minLen, capacity);
foreach (var c in cities)
_suffixdict.Add(c.Name, c);
}
public IEnumerable<City> Find(string find, StringComparison stringComparison = StringComparison.OrdinalIgnoreCase)
{
var normalizedFind = SuffixDict<City>.NormalizeString(find);
var x = _suffixdict.Find(normalizedFind).ToArray();
foreach (var city in _suffixdict.Find(normalizedFind).Where(v => v.Name.IndexOf(normalizedFind, stringComparison) >= 0))
yield return city;
}
}
public class SuffixDict<T>
{
private readonly int _suffixsize;
private ConcurrentDictionary<string, IList<T>> _dict;
public SuffixDict(int suffixSize, int capacity = 1000)
{
_suffixsize = suffixSize;
_dict = new ConcurrentDictionary<string, IList<T>>(Environment.ProcessorCount, capacity);
}
public void Add(string suffix, T value)
{
foreach (var s in GetSuffixes(suffix, _suffixsize))
AddDict(s, value);
}
public IEnumerable<T> Find(string suffix)
{
var normalizedfind = NormalizeString(suffix);
var find = normalizedfind.Substring(0, Math.Min(normalizedfind.Length, _suffixsize));
if (_dict.TryGetValue(find, out var result))
foreach (var i in result)
yield return i;
}
private void AddDict(string suffix, T value)
{
_dict.AddOrUpdate(suffix, (s) => new List<T>() { value }, (k, v) => { v.Add(value); return v; });
}
public static string NormalizeString(string value)
{
return value.Normalize().ToLowerInvariant();
}
private static IEnumerable<string> GetSuffixes(string value, int suffixSize)
{
var nv = NormalizeString(value);
if (value.Length < suffixSize)
{
yield return nv;
}
else
{
for (var i = 0; i <= nv.Length - suffixSize; i++)
yield return nv.Substring(i, suffixSize);
}
}
}
This bumps the load time up from 00:00:16.3899085 to 00:00:25.6113214, memory usage goes down from 650MB to 486MB. Lookups/searches perform a bit better since we have one less level of indirection.
Find: sterda elapsed: 00:00:00.0168616 results: 32
Find: york elapsed: 00:00:00.0003945 results: 155
Find: dorf elapsed: 00:00:00.0062015 results: 6095
I'm happy with the results so far. I'll do a little polishing and refactoring and call it a day! Thanks everybody for the help!
And this is how it performs with 2,972,036 cities:
This has evolved into a case-insensitive, accent-insensitive search by modifying the code to this:
public static class ExtensionMethods
{
public static T FirstOrDefault<T>(this IEnumerable<T> src, Func<T, bool> testFn, T defval)
{
return src.Where(aT => testFn(aT)).DefaultIfEmpty(defval).First();
}
public static int IndexOf(this string source, string match, IEqualityComparer<string> sc)
{
return Enumerable.Range(0, source.Length) // for each position in the string
.FirstOrDefault(i => // find the first position where either
// match is Equals at this position for length of match (or to end of string) or
sc.Equals(source.Substring(i, Math.Min(match.Length, source.Length - i)), match) ||
// match is Equals to on of the substrings beginning at this position
Enumerable.Range(1, source.Length - i - 1).Any(ml => sc.Equals(source.Substring(i, ml), match)),
-1 // else return -1 if no position matches
);
}
}
public class CaseAccentInsensitiveEqualityComparer : IEqualityComparer<string>
{
private static readonly CompareOptions _compareoptions = CompareOptions.IgnoreCase | CompareOptions.IgnoreNonSpace | CompareOptions.IgnoreKanaType | CompareOptions.IgnoreWidth | CompareOptions.IgnoreSymbols;
private static readonly CultureInfo _cultureinfo = CultureInfo.InvariantCulture;
public bool Equals(string x, string y)
{
return string.Compare(x, y, _cultureinfo, _compareoptions) == 0;
}
public int GetHashCode(string obj)
{
return obj != null ? RemoveDiacritics(obj).ToUpperInvariant().GetHashCode() : 0;
}
private string RemoveDiacritics(string text)
{
return string.Concat(
text.Normalize(NormalizationForm.FormD)
.Where(ch => CharUnicodeInfo.GetUnicodeCategory(ch) != UnicodeCategory.NonSpacingMark)
).Normalize(NormalizationForm.FormC);
}
}
public class CitiesCollection
{
private SuffixDict<City> _suffixdict;
private HashSet<string> _countries;
private Dictionary<int, City> _cities;
private readonly IEqualityComparer<string> _comparer = new CaseAccentInsensitiveEqualityComparer();
public CitiesCollection(IEnumerable<City> cities, int minLen, int capacity = 1000)
{
_suffixdict = new SuffixDict<City>(minLen, _comparer, capacity);
_countries = new HashSet<string>();
_cities = new Dictionary<int, City>(capacity);
foreach (var c in cities)
{
_suffixdict.Add(c.Name, c);
_countries.Add(c.Country);
_cities.Add(c.Id, c);
}
}
public City this[int index] => _cities[index];
public IEnumerable<string> Countries => _countries;
public IEnumerable<City> Find(string find, StringComparison stringComparison = StringComparison.OrdinalIgnoreCase)
{
foreach (var city in _suffixdict.Find(find).Where(v => v.Name.IndexOf(find, _comparer) >= 0))
yield return city;
}
}
public class SuffixDict<T>
{
private readonly int _suffixsize;
private ConcurrentDictionary<string, IList<T>> _dict;
public SuffixDict(int suffixSize, IEqualityComparer<string> stringComparer, int capacity = 1000)
{
_suffixsize = suffixSize;
_dict = new ConcurrentDictionary<string, IList<T>>(Environment.ProcessorCount, capacity, stringComparer);
}
public void Add(string suffix, T value)
{
foreach (var s in GetSuffixes(suffix, _suffixsize))
AddDict(s, value);
}
public IEnumerable<T> Find(string suffix)
{
var find = suffix.Substring(0, Math.Min(suffix.Length, _suffixsize));
if (_dict.TryGetValue(find, out var result))
{
foreach (var i in result)
yield return i;
}
}
private void AddDict(string suffix, T value)
{
_dict.AddOrUpdate(suffix, (s) => new List<T>() { value }, (k, v) => { v.Add(value); return v; });
}
private static IEnumerable<string> GetSuffixes(string value, int suffixSize)
{
if (value.Length < 2)
{
yield return value;
}
else
{
for (var i = 0; i <= value.Length - suffixSize; i++)
yield return value.Substring(i, suffixSize);
}
}
}
With credit also to Netmage and Mitsugui. There are still some issues / edge-cases but it's continually improving!
You could use a suffix tree: https://en.wikipedia.org/wiki/Suffix_tree
It requires enough space to store about 20 times your list of words in memory
Suffix array is a space efficient alternative: https://en.wikipedia.org/wiki/Suffix_array
in query benchmark contains very faster then indexOf >0
cities.Values.Where(c => c.Name.Contans("yor"))
I making a application with compression mechanism and need my own Dictionary. Every cicle in my app, It adds new element into a myDictionary and update(add a char to some previous elements in myDictionary ). I was doing it with normal list and Quicksort function, but it was really slow. I'm searching for some new methods how to do this but SortedList, Dictionary or LookUp doesnt seems like what I looking for. Is it better to make my own SortedList or is too hard/complex to manage?
Some of the code:
public class MyDictionary
{
private List<string> Contexts;
private List<string> Contents;
private int Count; //words count
//Konstruktor
public MyDictionary()
{
Count = 0;
Contexts = new List<string>();
Contents = new List<string>();
}
region Public Functions
public void AddChar(char ch, int contentSize)
{
for (int i = 0; i < Count; i++)
{
if (Contents[i].Length < contentSize)
{
Contents[i] = Contents[i] + ch;
}
}
}
public void Add(string context, string content)
{
Contexts.Add(Reverse(context)); //otočený kontext
Contents.Add(content);
Count++;
}
public void update()
{
quicksort(Contexts, Contents, 0, Count-1);
}
private void quicksort(List<String> context, List<String> content, int left, int right)
{
int i = left, j = right;
string pivot = context[(left + right) / 2];
while (i <= j)
{
while (context[i].CompareTo(pivot) < 0)
{
i++;
}
while (context[j].CompareTo(pivot) > 0)
{
j--;
}
if (i <= j)
{
swap(i,j);
i++;
j--;
}
}
// Recursive calls
if (left < j)
{
quicksort(context, content, left, j);
}
if (i < right)
{
quicksort(context, content, i, right);
}
}
private static string Reverse(string s)
{
char[] charArray = s.ToCharArray();
Array.Reverse(charArray);
return new string(charArray);
}
Here is a class that acts like a SortedDictionary, but can hold multiple values with the same key. You may need to flesh it out a little bit, with methods like Remove, and adding support for your own IComparer<TKey> if you need them. LINQPad file
public class SortedMultiValue<TKey, TValue> : IEnumerable<TValue>
{
private SortedDictionary<TKey, List<TValue>> _data;
public SortedMultiValue()
{
_data = new SortedDictionary<TKey, System.Collections.Generic.List<TValue>>();
}
public void Clear()
{
_data.Clear();
}
public void Add(TKey key, TValue value)
{
if (!_data.TryGetValue(key, out List<TValue> items))
{
items = new List<TValue>();
_data.Add(key, items);
}
items.Add(value);
}
public IEnumerable<TValue> Get(TKey key)
{
if (_data.TryGetValue(key, out List<TValue> items))
{
return items;
}
throw new KeyNotFoundException();
}
public IEnumerator<TValue> GetEnumerator()
{
return CreateEnumerable().GetEnumerator();
}
IEnumerator IEnumerable.GetEnumerator()
{
return CreateEnumerable().GetEnumerator();
}
IEnumerable<TValue> CreateEnumerable()
{
foreach (IEnumerable<TValue> values in _data.Values)
{
foreach (TValue value in values)
{
yield return value;
}
}
}
}
You can use it like this:
var data = new SortedMultiValue<string, string>();
data.Add("Dog", "Buddy");
data.Add("Dog", "Mr. Peanutbutter");
data.Add("cat", "Charlie");
data.Add("cat", "Sam");
data.Add("cat", "Leo");
foreach (string item in data)
{
Console.WriteLine(item);
}
Console.WriteLine();
foreach (string item in data.Get("cat"))
{
Console.WriteLine(item);
}
Console.WriteLine();
foreach (string item in data.Get("Dog"))
{
Console.WriteLine(item);
}
It produces this as the output (notice that the first group of names is sorted by the key they were inserted with):
Charlie
Sam
Leo
Buddy
Mr. Peanutbutter
Charlie
Sam
Leo
Buddy
Mr. Peanutbutter
Is there a StartWith method for arrays in .NET? Or something similar to it in LINQ?
var arr1 = { "A", "B, "C" }
var arr2 = { "A", "B, "C", "D" }
var arr3 = { "A", "B, "CD" }
var arr4 = { "E", "A, "B", "C" }
arr2.StartWith(arr1) // true
arr1.StartWith(arr2) // false
arr3.StartWith(arr1) // false
arr4.StartWith(arr1) // false
Or I should do it straightforward:
bool StartWith(string[] arr1, string[] arr2)
{
if (arr1.Count() < arr2.Count) return false;
for (var i = 0; i < arr2.Count(), i++)
{
if (arr2[i] != arr1[i]) return false;
}
return true;
}
I'm looking for the most efficient way to do that.
bool answer = arr2.Take(arr1.Length).SequenceEqual(arr1);
Your "striaghtformward" way is the way most LINQ methods would be doing it anyway. There are a few tweaks you could do. For example make it a extension method and use a comparer for the comparison of the two types so custom comparers could be used.
public static class ExtensionMethods
{
static bool StartWith<T>(this T[] arr1, T[] arr2)
{
return StartWith(arr1, arr2, EqualityComparer<T>.Default);
}
static bool StartWith<T>(this T[] arr1, T[] arr2, IEqualityComparer<T> comparer)
{
if (arr1.Length < arr2.Length) return false;
for (var i = 0; i < arr2.Length, i++)
{
if (!comparer.Equals(arr2[i], arr1[i])) return false;
}
return true;
}
}
UPDATE: For fun I decided to take the time and write a little more "advanced" version that would work with any IEnumerable<T> and not just arrays.
public static class ExtensionMethods
{
static bool StartsWith<T>(this IEnumerable<T> #this, IEnumerable<T> #startsWith)
{
return StartsWith(#this, startsWith, EqualityComparer<T>.Default);
}
static bool StartsWith<T>(this IEnumerable<T> #this, IEnumerable<T> startsWith, IEqualityComparer<T> comparer)
{
if (#this == null) throw new ArgumentNullException("this");
if (startsWith == null) throw new ArgumentNullException("startsWith");
if (comparer == null) throw new ArgumentNullException("comparer");
//Check to see if both types implement ICollection<T> to get a free Count check.
var thisCollection = #this as ICollection<T>;
var startsWithCollection = startsWith as ICollection<T>;
if (thisCollection != null && startsWithCollection != null && (thisCollection.Count < startsWithCollection.Count))
return false;
using (var thisEnumerator = #this.GetEnumerator())
using (var startsWithEnumerator = startsWith.GetEnumerator())
{
//Keep looping till the startsWithEnumerator runs out of items.
while (startsWithEnumerator.MoveNext())
{
//Check to see if the thisEnumerator ran out of items.
if (!thisEnumerator.MoveNext())
return false;
if (!comparer.Equals(thisEnumerator.Current, startsWithEnumerator.Current))
return false;
}
}
return true;
}
}
You can do:
var result = arr2.Take(arr1.Length).SequenceEqual(arr1);
To optimize it further you can add the check arr2.Length >= arr1.Length in the start like:
var result = arr2.Length >= arr1.Length && arr2.Take(arr1.Length).SequenceEqual(arr1);
The end result would be same.
Try Enumerable.SequenceEqual(a1, a2) but trim your first array, i.e.,
var arr1 = { "A", "B, "C" }
var arr2 = { "A", "B, "C", "D" }
if (Enumerable.SequenceEqual(arr1, arr2.Take(arr1.Length))
You don't want to require everything to be an array, and you don't want to call Count() on an IEnumerable<T> that may be a large query, when you only really want to sniff at the first four items or whatever.
public static class Extensions
{
public static void Test()
{
var a = new[] { "a", "b" };
var b = new[] { "a", "b", "c" };
var c = new[] { "a", "b", "c", "d" };
var d = new[] { "x", "y" };
Console.WriteLine("b.StartsWith(a): {0}", b.StartsWith(a));
Console.WriteLine("b.StartsWith(c): {0}", b.StartsWith(c));
Console.WriteLine("b.StartsWith(d, x => x.Length): {0}",
b.StartsWith(d, x => x.Length));
}
public static bool StartsWith<T>(
this IEnumerable<T> sequence,
IEnumerable<T> prefixCandidate,
Func<T, T, bool> compare = null)
{
using (var eseq = sequence.GetEnumerator())
using (var eprefix = prefixCandidate.GetEnumerator())
{
if (compare == null)
{
compare = (x, y) => Object.Equals(x, y);
}
eseq.MoveNext();
eprefix.MoveNext();
do
{
if (!compare(eseq.Current, eprefix.Current))
return false;
if (!eprefix.MoveNext())
return true;
}
while (eseq.MoveNext());
return false;
}
}
public static bool StartsWith<T, TProperty>(
this IEnumerable<T> sequence,
IEnumerable<T> prefixCandidate,
Func<T, TProperty> selector)
{
using (var eseq = sequence.GetEnumerator())
using (var eprefix = prefixCandidate.GetEnumerator())
{
eseq.MoveNext();
eprefix.MoveNext();
do
{
if (!Object.Equals(
selector(eseq.Current),
selector(eprefix.Current)))
{
return false;
}
if (!eprefix.MoveNext())
return true;
}
while (eseq.MoveNext());
return false;
}
}
}
Here are some different ways of doing that. I didn't optimize or fully validated everything, there is room for improvement everywhere. But this should give you some idea.
The best performance will always be going low level, if you grab the iterator and go step by step you can get much faster results.
Methods and performance results:
StartsWith1 00:00:01.9014586
StartsWith2 00:00:02.1227468
StartsWith3 00:00:03.2222109
StartsWith4 00:00:05.5544177
Test method:
var watch = new Stopwatch();
watch.Start();
for (int i = 0; i < 10000000; i++)
{
bool test = action(arr2, arr1);
}
watch.Stop();
return watch.Elapsed;
Methods:
public static class IEnumerableExtender
{
public static bool StartsWith1<T>(this IEnumerable<T> source, IEnumerable<T> compare)
{
if (source.Count() < compare.Count())
{
return false;
}
using (var se = source.GetEnumerator())
{
using (var ce = compare.GetEnumerator())
{
while (ce.MoveNext() && se.MoveNext())
{
if (!ce.Current.Equals(se.Current))
{
return false;
}
}
}
}
return true;
}
public static bool StartsWith2<T>(this IEnumerable<T> source, IEnumerable<T> compare) =>
compare.Take(source.Count()).SequenceEqual(source);
public static bool StartsWith3<T>(this IEnumerable<T> source, IEnumerable<T> compare)
{
if (source == null)
{
throw new ArgumentNullException(nameof(source));
}
if (compare == null)
{
throw new ArgumentNullException(nameof(compare));
}
if (source.Count() < compare.Count())
{
return false;
}
return compare.SequenceEqual(source.Take(compare.Count()));
}
public static bool StartsWith4<T>(this IEnumerable<T> arr1, IEnumerable<T> arr2)
{
return StartsWith4(arr1, arr2, EqualityComparer<T>.Default);
}
public static bool StartsWith4<T>(this IEnumerable<T> arr1, IEnumerable<T> arr2, IEqualityComparer<T> comparer)
{
if (arr1.Count() < arr2.Count()) return false;
for (var i = 0; i < arr2.Count(); i++)
{
if (!comparer.Equals(arr2.ElementAt(i), arr1.ElementAt(i))) return false;
}
return true;
}
}
Something similar to http://regexio.com/prototype.html, I'm trying to get a set matching a particular regex.
Basically, you need to parse the regular expression and then, instead of reading input while walking the parsed expression, output the variants.
I have hacked the following program doing what you need for a very simple regular expression (only alternate options using |, iteration using *, grouping using (), and escaping using \ is supported). Note that the iteration is done simply 0–5 times, conversion to possibly infinite iteration left as an exercise for the reader ;-).
I have used a straightforward recursive-descent parser building an abstract syntax tree in memory; this tree is in the end walked and all possible sets are built. The solution is probably not optimal at all, but it works. Enjoy:
public class TestPrg
{
static void Main()
{
var expression = new RegexParser("a(b|c)*d").Parse();
foreach (var item in expression.Generate())
{
Console.WriteLine(item);
}
}
}
public static class EnumerableExtensions
{
// Build a Cartesian product of a sequence of sequences
// Code by Eric Lippert, copied from <http://blogs.msdn.com/b/ericlippert/archive/2010/06/28/computing-a-cartesian-product-with-linq.aspx>
public static IEnumerable<IEnumerable<T>> CartesianProduct<T>(this IEnumerable<IEnumerable<T>> sequences)
{
IEnumerable<IEnumerable<T>> emptyProduct = new[] { Enumerable.Empty<T>() };
return sequences.Aggregate(
emptyProduct,
(accumulator, sequence) =>
from accseq in accumulator
from item in sequence
select accseq.Concat(new[] { item }));
}
}
public class RegexParser
{
private const char EOF = '\x0000';
private readonly string str;
private char curr;
private int pos;
public RegexParser(string s)
{
str = s;
}
public RegExpression Parse()
{
pos = -1;
Read();
return ParseExpression();
}
private void Read()
{
++pos;
curr = pos < str.Length ? str[pos] : EOF;
}
private RegExpression ParseExpression()
{
var term = ParseTerm();
if (curr == '|')
{
Read();
var secondExpr = ParseExpression();
return new Variants(term, secondExpr);
}
else
{
return term;
}
}
private RegExpression ParseTerm()
{
var factor = ParseFactor();
if (curr != '|' && curr != '+' && curr != '*' && curr != ')' && curr != EOF)
{
var secondTerm = ParseTerm();
return new Concatenation(factor, secondTerm);
}
else
{
return factor;
}
}
private RegExpression ParseFactor()
{
var element = ParseElement();
if (curr == '*')
{
Read();
return new Repeat(element);
}
else
{
return element;
}
}
private RegExpression ParseElement()
{
switch (curr)
{
case '(':
Read();
var expr = ParseExpression();
if (curr != ')') throw new FormatException("Closing paren expected");
Read();
return expr;
case '\\':
Read();
var escapedChar = curr;
Read();
return new Literal(escapedChar);
default:
var literal = curr;
Read();
return new Literal(literal);
}
}
}
public abstract class RegExpression
{
protected static IEnumerable<RegExpression> Merge<T>(RegExpression head, RegExpression tail, Func<T, IEnumerable<RegExpression>> selector)
where T : RegExpression
{
var other = tail as T;
if (other != null)
{
return new[] { head }.Concat(selector(other));
}
else
{
return new[] { head, tail };
}
}
public abstract IEnumerable<string> Generate();
}
public class Variants : RegExpression
{
public IEnumerable<RegExpression> Subexpressions { get; private set; }
public Variants(RegExpression term, RegExpression rest)
{
Subexpressions = Merge<Variants>(term, rest, c => c.Subexpressions);
}
public override IEnumerable<string> Generate()
{
return Subexpressions.SelectMany(sub => sub.Generate());
}
}
public class Concatenation : RegExpression
{
public IEnumerable<RegExpression> Subexpressions { get; private set; }
public Concatenation(RegExpression factor, RegExpression rest)
{
Subexpressions = Merge<Concatenation>(factor, rest, c => c.Subexpressions);
}
public override IEnumerable<string> Generate()
{
foreach (var variant in Subexpressions.Select(sub => sub.Generate()).CartesianProduct())
{
var builder = new StringBuilder();
foreach (var item in variant) builder.Append(item);
yield return builder.ToString();
}
}
}
public class Repeat : RegExpression
{
public RegExpression Expr { get; private set; }
public Repeat(RegExpression expr)
{
Expr = expr;
}
public override IEnumerable<string> Generate()
{
foreach (var subexpr in Expr.Generate())
{
for (int cnt = 0; cnt < 5; ++cnt)
{
var builder = new StringBuilder(subexpr.Length * cnt);
for (int i = 0; i < cnt; ++i) builder.Append(subexpr);
yield return builder.ToString();
}
}
}
}
public class Literal : RegExpression
{
public char Ch { get; private set; }
public Literal(char c)
{
Ch = c;
}
public override IEnumerable<string> Generate()
{
yield return new string(Ch, 1);
}
}
My answer to a similar question might work for you. If the regex doesn't involve
any * operations (so that the language it recognizes is finite), it should be easy
to rewrite the regex as a BNF grammar, then do a bottom-up analysis producing
the finite sets corresponding to each nonterminal symbol until you reach the
start symbol, at which point you're done.