C# get also the key from parsing substring from string - c#

I would like to refactor my method. I also need to get which value was first? So which anyOf? Is it possible to get it from here?
Example:
List<string> anyOf = new List<string>(){"at", "near", "by", "above"};
string source = "South Branch Raritan River near High Bridge at NJ"
public static int IndexOfAny(this string source, IEnumerable<string> anyOf, StringComparison stringComparisonType = StringComparison.CurrentCultureIgnoreCase)
{
var founds = anyOf
.Select(sub => source.IndexOf(sub, stringComparisonType))
.Where(i => i >= 0);
return founds.Any() ? founds.Min() : -1;
}
I would like to get back what is first in string. "near" or "at".

You could use:
public static (int index, string? firstMatch) IndexOfAny(this string source, IEnumerable<string> anyOf, StringComparison stringComparisonType = StringComparison.CurrentCultureIgnoreCase)
{
return anyOf
.Select(s => (Index: source.IndexOf(s, stringComparisonType), String: s))
.Where(x => x.Index >= 0)
.DefaultIfEmpty((-1, null))
.First();
}

I couldn't resist creating a more efficient implementation.
Working here.
Whilst this looks more complicated, its better because,
It allocates only,
an array for the valid search terms,
a array of indices for each search term and,
an array of lengths for each search term.
The source text is enumerated only once and, if a match is found,
that loop will exit early.
Additionally, the code incorporates parameter checking which you'll want as extension methods should be resusable.
public static class Extensions
{
public static int IndexOfAny<T>(
this IEnumerable<T> source,
IEnumerable<IEnumerable<T>> targets,
IEqualityComparer<T> comparer = null)
{
// Parameter Handling
comparer = comparer ?? EqualityComparer<T>.Default;
ArgumentNullException.ThrowIfNull(targets);
var clean = targets
.Where(t => t != null)
.Select(t => t.ToArray())
.Where(t => t.Length > 0)
.ToArray();
if (clean.Length == 0)
{
throw new ArgumentException(
$"'{nameof(targets)}' does not contain a valid search sequence");
}
// Prep
var lengths = clean.Select(t => t.Length).ToArray();
var indices = clean.Select(_ => 0).ToArray();
int i = 0;
// Process
foreach(var t in source)
{
i++;
for(var j = 0; j < clean.Length; j++)
{
var index = indices[j];
if (comparer.Equals(clean[j][index], t))
{
index += 1;
if (index == lengths[j])
{
return i - lengths[j];
}
indices[j] = index;
}
else
{
if (index != 0)
{
indices[j] = 0;
}
}
}
}
return -1;
}
public static int IndexOfAny(
this string source,
IEnumerable<string> targets,
StringComparer comparer = null)
{
comparer = comparer ?? StringComparer.Ordinal;
ArgumentNullException.ThrowIfNull(targets);
return source.ToCharArray().IndexOfAny(
targets.Select(t => t.ToCharArray()),
new CharComparerAdapter(comparer));
}
}
public class CharComparerAdapter : IEqualityComparer<char>
{
private StringComparer Comparer { get; }
public CharComparerAdapter(StringComparer comparer)
{
ArgumentNullException.ThrowIfNull(comparer);
Comparer = comparer;
}
public bool Equals(char left, char right)
{
return Comparer.Equals(left.ToString(), right.ToString());
}
public int GetHashCode(char v)
{
return v;
}
}

Related

Check List<T> for duplications with optional words to exclude

I have a method as below. Method return either false/true either when list contains duplicates or not. I would like to extend my method to say for instance (optional) that i want to exclude specific items from check. For instance i want to check entire list as it is now or i want to say for instance exclude: string.empty items or for instance string.empty and "some word". Is it possible?
public static bool IsListContainsDuplicates<T>(List<T> list)
{
return list.GroupBy(n => n).Any(c => c.Count() > 1);
}
public static bool ContainsDuplicates<T>(this IEnumerable<T> items, IEnumerable<T> itemsToExclude = null)
{
if (itemsToExclude == null) itemsToExclude = Enumerable.Empty<T>();
return items.Except(itemsToExclude)
.GroupBy(n => n)
.Any(c => c.Count() > 1);
}
But i'd prefer this implementation because it's more performant:
public static bool ContainsDuplicates<T>(this IEnumerable<T> items, IEnumerable<T> itemsToExclude = null)
{
if (itemsToExclude == null) itemsToExclude = Enumerable.Empty<T>();
HashSet<T> set = new HashSet<T>();
return !items.Except(itemsToExclude).All(set.Add);
}
Instead of making your method more complicated, you should open it more to combine it with others:
public static class MyLinqMethods
{
public static bool HasDuplicates<T>(this IEnumerable<T> sequence)
{
return sequence.GroupBy(n => n).Any(c => c.Count() > 1);
}
}
Now you can use it with Linq:
var original = new[] { string.Empty, "Hello", "World", string.Empty };
var duplicatesInOriginal = original.HasDuplicates();
var duplicatesIfStringEmptyIsIgnored = original.Where(o => o != string.Empty).HasDuplicates();
You can use Except(). From MSDN:
Produces the set difference of two sequences by using the default
equality comparer to compare values.
return list.Except(listToExclude).GroupBy(n => n).Any(c => c.Count() > 1);
This will also help, using a 'params' in arguments and then doing Except()
public static bool IsListContainsDuplicates<T>(List<T> list, params T[] optional)
{
return list.Except(optional).GroupBy(n => n).Any(c => c.Count() > 1);
}
You can call like this if you doesn't want to exclude anything:
IsListContainsDuplicates(list)
Else, just pass the params values, for example, if the list is an integer list then,
IsListContainsDuplicates(list,5,4)

Quick (sub)string search in large set of data

Given a city:
public class City
{
public int Id { get; set; }
public string Name { get; set; }
public string Country { get; set; }
public LatLong Location { get; set; }
}
I have a list of close to 3,000,000 cities (and towns and villages etc.) in a file. This file is read into memory; I have been playing with arrays, lists, dictionaries (key = Id) etc.
I want to find, as quick as possible, all cities matching a substring (case insensitive). So when I search for 'yor' I want to get all matches (1000+) ASAP (matching 'York Town', 'Villa Mayor', 'New York', ...).
Functionally you could write this as:
cities.Values.Where(c => c.Name.IndexOf("yor", StringComparison.OrdinalIgnoreCase) >= 0)
I don't mind doing some pre-processing when reading the file; as a matter of fact: that's what I'm mostly looking for. Read the file, "chew" on the data creating some sort of index or... and then be ready to answer queries like "yor".
I want this to be standalone, self-contained. I do not want to add dependencies like an RDBMS, ElasticSearch or whatever. I don't mind having (parts of) the list in memory more than once. I don't mind spending some memory on a datastructure to help me find my results quickly. I don't want libraries or packages. I want an algorithm I can implement myself.
Basically I want the above LINQ statement, but optimized for my case; currently plowing through almost 3,000,000 records takes about +/- 2 seconds. I want this sub 0.1 second so I could use the search and it's results as 'autocomplete'.
Creating an "index"(-alike) structure is probably what I need. As I'm writing I remember something about a "bloom filter" but I'm not sure if that would help or even supports substring search. Will look into that now.
Any tips, pointers, help very much appreciated.
I created a bit of a hybrid based on a suffix array / dictionary. Thanks to saibot for suggesting it first and all other people helping and suggesting.
This is what I came up with:
public class CitiesCollection
{
private Dictionary<int, City> _cities;
private SuffixDict<int> _suffixdict;
public CitiesCollection(IEnumerable<City> cities, int minLen)
{
_cities = cities.ToDictionary(c => c.Id);
_suffixdict = new SuffixDict<int>(minLen, _cities.Values.Count);
foreach (var c in _cities.Values)
_suffixdict.Add(c.Name, c.Id);
}
public IEnumerable<City> Find(string find)
{
var normalizedFind = _suffixdict.NormalizeString(find);
foreach (var id in _suffixdict.Get(normalizedFind).Where(v => _cities[v].Name.IndexOf(normalizedFind, StringComparison.OrdinalIgnoreCase) >= 0))
yield return _cities[id];
}
}
public class SuffixDict<T>
{
private readonly int _suffixsize;
private ConcurrentDictionary<string, IList<T>> _dict;
public SuffixDict(int suffixSize, int capacity)
{
_suffixsize = suffixSize;
_dict = new ConcurrentDictionary<string, IList<T>>(Environment.ProcessorCount, capacity);
}
public void Add(string suffix, T value)
{
foreach (var s in GetSuffixes(suffix))
AddDict(s, value);
}
public IEnumerable<T> Get(string suffix)
{
return Find(suffix).Distinct();
}
private IEnumerable<T> Find(string suffix)
{
foreach (var s in GetSuffixes(suffix))
{
if (_dict.TryGetValue(s, out var result))
foreach (var i in result)
yield return i;
}
}
public string NormalizeString(string value)
{
return value.Normalize().ToLowerInvariant();
}
private void AddDict(string suffix, T value)
{
_dict.AddOrUpdate(suffix, (s) => new List<T>() { value }, (k, v) => { v.Add(value); return v; });
}
private IEnumerable<string> GetSuffixes(string value)
{
var nv = NormalizeString(value);
for (var i = 0; i <= nv.Length - _suffixsize ; i++)
yield return nv.Substring(i, _suffixsize);
}
}
Usage (where I assume mycities to be an IEnumerable<City> with the given City object from the question):
var cc = new CitiesCollection(mycities, 3);
var results = cc.Find("york");
Some results:
Find: sterda elapsed: 00:00:00.0220522 results: 32
Find: york elapsed: 00:00:00.0006212 results: 155
Find: dorf elapsed: 00:00:00.0086439 results: 6095
Memory usage is very, very acceptable. Only 650MB total having the entire collection of 3,000,000 cities in memory.
In the above I'm storing Id's in the "SuffixDict" and I have a level of indirection (dictionary lookups to find id=>city). This can be further simplified to:
public class CitiesCollection
{
private SuffixDict<City> _suffixdict;
public CitiesCollection(IEnumerable<City> cities, int minLen, int capacity = 1000)
{
_suffixdict = new SuffixDict<City>(minLen, capacity);
foreach (var c in cities)
_suffixdict.Add(c.Name, c);
}
public IEnumerable<City> Find(string find, StringComparison stringComparison = StringComparison.OrdinalIgnoreCase)
{
var normalizedFind = SuffixDict<City>.NormalizeString(find);
var x = _suffixdict.Find(normalizedFind).ToArray();
foreach (var city in _suffixdict.Find(normalizedFind).Where(v => v.Name.IndexOf(normalizedFind, stringComparison) >= 0))
yield return city;
}
}
public class SuffixDict<T>
{
private readonly int _suffixsize;
private ConcurrentDictionary<string, IList<T>> _dict;
public SuffixDict(int suffixSize, int capacity = 1000)
{
_suffixsize = suffixSize;
_dict = new ConcurrentDictionary<string, IList<T>>(Environment.ProcessorCount, capacity);
}
public void Add(string suffix, T value)
{
foreach (var s in GetSuffixes(suffix, _suffixsize))
AddDict(s, value);
}
public IEnumerable<T> Find(string suffix)
{
var normalizedfind = NormalizeString(suffix);
var find = normalizedfind.Substring(0, Math.Min(normalizedfind.Length, _suffixsize));
if (_dict.TryGetValue(find, out var result))
foreach (var i in result)
yield return i;
}
private void AddDict(string suffix, T value)
{
_dict.AddOrUpdate(suffix, (s) => new List<T>() { value }, (k, v) => { v.Add(value); return v; });
}
public static string NormalizeString(string value)
{
return value.Normalize().ToLowerInvariant();
}
private static IEnumerable<string> GetSuffixes(string value, int suffixSize)
{
var nv = NormalizeString(value);
if (value.Length < suffixSize)
{
yield return nv;
}
else
{
for (var i = 0; i <= nv.Length - suffixSize; i++)
yield return nv.Substring(i, suffixSize);
}
}
}
This bumps the load time up from 00:00:16.3899085 to 00:00:25.6113214, memory usage goes down from 650MB to 486MB. Lookups/searches perform a bit better since we have one less level of indirection.
Find: sterda elapsed: 00:00:00.0168616 results: 32
Find: york elapsed: 00:00:00.0003945 results: 155
Find: dorf elapsed: 00:00:00.0062015 results: 6095
I'm happy with the results so far. I'll do a little polishing and refactoring and call it a day! Thanks everybody for the help!
And this is how it performs with 2,972,036 cities:
This has evolved into a case-insensitive, accent-insensitive search by modifying the code to this:
public static class ExtensionMethods
{
public static T FirstOrDefault<T>(this IEnumerable<T> src, Func<T, bool> testFn, T defval)
{
return src.Where(aT => testFn(aT)).DefaultIfEmpty(defval).First();
}
public static int IndexOf(this string source, string match, IEqualityComparer<string> sc)
{
return Enumerable.Range(0, source.Length) // for each position in the string
.FirstOrDefault(i => // find the first position where either
// match is Equals at this position for length of match (or to end of string) or
sc.Equals(source.Substring(i, Math.Min(match.Length, source.Length - i)), match) ||
// match is Equals to on of the substrings beginning at this position
Enumerable.Range(1, source.Length - i - 1).Any(ml => sc.Equals(source.Substring(i, ml), match)),
-1 // else return -1 if no position matches
);
}
}
public class CaseAccentInsensitiveEqualityComparer : IEqualityComparer<string>
{
private static readonly CompareOptions _compareoptions = CompareOptions.IgnoreCase | CompareOptions.IgnoreNonSpace | CompareOptions.IgnoreKanaType | CompareOptions.IgnoreWidth | CompareOptions.IgnoreSymbols;
private static readonly CultureInfo _cultureinfo = CultureInfo.InvariantCulture;
public bool Equals(string x, string y)
{
return string.Compare(x, y, _cultureinfo, _compareoptions) == 0;
}
public int GetHashCode(string obj)
{
return obj != null ? RemoveDiacritics(obj).ToUpperInvariant().GetHashCode() : 0;
}
private string RemoveDiacritics(string text)
{
return string.Concat(
text.Normalize(NormalizationForm.FormD)
.Where(ch => CharUnicodeInfo.GetUnicodeCategory(ch) != UnicodeCategory.NonSpacingMark)
).Normalize(NormalizationForm.FormC);
}
}
public class CitiesCollection
{
private SuffixDict<City> _suffixdict;
private HashSet<string> _countries;
private Dictionary<int, City> _cities;
private readonly IEqualityComparer<string> _comparer = new CaseAccentInsensitiveEqualityComparer();
public CitiesCollection(IEnumerable<City> cities, int minLen, int capacity = 1000)
{
_suffixdict = new SuffixDict<City>(minLen, _comparer, capacity);
_countries = new HashSet<string>();
_cities = new Dictionary<int, City>(capacity);
foreach (var c in cities)
{
_suffixdict.Add(c.Name, c);
_countries.Add(c.Country);
_cities.Add(c.Id, c);
}
}
public City this[int index] => _cities[index];
public IEnumerable<string> Countries => _countries;
public IEnumerable<City> Find(string find, StringComparison stringComparison = StringComparison.OrdinalIgnoreCase)
{
foreach (var city in _suffixdict.Find(find).Where(v => v.Name.IndexOf(find, _comparer) >= 0))
yield return city;
}
}
public class SuffixDict<T>
{
private readonly int _suffixsize;
private ConcurrentDictionary<string, IList<T>> _dict;
public SuffixDict(int suffixSize, IEqualityComparer<string> stringComparer, int capacity = 1000)
{
_suffixsize = suffixSize;
_dict = new ConcurrentDictionary<string, IList<T>>(Environment.ProcessorCount, capacity, stringComparer);
}
public void Add(string suffix, T value)
{
foreach (var s in GetSuffixes(suffix, _suffixsize))
AddDict(s, value);
}
public IEnumerable<T> Find(string suffix)
{
var find = suffix.Substring(0, Math.Min(suffix.Length, _suffixsize));
if (_dict.TryGetValue(find, out var result))
{
foreach (var i in result)
yield return i;
}
}
private void AddDict(string suffix, T value)
{
_dict.AddOrUpdate(suffix, (s) => new List<T>() { value }, (k, v) => { v.Add(value); return v; });
}
private static IEnumerable<string> GetSuffixes(string value, int suffixSize)
{
if (value.Length < 2)
{
yield return value;
}
else
{
for (var i = 0; i <= value.Length - suffixSize; i++)
yield return value.Substring(i, suffixSize);
}
}
}
With credit also to Netmage and Mitsugui. There are still some issues / edge-cases but it's continually improving!
You could use a suffix tree: https://en.wikipedia.org/wiki/Suffix_tree
It requires enough space to store about 20 times your list of words in memory
Suffix array is a space efficient alternative: https://en.wikipedia.org/wiki/Suffix_array
in query benchmark contains very faster then indexOf >0
cities.Values.Where(c => c.Name.Contans("yor"))

C# remove duplicates from List<string[]>

I got a lot of data from a database, which are results from a search function. Now I've a List<string[]> which has duplicated elements of type string[]. The string[] in the list are the search results.
I know that every new created array has a different instance so i can't use MyListOfArrays.Distinct().ToList().
Maybe it's a very basic question...
My question is, are there any functions built in to remove a duplicated string[] form the List<string[]>? Or do I have to write it by my selfe?
Thank you
You can use distinct method with custom equalityComparer
IEnumerable<string[]> distinct = inputStringArrayList.Distinct(new EqualityComparer());
EqualityComparer
class EqualityComparer : IEqualityComparer<string[]>
{
public bool Equals(string[] x, string[] y)
{
if (x.Length != y.Length)
{
return false;
}
if (x.Where((t, i) => t != y[i]).Any())
{
return false;
}
return true;
}
public int GetHashCode(string[] obj)
{
return obj.GetHashCode();
}
}
Alternative Equals Method
public bool Equals(string[] x, string[] y)
{
return x.SequenceEqual(y);
}
Here I am assuming you are having exact same string arrays with same content at same index.
Correction from Matthew Watson
public int GetHashCode(string[] obj)
{
if (obj == null)
return 0;
int hash = 17;
unchecked
{
foreach (string s in obj)
hash = hash*23 + ((s == null) ? 0 : s.GetHashCode());
}
return hash;
}
I have corrected the answer from #Muctadir Dinar.
(He deserves credit for the answer - I am just correcting it and providing a complete test program):
using System;
using System.Collections.Generic;
using System.Linq;
namespace Demo
{
sealed class EqualityComparer: IEqualityComparer<string[]>
{
public bool Equals(string[] x, string[] y)
{
if (ReferenceEquals(x, y))
return true;
if (x == null || y == null)
return false;
return x.SequenceEqual(y);
}
public int GetHashCode(string[] obj)
{
if (obj == null)
return 0;
int hash = 17;
unchecked
{
foreach (string s in obj)
hash = hash*23 + ((s == null) ? 0 : s.GetHashCode());
}
return hash;
}
}
class Program
{
private void run()
{
var list = new List<string[]>
{
strings(1, 10),
strings(2, 10),
strings(3, 10),
strings(2, 10),
strings(4, 10)
};
dump(list);
Console.WriteLine();
var result = list.Distinct(new EqualityComparer());
dump(result);
}
static void dump(IEnumerable<string[]> list)
{
foreach (var array in list)
Console.WriteLine(string.Join(",", array));
}
static string[] strings(int start, int count)
{
return Enumerable.Range(start, count)
.Select(element => element.ToString())
.ToArray();
}
static void Main(string[] args)
{
new Program().run();
}
}
}
A simple and not very efficient approach would be to use string.Join on the string[]:
list = list
.GroupBy(strArr => string.Join("|", strArr))
.Select(g => g.First())
.ToList();

Custom array sort

Each item/string in my array starts with two letters followed by two or three numbers and then sometimes followed by another letter.
Examples, RS01 RS10 RS32A RS102 RS80 RS05A RS105A RS105B
I tried to sort this using the default Array.Sort but it came back with this...
RS01
RS05A
RS10
RS102
RS105A
RS105B
RS32A
RS80
But I need it like this..
RS01
RS05A
RS10
RS32A
RS80
RS102
RS105A
RS105B
Any Ideas?
Here is sorting with custom comparison delegate and regular expressions:
string[] array = { "RS01", "RS10", "RS32A", "RS102",
"RS80", "RS05A", "RS105A", "RS105B" };
Array.Sort(array, (s1, s2) =>
{
Regex regex = new Regex(#"([a-zA-Z]+)(\d+)([a-zA-Z]*)");
var match1 = regex.Match(s1);
var match2 = regex.Match(s2);
// prefix
int result = match1.Groups[1].Value.CompareTo(match2.Groups[1].Value);
if (result != 0)
return result;
// number
result = Int32.Parse(match1.Groups[2].Value)
.CompareTo(Int32.Parse(match2.Groups[2].Value));
if (result != 0)
return result;
// suffix
return match1.Groups[3].Value.CompareTo(match2.Groups[3].Value);
});
UPDATE (little refactoring, and moving all stuff to separate comparer class). Usage:
Array.Sort(array, new RSComparer());
Comparer itself:
public class RSComparer : IComparer<string>
{
private Dictionary<string, RS> entries = new Dictionary<string, RS>();
public int Compare(string x, string y)
{
if (!entries.ContainsKey(x))
entries.Add(x, new RS(x));
if (!entries.ContainsKey(y))
entries.Add(y, new RS(y));
return entries[x].CompareTo(entries[y]);
}
private class RS : IComparable
{
public RS(string value)
{
Regex regex = new Regex(#"([A-Z]+)(\d+)([A-Z]*)");
var match = regex.Match(value);
Prefix = match.Groups[1].Value;
Number = Int32.Parse(match.Groups[2].Value);
Suffix = match.Groups[3].Value;
}
public string Prefix { get; private set; }
public int Number { get; private set; }
public string Suffix { get; private set; }
public int CompareTo(object obj)
{
RS rs = (RS)obj;
int result = Prefix.CompareTo(rs.Prefix);
if (result != 0)
return result;
result = Number.CompareTo(rs.Number);
if (result != null)
return result;
return Suffix.CompareTo(rs.Suffix);
}
}
}
You can use this linq query:
var strings = new[] {
"RS01","RS05A","RS10","RS102","RS105A","RS105B","RS32A","RS80"
};
strings = strings.Select(str => new
{
str,
num = int.Parse(String.Concat(str.Skip(2).TakeWhile(Char.IsDigit))),
version = String.Concat(str.Skip(2).SkipWhile(Char.IsDigit))
})
.OrderBy(x => x.num).ThenBy(x => x.version)
.Select(x => x.str)
.ToArray();
DEMO
Result:
RS01
RS05A
RS10
RS32A
RS80
RS102
RS105A
RS105B
You'll want to write a custom comparer class implementing IComparer<string>; it's pretty straightforward to break your strings into components. When you call Array.Sort, give it an instance of your comparer and you'll get the results you want.

In C#, what's the best way to search a list of a class by one of the class's members?

Say I have a class that looks something like this:
class SomeClass {
int m_member;
public int Member {
get { return m_member; }
set { m_member = value; }
}
}
And somewhere else, I have a list of type List<SomeClass> list.
If I want to search the list for a particular instance of the class, I can just do
int index = list.IndexOf(someInstance);
But if I want to search the list by Member, I have to do this:
int index = -1;
for (int i = 0; i < list.Count; i++) {
if (list[i].Member == someMember) {
index = i;
break;
}
}
Is there a better way to do this?
int index = list.FindIndex(m => m.Member == someMember);
If you can use Linq
SomeClass aClasss = list.FirstOrDefault(item => item.Member == someMember);
You can have an extended search method(using reflection) for the list like below where property name and search value are input parameters.
public static List<T> SearchList<T>(this List<T> list, string PropertyName, string SearchValue)
{
return list.Select(item =>
new
{
i = item,
Props = item.GetType().GetProperties(BindingFlags.Instance | BindingFlags.Public)
})
.Where(item => item.Props.Any(p =>
{
var val = p.GetValue(item.i, null);
return val != null
&& (p.Name.ToLower() == PropertyName.ToLower() || string.IsNullOrEmpty(PropertyName))
&& (val.ToString().ToLower().Contains(SearchValue.ToLower()) || string.IsNullOrEmpty(SearchValue));
}))
.Select(item => item.i)
.ToList();
}
calling:
List<Employee> employees = new List<Employee>();
var searchedEmployees = data.SortList(serachProperty, searchValue);

Categories

Resources