Searching for dictionary keys contained in a string array

Searching for dictionary keys contained in a string array - c#

I have a List of strings where each item is a free text describing a skill, so looks kinda like this:
List<string> list = new List<string> {"very good right now", "pretty good",
"convinced me that is good", "pretty medium", "just medium" .....}
And I want to keep a user score for these free texts. So for now, I use conditions:
foreach (var item in list)
{
if (item.Contains("good"))
{
score += 2.5;
Console.WriteLine("good skill, score+= 2.5, is now {0}", score);
}
else if (item.Contains(low"))
{
score += 1.0;
Console.WriteLine("low skill, score+= 1.0, is now {0}", score);
}
}
Suppose In the furure I want to use a dictionary for the score mapping, such as:
Dictionary<string, double> dic = new Dictionary<string, double>
{ { "good", 2.5 }, { "low", 1.0 }};
What would be a good way to cross between the dictionary values and the string list? The way I see it now is do a nested loop:
foreach (var item in list)
{
foreach (var key in dic.Keys)
if (item.Contains(key))
score += dic[key];
}
But I'm sure there are better ways. Better being faster, or more pleasant to the eye (LINQ) at the very least.
Thanks.

var scores = from item in list
from word in item.Split()
join kvp in dic on word equals kvp.Key
select kvp.Value;
var totalScore = scores.Sum();
Note: your current solution checks whether the item in the list contains key in the dictionary. But it will return true even if key in dictionary is a part of some word in the item. E.g. "follow the rabbit" contains "low". Splitting item into words solves this issue.
Also LINQ join uses hash set internally to search first sequence items in second sequence. That gives you O(1) lookup speed instead of O(N) when you enumerating all entries of dictionary.

If your code finds N skill strings containing the word "good" then it appends score 2.5 N times.
So you can just count skill strings containing dictionary work and multiply the value on corresponding score.
var scores = from pair in dic
let word = pair.Key
let score = pair.Value
let count = list.Count(x => x.Contains(word))
select score * count;
var totalScore = scores.Sum();

its not faster really, but you can use LINQ:
score = list.Select(s => dic.Where(d => s.Contains(d.Key))
.Sum(d => d.Value))
.Sum();
note that your example loop will hit 2 different keys if he string matches both, I kept that in my solution.

Well, you aren't really using the Dictionary as a dictionary, so we can simplify this a bit with a new class:
class TermValue
{
public string Term { get; set; }
public double Value { get; set; }
public TermValue(string t, double v)
{
Term = t;
Value = v;
}
}
With that, we can be a bit more direct:
void Main()
{
var dic = new TermValue[] { new TermValue("good", 2.5), new TermValue("low", 1.0)};
List<string> list = new List<string> {"very good right now", "pretty good",
"convinced me that is good", "pretty medium", "just medium" };
double score = 0.0;
foreach (var item in list)
{
var entry = dic.FirstOrDefault(d =>item.Contains(d.Term));
if (entry != null)
score += entry.Value;
}
}
From here, we can just play a bit (the compiled code for this will probably be the same as above)
double score = 0.0;
foreach (var item in list)
{
score += dic.FirstOrDefault(d =>item.Contains(d.Term))?.Value ?? 0.0;
}
then, (in the word of the Purple One), we can go crazy:
double score = list.Aggregate(0.0,
(scre, item) =>scre + (dic.FirstOrDefault(d => item.Contains(d.Term))?.Value ?? 0.0));

Related

c# : avoid branching while iterating on a dictionary

I have the below code in which i branch for each sample in a dictionary , is there a way either by using LINQ or any other method in which i can avoid branching -> may be a functional approach
Dictionary<string, int> samples = new Dictionary<string, int>()
{
{"a", 1},
{"aa", 2},
{"b", 1},
{"bb", 3}
};
foreach (var sample in samples)
{
if (sample.Value ==)
{
Console.WriteLine("sample passed");
}
else if (sample.Value == 2)
{
Console.WriteLine("sample isolated");
}
else if (sample.Value == 3)
{
Console.WriteLine("sample biased");
}
}
UPD
What if i have other type of comprasion:
foreach (var sample in samples)
{
if (sample.Value <= 1)
{
Console.WriteLine("sample passed");
}
else if (sample.Value <= 2)
{
Console.WriteLine("sample isolated");
}
else if (sample.Value <= 3)
{
Console.WriteLine("sample biased");
}
}

One option would be to create a list of Actions that you wish to perform, then execute them based on the index. This way your methods can be quite varied. If you need to perform very similar actions for each option, then storing a list of values would be a better than storing Actions.
List<Action> functions = new List<Action>();
functions.Add(() => Console.WriteLine("sample passed"));
functions.Add(() => Console.WriteLine("sample isolated"));
functions.Add(() => Console.WriteLine("sample biased"));
foreach (var sample in samples)
{
Action actionToExecute = functions[sample.Value - 1];
actionToExectute();
}
If you wanted to use a dictionary as your comment implies:
Dictionary<int, Action> functions = new Dictionary<int, Action>();
functions.Add(1, () => Console.WriteLine("sample passed"));
functions.Add(2, () => Console.WriteLine("sample isolated"));
functions.Add(3, () => Console.WriteLine("sample biased"));
foreach (var sample in samples)
{
Action actionToExecute = functions[sample.Value];
actionToExectute();
}

For this concrete case you can introduce another map(Dictionary or an array, as I did):
Dictionary<string, int> samples = new Dictionary<string, int>()
{
{"a", 1},
{"aa", 2},
{"b", 1},
{"bb", 3}
};
var map = new []
{
"sample passed",
"sample isolated",
"sample biased"
};
foreach (var sample in samples)
{
Console.WriteLine(map[sample.Value - 1]);
}
As for actual code it highly depends on usecases and how you want to handle faulty situations.
UPD
It seems that if you will be using dictionary for your map there still will be some branching, but if you will not have misses branch prediction should take care of it.

So you have a Dictionary<string, int>. Every item in the dictionary is a KeyValuePair<string, int>. I assume that the string is the name of the sample (identifier), and the int is a number that says something about the sample:
if the number equals 0 or 1, the sample is qualified as Passed;
if the number equals 2, then you call it Isolated
if the number equals 3, then you call it Biased.
All higher numbers are not interesting for you.
You want to group the samples in Passed / Isolated / Biased samples.
Whenever you have a sequence of similar items and you want to make groups of items, where every element has something in common with the other elements in the group, consider using one of the overloads of Enumerable.GroupBy
Let's first define an enum to hold your qualifications, and a method that converts the integer value of the sample into the enum:
enum SampleQualification
{
Passed,
Isolated,
Biased,
}
SampleQualification FromNumber(int number)
{
switch (number)
{
case 2:
return SampleQualification.Isolated;
case 3:
return SampleQualification.Biased;
default:
return SampleQualification.Passed;
}
}
Ok, so you have your dictionary of samples, where every key is a name of the sample and the value is a number that can be converted to a SampleQualification.
Dictionary<string, int> samples = ...
var qualifiedSamples = samples // implements IEnumerable<KeyValuePair<string, int>>
// keep only samples with Value 0..3
.Where(sample => 0 <= sample.Value && sample.Value <= 3)
// Decide where the sample is Passed / Isolated / Biased
.Select(sample => new
{
Qualification = FromNumber(sample.Value)
Name = sample.Key, // the name of the sample
Number = sample.Value,
})
// Make groups of Samples with same Qualification:
.GroupBy(
// KeySelector: make groups with same qualification:
sample => sample.Qualification,
// ResultSelector: take the qualification, and all samples with this qualification
// to make one new:
(qualification, samplesWithThisQualification) => new
{
Qualification = qualification,
Samples = samplesWithThisQualification.Select(sample => new
{
Name = sample.Name,
Number = sample.Number,
})
.ToList(),
});
The result is a sequence of items. Where every item has a property Qualification, which holds Passed / Isolated / Biased. Every item also has a list of samples that have this qualification.
// Process Result
foreach (var qualifiedSample in qualifiedSamples)
{
Console.WriteLine("All samples with qualification " + qualifiedSample.Qualification);
foreach (var sample in qualifiedSample.Samples)
{
Console.WriteLine({0} - {1}, sample.Name, sample.Value);
}
}

Sum values in double nested List of Dictionaries in C#

I have (for certain reasons not to get into now...) a List of the following structure:
List1<Dictionary1<string, List2<Dictionary2<string, string>>>>
(I added the 1 and 2 naming for clarity).
I want to iterate over List1 and sum up Dictionary1, so that all values of identical keys in Dictionary2 will add up.
For example if each Dictionary1 item contains a Dictionary2:
{ "Price", 23},
{ "Customers", 3}
then I want to iterate over all List2 elements, and over all List1 elements, and have a final dictionary of the total sum of all prices and customers as a single key for each category:
{ "Price", 15235},
{ "Customers", 236}
I hope that's clear.. In other words, I want to sum up this double-nested list in a way that I'm left with all unique keys across all nested dictionaries and have the values summed up.
I believe it can be done with LINQ, but I'm not sure how to do that..

This may be the ugliest thing I've ever written, and makes some assumptions on what you're doing, but I think this gets you what you want:
var query = from outerDictionary in x
from listOfDictionaries in outerDictionary.Values
from innerDictionary in listOfDictionaries
from keyValuePairs in innerDictionary
group keyValuePairs by keyValuePairs.Key into finalGroup
select new
{
Key = finalGroup.Key,
Sum = finalGroup.Sum(a => Convert.ToInt32(a.Value))
};
Where x is your main List.

Ok, so it looks like that you were attempting to create an Dictionary of Items with various properties (Cost, Customers, etc...), which begs the question: why not just create a class?
After all, it would be pretty simple to turn your dictionary of dictionary of items into a single dictionary of properties, such as below.
public class ItemProperties
{
public double Price {get; set;} = 0;
public int Customers {get; set;} = 0;
//Whichever other properties you were thinking of using.
}
static ItemProperties AddAll(Dictionary<string, ItemProperties> items)
ItemProperties finalitem = new ItemProperties();
{
foreach (var item in items)
{
finalitem.Price += item.Price;
finalitem.Customers += item.Customers;
//Repeat for all other existing properties.
}
return finalitem;
}
Of course, this only works if the number and kind of properties is immutable. Another way to approach this problem would be to use TryParse to find properties in Dictionary2 that you think can be added. This is problematic however, and requires some good error checking.
static Dictionary < string, string > AddNestedDictionary(Dictionary < string, Dictionary < string, string > items) {
Dictionary < string, string > finalitem = new Dictionary < string, string > ();
foreach(var item in items) {
foreach(var prop in item) {
if (!finalitem.ContainsKey(prop.Key)) {
finalitem.Add(prop);
}
double i = 0;
if (Double.TryParse(prop.Value, out i)) {
finalitem[prop.Key] += i;
}
}
}
return finalitem;
}
Again, not the best answer compared to the simplicity of a static class. But that would be the price you pay for nonclarity.

Iterate a list and keep an index based on the name of the item

I have a list of items which have names and I need to iterate them, but I also need to know how many times this item with the same name it is. So this is an example:
-----
|1|A|
|2|B|
|3|C|
|4|C|
|5|C|
|6|A|
|7|B|
|8|C|
|9|C|
-----
So, when I'm iterating and I'm on row 1, I want to know it is the first time it is an A, when I'm on row 6, I want to know it is the second time, when I'm on row 9, I want to know it is the 5th C, etc. How can I achieve this? Is there some index I can keep track of? I was also thinking of filling a hash while iterating, but perhaps thats too much.

You can use Dictionary<char, int> for keeping count of each character in your list
here your key will be character and value will contain number of occurrences of that character in list
Dictionary<char, int> occurances = new Dictionary<char, int>();
List<char> elements = new List<char>{'A', 'B','C','C','C','A','B', 'C', 'C'};
int result = 0;
foreach(char element in elements)
{
if(occurances.TryGetValue(element, out result))
occurances[element] = result + 1;
else
occurances.Add(element, 1);
}
foreach(KeyValuePair<char, int> kv in occurances)
Console.WriteLine("Key: "+ kv.Key + " Value: "+kv.Value);
Output:
Key: A Value: 2
Key: B Value: 2
Key: C Value: 5
POC: dotNetFiddler

Use dictionary to keep track of counter.
List<string> input = new List<string> { "A", "B", "C", "C", "C", "A", "B", "C", "C" };
Dictionary<string, int> output = new Dictionary<string, int>();
foreach(var item in input)
{
if (output.ContainsKey(item))
{
output[item] = output[item] + 1;
}
else
{
output.Add(item, 1);
}
}

I think you'll need a reversed index instead of row store index.
Row store index just like your table described, and reversed index store terms to search indexes.
Probably like this:
A 1,6
B 2,7
C 3,4,5,8,9
The search engine such like 'Elastic search/Solr' will store terms like this.
If you are in C#, Dictionary<string, List<int>> is pretty much good for you. There you can keep your data that is reverse indexed.

The clean way is to implement your own list; the item is your own object. By this method, you implement your own Iterator pattern with an additional property in your object and your own Add() method. The new Iterator should inherit List and should override the Add() method of List.
I implement this for my own. you can use it. keep in mind, this solution is one of some solutions that exist. However, I think this is one the best solutions with respect to SOLID and OO principals.
public class CounterIterator : List<Item>
{
public new void Add(Item item)
{
base.Add(item);
foreach (var listItem in this)
{
if (listItem.Equals(item))
{
item.CountRepeat++;
}
}
}
}
public class Item
{
public Item(string value)
{
Value = value;
}
public string Value { get; private set; }
public int CountRepeat { get; set; }
public override bool Equals(object obj)
{
var item = obj as Item;
return item != null && Value.Equals(item.Value);
}
}
I tested the code above. It is an extension of List which has an added behavior. If anyone thinks it is not a correct answer, please mention me in comments. I will try to clarify the issue.

How to display all mistaken words

I have some text in richTextBox1.
I have to sort the words by their frequency and display them in richTextBox2. It seems to work.
Have to find all mistaken words and display them in richTextBox4. I'm using Hunspell.
Apparently I'm missing something. Almost all words are displayed in richTextBox4 not only the wrong ones.
Code:
foreach (Match match in wordPattern.Matches(str))
{
if (!words.ContainsKey(match.Value))
words.Add(match.Value, 1);
else
words[match.Value]++;
}
string[] words2 = new string[words.Keys.Count];
words.Keys.CopyTo(words2, 0);
int[] freqs = new int[words.Values.Count];
words.Values.CopyTo(freqs, 0);
Array.Sort(freqs, words2);
Array.Reverse(freqs);
Array.Reverse(words2);
Dictionary<string, int> dictByFreq = new Dictionary<string, int>();
for (int i = 0; i < freqs.Length; i++)
{
dictByFreq.Add(words2[i], freqs[i]);
}
Hunspell hunspell = new Hunspell("en_US.aff", "en_US.dic");
StringBuilder resultSb = new StringBuilder(dictByFreq.Count);
foreach (KeyValuePair<string, int> entry in dictByFreq)
{
resultSb.AppendLine(string.Format("{0} [{1}]", entry.Key, entry.Value));
richTextBox2.Text = resultSb.ToString();
bool correct = hunspell.Spell(entry.Key);
if (correct == false)
{
richTextBox4.Text = resultSb.ToString();
}
}

In addition to the above answer (which should work if your Hunspell.Spell method works correctly), I have a few suggestions to shorten your code. You are adding Matches to your dictionary, and counting the number of occurrences of each match. Then you appear to be sorting them in descending value of the frequency (so the highest occurrence match will have index 0 in the result). Here are a few code snippets which should make your function a lot shorter:
IOrderedEnumerable<KeyValuePair<string, int>> dictByFreq = words.OrderBy<KeyValuePair<string, int>, int>((KeyValuePair<string, int> kvp) => -kvp.Value);
This uses the .NET framework to do all your work for you. words.OrderBy takes a Func argument which provides the value to sort on. The problem with using the default values for this function is it wants to sort on the keys and you want to sort on the values. This function call will do exactly that. It will also sort them in descending order based on the values, which is the frequency that a particular match occurred. It returns an IOrderedEnumerable object, which has to be stored. And since that is enumerable, you don't even have to put it back into a dictionary! If you really need to do other operations on it later, you can call the dictByFreq.ToList() function, which returns an object of type: List>.
So your whole function then becomes this:
foreach (Match match in wordPattern.Matches(str))
{
if (!words.ContainsKey(match.Value))
words.Add(match.Value, 1);
else
words[match.Value]++;
}
IOrderedEnumerable<KeyValuePair<string, int>> dictByFreq = words.OrderBy<KeyValuePair<string, int>, int>((KeyValuePair<string, int> kvp) => -kvp.Value);
Hunspell hunspell = new Hunspell("en_US.aff", "en_US.dic");
StringBuilder resultSb = new StringBuilder(dictByFreq.Count);
foreach (KeyValuePair<string, int> entry in dictByFreq)
{
resultSb.AppendLine(string.Format("{0} [{1}]", entry.Key, entry.Value));
richTextBox2.Text = resultSb.ToString();
bool correct = hunspell.Spell(entry.Key);
if (correct == false)
{
richTextBox4.Text = entry.Key;
}
}

Your are displaying on richtextbox4 the same as in richtextbox2 :)
I think this should work:
foreach (KeyValuePair<string, int> entry in dictByFreq)
{
resultSb.AppendLine(string.Format("{0} [{1}]", entry.Key, entry.Value));
richTextBox2.Text = resultSb.ToString();
bool correct = hunspell.Spell(entry.Key);
if (correct == false)
{
richTextBox4.Text += entry.Key;
}
}

Appropriate datastructure for key.contains(x) Map/Dictionary

I am somewhat struggling with the terminology and complexity of my explanations here, feel free to edit it.
I have 1.000 - 20.000 objects. Each one can contain several name words (first, second, middle, last, title...) and normalized numbers(home, business...), email adresses or even physical adresses and spouse names.
I want to implement a search that enables users to freely combine word parts and number parts.When I search for "LL 676" I want to find all objects that contain any String with "LL" AND "676".
Currently I am iterating over every object and every objects property, split the searchString on " " and do a stringInstance.Contains(searchword).
This is too slow, so I am looking for a better solution.
What is the appropriate language agnostic data structure for this?
In my case I need it for C#.
Is the following data structure a good solution?
It's based on a HashMap/Dictionary.
At first I create a String that contains all name parts and phone numbers I want to look through, one example would be: "William Bill Henry Gates III 3. +436760000 billgatesstreet 12":
Then I split on " " and for every word x I create all possible substrings y that fullfill x.contains(y). I put every of those substrings inside the hashmap/dictionary.
On lookup/search I just need to call the search for every searchword and the join the results. Naturally, the lookup speed is blazingly fast (native Hashmap/Dictionary speed).
EDIT: Inserts are very fast as well (insignificant time) now that I use a smarter algorithm to get the substrings.

It's possible I've misunderstood your algorithm or requirement, but this seems like it could be a potential performance improvement:
foreach (string arg in searchWords)
{
if (String.IsNullOrEmpty(arg))
continue;
tempList = new List<T>();
if (dictionary.ContainsKey(arg))
foreach (T obj in dictionary[arg])
if (list.Contains(obj))
tempList.Add(obj);
list = new List<T>(tempList);
}
The idea is that you do the first search word separately before this, and only put all the subsequent words into the searchWords list.
That should allow you to remove your final foreach loop entirely. Results only stay in your list as long as they keep matching every searchWord, rather than initially having to pile everything that matches a single word in then filter them back out at the end.

In case anyone cares for my solution:
Disclaimer:
This is only a rough draft.
I have only done some synthetic testing and I have written a lot of it without testing it again.I have revised my code: Inserts are now ((n^2)/2)+(n/2) instead of 2^n-1 which is infinitely faster. Word length is now irrelevant.
namespace MegaHash
{
using System;
using System.Collections.Concurrent;
using System.Collections.Generic;
using System.Threading.Tasks;
public class GenericConcurrentMegaHash<T>
{
// After doing a bulk add, call AwaitAll() to ensure all data was added!
private ConcurrentBag<Task> bag = new ConcurrentBag<Task>();
private ConcurrentDictionary<string, List<T>> dictionary = new ConcurrentDictionary<string, List<T>>();
// consider changing this to include for example '-'
public char[] splitChars;
public GenericConcurrentMegaHash()
: this(new char[] { ' ' })
{
}
public GenericConcurrentMegaHash(char[] splitChars)
{
this.splitChars = splitChars;
}
public void Add(string keyWords, T o)
{
keyWords = keyWords.ToUpper();
foreach (string keyWord in keyWords.Split(splitChars))
{
if (keyWord == null || keyWord.Length < 1)
return;
this.bag.Add(Task.Factory.StartNew(() => { AddInternal(keyWord, o); }));
}
}
public void AwaitAll()
{
lock (this.bag)
{
foreach (Task t in bag)
t.Wait();
this.bag = new ConcurrentBag<Task>();
}
}
private void AddInternal(string key, T y)
{
for (int i = 0; i < key.Length; i++)
{
for (int i2 = 0; i2 < i + 1; i2++)
{
string desire = key.Substring(i2, key.Length - i);
if (dictionary.ContainsKey(desire))
{
List<T> l = dictionary[desire];
lock (l)
{
try
{
if (!l.Contains(y))
l.Add(y);
}
catch (Exception ex)
{
ex.ToString();
}
}
}
else
{
List<T> l = new List<T>();
l.Add(y);
dictionary[desire] = l;
}
}
}
}
public IList<T> FulltextSearch(string searchString)
{
searchString = searchString.ToUpper();
List<T> list = new List<T>();
string[] searchWords = searchString.Split(splitChars);
foreach (string arg in searchWords)
{
if (arg == null || arg.Length < 1)
continue;
if (dictionary.ContainsKey(arg))
foreach (T obj in dictionary[arg])
if (!list.Contains(obj))
list.Add(obj);
}
List<T> returnList = new List<T>();
foreach (T o in list)
{
foreach (string arg in searchWords)
if (dictionary[arg] == null || !dictionary[arg].Contains(o))
goto BREAK;
returnList.Add(o);
BREAK:
continue;
}
return returnList;
}
}
}

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.