C# using dictionaries - c#

I'm sorry in advance if it's bad to ask for this sort of help... but I don't know who else to ask.
I have an assignment to read two text files, and find the 10 longest words in the first file (and the amount of times they're repeated) which dont exist in the second file.
I currently read both of the files with File.ReadAllLines then split them into arrays, where every element is a single word (punctuation marks removed as well) and removed empty entries.
The idea I had to pick out the words fitting the requirements was: to make a dictionary containing a string Word and an int Count. Then make a loop repeating for the first file's length.... firstly comparing the element with the entire dictionary - if it finds a match, increase the Count by 1. Then if it doesn't match with any of the dictionary elements - compare the given element with every element in the 2nd file through another loop, if it finds a match - just go on to the next element of the first file, if it doesn't find any matches - add the word to the dictionary, and set Count to 1.
So my first question is: Is this actually the most efficient way to do this? (Don't forget I've only recently started studying c# and am not allowed to use linq)
Second question: How do I work with the dictionary, because most of the results I could find were very confusing, and we have not yet met them at university.
My code so far:
// Reading and making all the words lowercase for comparisons
string punctuation = " ,.?!;:\"\r\n";
string Read1 = File.ReadAllText("#\\..\\Book1.txt");
Read1 = Read1.ToLower();
string Read2 = File.ReadAllText("#\\..\\Book2.txt");
Read2 = Read2.ToLower();
//Working with the 1st file
string[] FirstFileWords = Read1.Split(punctuation.ToCharArray());
var temp1 = new List<string>();
foreach (var word in FirstFileWords)
{
if (!string.IsNullOrEmpty(word))
temp1.Add(word);
}
FirstFileWords = temp1.ToArray();
Array.Sort(FirstFileWords, (x, y) => y.Length.CompareTo(x.Length));
//Working with the 2nd file
string[] SecondFileWords = Read2.Split(punctuation.ToCharArray());
var temp2 = new List<string>();
foreach (var word in SecondFileWords)
{
if (!string.IsNullOrEmpty(word))
temp2.Add(word);
}
SecondFileWords = temp2.ToArray();

Well I think you've done very well so far. Not being able to use Linq here is torture ;)
As for performance, you should consider making your SecondFileWords a HashSet<string> as this would increase lookup times if any word exists in the 2nd file tremendously without much effort. I wouldn't go much further in terms of performance optimization for an exercise like that if performance is not a key requirement.
Of course, you would have to check that you don't add duplicates to your 2nd list, so change your current implementation to something like:
HashSet<string> temp2 = new HashSet<string>();
foreach (var word in SecondFileWords)
{
if (!string.IsNullOrEmpty(word) && !temp2.Contains(word))
{
temp2.Add(word);
}
}
Don't convert this back to an Array again, this is not necessary.
This brings me back to your FirstFileWords which would contain duplicates too. This will cause issues later on when the top words might contain the same word multiple times. So let's get rid of them. Here it's more complicated as you need to retain the information how often a word appeared in your first list.
So let's bring a Dictionary<string, int> into play here now. A Dictionary stores a lookup key, as the HashSet, but in addition, also a value. We will use the key for the word, and the value for a number that contains the amount of how often the word appeared in the first list.
Dictionary<string, int> temp1 = new Dictionary<string, int>();
foreach (var word in FirstFileWords)
{
if (string.IsNullOrEmpty(word))
{
continue;
}
if (temp1.ContainsKey(word))
{
temp1[word]++;
}
else
{
temp1.Add(word, 1);
}
}
Now a dictionary cannot be sorted, which complicates things at this point as you still need to get your sorting by word length done. So let's get back to your Array.Sort method which I think is a good choice when you are not allowed to use Linq:
KeyValuePair<string, int>[] firstFileWordsWithCount = temp1.ToArray();
Array.Sort(firstFileWordsWithCount, (x, y) => y.Key.Length.CompareTo(x.Key.Length));
Note: You are using .ToArray() in your example, so I think it's OK to use it. But strictly speaking, this would also fall unter using Linq IMHO.
Now all that's left is working through your firstFileWordsWithCount array until you got 10 words that do not exist in the HashSet temp2. Something like:
int foundWords = 0;
foreach(KeyValuePair<string, int> candidate in firstFileWordsWithCount)
{
if (!temp2.Contains(candidate.Key))
{
Console.WriteLine($"{candidate.Key}: {candidate.Value}");
foundWords++;
}
if (foundWords >= 10)
{
break;
}
}
If anything is unclear, just ask.

This is what you'll get when using dictionaries:
string File1 = "AMD Intel Skylake Processors Graphics Cards Nvidia Architecture Microprocessor Skylake SandyBridge KabyLake";
string File2 = "Graphics Nvidia";
Dictionary<string, int> Dic = new Dictionary<string, int>();
string[] File1Array = File1.Split(" ".ToCharArray(), StringSplitOptions.RemoveEmptyEntries);
Array.Sort(File1Array, (s1, s2) => s2.Length.CompareTo(s1.Length));
foreach (string s in File1Array)
{
if (Dic.ContainsKey(s))
{
Dic[s]++;
}
else
{
Dic.Add(s, 1);
}
}
string[] File2Array = File2.Split(" ".ToCharArray(), StringSplitOptions.RemoveEmptyEntries);
foreach (string s in File2Array)
{
if (Dic.ContainsKey(s))
{
Dic.Remove(s);
}
}
int i = 0;
foreach (KeyValuePair<string, int> kvp in Dic)
{
i++;
Console.WriteLine(kvp.Key + " " + kvp.Value);
if (i == 9)
{
break;
}
}
My earlier attempt was using LINQ, which is apparently not allowed but missed it.
string[] Results = File1.Split(" ".ToCharArray()).Except(File2.Split(" ".ToCharArray())).OrderByDescending(s => s.Length).Take(10).ToArray();
for (int i = 0; i < Results.Length; i++)
{
Console.WriteLine(Results[i] + " " + Regex.Matches(File1, Results[i]).Count);
}

Related

C# human name checking and matching algorithm

Is there an algorith or C# library to determine if a human name is correct or not and, if not, to find its nearest matching?
I found algorithms for string matching like the Levenshtein's distance algorithm, but all of them check the matching between one string and another, and i want to check the matching between one name and all the possible names in English (for example), to check if the name was wrongly written.
For example:
Someone inserts the name "Giliam" while it should be "william". I want to know if there are any algorithm (or group of them) to detect the error and propose a correction.
All solutions that come to my mind involves the implementation of a huge human name's dictionary and use it to check for the correctness of each input name matching one by one... And it sounds terrorific performace to me, so i want to ask for a better approach.
Thanks.
What you are in effect asking is how to create a spell checker with a given dictionary. One way to do this that doesn't involve looking up and testing every possible entry in a list is to do the inverse of the problem: Generate a list of possible permutations from the user input, and test each one of those to see if they're in a list. This is a much more manageable problem.
For instance, you could use a function like this to generate each possible permutation that one "edit" could get from a given word:
static HashSet<string> GenerateEdits(string word)
{
// Normalize the case
word = word.ToLower();
var splits = new List<Tuple<string, string>>();
for (int i = 0; i < word.Length; i++)
{
splits.Add(new Tuple<string, string>(word.Substring(0, i), word.Substring(i)));
}
var ret = new HashSet<string>();
// All cases of one character removed
foreach (var cur in splits)
{
if (cur.Item2.Length > 0)
{
ret.Add(cur.Item1 + cur.Item2.Substring(1));
}
}
// All transposed possibilities
foreach (var cur in splits)
{
if (cur.Item2.Length > 1)
{
ret.Add(cur.Item1 + cur.Item2[1] + cur.Item2[0] + cur.Item2.Substring(2));
}
}
var letters = "abcdefghijklmnopqrstuvwxyz";
// All replaced characters
foreach (var cur in splits)
{
if (cur.Item2.Length > 0)
{
foreach (var letter in letters)
{
ret.Add(cur.Item1 + letter + cur.Item2.Substring(1));
}
}
}
// All inserted characters
foreach (var cur in splits)
{
foreach (var letter in letters)
{
ret.Add(cur.Item1 + letter + cur.Item2);
}
}
return ret;
}
And then exercise the code to see if a given user input can be easily convoluted to one of these entries. Finding the best match can be done by weighted averages, or simply by presenting the list of possibilities to the user:
// Example file from:
// https://raw.githubusercontent.com/smashew/NameDatabases/master/NamesDatabases/first%20names/all.txt
string source = #"all.txt";
var names = new HashSet<string>();
using (var sr = new StreamReader(source))
{
string line;
while ((line = sr.ReadLine()) != null)
{
names.Add(line.ToLower());
}
}
var userEntry = "Giliam";
var found = false;
if (names.Contains(userEntry.ToLower()))
{
Console.WriteLine("The entered value of " + userEntry + " looks good");
found = true;
}
if (!found)
{
// Try edits one edit away from the user entry
foreach (var test in GenerateEdits(userEntry))
{
if (names.Contains(test))
{
Console.WriteLine(test + " is a possibility for " + userEntry);
found = true;
}
}
}
if (!found)
{
// Try edits two edits away from the user entry
foreach (var test in GenerateEdits(userEntry))
{
foreach (var test2 in GenerateEdits(test))
{
if (names.Contains(test))
{
Console.WriteLine(test + " is a possibility for " + userEntry);
found = true;
}
}
}
}
kiliam is a possibility for Giliam
liliam is a possibility for Giliam
viliam is a possibility for Giliam
wiliam is a possibility for Giliam
Of course, since you're talking about human names, you had, at best, make this a suggestion, and be very prepared for odd spellings, and spellings of things you've never seen. And if you want to support other languages, the implementation of GenerateEdits gets more complex as you consider what counts for a 'typo'

adding a string to an array

I'm just doing a little project in C# (I'm a beginner), my code is basically asking you "how many words are in this sentence?" and then asks you for every word, once it gets all of them it prints it out with "ba" attached to every word.
I know I'm a real beginner and my code's probably a joke but could you please help me out with this one?
Console.WriteLine("How many words are in this sentence?");
int WordAmount = Convert.ToInt32(Console.ReadLine());
int i = 1;
while (i <= WordAmount)
{
Console.WriteLine("Enter a word");
string[] word = new string[] { Console.ReadLine() };
i++;
}
Console.WriteLine(word + "ba");
You're close, you've just got one issue.
string[] word = new string[] { Console.ReadLine() };
You are creating a new array list inside the scope of a while loop. Not only will this disappear every loop, meaning you never save the old words, but you also won't be able to use it outside of the loop, making it useless.
Create a string[] words = new string[WordAmount];. Then iterate through it to add your Console.ReadLine() to it, and finally, iterate through it once more and Console.WriteLine(words[i] + "ba");
string[] wordList = new string[WordAmount];
while (i <= WordAmount)
{
Console.WriteLine("Enter a word");
wordList[i-1] = Console.ReadLine() ;
i++;
}
foreach (var item in wordList)
Console.WriteLine(item + "ba");
Working Fiddle: https://dotnetfiddle.net/7UJKwN
your code has multiple issues. First you need to define your array outside of your while loop, and then fill it one by one.
In order to read/write array of strings (string[]), you need to loop through (iterate) it.
My code actually iterates your wordList. In the first While loop I am iterating to fill the wordList array. then printing it in the second loop
First of all, consider storing your words in some kind of collection, for example a list.
List<string> words = new List<string>();
while (i <= WordAmount)
{
Console.WriteLine("Enter a word");
string word = Console.ReadLine();
words.Add(word);
i++;
}
I don't think your code compiles - the reason is you are trying to use the word variable outside of the scope that it is defined in. In my solution I have declared and initialized a list of strings (so list of the words in this case) outside of the scope where user has to input words, it is possible to access it in the inner scope (the area between curly brackets where user enters the words).
To print all the words, you have to iterate over the list and add a "ba" part. Something like this:
foreach(var word in words)
{
Console.WriteLine(word + "ba");
}
Or more concisely:
words.ForEach(o => Console.WriteLine(o + "ba"));
If you want to print the sentence without using line breaks, you can use LINQ:
var wordsWithBa = words.Select(o => o + "ba ").Aggregate((a, b) => a + b);
Console.WriteLine(wordsWithBa);
Although I would recommend learning LINQ after you are a bit more familiarized with C# :)
You can look here and here to familiarize yourself with the concept of collections and scopes of variables.
You could also use a StringBuilder class to do this task (my LINQ method is not very efficient if it comes to memory, but i believe it is enough for your purpose).

Replace character at specific index in List<string>, but indexer is read only [duplicate]

This question already has answers here:
Is there an easy way to change a char in a string in C#?
(8 answers)
Closed 5 years ago.
This is kind of a basic question, but I learned programming in C++ and am just transitioning to C#, so my ignorance of the C# methods are getting in my way.
A client has given me a few fixed length files and they want the 484th character of every odd numbered record, skipping the first one (3, 5, 7, etc...) changed from a space to a 0. In my mind, I should be able to do something like the below:
static void Main(string[] args)
{
List<string> allLines = System.IO.File.ReadAllLines(#"C:\...").ToList();
foreach(string line in allLines)
{
//odd numbered logic here
line[483] = '0';
}
...
//write to new file
}
However, the property or indexer cannot be assigned to because it is read only. All my reading says that I have not set a setter for the variable, and I have tried what was shown at this SO article, but I am doing something wrong every time. Should what is shown in that article work? Should I do something else?
You cannot modify C# strings directly, because they are immutable. You can convert strings to char[], modify it, then make a string again, and write it to file:
File.WriteAllLines(
#"c:\newfile.txt"
, File.ReadAllLines(#"C:\...").Select((s, index) => {
if (index % 2 = 0) {
return s; // Even strings do not change
}
var chars = s.ToCharArray();
chars[483] = '0';
return new string(chars);
})
);
Since strings are immutable, you can't modify a single character by treating it as a char[] and then modify a character at a specific index. However, you can "modify" it by assigning it to a new string.
We can use the Substring() method to return any part of the original string. Combining this with some concatenation, we can take the first part of the string (up to the character you want to replace), add the new character, and then add the rest of the original string.
Also, since we can't directly modify the items in a collection being iterated over in a foreach loop, we can switch your loop to a for loop instead. Now we can access each line by index, and can modify them on the fly:
for(int i = 0; i < allLines.Length; i++)
{
if (allLines[i].Length > 483)
{
allLines[i] = allLines[i].Substring(0, 483) + "0" + allLines[i].Substring(484);
}
}
It's possible that, depending on how many lines you're processing and how many in-line concatenations you end up doing, there is some chance that using a StringBuilder instead of concatenation will perform better. Here is an alternate way to do this using a StringBuilder. I'll leave the perf measuring to you...
var sb = new StringBuilder();
for (int i = 0; i < allLines.Length; i++)
{
if (allLines[i].Length > 483)
{
sb.Clear();
sb.Append(allLines[i].Substring(0, 483));
sb.Append("0");
sb.Append(allLines[i].Substring(484));
allLines[i] = sb.ToString();
}
}
The first item after the foreach (string line in this case) is a local variable that has no scope outside the loop - that’s why you can’t assign a value to it. Try using a regular for loop instead.
Purpose of for each is meant to iterate over a container. It's read only in nature. You should use regular for loop. It will work.
static void Main(string[] args)
{
List<string> allLines = System.IO.File.ReadAllLines(#"C:\...").ToList();
for (int i=0;i<=allLines.Length;++i)
{
if (allLines[i].Length > 483)
{
allLines[i] = allLines[i].Substring(0, 483) + "0";
}
}
...
//write to new file
}

How to import Dictionary text file and check for word matches?

I generate a random string of 500 characters and want to check for words.
bliduuwfhbgphwhsyzjnlfyizbjfeeepsbpgplpbhaegyepqcjhhotovnzdtlracxrwggbcmjiglasjvmscvxwazmutqiwppzcjhijjbguxfnduuphhsoffaqwtmhmensqmyicnciaoczumjzyaaowbtwjqlpxuuqknxqvmnueknqcbvkkmildyvosczlbnlgumohosemnfkmndtiubfkminlriytmbtrzhwqmovrivxxojbpirqahatmydqgulammsnfgcvgfncqkpxhgikulsjynjrjypxwvlkvwvigvjvuydbjfizmbfbtjprxkmiqpfuyebllzezbxozkiidpplvqkqlgdlvjbfeticedwomxgawuphocisaejeonqehoipzsjgbfdatbzykkurrwwtajeajeornrhyoqadljfjyizzfluetynlrpoqojxxqmmbuaktjqghqmusjfvxkkyoewgyckpbmismwyfebaucsfueuwgio
I import a Dictionary Words txt file and check the string to see if it contains each word. If a match is found, it's added to a list.
I read using Dictionary<> is faster than Array for a words list.
When I use that method, I can see the cpu working the foreach loop in the debugger, and my loop counter goes up, about 10,000+ times in 10 seconds, but the loop continues on forever and does not return any results.
When I use Array for Dictionary, the program works, but slower at around 500 times in 10 seconds.
Not Working
Using Dictionary<>
// Random Message
public string message = Random(500);
// Dictionary Words Reference
public Dictionary<string, string> dictionary = new Dictionary<string, string>();
// Matches Found
public static List<string> matches = new List<string>();
public MainWindow()
{
InitializeComponent();
// Import Dictionary File
dictionary = File
.ReadLines(#"C:\dictionary.txt")
.Select((v, i) => new { Index = i, Value = v })
.GroupBy(p => p.Index / 2)
.ToDictionary(g => g.First().Value, g => g.Last().Value);
// If Message Contains word, add to Matches List
foreach (KeyValuePair<string, string> entry in dictionary)
{
if (message.Contains(entry.Value))
{
matches.Add(entry.Value);
}
}
}
Working
Using Array
// Random Message
public string message = Random(500);
// Dictionary Words Reference
public string[] dictionary = File.ReadAllLines(#"C:\dictionary.txt");
// Matches Found
public List<string> matches = new List<string>();
public MainWindow()
{
InitializeComponent();
// If Message Contains word, add to Matches List
foreach (var entry in dictionary)
{
if (message.Contains(entry))
{
matches.Add(entry);
}
}
}
I doubt if you want Dictionary<string, string> as a dictionary ;) HashSet<string> will be enough:
using System.Linq;
...
string source = "bliduuwfhbgphwhsyzjnlfyizbj";
HashSet<string> allWords = new HashSet<string>(File
.ReadLines(#"C:\dictionary.txt")
.Select(line => line.Trim())
.Where(line => !string.IsNullOrEmpty(line)), StringComparer.OrdinalIgnoreCase);
int shortestWord = allWords.Min(word => word.Length);
int longestWord = allWords.Max(word => word.Length);
// If you want duplicates, change HashSet<string> to List<string>
HashSet<string> wordsFound = new HashSet<string>(StringComparer.OrdinalIgnoreCase);
for (int length = shortestWord; length <= longestWord; ++length) {
for (int position = 0; position <= source.Length - length; ++position) {
string extract = source.Substring(position, length);
if (allWords.Contains(extract))
wordsFound.Add(extract);
}
}
Test: for
https://raw.githubusercontent.com/dolph/dictionary/master/popular.txt
dictionary donwloaded as C:\dictionary.txt file
Console.WriteLine(string.Join(", ", wordsFound.OrderBy(x => x)));
we have output
id, li, lid
Using a Dictionary in this scenario doesn't make much sense. A Dictionary is, essentially, a list of variables that stores both the variable name and the variable value.
I could have the following:
int age = 21;
int money = 21343;
int distance = 10;
int year = 2017;
And convert it to a Dictionary instead, using the following:
Dictionary<string, int> numbers = new Dictionary<string, int>()
{
{ "age", 21 },
{ "money", 21343},
{ "distance", 10 },
{ "year", 2017 }
};
And then I can access a value in the dictionary using its key (the first value). So, for example, if I want to know what "age" is, I would use:
Console.Log(numbers["age"]);
This is only a single example of the power of dictionaries - there is a LOT more that they can do, and they can make your life a lot easier. In this scenario, however, they aren't going to do what you're expecting them to do. I would suggest just using the Array, or a List.
You are misusing the dictionary,
you are basically using the dictionary as a list, so it only added some overhead to the program. not helping in any way.
It would have been useful if you had something you want to query against the dictionary not the other way around.
Also, in any case, what you want is a HashSet, not a dictionary since your key in the dictionary is not the word you are querying against but an irrelevant int.
you can read more about dictionary and HashSet here:
dictionary: https://www.dotnetperls.com/dictionary
hashset: https://www.dotnetperls.com/hashset

C# fastest way to remove duplicates from string: Split vs. Loop

I have a string of dash-separated numbers that I am removing duplicate numbers from
string original = "45-1-3-45-10-3-15";
string new = "45-1-3-10-15";
I have tried two approaches, and used Stopwatch to determine which method is faster, but I am getting inconsistent time elapses so I was hoping for some insight into which method would be more efficient for achieving the new duplicate-free list.
Method 1: While loop
List<string> temp = new List<string>();
bool moreNumbers = true;
while (moreNumbers)
{
if (original.Contains("-"))
{
string number = original.Substring(0, original.IndexOf("-"));
//don't add if the number is already in the list
int index = temp.FindIndex(item => item == number);
if (index < 0)
temp.Add(value);
original = original.Substring(original.IndexOf("-") + 1);
}
else
moreNumbers = false;
}
//add remaining value in
string lastNumber = original;
//don't add if the number is already in the list
int indexLast = temp.FindIndex(item => item == lastNumber);
if (indexLast < 0)
temp.Add(lastNumber);
string new = "";
foreach (string number in temp)
{
new += "-" + number;
}
if (new[0] == '-')
new = new.Substring(1);
Method 2: Split
List<string> temp = original.Split('-').Distinct().ToList();
string new = "";
foreach (string number in temp)
{
new += "-" + number;
}
if (new[0] == '-')
new = new.Substring(1);
I think the second method is more readable, but possibly slower? Which of these methods would be more efficient or a better approach?
This will be highly optimized but you test for performance.
string result = string.Join("-", original.Split('-').Distinct());
You have some inefficiencies in both your examples.
Method 1: manipulating a string is never efficient. Strings are immutable.
Method 2: no need to create a List and use a StringBuilder() instead of using string concatenation.
Lastly, new is a C# reserved word so none of your code will compile.
In the first approach, you're using several Substring calls and several IndexOf calls. I don't know exactly the internal implementation, but I guess they are O(n) in time complexity.
Since, for each number in the list, you'll do a full loop in the other list (you're using strings as lists), you'll have an O(n^2) time complexity.
The second option, I assume it is O(n^2) too, because to make a distinct of the list in an IEnumerable, it will have to iterate the list.
I think one optimezed approach to the problem is:
1) loop the main string and for each "-" or end of string, save the number (this will be more economic than the Split in terms of space).
2) for each number, put it in a Dictionary. This won't be economic in terms of space, but will provide O(1) time to check if the item. Hashing small strings shouldn't be too constly.
3) Loop the Dictionary to retrieve the distinct values.
This implementation will be O(n), better than O(n^2).
Note that only using the dictionary can deliver the result string in a different order. If the order is important, use the Dictionary to check if the item is duplicated, but put in an auxiliary list. Again, this will have a space cost.

Categories

Resources