Dictionary re-adding words as new keys instead of increasing value - c#

I am writing a program that finds every unique word in a text and prints it in a text box. I do this by printing each key in a dictionary however my dictionary is adding each word as a separate key instead of ignoring words that are already there.
The function is being called correctly and it does work it simpy prints the entire text I hand it however.
EDIT: I am reading the string from a text file then sending it to the function.
This is the input string and the output:
Output:
To be or not to that is the question Whether tis nobler in mind suffer
The slings and arrows of outrageous fortune Or take arms against a sea
troubles And by opposing end them die sleep No more sleep say we end
The heartache thousand natural shocks That flesh heir Tis consummation
public string FindUniqueWords(string text)
{
Dictionary<string, int> dictionary = new Dictionary<string, int>();
string uniqueWord = "";
text = text.Replace(",", ""); //Just cleaning up a bit
text = text.Replace(".", ""); //Just cleaning up a bit
string[] arr = text.Split(' '); //Create an array of words
foreach (string word in arr) //let's loop over the words
{
if (dictionary.ContainsKey(word)) //if it's in the dictionary
dictionary[word] = dictionary[word] + 1; //Increment the count
else
dictionary[word] = 1; //put it in the dictionary with a count 1
}
foreach (KeyValuePair<string, int> pair in dictionary) //loop through the dictionary
{
uniqueWord += (pair.Key + " ");
}
uniqueWords.Text = uniqueWord;
return ("");
}

You're reading the text with System.IO.File.ReadAllText, so text may also contain newline characters.
Replace arr = text.Split(' ') by arr = text.Split(' ', '\r', '\n') or add another replace: text = text.Replace(Environment.NewLine, " ");
Of course, by looking at arr in the debugger, you could have found out by yourself.

A shorter way: (Dont forget to use Using System.Linq)
string strInput = "TEST TEST Text 123";
var words = strInput.Split().Distinct();
foreach (var word in words )
{
Console.WriteLine(word);
}

Your code works as it's supposed to (ignoring case though). The problem almost certainly lies with showing the results in your application, or with how you are calling the FindUniqueWords method (not the complete text at once).
Also, pretty important to note here: a Dictionary<TKey, TValue> by default simply cannot contain a single key multiple times. It would defeat the whole purpose of the dictionary in the first place. It's only possible if you override the Equality comparison somewhere, which you aren't doing.
If I try your code, with the following input:
To be or not to that is is is is is is is the question
The output becomes :
To be or not to that is the question
It works like it's supposed to.

Related

grouping adjacent similar substrings

I am writing a program in which I want to group the adjacent substrings, e.g ABCABCBC can be compressed as 2ABC1BC or 1ABCA2BC.
Among all the possible options I want to find the resultant string with the minimum length.
Here is code what i have written so far but not doing job. Kindly help me in this regard.
using System;
using System.Collections.Generic;
using System.Linq;
namespace EightPrgram
{
class Program
{
static void Main(string[] args)
{
string input;
Console.WriteLine("Please enter the set of operations: ");
input = Console.ReadLine();
char[] array = input.ToCharArray();
List<string> list = new List<string>();
string temp = "";
string firstTemp = "";
foreach (var x in array)
{
if (temp.Contains(x))
{
firstTemp = temp;
if (list.Contains(firstTemp))
{
list.Add(firstTemp);
}
temp = "";
list.Add(firstTemp);
}
else
{
temp += x;
}
}
/*foreach (var item in list)
{
Console.WriteLine(item);
}*/
Console.ReadLine();
}
}
}
You can do this with recursion. I cannot give you a C# solution, since I do not have a C# compiler here, but the general idea together with a python solution should do the trick, too.
So you have an input string ABCABCBC. And you want to transform this into an advanced variant of run length encoding (let's called it advanced RLE).
My idea consists of a general first idea onto which I then apply recursion:
The overall target is to find the shortest representation of the string using advanced RLE, let's create a function shortest_repr(string).
You can divide the string into a prefix and a suffix and then check if the prefix can be found at the beginning of the suffix. For your input example this would be:
(A, BCABCBC)
(AB, CABCBC)
(ABC, ABCBC)
(ABCA, BCBC)
...
This input can be put into a function shorten_prefix, which checks how often the suffix starts with the prefix (e.g. for the prefix ABC and the suffix ABCBC, the prefix is only one time at the beginning of the suffix, making a total of 2 ABC following each other. So, we can compact this prefix / suffix combination to the output (2ABC, BC).
This function shorten_prefix will be used on each of the above tuples in a loop.
After using the function shorten_prefix one time, there still is a suffix for most of the string combinations. E.g. in the output (2ABC, BC), there still is the string BC as suffix. So, need to find the shortest representation for this remaining suffix. Wooo, we still have a function for this called shortest_repr, so let's just call this onto the remaining suffix.
This image displays how this recursion works (I only expanded one of the node after the 3rd level, but in fact all of the orange circles would go through recursion):
We start at the top with a call of shortest_repr to the string ABABB (I selected a shorter sample for the image). Then, we split this string at all possible split positions and get a list of prefix / suffix pairs in the second row. On each of the elements of this list we first call the prefix/suffix optimization (shorten_prefix) and retrieve a shortened prefix/suffix combination, which already has the run-length numbers in the prefix (third row). Now, on each of the suffix, we call our recursion function shortest_repr.
I did not display the upward-direction of the recursion. When a suffix is the empty string, we pass an empty string into shortest_repr. Of course, the shortest representation of the empty string is the empty string, so we can return the empty string immediately.
When the result of the call to shortest_repr was received inside our loop, we just select the shortest string inside the loop and return this.
This is some quickly hacked code that does the trick:
def shorten_beginning(beginning, ending):
count = 1
while ending.startswith(beginning):
count += 1
ending = ending[len(beginning):]
return str(count) + beginning, ending
def find_shortest_repr(string):
possible_variants = []
if not string:
return ''
for i in range(1, len(string) + 1):
beginning = string[:i]
ending = string[i:]
shortened, new_ending = shorten_beginning(beginning, ending)
shortest_ending = find_shortest_repr(new_ending)
possible_variants.append(shortened + shortest_ending)
return min([(len(x), x) for x in possible_variants])[1]
print(find_shortest_repr('ABCABCBC'))
print(find_shortest_repr('ABCABCABCABCBC'))
print(find_shortest_repr('ABCABCBCBCBCBCBC'))
Open issues
I think this approach has the same problem as the recursive levenshtein distance calculation. It calculates the same suffices multiple times. So, it would be a nice exercise to try to implement this with dynamic programming.
If this is not a school assignment or performance critical part of the code, RegEx might be enough:
string input = "ABCABCBC";
var re = new Regex(#"(.+)\1+|(.+)", RegexOptions.Compiled); // RegexOptions.Compiled is optional if you use it more than once
string output = re.Replace(input,
m => (m.Length / m.Result("$1$2").Length) + m.Result("$1$2")); // "2ABC1BC" (case sensitive by default)

C# Syntax - Remove last occurrence of ';' in string from split

I have a list of strings stored in an ArrayList. I want to split them by every occurrence of ';'. The problem is, whenever I try to display them using MessageBox, there's an excess space or unnecessary value that gets displayed.
Sample input (variable = a):
Arial;16 pt;None;None;None;None;None;None;FF0000;None;100;Normal;None;Normal;
Below is a line of code I used to split them:
string[] display_document = (a[0] + "").Split(';');
Code to display:
foreach (object doc_properties in display_document)
{
TextBox aa = new TextBox();
aa.Font = new Font(aa.Font.FontFamily, 9);
aa.Text = doc_properties.ToString();
aa.Location = new Point(pointX, pointY);
aa.Size = new System.Drawing.Size(80, 25);
aa.ReadOnly = true;
doc_panel.Controls.Add(aa);
doc_panel.Show();
pointY += 30;
}
The output that displays are the following:
How do I remove the last occurrence of that semicolon? I really need help fixing this. Thank you so much for all of your help.
Wouldnt It be easiest to check if the input ends with a ";" before splitting it, and if so remove the last character? Sample code:
string a = "Arial;16 pt;None;None;None;None;None;None;FF0000;None;100;Normal;None;Normal;";
if (a.EndsWith(";"))
{
a = a.Remove(a.LastIndexOf(";"));
}
//Proceed with split
Split will not print last semicolon if no space character is added and your input is a string.
I don't know why you prefer an array list (which probably is the reason of this strange behaviour) but if you could use your input as a string you could try that
string a = "Arial;16pt;None;None;None;None;None;None;FF0000;None;100;Normal;None;Normal;";
string[] display_document = a.Split(';');
foreach (object doc_properties in display_document)
{
//The rest of your code
}

Find a delimiter of csv or text files in c#

I want to find a delimiter being used to separate the columns in csv or text files.
I am using TextFieldParser class to read those files.
Below is my code,
String path = #"c:\abc.csv";
DataTable dt = new DataTable();
if (File.Exists(path))
{
using (Microsoft.VisualBasic.FileIO.TextFieldParser parser = new Microsoft.VisualBasic.FileIO.TextFieldParser(path))
{
parser.TextFieldType = FieldType.Delimited;
if (path.Contains(".txt"))
{
parser.SetDelimiters("|");
}
else
{
parser.SetDelimiters(",");
}
parser.HasFieldsEnclosedInQuotes = true;
bool firstLine = true;
while (!parser.EndOfData)
{
string[] fields = parser.ReadFields();
if (firstLine)
{
foreach (var val in fields)
{
dt.Columns.Add(val);
}
firstLine = false;
continue;
}
dt.Rows.Add(fields);
}
}
lblCount.Text = "Count of total rows in the file: " + dt.Rows.Count.ToString();
dgvTextFieldParser1.DataSource = dt;
Instead of passing the delimiters manually based on the file type, I want to read the delimiter from the file and then pass it.
How can I do that?
Mathematically correct but totally useless answer: It's not possible.
Pragmatical answer: It's possible but it depends on how much you know about the file's structure. It boils down to a bunch of assumptions and depending on which we'll make, the answer will vary. And if you can't make any assumptions, well... see the mathematically correct answer.
For instance, can we assume that the delimiter is one or any of the elements in the set below?
List<char> delimiters = new List<char>{' ', ';', '|'};
Or can we assume that the delimiter is such that it produces elements of equal length?
Should we try to find a delimiter that's a single character or can a word be one?
Etc.
Based on the question, I'll assume that it's the first option and that we have a limited set of possible characters, precisely one of which is be a delimiter for a given file.
How about you count the number of occurrences of each such character and assume that the one that's occurring most frequently is the one? Is that sufficiently rigid or do you need to be more sure than that?
List<char> delimiters = new List<char>{' ', ';', '-'};
Dictionary<char, int> counts = delimiters.ToDictionary(key => key, value => 0);
foreach(char c in delimiters)
counts[c] = textArray.Count(t => t == c);
I'm not in front of a computer so I can't verify but the last step would be returning the key from the dictionary the value of which is the maximal.
You'll need to take into consideration a special case such that there's no delimiters detected, there are equally many delimiters of two types etc.
Very simple guessing approach using LINQ:
static class CsvSeperatorDetector
{
private static readonly char[] SeparatorChars = {';', '|', '\t', ','};
public static char DetectSeparator(string csvFilePath)
{
string[] lines = File.ReadAllLines(csvFilePath);
return DetectSeparator(lines);
}
public static char DetectSeparator(string[] lines)
{
var q = SeparatorChars.Select(sep => new
{Separator = sep, Found = lines.GroupBy(line => line.Count(ch => ch == sep))})
.OrderByDescending(res => res.Found.Count(grp => grp.Key > 0))
.ThenBy(res => res.Found.Count())
.First();
return q.Separator;
}
}
What this does is it reads the file line by line (note that CSV files may include line breaks), then checks for each potential separator how often it occurs in each line.
Then we check which separator occurs on the most lines, and of those which occur on the same number of lines, we take the one with the most even distribution (e.g. 5 occurences on every line are ranked higher than one that occurs once in one line and 10 times in another line).
Of course you might have to tweak this for your own purposes, add error handling, fallback logic and so forth. I'm sure it's not perfect, but it's good enough for me.
You could probably take n bytes from the file, count possible delimiter characters(or all characters found) using a hash map/dictionary, and then the character repeated most is probably the delimiter you're looking for. It would make sense to me that the characters used as delimiters would be the ones used the most. When done you reset the stream, but since you're using a text reader you would have to probably initialize another text reader or something. This would get slightly more hairy if the CSV used more than one delimiter. You would probably have to ignore some characters like alpha and numeric.
In python we can do this easily by using csv sniffer. It will cater for text files and also if you just need to read some bytes from the file.

C#: Getting Substring between two different Delimiters

I have problems splitting this Line. I want to get each String between "#VAR;" and "#ENDVAR;". So at the End, there should be a output of:
Variable=Speed;Value=Fast;
Variable=Fabricator;Value=Freescale;Op==;
Later I will separate each Substring, using ";" as a delimiter but that I guess wont be that hard. This is how a line looks like:
#VAR;Variable=Speed;Value=Fast;Op==;#ENDVAR;#VAR;Variable=Fabricator;Value=Freescale;Op==;#ENDVAR;
I tried some split-options, but most of the time I just get an empty string. I also tried a Regex. But either the Regex was wrong or it wasnt suitable to my String. Probably its wrong, at school we learnt Regex different then its used in C#, so I was confused while implementing.
Regex.Match(t, #"/#VAR([a-z=a-z]*)/#ENDVAR")
Edit:
One small question: I am iterating over many lines like the one in the question. I use NoIdeas code on the line to get it in shape. The next step would be to print it as a Text-File. To print an Array I would have to loop over it. But in every iteration, when I get a new line, I overwrite the Array with the current splitted string. I put the Rest of my code in the question, would be nice if someone could help me.
string[] w ;
foreach (EA.Element theObjects in myPackageObject.Elements)
{
theObjects.Type = "Object";
foreach (EA.Element theElements in PackageHW.Elements)
{
if (theObjects.ClassfierID == theElements.ElementID)
{
t = theObjects.RunState;
w = t.Replace("#ENDVAR;", "#VAR;").Replace("#VAR;", ";").Split(new string[] { ";" }, StringSplitOptions.RemoveEmptyEntries);
foreach (string s in w)
{
tw2.WriteLine(s);
}
}
}
}
The piece with the foreach-loop is wrong pretty sure. I need something to print each splitted t. Thanks in advance.
you can do it without regex using
str.Replace("#ENDVAR;", "#VAR;")
.Split(new string[] { "#VAR;" }, StringSplitOptions.RemoveEmptyEntries);
and if you want to save time you can do:
str.Replace("#ENDVAR;", "#VAR;")
.Replace("#VAR;", ";")
.Split(new string[] { ";" }, StringSplitOptions.RemoveEmptyEntries);
You can use a look ahead assertion here.
#VAR;(.*?)(?=#ENDVAR)
If your string never consists of whitespace between #VAR; and #ENDVAR; you could use the below line, this will not match empty instances of your lines.
#VAR;([^\s]+)(?=#ENDVAR)
See this demo
Answer using raw string manipulation.
IEnumerable<string> StuffFoundInside(string biggerString)
{
var closeDelimeterIndex = 0;
do
{
int openDelimeterIndex = biggerString.IndexOf("#VAR;", startingIndex);
if (openDelimeterIndex != -1)
{
closeDelimeterIndex = biggerString.IndexOf("#ENDVAR;", openDelimeterIndex);
if (closeDelimiterIndex != -1)
{
yield return biggerString.Substring(openDelimeterIndex, closeDelimeterIndex - openDelimiterIndex);
}
}
} while (closeDelimeterIndex != -1);
}
Making a list and adding each item to the list then returning the list might be faster, depending on how the code using this code would work. This allows it to terminate early, but has the coroutine overhead.
Use this regex:
(?i)#VAR;(.+?)#ENDVAR;
Group 1 in each match will be your line content.
(If you don't like regexs)
Code:
var s = "#VAR;Variable=Speed;Value=Fast;Op==;#ENDVAR;#VAR;Variable=Fabricator;Value=Freescale;Op==;#ENDVAR;";
var tokens = s.Split(new String [] {"#ENDVAR;#VAR;"}, StringSplitOptions.None);
foreach (var t in tokens)
{
var st = t.Replace("#VAR;", "").Replace("#ENDVAR;", "");
Console.WriteLine(st);
}
Output:
Variable=Speed;Value=Fast;Op==;
Variable=Fabricator;Value=Freescale;Op==;
Regex.Split works well but yields empty entries that have to be removed as shown here:
string[] result = Regex.Split(input, #"#\w+;")
.Where(s => s != "")
.ToArray();
I tried some split-options, but most of the time I just get an empty string.
In this case the requirements seem to be simpler than you're stating. Simply splitting and using linq will do your whole operation in one statement:
string test = "#VAR;Variable=Speed;Value=Fast;Op==;#ENDVAR;#VAR;Variable=Fabricator;Value=Freescale;Op==;#ENDVAR;";
List<List<string>> strings = (from s in test.Split(new string[]{"#VAR;",";#ENDVAR;"},StringSplitOptions.RemoveEmptyEntries)
let s1 = s.Split(new char[]{';'},StringSplitOptions.RemoveEmptyEntries).ToList<string>()
select (s1)).ToList<List<string>>();
the outpout is:
?strings[0]
Count = 3
[0]: "Variable=Speed"
[1]: "Value=Fast"
[2]: "Op=="
?strings[1]
Count = 3
[0]: "Variable=Fabricator"
[1]: "Value=Freescale"
[2]: "Op=="
To write the data to a file something like this will work:
foreach (List<string> s in strings)
{
System.IO.File.AppendAllLines("textfile1.txt", s);
}

String Concatenation / Overwriting?

This is a program that reads in a CSV file, adds the values to a dictionary class and then analyses a string in a textbox to see if any of the words match the dictionary entry. It will replace abbreviations (LOL, ROFL etc) into their real words. It matches strings by splitting the inputted text into individual words.
public void btnanalyze_Click(object sender, EventArgs e)
{
var abbrev = new Dictionary<string, string>();
using (StreamReader reader = new StreamReader("C:/Users/Jordan Moffat/Desktop/coursework/textwords0.csv"))
{
string line;
string[] row;
while ((line = reader.ReadLine()) != null)
{
row = line.Split(',');
abbrev.Add(row[0], row[1]);
Console.WriteLine(abbrev);
}
}
string twitterinput;
twitterinput = "";
// string output;
twitterinput = txtInput.Text;
{
char[] delimiterChars = { ' ', ',', '.', ':', '\t' };
string text = twitterinput;
string[] words = twitterinput.Split(delimiterChars);
string merge;
foreach (string s in words)
{
if (abbrev.ContainsKey(s))
{
string value = abbrev[s];
merge = string.Join(" ", value);
}
if (!abbrev.ContainsKey(s))
{
string not = s;
merge = string.Join(" ", not);
}
;
MessageBox.Show(merge);
}
The problem so far is that the final string is outputted into a text box, but only prints the last word as it overwrites. This is a University assignment, so I'm looking for a push in the correct direction as opposed to an actual answer. Many thanks!
string.Join() takes a collection of strings, concatenates them together and returns the result. But in your case, the collection contains only one item: value, or not.
To make your code work, you could use something like:
merge = string.Join(" ", merge, value);
But because of the way strings work, this will be quite slow, so you should use StringBuilder instead.
This is the problem:
string not = s;
merge = string.Join(" ", not);
You are just joining a single element (the latest) with a space delimiter, thus overwriting what you previously put into merge.
If you want to stick with string you need to use Concat to append the new word onto the output, though this will be slow as you are recreating the string each time. It will be more efficient to use StringBuilder to create the output.
If your assignment requires that you use Join to build up the output, then you'll need to replace the target words in the words array as you loop over them. However, for that you'll need to use some other looping mechanism than foreach as that doesn't let you modify the array you're looping over.
Better to User StringBuilder Class for such purpose
http://msdn.microsoft.com/en-us/library/system.text.stringbuilder.aspx

Categories

Resources