HashSet IntersectWith count words but only unique

HashSet IntersectWith count words but only unique - c#

I got richtextBox control in form and a text file. I am getting text file to array and getting richtextbox1.text to other array than compare it and count words matching.
But for example there are two "name" word in text file and three "and" word in richtextbox. So if there is two same word in text file in richtextbox it cant be 3 or higher after 2, it must be wrong word so it must not be counted. But HashSet is counting unique values only not looking for duplicates in text file. I want to compare every word in text file with words in RichTextBox.
My code is here:
StreamReader sr = new StreamReader("c:\\test.txt",Encoding.Default);
string[] word = sr.ReadLine().ToLower().Split(' ');
sr.Close();
string[] word2 = richTextBox1.Text.ToLower().Split(' ');
var set1 = new HashSet<string>(word);
var set2 = new HashSet<string>(word2);
set1.IntersectWith(set2);
MessageBox.Show(set1.Count.ToString());

Inferring that you want:
file:
foo
foo
foo
bar
text box:
foo
foo
bar
bar
to result in '3' (2 foos and one bar)
Dictionary<string,int> fileCounts = new Dictionary<string, int>();
using (var sr = new StreamReader("c:\\test.txt",Encoding.Default))
{
foreach (var word in sr.ReadLine().ToLower().Split(' '))
{
int c = 0;
if (fileCounts.TryGetValue(word, out c))
{
fileCounts[word] = c + 1;
}
else
{
fileCounts.Add(word, 1);
}
}
}
int total = 0;
foreach (var word in richTextBox1.Text.ToLower().Split(' '))
{
int c = 0;
if (fileCounts.TryGetValue(word, out c))
{
total++;
if (c - 1 > 0)
fileCounts[word] = c - 1;
else
fileCounts.Remove(word);
}
}
MessageBox.Show(total.ToString());
Note that this is destructively modifying the read dictionary, you can avoid this (so only have to read the dictionary once) buy simply counting the rich text box in the same way and then taking the Min of the individual counts and summing them.

You need the counts to be the same? You need to count the words, then...
static Dictionary<string, int> CountWords(string[] words) {
// use (StringComparer.{your choice}) for case-insensitive
var result = new Dictionary<string, int>();
foreach (string word in words) {
int count;
if (result.TryGetValue(word, out count)) {
result[word] = count + 1;
} else {
result.Add(word, 1);
}
}
return result;
}
...
var set1 = CountWords(word);
var set2 = CountWords(word2);
var matches = from val in set1
where set2.ContainsKey(val.Key)
&& set2[val.Key] == val.Value
select val.Key;
foreach (string match in matches)
{
Console.WriteLine(match);
}

Related

Detecting and modifying ListBox entries that contain digits

My program has about 25 entries, most of them string only. However, some of them are supposed to have digits in them, and I don't need those digits in the output (output should be string only). So, how can I "filter out" integers from strings?
Also, if I have integers, strings AND chars, how could I do it (for example, one ListBox entry is E#2, and should be renamed to E# and then printed as output)?

Assuming that your entries are in a List<string>, you can loop through the list and then through each character of each entry, then check if it is a number and remove it. Something like this:
List<string> list = new List<string>{ "abc123", "xxx111", "yyy222" };
for (int i = 0; i < list.Count; i++) {
var no_numbers = "";
foreach (char c in list[i]) {
if (!Char.IsDigit(c))
no_numbers += c;
}
list[i] = no_numbers;
}
This only removes digits as it seems you wanted from your question. If you want to remove all other characters except letters, you can change the logic a bit and use Char.IsLetter() instead of Char.IsDigit().

You can remove all numbers from a strings with this LINQ solution:
string numbers = "Ho5w ar7e y9ou3?";
string noNumbers = new string(numbers.Where(c => !char.IsDigit(c)).ToArray());
noNumbers = "How are you?"
But you can also remove all numbers from a string by using a foreach loop :
string numbers = "Ho5w ar7e y9ou3?";
List<char> noNumList = new List<char>();
foreach (var c in numbers)
{
if (!char.IsDigit(c))
noNumList.Add(c);
}
string noNumbers = string.Join("", noNumList);
If you want to remove all numbers from strings inside a collection :
List<string> myList = new List<string>() {
"Ho5w ar7e y9ou3?",
"W9he7re a3re y4ou go6ing?",
"He2ll4o!"
};
List<char> noNumList = new List<char>();
for (int i = 0; i < myList.Count; i++)
{
foreach (var c in myList[i])
{
if(!char.IsDigit(c))
noNumList.Add(c);
}
myList[i] = string.Join("", noNumList);
noNumList.Clear();
}
myList Output :
"How are you?"
"Where are you going?"
"Hello!"

I don't know exactly what is your scenario, but given a string, you can loop through its characters, and if it's a number, discard it from output.
Maybe this is what you're looking for:
string entry = "E#2";
char[] output = new char[entry.Length];
for(int i = 0, j =0; i < entry.Length ; i++)
{
if(!Char.IsDigit(entry[i]))
{
output[j] = entry[i];
j++;
}
}
Console.WriteLine(output);
I've tried to give you a simple solution with one loop and two index variables, avoiding string concatenations that can make performance lacks.
See this example working at C# Online Compiler

If i am not wrong,maybe this is how your list looks ?
ABCD123
EFGH456
And your expected output is :
ABCD
EFGH
Is that correct?If so,assuming that it's a List<string>,then you can use the below code :
list<string> mylist = new list<string>;
foreach(string item in mylist)
{
///To get letters/alphabets
var letters = new String(item.Where(Char.IsLetter).ToArray());
///to get special characters
var letters = new String(item.Where(Char.IsSymbol).ToArray())
}
Now you can easily combine the codes :)

Replace string with multiple different options

Hi there wonderful people of stackOverFlow.
I am currently in a position where im totaly stuck. What i want to be able to do is take out a word from a text and replace it with a synonym. I thought about it for a while and figured out how to do it if i ONLY have one possible synonym with this code.
string pathToDesk = Environment.GetFolderPath(Environment.SpecialFolder.Desktop);
string text = System.IO.File.ReadAllText(pathToDesk + "/Text.txt");
string replacementsText = System.IO.File.ReadAllText(pathToDesk + "/Replacements.txt");
string wordsToReplace = System.IO.File.ReadAllText(pathToDesk + "/WordsToReplace.txt");
string[] words = text.Split(' ');
string[] reWords = wordsToReplace.Split(' ');
string[] replacements = replacementsText.Split(' ');
for(int i = 0; i < words.Length; i++) {//for each word
for(int j = 0; j < replacements.Length; j++) {//compare with the possible synonyms
if (words[i].Equals(reWords[j], StringComparison.InvariantCultureIgnoreCase)) {
words[i] = replacements[j];
}
}
}
string newText = "";
for(int i = 0; i < words.Length; i++) {
newText += words[i] + " ";
}
txfInput.Text = newText;
But lets say that we were to get the word hi. Then i want to be able to replace that with {"Hello","Yo","Hola"}; (For example)
Then my code will not be good for anything since they will not have the same position in the arrays.
Is there any smart solution to this I would really like to know.

you need to store your synonyms differently
in your file you need something like
hello yo hola hi
awesome fantastic great
then for each line, split the words, put them in an array array of arrays
Now use that to find replacement words
This won't be super optimized, but you can easily index each word to a group of synonyms as well.
something like
public class SynonymReplacer
{
private Dictionary<string, List<string>> _synonyms;
public void Load(string s)
{
_synonyms = new Dictionary<string, List<string>>();
var lines = s.Split(new[] {'\r', '\n'}, StringSplitOptions.RemoveEmptyEntries);
foreach (var line in lines)
{
var words = line.Split(new[] {' '}, StringSplitOptions.RemoveEmptyEntries).ToList();
words.ForEach(word => _synonyms.Add(word, words));
}
}
public string Replace(string word)
{
if (_synonyms.ContainsKey(word))
{
return _synonyms[word].OrderBy(a => Guid.NewGuid())
.FirstOrDefault(w => w != word) ?? word;
}
return word;
}
}
The OrderBy gets you a random synonym...
then
var s = new SynonymReplacer();
s.Load("hi hello yo hola\r\nawesome fantastic great\r\n");
Console.WriteLine(s.Replace("hi"));
Console.WriteLine(s.Replace("ok"));
Console.WriteLine(s.Replace("awesome"));
var words = new string[] {"hi", "you", "look", "awesome"};
Console.WriteLine(string.Join(" ", words.Select(s.Replace)));
and you get :-
hello
ok
fantastic
hello you look fantastic

Your first task will be to build a list of words and synonyms. A Dictionary will be perfect for this. The text file containing this list might look like this:
word1|synonym11,synonym12,synonym13
word2|synonym21,synonym22,synonym23
word3|synonym31,synonym32,synonym33
Then you can construct the dictionary like this:
public Dictionary<string, string[]> GetSynonymSet(string synonymSetTextFileFullPath)
{
var dict = new Dictionary<string, string[]>();
string line;
// Read the file and display it line by line.
using (var file = new StreamReader(synonymSetTextFileFullPath))
{
while((line = file.ReadLine()) != null)
{
var split = line.Split('|');
if (!dict.ContainsKey(split[0]))
{
dict.Add(split[0], split[1].Split(','));
}
}
}
return dict;
}
The eventual code will look like this
public string ReplaceWordsInText(Dictionary<string, string[]> synonymSet, string text)
{
var newText = new StringBuilder();
string[] words = text.Split(' ');
for (int i = 0; i < words.Length; i++) //for each word
{
string[] synonyms;
if (synonymSet.TryGetValue(words[i], out synonyms)
{
// The exact synonym you wish to use is up to you.
// I will just use the first one
words[i] = synonyms[0];
}
newText.AppendFormat("{0} ", words[i]);
}
return newText.ToString();
}

How to parse out all unique variables with a certain naming convention?

I have a code file and I need to find all unique objects of type TADODataSet, but they aren't defined in this 30,000 line file I have.
I wrote a console application that splits each line into individual words and adds that word to a list if it contains ADODataSet (the naming convention prefix for the objects I'm interested in) but this didn't work quite right because of how I'm splitting my lines of code.
This is all of my code:
static void Main(string[] args)
{
string file = #"C:\somePath\Form1.cs";
string output = #"C:\someOtherPath\New Text Document.txt";
List<string> datasets = new List<string>();
string[] lines = File.ReadAllLines(file);
foreach (string line in lines)
{
string[] words = line.Split(' ');
foreach (string word in words)
{
if (word.ToLower().Contains("adodataset"))
datasets.Add(word);
}
}
if (datasets.Count > 0)
{
using (StreamWriter sw = new StreamWriter(output))
{
foreach (string dataset in datasets.Distinct())
{
sw.WriteLine(dataset);
}
}
Console.WriteLine(String.Format("Wrote {0} data sets to {1}", datasets.Distinct().Count(), output));
Console.ReadKey();
}
}
But this didn't work as I hoped, and added "words" such as these:
SQLText(ADODataSetEnrollment->FieldByName("Age1")->AsString)
SQLText(ADODataSetEnrollment->FieldByName("Age2")->AsString)
SQLText(ADODataSetEnrollment->FieldByName("Age3")->AsString)
I'm only interested in ADODataSetEnrollment, so I should only have 1 entry for that variable in my output file but because that line of code doesn't contain a space it's treated as a single "word".
How can I split my lines array instead, so that way I can find unique variables?

Have you tried RegEx matching? With RegEx you can for example say
RegEx.IsMatch(word, "(?i)(?<!\w)adodataset(?!\w)")
> (?i) means ignore case (like uppercase, lower case, i think)
> (?<!\w)means not preceded by a literal (like letters, ABC..., abc... and so
> forth)
> (?!\w) means not followed by a literal RegEx.IsMatch(...)
> returns a bool value

Ended up with this as a solution:
string file = #"C:\somePath\Form1.cs";
string output = #"C:\someOtherPath\New Text Document.txt";
List<string> datasets = new List<string>();
string[] lines = File.ReadAllLines(file);
decimal i = 0;
foreach (string line in lines)
{
string[] words = line.Split(' ');
foreach (string word in words)
{
if (word.ToLower().Contains("adodataset"))
{
int start = word.ToLower().IndexOf("adodataset");
string dsWord = String.Empty;
string temp = word.Substring(start, word.Length - start);
foreach (char c in temp)
{
if (Char.IsLetter(c))
dsWord += c;
else
break;
}
if (dsWord != String.Empty)
datasets.Add(dsWord);
}
}
i++;
Console.Write("\r{0}% ", Math.Round(i / lines.Count() * 100, 2));
}
if (datasets.Count > 0)
{
using (StreamWriter sw = new StreamWriter(output))
{
foreach (string dataset in datasets.Distinct())
sw.WriteLine(dataset);
}
Console.WriteLine(String.Format("Wrote {0} data sets to {1}", datasets.Distinct().Count(), output));
Console.ReadKey();
}
Pretty ghetto, but it did what I needed it to do. I'll happily accept someone else's answer though if they know of a better way to use Regex to just pull out the variable name from within the line of code, rather than the whole line itself.

You can try this solution:
string file = File.ReadAllText(#"text.txt");
string output = #"C:\someOtherPath\New Text Document.txt";
List<string> datasets = new List<string>();
var a = Regex.Matches(file, #"\W(ADODataSet\w*)", RegexOptions.IgnoreCase);
foreach (Match m in a)
{
datasets.Add(m.Groups[1].Value);
}

Holding number of strings

I'm trying to make a concordance.
I have a dictionary where are each word and frequency of appearing this word in text.
Now I would have to store a number of line where word occured.
To do it I suppose to make a container which will store each line.
Something like this:
List<String> eachLine = new List<string>();
using (var strReader = new StreamReader(#"pathToFile/Text.txt"))
{
string line;
while ((line = strReader.ReadLine()) != null)
{
eachLine.Add(line);
}
}
Here are Dictionary
Dictionary<string, int> concordanceDictionary = new Dictionary<string, int>();
string lines = File.ReadAllText(path:Text.txt").ToLower();
string[] words = SplitWords(lines);
foreach (var word in words)
{
int i = 1;
if (!concordanceDictionary.ContainsKey(word))
{
concordanceDictionary.Add(word, i);
}
else
{
concordanceDictionary[word]++;
}
}
var list =concordanceDictionary.Keys.ToList();
list.Sort();
To store number of lines I'll create a 'List' where I will put index of line where word occured by using method Contain for each word in dictionary which will check if this word is in
' List<String> eachLine '
The problem is how to display each word with list of number of line?
May be you can suggest me more elegant and easier way to do it

Created a console app for you to run
class Program
{
static void Main(string[] args)
{
ReadTextToDictionary read = new ReadTextToDictionary();
var strings = read.TextToListString(#"C:\stackoverflow\first.txt");
var dictionarys = read.TextToDictionaryString(#"C:\stackoverflow\second.txt");
foreach(var s in strings) {
var compare = dictionarys.Where(a=>a.Value.Contains(s.ToString()));
foreach(var f in compare)
{
Console.WriteLine(s+" in line "+f.Key.ToString() + " " + f.Value);
}
}
Console.ReadKey();
}
}
class ReadTextToDictionary
{
public List<string> TextToListString(string path){
var lines = System.IO.File.ReadAllLines(path);
return lines.ToList();
}
public Dictionary<int,string> TextToDictionaryString(string path)
{
Dictionary<int, string> dstr = new Dictionary<int, string>();
var lines = System.IO.File.ReadAllLines(path);
int count = 0;
foreach (var s in lines)
{
count++;
dstr.Add(count, s);
}
return dstr;
}
}

I would use a Dictionary<String,List<Int32>> where the key is String which is the current word, and the List<Int32> is the list of line-numbers the word appears on. To get the occurrence count just dereference the list's Count property: dictionary[ word ].Count.
As an aside, you don't need to read everything into memory at once (as String[] instances). You can simply read through the file character-by-character and identify whitespace and line-breaks.
To that end, this is my implementation:
void Run() {
Dictionary< String, List<Int32> > dict = new Dictionary< String, List<Int32> >();
foreach(Tuple<String,Int32> wordOccurrence in GetWords()) {
String word = wordOccurrence.Item1;
Int32 line = wordOccurrence.Item2;
if( !dict.ContainsKey( word ) ) dict.Add( word , new List<Int32>() );
dict[ word ].Add( line );
}
foreach(String word in dict.Keys) {
Console.WriteLine("\"{0}\" appeared {1} times, on these lines:", word, dict[word].Count);
foreach(Int32 line in dict[word]) Console.WriteLine("\t{0}", line );
Console.WriteLine("");
}
}
IEnumerable< Tuple<String,Int32> > GetWords() {
using(StreamReader rdr = new StreamReader("fileName")) {
StringBuilder sb = new StringBuilder();
Int32 nc; Char c;
Itn32 lineNumber = 0;
while( (nc = rdr.Read() != -1 ) {
c = (Char)nc;
if( Char.IsWhitespace(c) ) {
if( sb.Length > 0 ) {
yield return new Tuple( sb.ToString(), lineNumber );
sb.Length = 0;
}
if( c == '\n' ) lineNumber++;
} else {
sb.Append( c );
}
}
if( sb.Length > 0 ) yield return new Tuple( sb.ToString(), lineNumber );
}
}

One way to do it would be to store each line where the word occurs in a list, as the value portion of a dictionary with the word as a key.
In otherwords, you would have a Dictionary<string, List<string>> where the key is a word and the associated list is all the lines containing the word.
This way, you can quickly access the lines AND you get the number of occurrences for free (dict[someWord].Count;)
For example:
// words dictionary has a word key and a list of lines containing the word
var words = new Dictionary<string, List<string>>();
using (var strReader = new StreamReader(#"pathToFile/Text.txt"))
{
string line;
// Read each line
while ((line = strReader.ReadLine()) != null)
{
// Get each word from the line
var wordsInLine = line.ToLower().Split(' ');
foreach (var word in wordsInLine)
{
// If this word already exists, update it's line number
if (words.ContainsKey(word))
{
words[word].Add(line);
}
// Otherwise, add a new word with this line number to the list
else
{
words.Add(word, new List<string> {line});
}
}
}
}
And if you really wanted to get all the lines, you could either add them to a list in the loop above, or do something like this:
var allLines = words.SelectMany(w => w.Value).Distinct().ToList();

split string in to several strings at specific points

I have a text file with lines of text laid out like so
12345MLOL68
12345MLOL68
12345MLOL68
I want to read the file and add commas to the 5th point, 6th point and 9th point and write it to a different text file so the result would be.
12345,M,LOL,68
12345,M,LOL,68
12345,M,LOL,68
This is what I have so far
public static void ToCSV(string fileWRITE, string fileREAD)
{
int count = 0;
string x = "";
StreamWriter commas = new StreamWriter(fileWRITE);
string FileText = new System.IO.StreamReader(fileREAD).ReadToEnd();
var dataList = new List<string>();
IEnumerable<string> splitString = Regex.Split(FileText, "(.{1}.{5})").Where(s => s != String.Empty);
foreach (string y in splitString)
{
dataList.Add(y);
}
foreach (string y in dataList)
{
x = (x + y + ",");
count++;
if (count == 3)
{
x = (x + "NULL,NULL,NULL,NULL");
commas.WriteLine(x);
x = "";
count = 0;
)
}
commas.Close();
}
The problem I'm having is trying to figure out how to split the original string lines I read in at several points. The line
IEnumerable<string> splitString = Regex.Split(FileText, "(.{1}.{5})").Where(s => s != String.Empty);
Is not working in the way I want to. It's just adding up the 1 and 5 and splitting all strings at the 6th char.
Can anyone help me split each string at specific points?

Simpler code:
public static void ToCSV(string fileWRITE, string fileREAD)
{
string[] lines = File.ReadAllLines(fileREAD);
string[] splitLines = lines.Select(s => Regex.Replace(s, "(.{5})(.)(.{3})(.*)", "$1,$2,$3,$4")).ToArray();
File.WriteAllLines(fileWRITE, splitLines);
}

Just insert at the right place in descending order like this.
string str = "12345MLOL68";
int[] indices = {5, 6, 9};
indices = indices.OrderByDescending(x => x).ToArray();
foreach (var index in indices)
{
str = str.Insert(index, ",");
}
We're doing this in descending order because if we do other way indices will change, it will be hard to track it.
Here is the Demo

Why don't you use substring , example
editedstring=input.substring(0,5)+","+input.substring(5,1)+","+input.substring(6,3)+","+input.substring(9);
This should suits your need.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

HashSet IntersectWith count words but only unique - c#

Related

Detecting and modifying ListBox entries that contain digits

Replace string with multiple different options

How to parse out all unique variables with a certain naming convention?

Holding number of strings

split string in to several strings at specific points

Categories

Resources