Searching for words in richTextBox (windows forms, c#) - c#

I need help with counting same words in richTextBox and also want to assign numbers/words to them. There are 500,000 + words in a box and all of them are written only 3 or 6 times.
One solution is to write down words i'm looking for in advance, search in a box and assign a number/word to it. But there are too many to do so, if there is a quicker way it will help me a lot.
Thanks in advance!
Here is a code of what i'm doing right now, it's not representing what i'm asking.
for (int i = 0; i < Url.Length; i++)
{
doc = web.Load(Url[i]);
int x=0;
int y=0;
//here was a long list of whiles for *fors* down there,
//removed to be cleaner for you to see
//one example:
//while (i == 0)
//{
// x = 18;
// y = 19;
// break;
//}
for (int w = 0; w < x; w++)
{
string metascore = doc.DocumentNode.SelectNodes("//*[#class=\"trow8\"]")[w].InnerText;
richTextBox1.AppendText("-----------");
richTextBox1.AppendText(metascore);
richTextBox1.AppendText("\r\n");
}
for (int z = 0; z < y; z++)
{
string metascore1 = doc.DocumentNode.SelectNodes("//*[#class=\"trow2\"]")[z].InnerText;
richTextBox1.AppendText("-----------");
richTextBox1.AppendText(metascore1);
richTextBox1.AppendText("\r\n");
}
}

Let's assume you have your text, read from a file, downloaded from some webpage, or RTB... Using LINQ will give you all you need.
string textToCount = "...";
// second step, split text by delimiters of your own
List<string> words = textToCount.Split(' ', ',', '!', '?', '.', ';', '\r', '\n') // split by whatever characters
.Where(s => !string.IsNullOrWhiteSpace(s)) // eliminate all whitespaces
.Select(w => w.ToLower()) // transform to lower for easier comparison/count
.ToList();
// use LINQ helper method to group words and project them into dictionary
// with word as a key and value as total number of word appearances
// this is the actual magic. With this in place, it's easy to get statistics.
var groupedWords = words.GroupBy(w => w)
.ToDictionary(k => k.Key, v => v.Count());`
// to get various statistics
// total different words - count your dictionary entries
groupedWords.Count();
// the most often word - looking for the word having the max number in the list of values
groupedWords.First(kvp => kvp.Value == groupedWords.Values.Max()).Key
// mentioned this many times
groupedWords.Values.Max()
// similar for min, just replace call to Max() with call to Min()
// print out your dictionary to see for every word how often it's metnioned
foreach (var wordStats in groupedWords)
{
Console.WriteLine($"{wordStats.Key} - {wordStats.Value}");
}
This solution is similar to previous post. The main difference is that this solution is using grouping by word, which is simple and putting it to dictionary. Once there, a lot can be easily found.

Assuming you have a RichTextBox running around somewhere, we will call him Bob:
//Grab all the text from the RTF
TextRange BobRange = new TextRange(
// TextPointer to the start of content in the RichTextBox.
Bob.Document.ContentStart,
// TextPointer to the end of content in the RichTextBox.
Bob.Document.ContentEnd
);
//Assume words are sperated by space, commas or periods, split on that and thorw the words in a List
List<string> BobsWords = BobRange.Text.Split(new char[] { ' ', ',', '.' }, StringSplitOptions.RemoveEmptyEntries).ToList<string>();
//Now use Linq for .net to grab what you want
int CountOfTheWordThe = BobsWords.Where(w => w == "The").Count();
//Now that we have the list of words we can create a MetaDictionary
Dictionary<string, int> BobsWordsAndCounts = new Dictionary<string, int>();
foreach( string word in BobsWords)
{
if (!BobsWordsAndCounts.Keys.Contains(word))
BobsWordsAndCounts.Add( word, BobsWords.Where(w => w == word).Count(); )
}
Is that what you meant by assign a number to a word? You want to know the count of each word in the RichTextBox?
Perhaps add a bit of context?

Related

I'm trying to find a specific word within a char array

Using C# I'm trying to find a specific word within a char array. Also, I don't want the same letter used more than once i.e. the word is 'hello' and I'm trying to find it within a random array of letters, so if the letter 'l' is used out of the random array of letters, I don't want it to be used again. There should be another 'l' within the array of letters to be used as the second 'l' in "hello". Just trying to be precise. A simple answer would be very helpful. Thank you.
Here is my attempt so far.
public static char [] Note = "hello".ToCharArray();
public static char [] Newspaper = "ahrenlxlpoz".ToCharArray();
static void main(string[] args)
{
Array.Sort(Note);
Array.Sort(Newspaper);
if(Newspaper.Contains<Note>)
{
Console.Write("It should display the letters of Note found within Newspaper");
}
}
I assume by "contains" you mean Newspaper has enough number of letters from each letter to make up Note. For example, you need at least two l's to make up the word "hello". If so, you need to basically count the number of each letter in both strings, and make sure the number of each letter in Note is less than or equal to the number of that letter in Newspaper.
var dictNote = Note.GroupBy(c => c).ToDictionary(g => g.Key, g => g.Count());
var dictNews = Newspaper.GroupBy(c => c).ToDictionary(g => g.Key, g => g.Count());
bool contains = dictNote.All(x =>
dictNews.ContainsKey(x.Key) && x.Value <= dictNews[x.Key]);
In fact, a string is a char array. And the most "classic" way to do this would be:
string Note = "hello";
char[] Newspaper = "ahrenlxlpoz".ToCharArray();
string res = "";
for (int i = 0; i < Note.Length; i++)
for (int j = 0; j < Newspaper.Length; j++)
if (Note[i] == Newspaper[j])
{
res += Newspaper[j];
Newspaper[j] = ' ';
break;
}
//This prints the remaining characters in Newspaper. I avoid repeating chars.
for (int i = 0; i < Newspaper.Length; i++ )
Console.Write(Newspaper[i]+"\n");
Console.Write("\n\n");
if (Note.Equals(res)) Console.Write("Word found");
else Console.Write("Word NOT found");
Console.Read();
At the end, res will be "hello". Print res in the console. I added the ' ' to avoid repeated characters as someone said in the answer up. So at the end it will compare the result with the word and will tell you if it found the word in the string. Try changing Newspaper to this: "ahrenlxlpaz" and it will tell you the word is NOT found :)
Try this:
public static char[] Note = "hello".ToCharArray();
public static char[] Newspaper = "ahrenlxlpoz".ToCharArray();
foreach (char c in Note) //check each character of Note
{
if (Newspaper.Contains(c))
{
Console.Write(c); //it will display hello
}
}

What's the best way to split a list of strings to match first and last letters?

I have a long list of words in C#, and I want to find all the words within that list that have the same first and last letters and that have a length of between, say, 5 and 7 characters. For example, the list might have:
"wasted was washed washing was washes watched watches wilts with wastes wits washings"
It would return
Length: 5-7, First letter: w, Last letter: d, "wasted, washed, watched"
Length: 5-7, First letter: w, Last letter: s, "washes, watches, wilts, wastes"
Then I might change the specification for a length of 3-4 characters which would return
Length: 3-4, First letter: w, Last letter: s, "was, wits"
I found this method of splitting which is really fast, made each item unique, used the length and gave an excellent start:
Spliting string into words length-based lists c#
Is there a way to modify/use that to take account of first and last letters?
EDIT
I originally asked about the 'fastest' way because I usually solve problems like this with lots of string arrays (which are slow and involve a lot of code). LINQ and lookups are new to me, but I can see that the ILookup used in the solution I linked to is amazing in its simplicity and is very fast. I don't actually need the minimum processor time. Any approach that avoids me creating separate arrays for this information would be fantastic.
this one liner will give you groups with same first/last letter in your range
int min = 5;
int max = 7;
var results = str.Split()
.Where(s => s.Length >= min && s.Length <= max)
.GroupBy(s => new { First = s.First(), Last = s.Last()});
var minLength = 5;
var maxLength = 7;
var firstPart = "w";
var lastPart = "d";
var words = new List<string> { "washed", "wash" }; // so on
var matches = words.Where(w => w.Length >= minLength && w.Length <= maxLength &&
w.StartsWith(firstPart) && w.EndsWith(lastPart))
.ToList();
for the most part, this should be fast enough, unless you're dealing with tens of thousands of words and worrying about ms. then we can look further.
Just in LINQPad I created this:
void Main()
{
var words = new []{"wasted", "was", "washed", "washing", "was", "washes", "watched", "watches", "wilts", "with", "wastes", "wits", "washings"};
var firstLetter = "w";
var lastLetter = "d";
var minimumLength = 5;
var maximumLength = 7;
var sortedWords = words.Where(w => w.StartsWith(firstLetter) && w.EndsWith(lastLetter) && w.Length >= minimumLength && w.Length <= maximumLength);
sortedWords.Dump();
}
If that isn't fast enough, I would create a lookup table:
Dictionary<char, Dictionary<char, List<string>> lookupTable;
and do:
lookupTable[firstLetter][lastLetter].Where(<check length>)
Here's a method that does exactly what you want. You are only given a list of strings and the min/max length, correct? You aren't given the first and last letters to filter on. This method processes all the first/last letters in the strings.
private static void ProcessInput(string[] words, int minLength, int maxLength)
{
var groups = from word in words
where word.Length > 0 && word.Length >= minLength && word.Length <= maxLength
let key = new Tuple<char, char>(word.First(), word.Last())
group word by key into #group
orderby Char.ToLowerInvariant(#group.Key.Item1), #group.Key.Item1, Char.ToLowerInvariant(#group.Key.Item2), #group.Key.Item2
select #group;
Console.WriteLine("Length: {0}-{1}", minLength, maxLength);
foreach (var group in groups)
{
Console.WriteLine("First letter: {0}, Last letter: {1}", group.Key.Item1, group.Key.Item2);
foreach (var word in group)
Console.WriteLine("\t{0}", word);
}
}
Just as a quick thought, I have no clue if this would be faster or more efficient than the linq solutions posted, but this could also be done fairly easily with regular expressions.
For example, if you wanted to get 5-7 letter length words that begin with "w" and end with "s", you could use a pattern along the lines of:
\bw[A-Za-z]{3,5}s\b
(and this could fairly easily be made to be more variable driven - For example, have a variable for first letter, min length, max length, last letter and plug them in to the pattern to replace w, 3, 5 & s)
Them, using the RegEx library, you could then just take your captured groups to be your list.
Again, I don't know how this compares efficiency-wise to linq, but I thought it might deserve mention.
Hope this helps!!

Find which phrases have been used multiple times in a string

It's easy to count occurrences of words in a file by using a Dictionary to identify which words are used the most frequently, but given a text file, how can I find commonly used phrases where a "phrase" is a set of two or more consecutive words?
For example, here is some sample text:
Except oral wills, every will shall be in writing, but may be
handwritten or typewritten. The will shall contain the testator's signature
or by some other person in the testator's conscious presence
and at the testator's express direction . The will shall be attested
and subscribed in the conscious presence of the testator, by two or
more competent witnesses, who saw the testator subscribe, or heard the
testator acknowledge the testator's signature.
For purposes of this section, conscious presence means within the
range of any of the testator's senses, excluding the sense of sight or
sound that is sensed by telephonic, electronic, or other distant
communication.
How can I identify that the phrases "conscious presence" (3 times) and "testator's signature" (2 times) as having appeared more than once (apart from brute force searching for every set of two or three words)?
I'll be writing this in c#, so c# code would be great, but I can't even identify a good algorithm so I'll settle for any code at all or even pseudo code for how to solve this.
Thought I'd have a quick go at this - not sure if this isn't the brute force method you were trying to avoid - but :
static void Main(string[] args)
{
string txt = #"Except oral wills, every will shall be in writing,
but may be handwritten or typewritten. The will shall contain the testator's
signature or by some other person in the testator's conscious presence and at the
testator's express direction . The will shall be attested and subscribed in the
conscious presence of the testator, by two or more competent witnesses, who saw the
testator subscribe, or heard the testator acknowledge the testator's signature.
For purposes of this section, conscious presence means within the range of any of the
testator's senses, excluding the sense of sight or sound that is sensed by telephonic,
electronic, or other distant communication.";
//split string using common seperators - could add more or use regex.
string[] words = txt.Split(',', '.', ';', ' ', '\n', '\r');
//trim each tring and get rid of any empty ones
words = words.Select(t=>t.Trim()).Where(t=>t.Trim()!=string.Empty).ToArray();
const int MaxPhraseLength = 20;
Dictionary<string, int> Counts = new Dictionary<string,int>();
for (int phraseLen = MaxPhraseLength; phraseLen >= 2; phraseLen--)
{
for (int i = 0; i < words.Length - 1; i++)
{
//get the phrase to match based on phraselen
string[] phrase = GetPhrase(words, i, phraseLen);
string sphrase = string.Join(" ", phrase);
Console.WriteLine("Phrase : {0}", sphrase);
int index = FindPhraseIndex(words, i+phrase.Length, phrase);
if (index > -1)
{
Console.WriteLine("Phrase : {0} found at {1}", sphrase, index);
if(!Counts.ContainsKey(sphrase))
Counts.Add(sphrase, 1);
Counts[sphrase]++;
}
}
}
foreach (var foo in Counts)
{
Console.WriteLine("[{0}] - {1}", foo.Key, foo.Value);
}
Console.ReadKey();
}
static string[] GetPhrase(string[] words, int startpos, int len)
{
return words.Skip(startpos).Take(len).ToArray();
}
static int FindPhraseIndex(string[] words, int startIndex, string[] matchWords)
{
for (int i = startIndex; i < words.Length; i++)
{
int j;
for(j=0; j<matchWords.Length && (i+j)<words.Length; j++)
if(matchWords[j]!=words[i+j])
break;
if (j == matchWords.Length)
return startIndex;
}
return -1;
}
Try this out. It's in no way fool-proof, but should get the job done for now.
Yes, this only matches 2-word combos, does not strip punctuation, and is brute-force. No, the ToList is not necessary.
string text = "that big long text block";
var splitBySpace = text.Split(' ');
var doubleWords = splitBySpace
.Select((x, i) => new { Value = x, Index = i })
.Where(x => x.Index != splitBySpace.Length - 1)
.Select(x => x.Value + " " + splitBySpace.ElementAt(x.Index + 1)).ToList();
var duplicates = doubleWords
.GroupBy(x => x)
.Where(x => x.Count() > 1)
.Select(x => new { x.Key, Count = x.Count() }).ToList();
I got the following results:
Here is my attempt at getting more than 2 word combos. Again, same warning as previous.
List<string> multiWords = new List<string>();
//i is the number of words to combine
//in this case, 2-6 words
for (int i = 2; i <= 6; i++)
{
multiWords.AddRange(splitBySpace
.Select((x, index) => new { Value = x, Index = index })
.Where(x => x.Index != splitBySpace.Length - i + 1)
.Select(x => CombineItems(splitBySpace, x.Index, x.Index + i - 1)));
}
var duplicates = multiWords
.GroupBy(x => x)
.Where(x => x.Count() > 1)
.Select(x => new { x.Key, Count = x.Count() }).ToList();
private string CombineItems(IEnumerable<string> source, int startIndex, int endIndex)
{
return string.Join(" ", source.Where((x, i) => i >= startIndex && i <= endIndex).ToArray());
}
The results this time:
Now I just want to say there is a high chance of a off-by-one error with my code. I did not fully test it, so make sure you test it before you use it.
If I were doing it, I would probably be starting with the brute force approach, but it sounds like you want to avoid that. A two-phase approach could do a count of each word, take the top few results (only start with the top few words that appear the most times), then search for and count only phrases that include these popular words. Then you won't spend your time searching over all phrases.
I have this feeling that CS folks will correct me saying that this would actually take more time than a straight up brute force. And maybe some linguists will pitch in with some methods for detecting phrases or something.
Good luck!

Calculating all possible combinations of a string, with a twist

I'm trying to allow a user to enter text in a textbox, and have the program generate all possible combinations of it, except with a minimum of 3 characters and maximum of 6. I don't need useless words like 'as', 'a', 'i', 'to', etc cluttering up my array. I'll also be checking each combination against a dictionary to make sure it's a real word.
I have the dictionary complete (painstakingly generated, here's a link to it in return (WARNING: gigantic load time (for me)!)
Anyways, if the user enters 'ABCDEF' (in no particular order), how could I generate, for example:
'ABC'
'BAC'
'CAB'
...
'ABD'
'ABE'
'ABF'
etc... EVERY possible combination, no matter what order? I understand that there are a ridiculous amount of these combinations, but it only needs to be calculated once, so I'm not too worried about that.
I've found code samples to recursively find combinations (not permutations, I don't need those) of just the fixed-width string (ABCDEF, ABCDFE ... ACDBFE, etc). They don't do what I need, and I haven't the slightest clue about where to even start with this project.
This isn't homework, it started out as a personal project of mine that's grown to take over my life over such a simple problem... I can't believe I can't figure this out!
It sounds to me like you're describing the Power Set
Here's an implementation I had lying around my personal library:
// Helper method to count set bits in an integer
public static int CountBits(int n)
{
int count = 0;
while (n != 0)
{
count++;
n &= (n - 1);
}
return count;
}
public static IEnumerable<IEnumerable<T>> PowerSet<T>(
IEnumerable<T> src,
int minSetSize = 0,
int maxSetSize = int.MaxValue)
{
// we want fast random access to the source, so we'll
// need to ToArray() it
var cached = src.ToArray();
var setSize = Math.Pow(2, cached.Length);
for(int i=0; i < setSize; i++)
{
var subSetSize = CountBits(i);
if(subSetSize < minSetSize ||
subSetSize > maxSetSize)
{
continue;
}
T[] set = new T[subSetSize];
var temp = i;
var srcIdx = 0;
var dstIdx = 0;
while(temp > 0)
{
if((temp & 0x01) == 1)
{
set[dstIdx++] = cached[srcIdx];
}
temp >>= 1;
srcIdx++;
}
yield return set;
}
yield break;
}
And a quick test rig:
void Main()
{
var src = "ABCDEF";
var combos = PowerSet(src, 3, 6);
// hairy joins for great prettiness
Console.WriteLine(
string.Join(" , ",
combos.Select(subset =>
string.Concat("[",
string.Join(",", subset) , "]")))
);
}
Output:
[A,B,C] , [A,B,D] , [A,C,D] , [B,C,D] , [A,B,C,D] , [A,B,E] , [A,C,E] , [B,C,E] , [A,B,C,E] ,
[A,D,E] , [B,D,E] , [A,B,D,E] , [C,D,E] , [A,C,D,E] , [B,C,D,E] , [A,B,C,D,E] , [A,B,F] ,
[A,C,F] , [B,C,F] , [A,B,C,F] , [A,D,F] , [B,D,F] , [A,B,D,F] , [C,D,F] , [A,C,D,F] ,
[B,C,D,F] , [A,B,C,D,F] , [A,E,F] , [B,E,F] , [A,B,E,F] , [C,E,F] , [A,C,E,F] , [B,C,E,F] ,
[A,B,C,E,F] , [D,E,F] , [A,D,E,F] , [B,D,E,F] , [A,B,D,E,F] , [C,D,E,F] , [A,C,D,E,F] ,
[B,C,D,E,F] , [A,B,C,D,E,F]
The best way to do this is to use a for loop and convert each character from an int to a char and concatenate them together in a string.
For example:
for(int i = 0; i < 26; i++)
{
Console.WriteLine((char)i + 'A');
}
Supposed, you also want stuff like "AAB" the "cross product" of your set of letters should be it.
Generation can be as simple as a LINQ:
string myset = "ABCDE";
var All = (from char l1 in myset
from char l2 in myset
from char l3 in myset
select new string(new char[] { l1, l2, l3})).ToList();
Note: construction many strings and char Arrays is not fast. You may want to replace the new string and new char[] with a custom class like so:
select new MyCustomClass(l1, l2, l3).ToList();
If you don't want things like "AAB" (or "EEL") then I'd point you to wikipedia for "combinations".
To get from fixed-length to "any length from 3 to 6" join multiple sets, if the limits are dynamic then use a loop.

How to find all the different combinations as units of different lengths of the characters of a string

Hi I want to find all the different combinations rather linear selections of characters from a given string without losing sequence as units of different sizes. Example:
Lets say a word "HAVING"
Then it can have combinations like (spaces separating individual units).
HA VI N G
HAV ING
H AV I N G
HAVIN G
H AVING
H AVIN G
HA VING
H AVI NG
....
Like this all the different selections of units of different lengths.
Can someone give a prototype code or algo idea.
Thanks,
Kalyana
In a string of size n, you have n-1 positions where you could place spaces (= between each pair of consecutive letters). Thus, you have 2^(n-1) options, each represented by a binary number with n-1 digits, e.g., your first example would be 01011.
That should give you enough information to get started (no full solution; this sounds like homework).
Simple recursive solution. Two sets are the first letter and the rest of the word. Find all combinations on the rest of the word. Then put the second letter with the first, and find all combinations on the rest of the word. Repeat until the rest of the word is 1 letter.
It's just a power set isn't it? For every position between two letters in your string you do or do not have a space. So for a string of length 6 there are 2 to the power of 5 possibilities.
A simple recursive function will enumerate all the possibilities. Do you need help writing such a function or were you just looking for the algorithm?
Between each letter, there's either a separator or there isn't. So recursively go through the word, trying out all combinations of separators/no-separators.
function f( word )
set = word[0] + f( substring(word, 1) )
set += word[0] + " " + f( substring(word, 1) )
return set;
My solution
Make it something like
void func(string s)
{
int len=s.length();
if(len==0)
return;
for(int i=1;i<=len;i++)
{
for(int j=0;j<i;j++)
cout<<s[j];
cout<<" ";
func(s[i]);
}
return;
}
Recursion. In java:
public static List<List<String>> split (String str) {
List<List<String>> res = new ArrayList<List<String>>();
if (str == null) {
return res;
}
for (int i = 0; i < str.length() - 1; i++) {
for (List<String> list : split(str.substring(i + 1))) {
List<String> tmpList = new ArrayList<String>();
tmpList.add(str.substring(0, i + 1));
for (String s : list) {
tmpList.add(s);
}
res.add(tmpList);
}
}
List<String> tmpList = new ArrayList<String>();
tmpList.add(str);
res.add(tmpList);
return res;
}
public static void main(String[] args) {
for (List<String> intermed : split("HAVING")) {
for (String str : intermed) {
System.out.print(str);
System.out.print(" ");
}
System.out.println();
}
}

Categories

Resources