Counting the Frequency of Specific Words in Text File

Counting the Frequency of Specific Words in Text File - c#

I have a text file stored as a string variable. The text file is processed so that it only contains lowercase words and spaces. Now, say I have a static dictionary, which is just a list of specific words, and I want to count, from within the text file, the frequency of each word in the dictionary. For example:
Text file:
i love love vb development although i m a total newbie
Dictionary:
love, development, fire, stone
The output I'd like to see is something like the following, listing both the dictionary word and its count. If it makes coding simpler, it can also only list the dictionary word that appeared in the text.
===========
WORD, COUNT
love, 2
development, 1
fire, 0
stone, 0
============
Using a regex (eg "\w+") I can get all the word matches, but I have no clue how to get the counts that are also in the dictionary, so I'm stuck. Efficiency is crucial here since the dictionary is quite large (~100,000 words) and the text files are not small either (~200kb each).
I appreciate any kind help.

You can count the words in the string by grouping them and turning it into a dictionary:
Dictionary<string, int> count =
theString.Split(' ')
.GroupBy(s => s)
.ToDictionary(g => g.Key, g => g.Count());
Now you can just check if the words exist in the dictionary, and show the count if it does.

var dict = new Dictionary<string, int>();
foreach (var word in file)
if (dict.ContainsKey(word))
dict[word]++;
else
dict[word] = 1;

Using Groovy regex facilty, i would do it as below :-
def input="""
i love love vb development although i m a total newbie
"""
def dictionary=["love", "development", "fire", "stone"]
dictionary.each{
def pattern= ~/${it}/
match = input =~ pattern
println "${it}" + "-"+ match.count
}

Try this. The words variable is obviously your string of text. The keywords array is a list of keywords you want to count.
This won't return a 0 for dictionary words that aren't in the text, but you specified that this behavior is okay. This should give you relatively good performance while meeting the requirements of your application.
string words = "i love love vb development although i m a total newbie";
string[] keywords = new[] { "love", "development", "fire", "stone" };
Regex regex = new Regex("\\w+");
var frequencyList = regex.Matches(words)
.Cast<Match>()
.Select(c => c.Value.ToLowerInvariant())
.Where(c => keywords.Contains(c))
.GroupBy(c => c)
.Select(g => new { Word = g.Key, Count = g.Count() })
.OrderByDescending(g => g.Count)
.ThenBy(g => g.Word);
//Convert to a dictionary
Dictionary<string, int> dict = frequencyList.ToDictionary(d => d.Word, d => d.Count);
//Or iterate through them as is
foreach (var item in frequencyList)
Response.Write(String.Format("{0}, {1}", item.Word, item.Count));
If you want to achieve the same thing without using RegEx since you indicated you know everything is lower case and separated by spaces, you could modify the above code like so:
string words = "i love love vb development although i m a total newbie";
string[] keywords = new[] { "love", "development", "fire", "stone" };
var frequencyList = words.Split(' ')
.Select(c => c)
.Where(c => keywords.Contains(c))
.GroupBy(c => c)
.Select(g => new { Word = g.Key, Count = g.Count() })
.OrderByDescending(g => g.Count)
.ThenBy(g => g.Word);
Dictionary<string, int> dict = frequencyList.ToDictionary(d => d.Word, d => d.Count);

Related

The most common word in spaceless string

I have a very long string of text that is many words separated by camelCase like so:
AedeagalAedilityAedoeagiAefaldnessAegeriidaeAeginaAeipathyAeneolithicAeolididaeAeonialAerialityAerinessAerobia
I need to find the most common word and the number of times it has been used, I am unaware how to do this due to the lack of spaces and being new to C#.
I have tried many methods but none seem to work, any advice you have I'd be very grateful.
I have a github repo with the file being downloaded and a few tests already done here: https://github.com/Imstupidpleasehelp/C-code-test
Thank you.

You can try querying the string with a help of regular expressions and Linq:
string source = ...
var result = Regex
.Matches(source, "[A-Z][a-z]*")
.Cast<Match>()
.Select(match => match.Value)
.GroupBy(word => word)
.Select(group => (word : group.Key, count : group.Count()))
.OrderByDescending(pair => pair.count)
.First();
Console.Write($"{result.word} appears {result.count} time");

string[] split = Regex.Split(exampleString, "(?<=[A-Za-z])(?=[A-Z][a-z])");
var result = split.GroupBy(s => s)
.Where(g=> g.Count()>=1 )
.OrderByDescending(g => g.Count())
.Select(g => new{ Word = g.Key, Occurrences = g.Count()});
var result will contain pairs of (Word, Occurrences) for all words.
If you want just the first one (the one with the most occurrences) use
var result = split.GroupBy(s => s)
.Where(g=> g.Count()>=1 )
.OrderByDescending(g => g.Count())
.Select(g => new{ Word = g.Key, Occurrences = g.Count()}).First();
Have in mind that it can happen that you have 2 or more words with the same number of occurrences, so using First() would only give you one of those.

A non-linq approach using for loop and IsUpper to separate the words.
string data = "AedeagalAedilityAedoeagiAefaldness";
var words = new List<string>();
var temp = new StringBuilder();
for(int i = 0;i < data.Length;i++)
{
temp.Append(data[i]);
if (i == data.Length-1 || char.IsUpper(data[i+1]))
{
words.Add(temp.ToString());
temp.Clear();
}
}

Identifying and grouping similar items in a collection of strings

I have a collection of strings like the following:
List<string> codes = new List<string>
{
"44.01", "44.02", "44.03", "44.04", "44.05", "44.06", "44.07", "44.08", "46", "47.10"
};
Each string is made up of two components separated by a full stop - a prefix code and a subcode. Some of the strings don't have sub codes.
I want to be able combine the strings whose prefixes are the same and output them as follows with the other codes also:
44(01,02,03,04,05,06,07,08),46,47.10
I'm stuck at the first hurdle of this, which is how to identify and group together the codes whose prefix values are the same, so that I can combine them into a single string as you can see above.

You can do:
var query = codes.Select(c =>
new
{
SplitArray = c.Split('.'), //to avoid multiple split
Value = c
})
.Select(c => new
{
Prefix = c.SplitArray.First(), //you can avoid multiple split if you split first and use it later
PostFix = c.SplitArray.Last(),
Value = c.Value,
})
.GroupBy(r => r.Prefix)
.Select(grp => new
{
Key = grp.Key,
Items = grp.Count() > 1 ? String.Join(",", grp.Select(t => t.PostFix)) : "",
Value = grp.First().Value,
});
This is how it works:
Split each item in the list on the delimiter and populate an anonymous type with Prefix, Postfix and original value
Later group on Prefix
after that select the values and the post fix values using string.Join
For output:
foreach (var item in query)
{
if(String.IsNullOrWhiteSpace(item.Items))
Console.WriteLine(item.Value);
else
Console.WriteLine("{0}({1})", item.Key, item.Items);
}
Output would be:
44(01,02,03,04,05,06,07,08)
46
47.10

Try this:-
var result = codes.Select(x => new { SplitArr = x.Split('.'), OriginalValue = x })
.GroupBy(x => x.SplitArr[0])
.Select(x => new
{
Prefix= x.Key,
subCode = x.Count() > 1 ?
String.Join(",", x.Select(z => z.SplitArray[1])) : "",
OriginalValue = x.First().OriginalValue
});
You can print your desired output like this:-
foreach (var item in result)
{
Console.Write("{0}({1}),",item.Prefix,item.subCode);
}
Working Fiddle.

Outlined idea:
Use Dictionary<string, List<string>> for collecting your result
in a loop over your list, use string.split() .. the first element will be your Dictionary key ... create a new List<string> there if the key doesn't exist yet
if the result of split has a second element, append that to the List
use a second loop to format that Dictionary to your output string
Of course, linq is possible too, e.g.
List<string> codes = new List<string>() {
"44.01", "44.05", "47", "42.02", "44.03" };
var result = string.Join(",",
codes.OrderBy(x => x)
.Select(x => x.Split('.'))
.GroupBy(x => x[0])
.Select((x) =>
{
if (x.Count() == 0) return x.Key;
else if (x.Count() == 1) return string.Join(".", x.First());
else return x.Key + "(" + string.Join(",", x.Select(e => e[1]).ToArray()) + ")";
}).ToArray());
Gotta love linq ... haha ... I think this is a monster.

You can do it all in one clever LINQ:
var grouped = codes.Select(x => x.Split('.'))
.Select(x => new
{
Prefix = int.Parse(x[0]),
Subcode = x.Length > 1 ? int.Parse(x[1]) : (int?)null
})
.GroupBy(k => k.Prefix)
.Select(g => new
{
Prefix = g.Key,
Subcodes = g.Where(s => s.Subcode.HasValue).Select(s => s.Subcode)
})
.Select(x =>
x.Prefix +
(x.Subcodes.Count() == 1 ? string.Format(".{0}", x.Subcodes.First()) :
x.Subcodes.Count() > 1 ? string.Format("({0})", string.Join(",", x.Subcodes))
: string.Empty)
).ToArray();
First it splits by Code and Subcode
Group by you Code, and get all Subcodes as a collection
Select it in the appropriate format
Looking at the problem, I think you should stop just before the last Select and let the data presentation be done in another part/method of your application.

The old fashioned way:
List<string> codes = new List<string>() {"44.01", "44.05", "47", "42.02", "44.03" };
string output=""
for (int i=0;i<list.count;i++)
{
string [] items= (codes[i]+"..").split('.') ;
int pos1=output.IndexOf(","+items[0]+"(") ;
if (pos1<0) output+=","+items[0]+"("+items[1]+")" ; // first occurence of code : add it
else
{ // Code already inserted : find the insert point
int pos2=output.Substring(pos1).IndexOf(')') ;
output=output.Substring(0,pos2)+","+items[1]+output.Substring(pos2) ;
}
}
if (output.Length>0) output=output.Substring(1).replace("()","") ;

This will work, including the correct formats for no subcodes, a single subcode, multiple subcodes. It also doesn't assume the prefix or subcodes are numeric, so it leaves leading zeros as is. Your question didn't show what to do in the case you have a prefix without subcode AND the same prefix with subcode, so it may not work in that edge case (44,44.01). I have it so that it ignores the prefix without subcode in that edge case.
List<string> codes = new List<string>
{
"44.01", "44.02", "44.03", "44.04", "44.05", "44.06", "44.07", "44.08", "46", "47.10"
};
var result=codes.Select(x => (x+".").Split('.'))
.Select(x => new
{
Prefix = x[0],
Subcode = x[1]
})
.GroupBy(k => k.Prefix)
.Select(g => new
{
Prefix = g.Key,
Subcodes = g.Where(s => s.Subcode!="").Select(s => s.Subcode)
})
.Select(x =>
x.Prefix +
(x.Subcodes.Count() == 0 ? string.Empty :
string.Format(x.Subcodes.Count()>1?"({0})":".{0}",
string.Join(",", x.Subcodes)))
).ToArray();

General idea, but i'm sure replacing the Substring calls with Regex would be a lot better as well
List<string> newCodes = new List<string>()
foreach (string sub1 in codes.Select(item => item.Substring(0,2)).Distinct)
{
StringBuilder code = new StringBuilder();
code.Append("sub1(");
foreach (string sub2 in codes.Where(item => item.Substring(0,2) == sub1).Select(item => item.Substring(2))
code.Append(sub2 + ",");
code.Append(")");
newCodes.Add(code.ToString());
}

You could go a couple ways... I could see you making a Dictionary<string,List<string>> so that you could have "44" map to a list of {".01", ".02", ".03", etc.} This would require you processing the codes before adding them to this list (i.e. separating out the two parts of the code and handling the case where there is only one part).
Or you could put them into a a SortedSet and provide your own Comparator which knows that these are codes and how to sort them (at least that'd be more reliable than grouping them alphabetically). Iterating over this SortedSet would still require special logic, though, so perhaps the Dictionary to List option above is still preferable.
In either case you would still need to handle a special case "46" where there is no second element in the code. In the dictionary example, would you insert a String.Empty into the list? Not sure what you'd output if you got a list {"46", "46.1"} -- would you display as "46(null,1)" or... "46(0,1)"... or "46(,1)" or "46(1)"?

Counting Word Occurrences

Below I have code that takes a test file, splits it into two groups: Apollo and Sabre, and is supposed to tell me how many times the word "Processed" is used in each group, however whenever I run this, it just tells me how many lines the file is, which I already know. Could someone please explain why this is not working and a solution on how to fix this?
var lines1 = File.ReadLines(path);
List<string> apollo = lines1.Take(7678).ToList();
List<string> sabre = lines.Skip(7678).Take(5292).ToList();
var g = apollo.GroupBy(i => i);
foreach (var grp in g)
{
Console.WriteLine("{0} {1}", grp.Key, grp.Count());
}

Perhaps you need to actually check for the value:
var g = apollo
.Where(line => line == "Processed")
.GroupBy(i => i);
However - perhaps you can just use Count()
var apoloCount = apollo.Count(line => line == "Processed");
var sabreCount = apollo.Count(line => line == "Processed");
If the lines contain multiple words (unclear from your question), you can do something like this:
var apoloCount = apollo
.SelectMany(line => line.Split(' ')) //Get the individual words from the line
.Count(word => word == "Processed");

C# File to Dictionary, but taking pairs of words

I am thinking about making a dictionary that contains words pairs as well as single words from a file.
Standard "single word" looks like:
private Dictionary<string, int> tempDict = new Dictionary<string, int>();
private void GetWords(string[] file)
{
tempDict = file
.SelectMany(i => File.ReadLines(i)
.SelectMany(line => line.Split(new[] { ' ' }, StringSplitOptions.RemoveEmptyEntries)))
.GroupBy(word => word)
.ToDictionary(g => g.Key, g => g.Count());
}
And the string:
Adam likes coffee
will be:
Adam ; likes ; coffee
But I want to make it so it matches pairs as well (but only the neighbouring ones) so it would look like:
Adam ; Adam likes ; likes ; likes coffee ; coffee
I am not sure if it is manageable to do, and need some help with this one.

MoreLINQ has a Enumerable.Pairwise which takes the current and the predecessor value and a projections function.
Returns a sequence resulting from applying a function to each element in the source sequence and its predecessor, with the exception of the first element which is only returned as the predecessor of the second element.
Concatenating that with the original split value array would output:
var sentence = "Adam likes coffee";
var splitWords = sentence.Split(new[] { ' ' }, StringSplitOptions.RemoveEmptyEntries);
var pairWise = splitWords.Pairwise((first, second) => string.Format("{0} {1}", first,
second))
.Concat(splitWords)
.GroupBy(x => x)
.ToDictionary(x => x.Key, x => x.Count())
Would result in:

Using Dictionary to count the number of appearances

My problem is that I am trying to take a body of text from a text box for example
"Spent the day with "insert famous name" '#excited #happy #happy"
then I want to count how many times each hashtag appears in the body, which can be any length of text.
so the above would return this
excited = 1
happy = 2
I Was planning on using a dictionary but I am not sure how I would implement the search for the hashtags and add to the dictionary.
This is all I have so far
string body = txtBody.Text;
Dictionary<string, string> dic = new Dictionary<string, string>();
foreach(char c in body)
{
}
thanks for any help

This can be achieved with a couple of LINQ methods:
var text = "Spent the day with <insert famous name> #excited #happy #happy";
var hashtags = text.Split(new[] { ' ' })
.Where(word => word.StartsWith("#"))
.GroupBy(hashtag => hashtag)
.ToDictionary(group => group.Key, group => group.Count());
Console.WriteLine(string.Join("; ", hashtags.Select(kvp => kvp.Key + ": " + kvp.Value)));
This will print
#excited: 1; #happy: 2

This will find any hashtags in a string of the form a hash followed by one or more non-whitespace characters and create a dictionary of them versus their count.
You did mean Dictionary<string, int> really, didn't you?
var input = "Spent the day with \"insert famous name\" '#excited #happy #happy";
Dictionary<string, int> dic =
Regex
.Matches(input, #"(?<=\#)\S+")
.Cast<Match>()
.Select(m => m.Value)
.GroupBy(s => s)
.ToDictionary(g => g.Key, g => g.Count());

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Counting the Frequency of Specific Words in Text File - c#

You can count the words in the string by grouping them and turning it into a dictionary: Dictionary<string, int> count = theString.Split(' ') .GroupBy(s => s) .ToDictionary(g => g.Key, g => g.Count()); Now you can just check if the words exist in the dictionary, and show the count if it does.

var dict = new Dictionary<string, int>(); foreach (var word in file) if (dict.ContainsKey(word)) dict[word]++; else dict[word] = 1;

Using Groovy regex facilty, i would do it as below :- def input=""" i love love vb development although i m a total newbie """ def dictionary=["love", "development", "fire", "stone"] dictionary.each{ def pattern= ~/${it}/ match = input =~ pattern println "${it}" + "-"+ match.count }

Related

The most common word in spaceless string

Identifying and grouping similar items in a collection of strings

Counting Word Occurrences

C# File to Dictionary, but taking pairs of words

Using Dictionary to count the number of appearances

Categories

Resources