I'm trying to make a concordance.
I have a dictionary where are each word and frequency of appearing this word in text.
Now I would have to store a number of line where word occured.
To do it I suppose to make a container which will store each line.
Something like this:
List<String> eachLine = new List<string>();
using (var strReader = new StreamReader(#"pathToFile/Text.txt"))
{
string line;
while ((line = strReader.ReadLine()) != null)
{
eachLine.Add(line);
}
}
Here are Dictionary
Dictionary<string, int> concordanceDictionary = new Dictionary<string, int>();
string lines = File.ReadAllText(path:Text.txt").ToLower();
string[] words = SplitWords(lines);
foreach (var word in words)
{
int i = 1;
if (!concordanceDictionary.ContainsKey(word))
{
concordanceDictionary.Add(word, i);
}
else
{
concordanceDictionary[word]++;
}
}
var list =concordanceDictionary.Keys.ToList();
list.Sort();
To store number of lines I'll create a 'List' where I will put index of line where word occured by using method Contain for each word in dictionary which will check if this word is in
' List<String> eachLine '
The problem is how to display each word with list of number of line?
May be you can suggest me more elegant and easier way to do it
Created a console app for you to run
class Program
{
static void Main(string[] args)
{
ReadTextToDictionary read = new ReadTextToDictionary();
var strings = read.TextToListString(#"C:\stackoverflow\first.txt");
var dictionarys = read.TextToDictionaryString(#"C:\stackoverflow\second.txt");
foreach(var s in strings) {
var compare = dictionarys.Where(a=>a.Value.Contains(s.ToString()));
foreach(var f in compare)
{
Console.WriteLine(s+" in line "+f.Key.ToString() + " " + f.Value);
}
}
Console.ReadKey();
}
}
class ReadTextToDictionary
{
public List<string> TextToListString(string path){
var lines = System.IO.File.ReadAllLines(path);
return lines.ToList();
}
public Dictionary<int,string> TextToDictionaryString(string path)
{
Dictionary<int, string> dstr = new Dictionary<int, string>();
var lines = System.IO.File.ReadAllLines(path);
int count = 0;
foreach (var s in lines)
{
count++;
dstr.Add(count, s);
}
return dstr;
}
}
I would use a Dictionary<String,List<Int32>> where the key is String which is the current word, and the List<Int32> is the list of line-numbers the word appears on. To get the occurrence count just dereference the list's Count property: dictionary[ word ].Count.
As an aside, you don't need to read everything into memory at once (as String[] instances). You can simply read through the file character-by-character and identify whitespace and line-breaks.
To that end, this is my implementation:
void Run() {
Dictionary< String, List<Int32> > dict = new Dictionary< String, List<Int32> >();
foreach(Tuple<String,Int32> wordOccurrence in GetWords()) {
String word = wordOccurrence.Item1;
Int32 line = wordOccurrence.Item2;
if( !dict.ContainsKey( word ) ) dict.Add( word , new List<Int32>() );
dict[ word ].Add( line );
}
foreach(String word in dict.Keys) {
Console.WriteLine("\"{0}\" appeared {1} times, on these lines:", word, dict[word].Count);
foreach(Int32 line in dict[word]) Console.WriteLine("\t{0}", line );
Console.WriteLine("");
}
}
IEnumerable< Tuple<String,Int32> > GetWords() {
using(StreamReader rdr = new StreamReader("fileName")) {
StringBuilder sb = new StringBuilder();
Int32 nc; Char c;
Itn32 lineNumber = 0;
while( (nc = rdr.Read() != -1 ) {
c = (Char)nc;
if( Char.IsWhitespace(c) ) {
if( sb.Length > 0 ) {
yield return new Tuple( sb.ToString(), lineNumber );
sb.Length = 0;
}
if( c == '\n' ) lineNumber++;
} else {
sb.Append( c );
}
}
if( sb.Length > 0 ) yield return new Tuple( sb.ToString(), lineNumber );
}
}
One way to do it would be to store each line where the word occurs in a list, as the value portion of a dictionary with the word as a key.
In otherwords, you would have a Dictionary<string, List<string>> where the key is a word and the associated list is all the lines containing the word.
This way, you can quickly access the lines AND you get the number of occurrences for free (dict[someWord].Count;)
For example:
// words dictionary has a word key and a list of lines containing the word
var words = new Dictionary<string, List<string>>();
using (var strReader = new StreamReader(#"pathToFile/Text.txt"))
{
string line;
// Read each line
while ((line = strReader.ReadLine()) != null)
{
// Get each word from the line
var wordsInLine = line.ToLower().Split(' ');
foreach (var word in wordsInLine)
{
// If this word already exists, update it's line number
if (words.ContainsKey(word))
{
words[word].Add(line);
}
// Otherwise, add a new word with this line number to the list
else
{
words.Add(word, new List<string> {line});
}
}
}
}
And if you really wanted to get all the lines, you could either add them to a list in the loop above, or do something like this:
var allLines = words.SelectMany(w => w.Value).Distinct().ToList();
Related
Hi there wonderful people of stackOverFlow.
I am currently in a position where im totaly stuck. What i want to be able to do is take out a word from a text and replace it with a synonym. I thought about it for a while and figured out how to do it if i ONLY have one possible synonym with this code.
string pathToDesk = Environment.GetFolderPath(Environment.SpecialFolder.Desktop);
string text = System.IO.File.ReadAllText(pathToDesk + "/Text.txt");
string replacementsText = System.IO.File.ReadAllText(pathToDesk + "/Replacements.txt");
string wordsToReplace = System.IO.File.ReadAllText(pathToDesk + "/WordsToReplace.txt");
string[] words = text.Split(' ');
string[] reWords = wordsToReplace.Split(' ');
string[] replacements = replacementsText.Split(' ');
for(int i = 0; i < words.Length; i++) {//for each word
for(int j = 0; j < replacements.Length; j++) {//compare with the possible synonyms
if (words[i].Equals(reWords[j], StringComparison.InvariantCultureIgnoreCase)) {
words[i] = replacements[j];
}
}
}
string newText = "";
for(int i = 0; i < words.Length; i++) {
newText += words[i] + " ";
}
txfInput.Text = newText;
But lets say that we were to get the word hi. Then i want to be able to replace that with {"Hello","Yo","Hola"}; (For example)
Then my code will not be good for anything since they will not have the same position in the arrays.
Is there any smart solution to this I would really like to know.
you need to store your synonyms differently
in your file you need something like
hello yo hola hi
awesome fantastic great
then for each line, split the words, put them in an array array of arrays
Now use that to find replacement words
This won't be super optimized, but you can easily index each word to a group of synonyms as well.
something like
public class SynonymReplacer
{
private Dictionary<string, List<string>> _synonyms;
public void Load(string s)
{
_synonyms = new Dictionary<string, List<string>>();
var lines = s.Split(new[] {'\r', '\n'}, StringSplitOptions.RemoveEmptyEntries);
foreach (var line in lines)
{
var words = line.Split(new[] {' '}, StringSplitOptions.RemoveEmptyEntries).ToList();
words.ForEach(word => _synonyms.Add(word, words));
}
}
public string Replace(string word)
{
if (_synonyms.ContainsKey(word))
{
return _synonyms[word].OrderBy(a => Guid.NewGuid())
.FirstOrDefault(w => w != word) ?? word;
}
return word;
}
}
The OrderBy gets you a random synonym...
then
var s = new SynonymReplacer();
s.Load("hi hello yo hola\r\nawesome fantastic great\r\n");
Console.WriteLine(s.Replace("hi"));
Console.WriteLine(s.Replace("ok"));
Console.WriteLine(s.Replace("awesome"));
var words = new string[] {"hi", "you", "look", "awesome"};
Console.WriteLine(string.Join(" ", words.Select(s.Replace)));
and you get :-
hello
ok
fantastic
hello you look fantastic
Your first task will be to build a list of words and synonyms. A Dictionary will be perfect for this. The text file containing this list might look like this:
word1|synonym11,synonym12,synonym13
word2|synonym21,synonym22,synonym23
word3|synonym31,synonym32,synonym33
Then you can construct the dictionary like this:
public Dictionary<string, string[]> GetSynonymSet(string synonymSetTextFileFullPath)
{
var dict = new Dictionary<string, string[]>();
string line;
// Read the file and display it line by line.
using (var file = new StreamReader(synonymSetTextFileFullPath))
{
while((line = file.ReadLine()) != null)
{
var split = line.Split('|');
if (!dict.ContainsKey(split[0]))
{
dict.Add(split[0], split[1].Split(','));
}
}
}
return dict;
}
The eventual code will look like this
public string ReplaceWordsInText(Dictionary<string, string[]> synonymSet, string text)
{
var newText = new StringBuilder();
string[] words = text.Split(' ');
for (int i = 0; i < words.Length; i++) //for each word
{
string[] synonyms;
if (synonymSet.TryGetValue(words[i], out synonyms)
{
// The exact synonym you wish to use is up to you.
// I will just use the first one
words[i] = synonyms[0];
}
newText.AppendFormat("{0} ", words[i]);
}
return newText.ToString();
}
I have a code file and I need to find all unique objects of type TADODataSet, but they aren't defined in this 30,000 line file I have.
I wrote a console application that splits each line into individual words and adds that word to a list if it contains ADODataSet (the naming convention prefix for the objects I'm interested in) but this didn't work quite right because of how I'm splitting my lines of code.
This is all of my code:
static void Main(string[] args)
{
string file = #"C:\somePath\Form1.cs";
string output = #"C:\someOtherPath\New Text Document.txt";
List<string> datasets = new List<string>();
string[] lines = File.ReadAllLines(file);
foreach (string line in lines)
{
string[] words = line.Split(' ');
foreach (string word in words)
{
if (word.ToLower().Contains("adodataset"))
datasets.Add(word);
}
}
if (datasets.Count > 0)
{
using (StreamWriter sw = new StreamWriter(output))
{
foreach (string dataset in datasets.Distinct())
{
sw.WriteLine(dataset);
}
}
Console.WriteLine(String.Format("Wrote {0} data sets to {1}", datasets.Distinct().Count(), output));
Console.ReadKey();
}
}
But this didn't work as I hoped, and added "words" such as these:
SQLText(ADODataSetEnrollment->FieldByName("Age1")->AsString)
SQLText(ADODataSetEnrollment->FieldByName("Age2")->AsString)
SQLText(ADODataSetEnrollment->FieldByName("Age3")->AsString)
I'm only interested in ADODataSetEnrollment, so I should only have 1 entry for that variable in my output file but because that line of code doesn't contain a space it's treated as a single "word".
How can I split my lines array instead, so that way I can find unique variables?
Have you tried RegEx matching? With RegEx you can for example say
RegEx.IsMatch(word, "(?i)(?<!\w)adodataset(?!\w)")
> (?i) means ignore case (like uppercase, lower case, i think)
> (?<!\w)means not preceded by a literal (like letters, ABC..., abc... and so
> forth)
> (?!\w) means not followed by a literal RegEx.IsMatch(...)
> returns a bool value
Ended up with this as a solution:
string file = #"C:\somePath\Form1.cs";
string output = #"C:\someOtherPath\New Text Document.txt";
List<string> datasets = new List<string>();
string[] lines = File.ReadAllLines(file);
decimal i = 0;
foreach (string line in lines)
{
string[] words = line.Split(' ');
foreach (string word in words)
{
if (word.ToLower().Contains("adodataset"))
{
int start = word.ToLower().IndexOf("adodataset");
string dsWord = String.Empty;
string temp = word.Substring(start, word.Length - start);
foreach (char c in temp)
{
if (Char.IsLetter(c))
dsWord += c;
else
break;
}
if (dsWord != String.Empty)
datasets.Add(dsWord);
}
}
i++;
Console.Write("\r{0}% ", Math.Round(i / lines.Count() * 100, 2));
}
if (datasets.Count > 0)
{
using (StreamWriter sw = new StreamWriter(output))
{
foreach (string dataset in datasets.Distinct())
sw.WriteLine(dataset);
}
Console.WriteLine(String.Format("Wrote {0} data sets to {1}", datasets.Distinct().Count(), output));
Console.ReadKey();
}
Pretty ghetto, but it did what I needed it to do. I'll happily accept someone else's answer though if they know of a better way to use Regex to just pull out the variable name from within the line of code, rather than the whole line itself.
You can try this solution:
string file = File.ReadAllText(#"text.txt");
string output = #"C:\someOtherPath\New Text Document.txt";
List<string> datasets = new List<string>();
var a = Regex.Matches(file, #"\W(ADODataSet\w*)", RegexOptions.IgnoreCase);
foreach (Match m in a)
{
datasets.Add(m.Groups[1].Value);
}
I am writing a simple console application to would allow me to count the occurrence of each unique word.
for example the console will allow the user to type a sentence, once press enter the system should count the number of time each words occurs. so far I can only count characters. any help would be appreciated.
class Program
{
static void Main(string[] args)
{
Console.WriteLine("Please enter string");
string input = Convert.ToString(Console.ReadLine());
Dictionary<string, int> objdict = new Dictionary<string, int>();
foreach (var j in input)
{
if (objdict.ContainsKey(j.ToString()))
{
objdict[j.ToString()] = objdict[j.ToString()] + 1;
}
else
{
objdict.Add(j.ToString(), 1);
}
}
foreach (var temp in objdict)
{
Console.WriteLine("{0}:{1}", temp.Key, temp.Value);
}
Console.ReadLine();
}
}
Try this method:
private void countWordsInALIne(string line, Dictionary<string, int> words)
{
var wordPattern = new Regex(#"\w+");
foreach (Match match in wordPattern.Matches(line))
{
int currentCount=0;
words.TryGetValue(match.Value, out currentCount);
currentCount++;
words[match.Value] = currentCount;
}
}
Call the above method like this:
var words = new Dictionary<string, int>(StringComparer.CurrentCultureIgnoreCase);
countWordsInALine(line, words);
In the words dictionary you will find the words (key) along with its occurance frequency (value).
Just call Split method passing single space (assuming word is seperated by single space) and it would give collection of each word then iterate over each element of collection with the same logic you were having.
class Program
{
static void Main(string[] args)
{
Console.WriteLine("Please enter string");
string input = Convert.ToString(Console.ReadLine());
Dictionary<string, int> objdict = new Dictionary<string, int>();
foreach (var j in input.Split(" "))
{
if (objdict.ContainsKey(j))
{
objdict[j] = objdict[j] + 1;
}
else
{
objdict.Add(j, 1);
}
}
foreach (var temp in objdict)
{
Console.WriteLine("{0}:{1}", temp.Key, temp.Value);
}
Console.ReadLine();
}
}
You need to split the string on spaces (or any other characters which you consider to delimit words). Try changing the loop to this:
foreach (string Word in input.Split(' ')) {
}
Might I suggest a ternary-tree to make things efficient?
Here's a link to a C# implementation: http://igoro.com/archive/efficient-auto-complete-with-a-ternary-search-tree/
After first inserting into the tree, you could simply call "Contains" with one of the implementations above to make things quick
Try this...
var theList = new List<string>() { "Alpha", "Alpha", "Beta", "Gamma", "Delta" };
theList.GroupBy(txt => txt)
.Where(grouping => grouping.Count() > 1)
.ToList()
.ForEach(groupItem => Console.WriteLine("{0} duplicated {1} times with these values {2}",
groupItem.Key,
groupItem.Count(),
string.Join(" ", groupItem.ToArray())));
Console.ReadKey();
http://omegacoder.com/?p=792
I have a text file with this info in it:
01 Index
home
about
02 Home
first
second
third
The line that starts with a number signifies a key and following it, till a blank line, are the values.
I want to have a Dictionary object with the key as the first line and the lines following it till the blank line as a string array representing.
Like so:
key = "01 Index"
value = string[]
where the values in the array would be:
string[0] = "home"
string[1] = "about"
The next key would be "02 Home" and its following lines as string[].
This method reads the text file:
string[] ReadFileEntries(string path)
{
return File.ReadAllLines(path);
}
This gives all the lines, 8 in the sample above, in the string[] where the 4th line would be the blank line.
How do I create the required dictionary from this?
Thanks in advance.
Regards.
To parse the file into, say, Dictionary<String, String[]> you can do somthing like that:
Dictionary<String, String[]> data = new Dictionary<String, String[]>();
String key = null;
List<String> values = new List<String>();
foreach (String line in File.ReadLines(path)) {
// Skip blank lines
if (String.IsNullOrEmpty(line))
continue;
// Check if it's a key (that should start from digit)
if ((line[0] >= '0' && line[0] <= '9')) { // <- Or use regular expression
if (!Object.ReferenceEquals(null, key))
data.Add(key, values.ToArray());
key = line;
values.Clear();
continue;
}
// it's a value (not a blank line and not a key)
values.Add(line);
}
if (!Object.ReferenceEquals(null, key))
data.Add(key, values.ToArray());
This should do what you're wanting to do. There is room for improvement, and I suggest refactoring a bit.
var result = new Dictionary<string, string[]>();
var input = File.ReadAllLines(#"c:\temp\test.txt");
var currentValue = new List<string>();
var currentKey = string.Empty;
foreach (var line in input)
{
if (currentKey == string.Empty)
{
currentKey = line;
}
else if (!string.IsNullOrEmpty(line))
{
currentValue.Add(line);
}
if (string.IsNullOrEmpty(line))
{
result.Add(currentKey, currentValue.ToArray());
currentKey = string.Empty;
currentValue = new List<string>();
}
}
if (currentKey != string.Empty)
{
result.Add(currentKey, currentValue.ToArray());
}
We declare a dictionary of that nature in this manner:
Dictionary<string, string[]> myDict = new Dictionary<string, string[]>()
To populate it we do this:
myDict.Add(someString, someStringArray[]);
I think something like this would do it:
string[] TheFileAsAnArray = ReadFileEntries(path);
Dictionary<string, string[]> myDict = new Dictionary<string, string[]>()
string key = "";
List<string> values = new List<string>();
for(int i = 0; i <= TheFileAsAnArray.Length; i++)
{
if(String.isNullOrEmpty(TheFileAsAnArray[i].Trim()))
{
myDict.Add(key, values.ToArray());
key = String.Empty;
values = new List<string>();
}
else
{
if(key == String.Empty)
key = TheFileAsAnArray[i];
else
values.Add(TheFileAsAnArray[i]);
}
}
I got richtextBox control in form and a text file. I am getting text file to array and getting richtextbox1.text to other array than compare it and count words matching.
But for example there are two "name" word in text file and three "and" word in richtextbox. So if there is two same word in text file in richtextbox it cant be 3 or higher after 2, it must be wrong word so it must not be counted. But HashSet is counting unique values only not looking for duplicates in text file. I want to compare every word in text file with words in RichTextBox.
My code is here:
StreamReader sr = new StreamReader("c:\\test.txt",Encoding.Default);
string[] word = sr.ReadLine().ToLower().Split(' ');
sr.Close();
string[] word2 = richTextBox1.Text.ToLower().Split(' ');
var set1 = new HashSet<string>(word);
var set2 = new HashSet<string>(word2);
set1.IntersectWith(set2);
MessageBox.Show(set1.Count.ToString());
Inferring that you want:
file:
foo
foo
foo
bar
text box:
foo
foo
bar
bar
to result in '3' (2 foos and one bar)
Dictionary<string,int> fileCounts = new Dictionary<string, int>();
using (var sr = new StreamReader("c:\\test.txt",Encoding.Default))
{
foreach (var word in sr.ReadLine().ToLower().Split(' '))
{
int c = 0;
if (fileCounts.TryGetValue(word, out c))
{
fileCounts[word] = c + 1;
}
else
{
fileCounts.Add(word, 1);
}
}
}
int total = 0;
foreach (var word in richTextBox1.Text.ToLower().Split(' '))
{
int c = 0;
if (fileCounts.TryGetValue(word, out c))
{
total++;
if (c - 1 > 0)
fileCounts[word] = c - 1;
else
fileCounts.Remove(word);
}
}
MessageBox.Show(total.ToString());
Note that this is destructively modifying the read dictionary, you can avoid this (so only have to read the dictionary once) buy simply counting the rich text box in the same way and then taking the Min of the individual counts and summing them.
You need the counts to be the same? You need to count the words, then...
static Dictionary<string, int> CountWords(string[] words) {
// use (StringComparer.{your choice}) for case-insensitive
var result = new Dictionary<string, int>();
foreach (string word in words) {
int count;
if (result.TryGetValue(word, out count)) {
result[word] = count + 1;
} else {
result.Add(word, 1);
}
}
return result;
}
...
var set1 = CountWords(word);
var set2 = CountWords(word2);
var matches = from val in set1
where set2.ContainsKey(val.Key)
&& set2[val.Key] == val.Value
select val.Key;
foreach (string match in matches)
{
Console.WriteLine(match);
}