Loading data from text file into a dictionary

Loading data from text file into a dictionary - c#

I have a file consisting of a list of text which looks as follows:
ABC Abbey something
ABD Aasdasd
This is the text file
The first string will always be the length of 3. So I want to loop through the file content, store those first 3 letters as Key and remaining as value. I am removing white space between them and Substringing as follows to store. The key works out fine but the line where I am storing the value returns following error. ArgumentOutOfRangeException
This is the exact code causing the problem.
line.Substring(4, line.Length)
If I call the subString between 0 and line.length it works fine. As long as I call it between 1and upwards - line.length I get the error. Honestly don't get it and been at it for hours. Some assistance please.
class Program {
static string line;
static Dictionary<string, string> stations = new Dictionary<string, string>();
static void Main(string[] args) {
var lines = File.ReadLines("C:\\Users\\username\\Desktop\\a.txt");
foreach (var l in lines) {
line = l.Replace("\t", "");
stations.Add(line.Substring(0, 3), line.Substring(4, line.Length));//error caused by this line
}
foreach(KeyValuePair<string, string> item in stations) {
//Console.WriteLine(item.Key);
Console.WriteLine(item.Value);
}
Console.ReadLine();
}
}

This is because the documentation specifies it will throw an ArgumentOutOfRangeException if:
startIndex plus length indicates a position not within this instance.
With the signature:
public string Substring(int startIndex, int length)
Since you use line.Length, you know that startIndex plus length will be 4+line.Length which is definitely not a position of this instance.
I recommend using the one parameter version:
public string Substring(int startIndex)
Thus line.Substring(3) (credit to #adv12 for spotting that). Since here you only should provide the startIndex. Of course you can use line.SubString(3,line.Length-3), but as always, better use a library since libraries are made to make programs fool-proof (this is not intended as offensive, simply make sure you reduce the amount of brain cycles for this task). Mind however that it still can throw an error if:
startIndex is less than zero or greater than the length of this instance.
So better provide checks that 3 is less than or equal to line.length...
Additional advice
Perhaps you should take a look to regex capturing. Now each key in your file contains three characters. But it is possible that in the (near) future four characters will be possible. Using regex capture, you could specify a pattern such that it is less likely that errors will occur during parsing.

You need to actually get less than the length of total line:
line.Substring(4, line.Length - 4) //subtract the chars which you're skipping
Your string:
ABC Abbey something
Length = 19
Start = 4
Remaining chars = 19 - 4 = 15 //and you are expecting 19, that is the error

I know this is a late answer that doesn't address what's wrong with your code but I feel that has already been done by other people. Instead I have different way to make the dictionary that doesn't involve substring at all so it's a little more robust, IMHO.
As long as you can guarantee that the two values are always separated by tab then this would work even if there were more or less characters in the key. It uses LINQ which should be fine from .NET 3.5.
// LINQ
using System.Linq;
// Creates a string[][] array with the list of keys in the first array position
// and the values in the second
var lines = File.ReadAllLines(#"path/to/file.txt")
.Select(s => s.Split('\t'))
.ToArray();
// Your dictionary
Dictionary<string, string> stations = new Dictionary<string, string>();
// Loop through the array and add the key/value pairs to the dictionary
for (int i = 0; i < lines.Length; i++)
{
// For example lines[i][0] = ABW, lines[i][1] = Abbey Wood
stations[lines[i][0]] = lines[i][1];
}
// Prove it works
foreach (KeyValuePair<string, string> entry in stations)
{
MessageBox.Show(entry.Key + " - " + entry.Value);
}
Hope this makes sense and gives you an alternate to consider ;-)

Related

Read multiple lines from a large file in non-ascending order

I have a very large text file, over 1GB, and I have a list of integers that represent line numbers, and the need is to produce another file containing the text of the original files line numbers in the new file.
Example of original large file:
ogfile line 1
some text here
another line
blah blah
So when I get a List of "2,4,4,1" the output file should read:
some text here
blah blah
blah blah
ogfile line 1
I have tried
string lineString = File.ReadLines(filename).Skip(lineNumList[i]-1).Take(1).First();
but this takes way to long as the file has to be read in, skipped to the line in question, then reread the next time... and we are talking millions of lines in the 1GB file and my List<int> is thousands of line numbers.
Is there a better/faster way to read a single line, or have the reader skip to a specific line number without "skipping" line by line?

The high-order bit here is: you are trying to solve a database problem using text files. Databases are designed to solve big data problems; text files, as you've discovered, are terrible at random access. Use a database, not a text file.
If you are hell-bent upon using a text file, what you have to do is take advantage of stuff you know about the likely problem parameters. For example, if you know that, as you imply, there are ~1M lines, each line is ~1KB, and the set of lines to extract is ~0.1% of the total lines, then you can come up with an efficient solution like this:
Make a set containing the line numbers to be read. The set must be fast to check for membership.
Make a dictionary that maps from line numbers to line contents. This must be fast to look up by key and fast to add new key/value pairs.
Read each line of the file one at a time; if the line number is in the set, add the contents to the dictionary.
Now iterate the list of line numbers and map the dictionary contents; now we have a sequence of strings.
Dump that sequence to the destination file.
We have five operations, so hopefully it is around five lines of code.
void DoIt(string pathIn, IEnumerable<int> lineNumbers, string pathOut)
{
var lines = new HashSet<int>(lineNumbers);
var dict = File.ReadLines(pathIn)
.Select((lineText, index) => new KeyValuePair<int, string>(index, lineText))
.Where(p => lines.Contains(p.Key))
.ToDictionary(p => p.Key, p => p.Value);
File.WriteAllLines(pathOut, lineNumbers.Select(i => dict[i]));
}
OK, got it in six. Pretty good.
Notice that I made use of all those assumptions; if the assumptions are violated then this stops being a good solution. In particular we assume that the dictionary is going to be small compared to the size of the input file. If that is not true, then you'll need a more sophisticated technique to get efficiencies.
Conversely, can we extract additional efficiencies? Yes, provided we know facts about likely inputs. Suppose for example we know that the same file will be iterated several times but with different line number sets, but those sets are likely to have overlap. In that case we can re-use dictionaries instead of rebuilding them. That is, suppose a previous operation has left a Dictionary<int, string> computed for lines (10, 20, 30, 40) and file X. If a request then comes in for lines (30, 20, 10) for file X, we already have the dictionary in memory.
The key thing I want to get across in this answer is that you must know something about the inputs in order to build an efficient solution; the more restrictions you can articulate on the inputs, the more efficient a solution you can build. Take advantage of all the knowledge you have about the problem domain.

Use a StreamReader, so you don't have to read the entire file, just until the last desired line, and store them in a Dictionary, for later fast search.
Edit: Thanks to Erick Lippert, I included a HashSet for fast lookup.
List<int> lineNumbers = new List<int>{2,4,4,1};
HashSet<int> lookUp = new HashSet<int>(lineNumbers);
Dictionary<int,string> lines = new Dictionary<int,string>();
using(StreamReader sr = new StreamReader(inputFile)){
int lastLine = lookUp.Max();
for(int currentLine=1;currentLine<=lastLine;currentLine++){
if(lookUp.Contains(currentLine)){
lines[currentLine]=sr.ReadLine();
}
else{
sr.ReadLine();
}
}
}
using(StreamWriter sw = new StreamWriter(outputFile)){
foreach(var line in lineNumbers){
sw.WriteLine(lines[line]);
}
}

You may use a StreamReader and ReadLine method to read line by line without shocking the memory:
var lines = new Dictionary<int, string>();
var indexesProcessed = new HashSet<int>();
var indexesNew = new List<int> { 2, 4, 4, 1 };
using ( var reader = new StreamReader(#"c:\\file.txt") )
for ( int index = 1; index <= indexesNew.Count; index++ )
if ( reader.Peek() >= 0 )
{
string line = reader.ReadLine();
if ( indexesNew.Contains(index) && !indexesProcessed.Contains(index) )
{
lines[index] = line;
indexesProcessed.Add(index);
}
}
using ( var writer = new StreamWriter(#"c:\\file-new.txt", false) )
foreach ( int index in indexesNew )
if ( indexesProcessed.Contains(index) )
writer.WriteLine(lines[index]);
It reads the file and select the desired indexes then save them in the desired order.
We use a HashSet to store processed indexes to speedup Contains calls as you indicate the file can be over 1GB.
The code is made to avoid index out of bound in case of mismatches between the source file and the desired indexes, but it slows down the process. You can optimize if you are sure that there will be no problem. In this case you can remove all usage of indexesProcessed.
Output:
some text here
blah blah
blah blah
ogfile line 1

One way to do this would be to simply read the input file once (and store the result in a variable), and then grab the lines you need and write them to the output file.
Since the line number is 1-based and arrays are 0-based (i.e. line number 1 is array index 0), we subtract 1 from the line number when specifying the array index:
static void Main(string[] args)
{
var inputFile = #"f:\private\temp\temp.txt";
var outputFile = #"f:\private\temp\temp2.txt";
var fileLines = File.ReadAllLines(inputFile);
var linesToDisplay = new[] {2, 4, 4, 1};
// Write each specified line in linesToDisplay from fileLines to the outputFile
File.WriteAllLines(outputFile,
linesToDisplay.Select(lineNumber => fileLines[lineNumber - 1]));
GetKeyFromUser("\n\nDone! Press any key to exit...");
}
Another way to do this that should be more efficient is to only read the file up to the maximum line number (using the ReadLines method), rather than reading the whole file (using the ReadAllLines method), and save just the lines we care about in a dictionary that maps the line number to the line text:
static void Main(string[] args)
{
var inputFile = #"f:\private\temp\temp.txt";
var outputFile = #"f:\private\temp\temp2.txt";
var linesToDisplay = new[] {2, 4, 4, 1};
var maxLineNumber = linesToDisplay.Max();
var fileLines = new Dictionary<int, string>(linesToDisplay.Distinct().Count());
// Start lineNumber at 1 instead of 0
int lineNumber = 1;
// Just read up to the largest line number we need
// and save the lines we care about in our dictionary
foreach (var line in File.ReadLines(inputFile))
{
if (linesToDisplay.Contains(lineNumber))
{
fileLines[lineNumber] = line;
}
// Increment our lineNumber and break if we're done
if (++lineNumber > maxLineNumber) break;
}
// Write the output to our file
File.WriteAllLines(outputFile, linesToDisplay.Select(line => fileLines[line]));
GetKeyFromUser("\n\nDone! Press any key to exit...");
}

How can I implement the Viterbi algorithm in C# to split conjoined words?

In short - I want to convert the first answer to the question here from Python into C#. My current solution to splitting conjoined words is exponential, and I would like a linear solution. I am assuming no spacing and consistent casing in my input text.
Background
I wish to convert conjoined strings such as "wickedweather" into separate words, for example "wicked weather" using C#. I have created a working solution, a recursive function using exponential time, which is simply not efficient enough for my purposes (processing at least over 100 joined words). Here the questions I have read so far, which I believe may be helpful, but I cannot translate their responses from Python to C#.
How can I split multiple joined words?
Need help understanding this Python Viterbi algorithm
How to extract literal words from a consecutive string efficiently?
My Current Recursive Solution
This is for people who only want to split a few words (< 50) in C# and don't really care about efficiency.
My current solution works out all possible combinations of words, finds the most probable output and displays. I am currently defining the most probable output as the one which uses the longest individual words - I would prefer to use a different method. Here is my current solution, using a recursive algorithm.
static public string find_words(string instring)
{
if (words.Contains(instring)) //where words is my dictionary of words
{
return instring;
}
if (solutions.ContainsKey(instring.ToString()))
{
return solutions[instring];
}
string bestSolution = "";
string solution = "";
for (int i = 1; i < instring.Length; i++)
{
string partOne = find_words(instring.Substring(0, i));
string partTwo = find_words(instring.Substring(i, instring.Length - i));
if (partOne == "" || partTwo == "")
{
continue;
}
solution = partOne + " " + partTwo;
//if my current solution is smaller than my best solution so far (smaller solution means I have used the space to separate words fewer times, meaning the words are larger)
if (bestSolution == "" || solution.Length < bestSolution.Length)
{
bestSolution = solution;
}
}
solutions[instring] = bestSolution;
return bestSolution;
}
This algorithm relies on having no spacing or other symbols in the entry text (not really a problem here, I'm not fussed about splitting up punctuation). Random additional letters added within the string can cause an error, unless I store each letter of the alphabet as a "word" within my dictionary. This means that "wickedweatherdykjs" would return "wicked weather d y k j s" using the above algorithm, when I would prefer an output of "wicked weather dykjs".
My updated exponential solution:
static List<string> words = File.ReadLines("E:\\words.txt").ToList();
static Dictionary<char, HashSet<string>> compiledWords = buildDictionary(words);
private void btnAutoSpacing_Click(object sender, EventArgs e)
{
string text = txtText.Text;
text = RemoveSpacingandNewLines(text); //get rid of anything that breaks the algorithm
if (text.Length > 150)
{
//possibly split the text up into more manageable chunks?
//considering using textSplit() for this.
}
else
{
txtText.Text = find_words(text);
}
}
static IEnumerable<string> textSplit(string str, int chunkSize)
{
return Enumerable.Range(0, str.Length / chunkSize)
.Select(i => str.Substring(i * chunkSize, chunkSize));
}
private static Dictionary<char, HashSet<string>> buildDictionary(IEnumerable<string> words)
{
var dictionary = new Dictionary<char, HashSet<string>>();
foreach (var word in words)
{
var key = word[0];
if (!dictionary.ContainsKey(key))
{
dictionary[key] = new HashSet<string>();
}
dictionary[key].Add(word);
}
return dictionary;
}
static public string find_words(string instring)
{
string bestSolution = "";
string solution = "";
if (compiledWords[instring[0]].Contains(instring))
{
return instring;
}
if (solutions.ContainsKey(instring.ToString()))
{
return solutions[instring];
}
for (int i = 1; i < instring.Length; i++)
{
string partOne = find_words(instring.Substring(0, i));
string partTwo = find_words(instring.Substring(i, instring.Length - i));
if (partOne == "" || partTwo == "")
{
continue;
}
solution = partOne + " " + partTwo;
if (bestSolution == "" || solution.Length < bestSolution.Length)
{
bestSolution = solution;
}
}
solutions[instring] = bestSolution;
return bestSolution;
}
How I would like to use the Viterbi Algorithm
I would like to create an algorithm which works out the most probable solution to a conjoined string, where the probability is calculated according to the position of the word in a text file that I provide the algorithm with. Let's say the file starts with the most common word in the English language first, and on the next line the second most common, and so on until the least common word in my dictionary. It looks roughly like this
the
be
and
...
attorney
Here is a link to a small example of such a text file I would like to use.
Here is a much larger text file which I would like to use
The logic behind this file positioning is as follows...
It is reasonable to assume that they follow Zipf's law, that is the
word with rank n in the list of words has probability roughly 1/(n log
N) where N is the number of words in the dictionary.
Generic Human, in his excellent Python solution, explains this much better than I can. I would like to convert his solution to the problem from Python into C#, but after many hours spent attempting this I haven't been able to produce a working solution.
I also remain open to the idea that perhaps relative frequencies with the Viterbi algorithm isn't the best way to split words, any other suggestions for creating a solution using C#?

Written text is highly contextual and you may wish to use a Markov chain to model sentence structure in order to estimate joint probability. Unfortunately, sentence structure breaks the Viterbi assumption -- but there is still hope, the Viterbi algorithm is a case of branch-and-bound optimization aka "pruned dynamic programming" (something I showed in my thesis) and therefore even when the cost-splicing assumption isn't met, you can still develop cost bounds and prune your population of candidate solutions. But let's set Markov chains aside for now... assuming that the probabilities are independent and each follows Zipf's law, what you need to know is that the Viterbi algorithm works on accumulating additive costs.
For independent events, joint probability is the product of the individual probabilities, making negative log-probability a good choice for the cost.
So your single-step cost would be -log(P) or log(1/P) which is log(index * log(N)) which is log(index) + log(log(N)) and the latter term is a constant.

Can't help you with the Viterbi Algorithm but I'll give my two cents concerning your current approach. From your code its not exactly clear what words is. This can be a real bottleneck if you don't choose a good data structure. As a gut feeling I'd initially go with a Dictionary<char, HashSet<string>> where the key is the first letter of each word:
private static Dictionary<char, HashSet<string>> buildDictionary(IEnumerable<string> words)
{
var dictionary = new Dictionary<char, HashSet<string>>();
foreach (var word in words)
{
var key = word[0];
if (!dictionary.ContainsKey(key))
{
dictionary[key] = new HashSet<string>();
}
dictionary[key].Add(word);
}
return dictionary;
}
And I'd also consider serializing it to disk to avoid building it up every time.
Not sure how much improvement you can make like this (dont have full information of you current implementation) but benchmark it and see if you get any improvement.
NOTE: I'm assuming all words are cased consistently.

How to improve performance of this algorithm?

I have a text file with 100000 pairs: word and frequency.
test.in file with words:
1 line - total count of all word-frequency pairs
2 line to ~100 001 - word-frequency pairs
100 002 line - total count of user input words
from 100 003 to the end - user input words
I parse this file and put the words in
Dictionary<string,double> dictionary;
And I want to execute some search + order logic in the following code:
for(int i=0;i<15000;i++)
{
tempInputWord = //take data from file(or other sources)
var adviceWords = dictionary
.Where(p => p.Key.StartsWith(searchWord, StringComparison.Ordinal))
.OrderByDescending(ks => ks.Value)
.ThenBy(ks => ks.Key,StringComparer.Ordinal)
.Take(10)
.ToList();
//some output
}
The problem: This code must run in less than 10 seconds.
On my computer (core i5 2400, 8gb RAM) with Parallel.For() - about 91 sec.
Can you give me some advice how to increase performance?
UPDATE :
Hooray! We did it!
Thank you #CodesInChaos, #usr, #T_D and everyone who was involved in solving the problem.
The final code:
var kvList = dictionary.OrderBy(ks => ks.Key, StringComparer.Ordinal).ToList();
var strComparer = new MyStringComparer();
var intComparer = new MyIntComparer();
var kvListSize = kvList.Count;
var allUserWords = new List<string>();
for (int i = 0; i < userWordQuantity; i++)
{
var searchWord = Console.ReadLine();
allUserWords.Add(searchWord);
}
var result = allUserWords
.AsParallel()
.AsOrdered()
.Select(searchWord =>
{
int startIndex = kvList.BinarySearch(new KeyValuePair<string, int>(searchWord, 0), strComparer);
if (startIndex < 0)
startIndex = ~startIndex;
var matches = new List<KeyValuePair<string, int>>();
bool isNotEnd = true;
for (int j = startIndex; j < kvListSize ; j++)
{
isNotEnd = kvList[j].Key.StartsWith(searchWord, StringComparison.Ordinal);
if (isNotEnd) matches.Add(kvList[j]);
else break;
}
matches.Sort(intComparer);
var res = matches.Select(s => s.Key).Take(10).ToList();
return res;
});
foreach (var adviceWords in result)
{
foreach (var adviceWord in adviceWords)
{
Console.WriteLine(adviceWord);
}
Console.WriteLine();
}
6 sec (9 sec without manual loop (with linq)))

You are not at all using any algorithmic strength of the dictionary. Ideally, you'd use a tree structure so that you can perform prefix lookups. On the other hand you are within 3.7x of your performance goal. I think you can reach that by just optimizing the constant factor in your algorithm.
Don't use LINQ in perf-critical code. Manually loop over all collections and collect results into a List<T>. That turns out to give a major speed-up in practice.
Don't use a dictionary at all. Just use a KeyValuePair<T1, T2>[] and run through it using a foreach loop. This is the fastest possible way to traverse a set of pairs.
Could look like this:
KeyValuePair<T1, T2>[] items;
List<KeyValuePair<T1, T2>> matches = new ...(); //Consider pre-sizing this.
//This could be a parallel loop as well.
//Make sure to not synchronize too much on matches.
//If there tend to be few matches a lock will be fine.
foreach (var item in items) {
if (IsMatch(item)) {
matches.Add(item);
}
}
matches.Sort(...); //Sort in-place
return matches.Take(10); //Maybe matches.RemoveRange(10, matches.Count - 10) is better
That should exceed a 3.7x speedup.
If you need more, try stuffing the items into a dictionary keyed on the first char of Key. That way you can look up all items matching tempInputWord[0]. That should reduce search times by the selectivity that is in the first char of tempInputWord. For English text that would be on the order of 26 or 52. This is a primitive form of prefix lookup that has one level of lookup. Not pretty but maybe it is enough.

I think the best way would be to use a Trie data structure instead of a dictionary. A Trie data structure saves all the words in a tree structure. A node can represent all the words that start with the same letters. So if you look for your search word tempInputWord in a Trie you will get a node that represents all the words starting with tempInputWord and you just have to traverse through all the child nodes. So you just have one search operation. The link to the Wikipedia article also mentions some other advantages over hash tables (that's what an Dictionary is basically):
Looking up data in a trie is faster in the worst case, O(m) time
(where m is the length of a search string), compared to an imperfect
hash table. An imperfect hash table can have key collisions. A key
collision is the hash function mapping of different keys to the same
position in a hash table. The worst-case lookup speed in an imperfect
hash table is O(N) time, but far more typically is O(1), with O(m)
time spent evaluating the hash.
There are no collisions of different keys in a trie.
Buckets in a trie, which are analogous to hash table buckets that store key collisions, are necessary only if a single key is
associated with more than one value.
There is no need to provide a hash function or to change hash functions as more keys are added to a trie.
A trie can provide an alphabetical ordering of the entries by key.
And here are some ideas for creating a trie in C#.
This should at least speed up the lookup, however, building the Trie might be slower.
Update:
Ok, I tested it myself using a file with frequencies of english words that uses the same format as yours. This is my code which uses the Trie class that you also tried to use.
static void Main(string[] args)
{
Stopwatch sw = new Stopwatch();
sw.Start();
var trie = new Trie<KeyValuePair<string,int>>();
//build trie with your value pairs
var lines = File.ReadLines("en.txt");
foreach(var line in lines.Take(100000))
{
var split = line.Split(' ');
trie.Add(split[0], new KeyValuePair<string,int>(split[0], int.Parse(split[1])));
}
Console.WriteLine("Time needed to read file and build Trie with 100000 words: " + sw.Elapsed);
sw.Reset();
//test with 10000 search words
sw.Start();
foreach (string line in lines.Take(10000))
{
var searchWord = line.Split(' ')[0];
var allPairs = trie.Retrieve(searchWord);
var bestWords = allPairs.OrderByDescending(kv => kv.Value).ThenBy(kv => kv.Key).Select(kv => kv.Key).Take(10);
var output = bestWords.Aggregate("", (s1, s2) => s1 + ", " + s2);
Console.WriteLine(output);
}
Console.WriteLine("Time to process 10000 different searchWords: " + sw.Elapsed);
}
My results on a pretty similar machine:
Time needed to read file and build Trie with 100000 words: 00:00:00.7397839
Time to process 10000 different searchWords: 00:00:03.0181700
So I think you are doing something wrong that we cannot see. For example the way you measure the time or the way you read the file. As my results show this stuff should be really fast. The 3 seconds are mainly due to the Console output in the loop which I needed so that the bestWords variable is used. Otherwise the variable would have been optimized away.

Replace the dictionary by a List<KeyValuePair<string, decimal>>, sorted by the key.
For the search I use that a substring sorts directly before its prefixes with ordinal comparisons. So I can use a binary search to find the first candidate. Since the candidates are contiguous I can replace Where with TakeWhile.
int startIndex = dictionary.BinarySearch(searchWord, comparer);
if(startIndex < 0)
startIndex = ~startIndex;
var adviceWords = dictionary
.Skip(startIndex)
.TakeWhile(p => p.Key.StartsWith(searchWord, StringComparison.Ordinal))
.OrderByDescending(ks => ks.Value)
.ThenBy(ks => ks.Key)
.Select(s => s.Key)
.Take(10).ToList();
Make sure to use ordinal comparison for all operations, including the initial sort, the binary search and the StartsWith check.
I would call Console.ReadLine outside the parallel loop. Probably using AsParallel().Select(...) on the collection of search words instead of Parallel.For.

If you want profiling, separate the reading of the file and see how long that takes.
Also data calculation, collection, presentation could be different steps.
If you want concurrence AND a dictionary, look at the ConcurrentDictionary, maybe even more for reliability than for performance, but probably for both:
http://msdn.microsoft.com/en-us/library/dd287191(v=vs.110).aspx

Assuming the 10 is constant, then why is everyone storing the entire data set? Memory is not free. The fastest solution is to store the first 10 entries into a list, sort it. Then, maintain the 10-element-sorted-list as you traverse through the rest of the data set, removing the 11th element every time you insert an element.
The above method works best for small values. If you had to take the first 5000 objects, consider using a binary heap instead of a list.

How to get String Line number in Foreach loop from reading array?

The program helps users to parse a text file by grouping certain part of the text files into "sections" array.
So the question is "Are there any methods to find out the line numbers/position within the array?" The program utilizes a foreach loop to read the "sections" array.
May someone please advise on the codes? Thanks!
namespace Testing
{
class Program
{
static void Main(string[] args)
{
TextReader tr = new StreamReader(#"C:\Test\new.txt");
String SplitBy = "----------------------------------------";
// Skip 5 lines of the original text file
for(var i = 0; i < 5; i++)
{
tr.ReadLine();
}
// Read the reststring
String fullLog = tr.ReadToEnd();
String[] sections = fullLog.Split(new string[] { SplitBy }, StringSplitOptions.None);
//String[] lines = sections.Skip(5).ToArray();
int t = 0;
// Tried using foreach (String r in sections.skip(4)) but skips sections instead of the Text lines found within each sections
foreach (String r in sections)
{
Console.WriteLine("The times are : " + t);
// Is there a way to know or get the "r" line number?
Console.WriteLine(r);
Console.WriteLine("============================================================");
t++;
}
}
}
}

A foreach loop doesn't have a loop counter of any kind. You can keep your own counter:
int number = 1;
foreach (var element in collection) {
// Do something with element and number,
number++;
}
or, perhaps easier, make use of LINQ's Enumerable.Select that gives you the current index:
var numberedElements = collection.Select((element, index) => new { element, index });
with numberedElements being a collection of anonymous type instances with properties element and index. In the case a file you can do this:
var numberedLines = File.ReadLines(filename)
.Select((Line,Number) => new { Line, Number });
with the advantage that the whole thing is processed lazily, so it will only read the parts of the file into memory that you actually use.

As far as I know, there is not a way to know which line number you are at within the file. You'd either have to keep track of the lines yourself, or read the file again until you get to that line and count along the way.
Edit:
So you're trying to get the line number of a string inside the array after the master string's been split by the SplitBy?
If there's a specific delimiter in that sub string, you could split it again - although, this might not give you what you're looking for, except...
You're essentially back at square one.
What you could do is try splitting the section string by newline characters. This should spit it out into an array that corresponds with line numbers inside the string.

Yes, you can use a for loop instead of foreach. Also, if you know the file isn't going to be too large, you can read all of the lines into an array with:
string[] lines = File.ReadAllLines(#"C:\Test\new.txt");

Well, don't use a foreach, use a for loop
for( int i = 0; i < sections.Length; ++ )
{
string section = sections[i];
int lineNum = i + 1;
}
You can of course maintain a counter when using a foreach loop as well, but there is no reason to since you have the standard for loop at your disposal which is made for this sort of thing.
Of course, this won't necessarily give you the line number of the string in the text file unless you split on Environment.NewLine. You are splitting on a large number of '-' characters and I have no idea how your file is structured. You'll likely end up underestimating the line number because all of the '---' bits will be discarded.

Not as your code is written. You must track the line number for yourself. Problematic areas of your code:
You skip 5 lines at the beginning of your code, you must track this.
Using the Split method, you are potentially "removing" lines from the original collection of lines. You must find away to know how many splits you have made, because they are an original part of the line count.
Rather than taking the approach you have, I suggest doing the parsing and searching within a classic indexed for-loop that visits each line of the file. This probably means giving up conveniences like Split, and rather looking for markers in the file manually with e.g. IndexOf.

I've got a much simpler solution to the questions after reading through all the answers yesterday.
As the string had a newline after each line, it is possible to split the strings and convert it into a new array which then is possible to find out the line number according to the array position.
The Codes:
foreach (String r in sections)
{
Console.WriteLine("The times are : " + t);
IList<String> names = r.Split('\n').ToList<String>();
}

Counting occurrences of a string in an array and then removing duplicates

I am fairly new to C# programming and I am stuck on my little ASP.NET project.
My website currently examines Twitter statuses for URLs and then adds those URLs to an array, all via a regular expression pattern matching procedure. Clearly more than one person will update a with a specific URL so I do not want to list duplicates, and I want to count the number of times a particular URL is mentioned in, say, 100 tweets.
Now I have a List<String> which I can sort so that all duplicate URLs are next to each other. I was under the impression that I could compare list[i] with list[i+1] and if they match, for a counter to be added to (count++), and if they don't match, then for the URL and the count value to be added to a new array, assuming that this is the end of the duplicates.
This would remove duplicates and give me a count of the number of occurrences for each URL. At the moment, what I have is not working, and I do not know why (like I say, I am not very experienced with it all).
With the code below, assume that a JSON feed has been searched for using a keyword into srchResponse.results. The results with URLs in them get added to sList, a string List type, which contains only the URLs, not the message as a whole.
I want to put one of each URL (no duplicates), a count integer (to string) for the number of occurrences of a URL, and the username, message, and user image URL all into my jagged array called 'urls[100][]'. I have made the array 100 rows long to make sure everything can fit but generally, this is too big. Each 'row' will have 5 elements in them.
The debugger gets stuck on the line: if (sList[i] == sList[i + 1]) which is the crux of my idea, so clearly the logic is not working. Any suggestions or anything will be seriously appreciated!
Here is sample code:
var sList = new ArrayList();
string[][] urls = new string[100][];
int ctr = 0;
int j = 1;
foreach (Result res in srchResponse.results)
{
string content = res.text;
string pattern = #"((https?|ftp|gopher|telnet|file|notes|ms-help):((//)|(\\\\))+[\w\d:##%/;$()~_?\+-=\\\.&]*)";
MatchCollection matches = Regex.Matches(content, pattern);
foreach (Match match in matches)
{
GroupCollection groups = match.Groups;
sList.Add(groups[0].Value.ToString());
}
}
sList.Sort();
foreach (Result res in srchResponse.results)
{
for (int i = 0; i < 100; i++)
{
if (sList[i] == sList[i + 1])
{
j++;
}
else
{
urls[ctr][0] = sList[i].ToString();
urls[ctr][1] = j.ToString();
urls[ctr][2] = res.text;
urls[ctr][3] = res.from_user;
urls[ctr][4] = res.profile_image_url;
ctr++;
j = 1;
}
}
}
The code then goes on to add each result into a StringBuilder method with the HTML.
Is now edite

The description of your algorithm seems fine. I don't know what's wrong with the implementation; I haven't read it that carefully. (The fact that you are using an ArrayList is an immediate red flag; why aren't you using a more strongly typed generic collection?)
However, I have a suggestion. This is exactly the sort of problem that LINQ was intended to solve. Instead of writing all that error-prone code yourself, just describe the transformation you're interested in, and let the compiler work it out for you.
Suppose you have a list of strings and you wish to determine the number of occurrences of each:
var notes = new []{ "Do", "Fa", "La", "So", "Mi", "Do", "Re" };
var counts = from note in notes
group note by note into g
select new { Note = g.Key, Count = g.Count() }
foreach(var count in counts)
Console.WriteLine("Note {0} occurs {1} times.", count.Note, count.Count);
Which I hope you agree is much easier to read than all that array logic you wrote. And of course, now you have your sequence of unique items; you have a sequence of counts, and each count contains a unique Note.

I'd recommend using a more sophisticated data structure than an array. A Set will guarantee that you have no duplicates.
Looks like C# collections doesn't include a Set, but there are 3rd party implementations available, like this one.

Your loop fails because when i == 99, (i + 1) == 100 which is outside the bounds of your array.
But as other have pointed out, .Net 3.5 has ways of doing what you want more elegantly.

If you don't need to know how many duplicates a specific entry has you could do the following:
LINQ Extension Methods
.Count()
.Distinct()
.Count()

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Loading data from text file into a dictionary - c#

You need to actually get less than the length of total line: line.Substring(4, line.Length - 4) //subtract the chars which you're skipping Your string: ABC Abbey something Length = 19 Start = 4 Remaining chars = 19 - 4 = 15 //and you are expecting 19, that is the error

Related

Read multiple lines from a large file in non-ascending order

How can I implement the Viterbi algorithm in C# to split conjoined words?

How to improve performance of this algorithm?

How to get String Line number in Foreach loop from reading array?

Counting occurrences of a string in an array and then removing duplicates

Categories

Resources