Removing lines from a string + old entries from a Dictionary in C# - c#

I have a Dictionary the first string, the key's, must never change.. it cant be deleted or anything.. but the value, i keep adding lines, and lines, and lines to the values.. i just create new lines with \r\n or \r .. and im just wondering what would be the easiest way to retain just the last 50 lines. and delete anything over the 50 lines.. im doing this because when i return it i have to put the values through a char array, and go through each letter, and this can be slow if there is too much data. any suggestions?

Guffa's general idea is right - your data structure should reflect what you actually want, which is a list of strings rather than a single string. The concept of "the last 50 lines" is pretty obviously to do with a collection rather than a single string, even if you've originally read it that way.
However, I'd suggest using a LinkedList<T> rather than a List<T>: every time you remove the first element of a List<T>, everything else has to shuffle up. List<T> is great for giving random access and not too bad at adding to the end, but sucks for removing from the start. LinkedList<T> is great at giving you iterator access, adding to / removing from the start, and adding to / removing from the end. It's a better fit. (If you really wanted to go to town you could even write your own fixed-size circular buffer type which encapsulated the logic for you; this would give the best of both worlds, in the situation where you don't want to be able to expand beyond a certain size.)
Regarding your comments to Guffa's answer: it's pretty common to convert input into a form which is more appropriate for processing, then convert it back to the original format for output. The reason why you do it is precisely the "more appropriate" bit. You don't want to have to parse the string for line breaks as part of the "updating the dictionary" action, IMO. In particular, it sounds like you're currently introducing the idea of "lines" where the original text is just being read in as strings. You're effectively creating your own "collection" class backed by a string, by delimiting strings with line breaks. That's inefficient, error-prone, and much harder to manage than using the built-in collections. It's easy to perform the conversion to a line-break-delimited string at the end if you want it, but it sounds like you're doing it way too early.

Instead of concatenating the lines, use a Dictionary<string, List<string>>. When you are about to add a string to the list you can check the count and remove the first string if the list already has 50 strings:
List<string> list;
if (!theDictionary.TryGetValue(key, out list)) {
theDictionary.Add(list = new List<string>());
}
if (list.Count == 50) {
list.RemoveAt(0);
}
list.Add(line);

Related

Call Length Property on Returned Array in Chained String/LINQ Methods of C#

I found this post on selecting a range from an array, and have to use the LINQ option:
Selecting a range of items inside an array in C#
Ultimately, I'm trying to get the last four lines from some text file. After, I've read in and cleaned the lines for unwanted characters and empty lines, I have an array with all of the lines. I'm using the following to do so:
string[] allLines = GetEachLine(results);
string[] lastFourLines = allLines.Skip(allLines.Length - 4).Take(4).ToArray();
This works fine, but I'm wondering if I could somehow skip assinging to the allLines variable all together. Such as:
string[] lastFourLines = GetEachLine(results).Skip(returnedArrayLength - 4).Take(4).ToArray();
It would be better to change GetEachLine and code preceding it (however results is computed) to use IEnumerable<T> and avoid using an array to read the entire file in memory for the last four lines (unless you use all of results for something else) - consider using File.ReadLines.
However, if you are using .Net Core 2.0 or greater, you can use Enumerable.TakeLast to efficiently return the last four lines:
var lastFourLines = GetEachLine(results).TakeLast(4);
if GetEachLine() returns string[] then that should work fine, though null checking may be needed.
As you chain more you may want to use line breaks to increase readability:
string[] lastFourLines = GetEachLine(results)
.Skip(allLines.Length - 4)
.Take(4)
.ToArray();
allLines.Length won't exist unless you still have line 1 from your question, you can avoid calling GetEachLine() twice by using TakeLast().
string[] lastFourLines = GetEachLine(results)
.TakeLast(4)
.ToArray();
If you are looking to efficiently retrieve the last N (filtered) line of a large file, you really need to start at the point where you are reading the file contents.
Consider a 1GB log file containing 10M records, where you only want the last few lines. Ideally, you would want to start by reading the last couple KB and then start extracting lines by searching for line breaks from the end, extracting each line and returning them in an iterator yield. If you run out of data, read the preceding block. Continue only as long as the consumer requests more values from the iterator.
Offhand, I don't know a built-in way to do this, and coding this from scratch could get pretty involved. Luckily, a search turned up this similar question having a highly rated answer.

To find out the number of occruence of words in a file

I came across this question in an interview:
We have to find out the number of occurences of two given words in a text file with <=n words between them.
Example1:
text:`this is first string this is second string`
Keywords:`this, string`
n= 4
output= 2
"this is first string" is the first occurrence and number of words between this and string is 2(is, first) which is less than 4.
this is second string is the remaining string. number of words between *this and string * is 2 (is, second) which is less than 4.
Therefore the answer is 2.
I have thought that I will use
Dictionary<string, List<int>>.
My idea was that I use the dictionary and get the list of places where the particular word is repeated and then iterate through both the lists, increment the count if a condition is met and then display the count.
Is my thinking process correct? Please provide any suggestions to improve my solution.
Thanks,
Not an answer per-se (as quite honestly, I don't understand the question :P), but to add some general interview advice to the other answers:
In interviews the interviewer is always looking for the thought process and that you are a critical, logical thinker. Not necessarily that you have excellent coding recall and can compile code in your brain.
In addition interviews are a stressful process. By slowing down and talking out loud as you work things out you not only look like a better communicator and logical thinker (even if getting the question wrong), you also give yourself time to think.
Use a pen and paper, speak as you think, start off from the top and work through it. I've got jobs even if I didn't know the answers to tech questions by demonstrating that I can at least try to work things out ;-)
In short, it's not just down to technical prowess
I think it depends if the call is done only one or multiple times per string. If it's something like
int getOccurences(String str, String reference, int min_size) { ... }
then you don't really need the dictionary, not even a ist. You can just iterate through the string to find occurrences of words and then check the number of separators between them.
If on the other hand the problem is for arbitrary search/indexing, IMHO you do need a dictionary. I'd go for a dictionary where the key is the word and the value is a list of indexes where it occurs.
HTH
If you need to do that repeatedly for different pairs of words in the same text, then a word dictionary with a list of indexes is a good solution. However, if you were only looking for one pair, then two lists of indexes for those two words would be sufficient.
The lists allow you to separate the word detection operation from the counting logic.

PeekRange on a stack in C#?

I have a program that needs to store data values and periodically get the last 'x' data values.
It initially thought a stack is the way to go but I need to be able to see more than just the top value - something like a PeekRange method where I can peek the last 'x' number of values.
At the moment I'm just using a list and get the last, say, 20 values like this:
var last20 = myList.Skip(myList.Count - 20).ToList();
The list grows all the time the program runs, but I only ever want the last 20 values. Could someone give some advice on a better data structure?
I'd probably be using a ring buffer. It's not hard to implement one on your own, AFAIK there's no implementation provided by the Framework..
Well since you mentioned the stack, I guess you only need modifications at the end of the list?
In that case the list is actually a nice solution (cache efficient and with fast insertion/removal at the end). However your way of extracting the last few items is somewhat inefficient, because IEnumerable<T> won't expose the random access provided by the List. So the Skip()-Implementation has to scan the whole List until it reaches the end (or do a runtime type check first to detect that the container implements IList<T>). It is more efficient, to either access the items directly by index, or (if you need a second array) to use List<T>.CopyTo().
If you need fast removal/insertion at the beginning, you may want to consider a ring buffer or (doubly) linked list (see LinkedList<T>). The linked list will be less cache-efficient, but it is easy and efficient to navigate and alter from both directions. The ring buffer is a bit harder to implement, but will be more cache- and space-efficient. So its probably better if only small value types or reference types are stored. Especially when the buffers size is fixed.
You could just removeat(0) after each add (if the list is longer than 20), so the list will never be longer than 20 items.
You said stack, but you also said you only ever want the last 20 items. I don't think these two requirements really go together.
I would say that Johannes is right about a ring buffer. It is VERY easy to implement this yourself in .NET; just use a Queue<T> and once you reach your capacity (20) start dequeuing (popping) on every enqueue (push).
If you want your PeekRange to enumerate from the most recent to least recent, you can defineGetEnumerator to do somehing likereturn _queue.Reverse().GetEnumerator();
Woops, .Take() wont do it.
Here's an implementation of .TakeLast()
http://www.codeproject.com/Articles/119666/LINQ-Introducing-The-Take-Last-Operators.aspx

c# Save Multidim ArrayList to Properties.Settings

Is it possible to save a multidimensional ArrayList to the Properties.Settings? I get errors when i try to.
If not, is there any other way i can save some sort of multidimensional thingie to the Properties.Settings?
Ah, I remember fighting with Properties.Settings... never a fun battle :(. The quick and dirty option would be to use different separator characters to do a string "serialization" (in quotes because a real serialization would be much better about handling edge cases etc.). Something like this:
int[][] myArray = GetArrayFromElsewhere();
string stringVersion = string.Join(";", myArray.Select(subArray => string.Join(",", subArray)));
Properties.Settings.StringVersion = stringVersion;
You could also use esoteric Unicode characters instead of ; and ,, so as to avoid accidentally splitting a string, and you could use a loop (or probably recursion) to generalize this to any number of dimensions.
But, this is of course a quick and dirty workaround. The real solution would be to do some sort of serialization of the multidimensional array. You might be able to get away with just some simple XmlSerializer or BinaryFormatter or even JavaScriptSerializer code---actually, I think the last of those might work really well---but if you need to get more complicated, this question discusses a similar solution for a hash table, with lots of gory details.

C# Datatype for large sorted collection with position?

I am trying to compare two large datasets from a SQL query. Right now the SQL query is done externally and the results from each dataset is saved into its own csv file. My little C# console application loads up the two text/csv files and compares them for differences and saves the differences to a text file.
Its a very simple application that just loads all the data from the first file into an arraylist and does a .compare() on the arraylist as each line is read from the second csv file. Then saves the records that don't match.
The application works but I would like to improve the performance. I figure I can greatly improve performance if I can take advantage of the fact that both files are sorted, but I don't know a datatype in C# that keeps order and would allow me to select a specific position. Theres a basic array, but I don't know how many items are going to be in each list. I could have over a million records. Is there a data type available that I should be looking at?
If data in both of your CSV files is already sorted and have the same number of records, you could skip the data structure entirely and do in-place analysis.
StreamReader one = new StreamReader("C:\file1.csv");
StreamReader two = new StreamReader("C:\file2.csv");
String lineOne;
String lineTwo;
StreamWriter differences = new StreamWriter("Output.csv");
while (!one.EndOfStream)
{
lineOne = one.ReadLine();
lineTwo = two.ReadLine();
// do your comparison.
bool areDifferent = true;
if (areDifferent)
differences.WriteLine(lineOne + lineTwo);
}
one.Close();
two.Close();
differences.Close();
System.Collections.Specialized.StringCollection allows you to add a range of values and, using the .IndexOf(string) method, allows you to retrieve the index of that item.
That being said, you could likely just load up a couple of byte[] from a filestream and do byte comparison... don't even worry about loading that stuff into a formal datastructure like StringCollection or string[]; if all you're doing is checking for differences, and you want speed, I would wreckon byte differences are where it's at.
This is an adaptation of David Sokol's code to work with varying number of lines, outputing the lines that are in one file but not the other:
StreamReader one = new StreamReader("C:\file1.csv");
StreamReader two = new StreamReader("C:\file2.csv");
String lineOne;
String lineTwo;
StreamWriter differences = new StreamWriter("Output.csv");
lineOne = one.ReadLine();
lineTwo = two.ReadLine();
while (!one.EndOfStream || !two.EndOfStream)
{
if(lineOne == lineTwo)
{
// lines match, read next line from each and continue
lineOne = one.ReadLine();
lineTwo = two.ReadLine();
continue;
}
if(two.EndOfStream || lineOne < lineTwo)
{
differences.WriteLine(lineOne);
lineOne = one.ReadLine();
}
if(one.EndOfStream || lineTwo < lineOne)
{
differences.WriteLine(lineTwo);
lineTwo = two.ReadLine();
}
}
Standard caveat about code written off the top of my head applies -- you may need to special-case running out of lines in one while the other still has lines, but I think this basic approach should do what you're looking for.
Well, there are several approaches that would work. You could write your own data structure that did this. Or you can try and use SortedList. You can also return the DataSets in code, and then use .Select() on the table. Granted, you would have to do this on both tables.
You can easily use a SortedList to do fast lookups. If the data you are loading is already sorted, insertions into the SortedList should not be slow.
If you are looking simply to see if all lines in FileA are included in FileB you could read it in and just compare streams inside a loop.
File 1
Entry1
Entry2
Entry3
File 2
Entry1
Entry3
You could loop through with two counters and find omissions, going line by line through each file and see if you get what you need.
Maybe I misunderstand, but the ArrayList will maintain its elements in the same order by which you added them. This means you can compare the two ArrayLists within one pass only - just increment the two scanning indices according to the comparison results.
One question I have is have you considered "out-sourcing" your comparison. There are plenty of good diff tools that you could just call out to. I'd be surprised if there wasn't one that let you specify two files and get only the differences. Just a thought.
I think the reason everyone has so many different answers is that you haven't quite got your problem specified well enough to be answered. First off, it depends what kind of differences you want to track. Are you wanting the differences to be output like in a WinDiff where the first file is the "original" and second file is the "modified" so you can list changes as INSERT, UPDATE or DELETE? Do you have a primary key that will allow you to match up two lines as different versions of the same record (when fields other than the primary key are different)? Or is is this some sort of reconciliation where you just want your difference output to say something like "RECORD IN FILE 1 AND NOT FILE 2"?
I think the asnwers to these questions will help everyone to give you a suitable answer to your problem.
If you have two files that are each a million lines as mentioned in your post, you might be using up a lot of memory. Some of the performance problem might be that you are swapping from disk. If you are simply comparing line 1 of file A to line one of file B, line2 file A -> line 2 file B, etc, I would recommend a technique that does not store so much in memory. You could either read write off of two file streams as a previous commenter posted and write out your results "in real time" as you find them. This would not explicitly store anything in memory. You could also dump chunks of each file into memory, say one thousand lines at a time, into something like a List. This could be fine tuned to meet your needs.
To resolve question #1 I'd recommend looking into creating a hash of each line. That way you can compare hashes quick and easy using a dictionary.
To resolve question #2 one quick and dirty solution would be to use an IDictionary. Using itemId as your first string type and the rest of the line as your second string type. You can then quickly find if an itemId exists and compare the lines. This of course assumes .Net 2.0+

Categories

Resources