I have a large text file (~100MB) that I keep it's lines in a list of strings.
my Winform requires occasionally to show a part of it, for example 500,000 lines.
I have tried using a ListBox, RichTextBox and TextBox, but the drawing takes too much time.
for example, TextBox takes 25 seconds to show 500,000 lines,
whereas notepad opens a text file of this size immediately.
what will be the fastest solution for this purpose?
Why not open a file stream and just read the first few lines. You can use seek as the user scrolls in the file and display the appropriate lines. The point is - reading the whole file into memory takes to long so don't do that!
Starter Code
The following is a short code snippet that isn't complete but it should at least get you started:
// estimate the average line length in bytes somehow:
int averageLineLengthBytes = 100;
// also need to store the current scroll location in "lines"
int currentScroll = 0;
using (var binaryReader = new StreamReader(new FileStream(fileName, FileAccess.Read)))
{
if (binaryReader.BaseStream.CanSeek)
{
// seek the location to read:
binaryReader.BaseStream.Seek(averageLineLengthBytes * currentScroll, SeekOrigin.Begin);
// read the next few lines using this command
binaryReader.ReadLine();
}
else
{
// revert to a slower implementation here!
}
}
The biggest trick is going to be estimating how long the scroll bar needs to be (how many lines are in the file). For that you are going to have to either alter the scroll bar as the user scrolls or you can use prior knowledge of how long typical lines are in this file and estimate the length based on the total number of bytes. Either way, hope this helps!
A Note About Virtual Mode
Virtual mode is a method of using a ListBox or similar list control to load the items on an as needed basis. The control will execute a callback to retrieve the items based on an index when the user scrolls within the control. This is a viable solution only if your data meets the following criteria:
You must know (up front) the number of data items that you wish to present. If you need to read the entire file to get this total, it isn't going to work for you!
You must be able to retrieve a specific data item based an index for that item without reading the entire file.
You must be willing to present the data in an icon, small details, details or other supported format (or be willing to go to a ton of extra work to write a custom list view).
If you cannot meet these criteria, then virtual mode is not going to be particularly helpful. The answer I presented with seek will work regardless of whether or not you can perform these actions. Of course, if you can meet these minimum criteria, then by all means - look up virtual mode for list views and you should find some really useful information!
ListView have a Virtual Mode property. It allows you to only load the data are in view using the Retrieve Virtual Item Event. So when that event is trigger for item number 40,000 for example, you would perform a seek on the file read in the line.
You can also find example of a virtual list box on Microsoft. It really old, but it gives you a basic idea.
Related
I did check to see if any existing questions matched mine but I didn't see any, if I did, my mistake.
I have two text files to compare against each other, one is a temporary log file that is overwritten sometimes, and the other is a permanent log, which will collect and append all of the contents of the temp log into one file (it will collect new lines in the log since when it last checked and append the new lines to the end of the complete log). However after a point this may lead to the complete log becoming quite large and therefore not so efficient to compare against so i have been thinking about different methods to approach this.
my first idea is to "buffer" the temp log (being that it will normally be the smaller of the two) strings into a list and simply loop through the archive log and do something like:
List<String> bufferedlines = new List<string>();
using (StreamReader ArchiveStream = new StreamReader(ArchivePath))
{
if (bufferedlines.Contains(ArchiveStream.ReadLine()))
{
}
}
Now there is a couple of ways I could proceed from here, I could create yet another list to store the inconsistencies, close the read stream (I'm not sure you can both read and write at the same time, if you can that might make things easier for my options) then open a write stream in append mode and write the list to the file. alternatively, cutting out the buffering the inconsistencies, i could open a write stream while the files are being compared and on the spot write the lines that aren't matched.
The other method i could think of was limited by my knowledge of whether it could be done or not, which was rather than buffer either file, compare the streams side by side as they are read and append the lines on the fly. Something like:
using (StreamReader ArchiveStream = new StreamReader(ArchivePath))
{
using (StreamReader templogStream = new StreamReader(tempPath))
{
if (!(ArchiveStream.ReadAllLines.Contains(TemplogStream.ReadLine())))
{
//write the line to the file
}
}
}
As I said I'm not sure whether that would work or that it may be more efficient than the first method, so i figured i'd ask, see if anyone had insight into how this might properly be implemented, and whether it was the most efficient way or there was a better method out there.
Effectively what you want here is all of the items from one set that aren't in another set. This is set subtraction, or in LINQ terms, Except. If your data sets were sufficiently small you could simply do this:
var lines = File.ReadLines(TempPath)
.Except(File.ReadLines(ArchivePath))
.ToList();//can't write to the file while reading from it
File.AppendAllLines(ArchivePath, lines);
Of course, this code requires bringing the all of the lines in the temp file into memory, because that's just how Except is implemented. It creates a HashSet of all of the items so that it can efficiently find matches from the other sequence.
Presumably here the number of lines that need to be added here is pretty small, so the fact that the lines that we find here all need to be stored in memory isn't a problem. If there will potentially be a lot the, you'd want to write them out to another file besides the first one (possibly concatting the two files together when done, if needed).
I need to build an index for a very big (50GB+) ASCII text file which will enable me to provide fast random read access to file (get nth line, get nth word in nth line). I've decided to use List<List<long>> map, where map[i][j] element is position of jth word of ith line in the file.
I will build the index sequentially, i.e. read the whole file and populating index with map.Add(new List<long>()) (new line) and map[i].Add(position) (new word). I will then retrieve specific word position with map[i][j].
The only problem I see is that I can't predict total count of lines/words, so I will bump into O(n) on every List reallocation, no idea of how I can avoid this.
Are there any other problems with the data structure I chose for the task? Which structure could be better?
UPD: File will not be altered during the runtime. There are no other ways to retrieve content except what I've listed.
Increasing size of a large list is very expensive operation; so, it's better to reserve list size at the beginning.
I'd suggest to use 2 lists. The first contains indexes of words within file, and the second contains indexes in the first list (index of the first word in the appropriate line).
You are very likely to exceed all available RAM. And when the system starts to page in/page out GC-managed RAM, performance of the program will be completely killed. I'd suggest to store your data in memory-mapped file rather than in managed memory. http://msdn.microsoft.com/en-us/library/dd997372.aspx
UPD memory mapped files are effective, when you need to work with huge amounts of data not fitting in RAM. Basically, it's your the only choice if your index becomes bigger than available RAM.
I'm reading a 500MB+ log file, I was wondering which would be faster in terms of speed.
I want to be able to scroll through the ~1mil entries, but shouldn't be slow because of the virtual mode(hopefully). It loads successfully, however scrolling is a bit lagged.
Currently I'm using a listview in virtual mode.
QUESTION - virtual-retrieval-item function -
list storing each entry of log info, and the retrieve item call to display the specific indexes displayed of the list such as list[e.getindex]
store the beginning of each log info (position in the log file) to an list, then calling a read function to read from the position to the escape character (gets one log entry). For example first entry would read 0 - escape character, say the second entry would read 16-43, third, 43-60 (the size of the log entries all vary)
There are pros and cons of both, but I am curious to see what others think in terms of speed.
On one hand, (1) reads all the data of a 1mil entries into a list, then reads them from memory because virtual mode helps display only the items that is viewable at the time (approx 10). However the overhead is that all the data is in memory
With (2) there is no storage of actual log entries into memory, however it has to make a call to file to scan through the file, and start reading at a specific line. It has to make this call for each item.
Is there an alternate? These are the fastest ways in which were researched.
I am using a multiline text box in C# to just log some trace information. I simply use AppendText("text-goes-here\r\n") as I need to add lines.
I've let this program run for a few days (with a lot of active trace) and I noticed it was using a lot of memory. Long story short, it appears that even with the maxlength value to something very small (256) the content of the text box just keeps expanding.
I thought it worked like a FIFO (throwing away the oldest text that exceeds the maxlength size). It doesn't, it just keeps increasing in size. This is apparently the cause of my memory waste. Anybody know what I'm doing wrong?
Added a few hours after initial question...
Ok, I tried the suggested code below. To quickly test it, I simply added a timer to my app and from that timer tick I now call a method that does essentially the same thing as the code below. The tick rate is high so that I can observe the memory usage of the process and quickly determine if there is a leak. There wasn't. That was good; however, I put this in my application and memory usage did not change (still leaking). That sure seems to imply that I have a leak somwehere else :-( however, if I simply add a return at the top of that method, the usage drops back to stable. Any thoughts on this? The timer-tick-invoked code did not accumulate memory but my real code (same method) does. The difference is that I'm calling the method from a variety of different places in the real code. Can the context of the call affect this somehow? (note, if it isn't already obvious, I'm not a .NET expert by any means)...
TextBox will allow you to append text regardless of MaxLength value - it's only used to control user entry. You can create a method that will be adding new text after verifying that maxlength is not reached, and if it is, just remove x lines from the beginning.
You could use a simple function to append text:
int maxLength = 256;
private void AppendText(string text)
{
textBox1.AppendText(text);
if(textBox1.Text.Length > maxLength)
textBox1.Text = textBox1.Text.Substring(textBox1.Text.Length - maxLength);
}
I am writing a log of lots and lots of formatted text to a textbox in a .net windows form app.
It is slow once the data gets over a few megs. Since I am appending the string has to be reallocated every time right? I only need to set the value to the text box once, but in my code I am doing line+=data tens of thousands of times.
Is there a faster way to do this? Maybe a different control? Is there a linked list string type I can use?
StringBuilder will not help if the text box is added to incrementally, like log output for example.
But, if the above is true and if your updates are frequent enough it may behoove you to cache some number of updates and then append them in one step (rather than appending constantly). That would save you many string reallocations... and then StringBuilder would be helpful.
Notes:
Create a class-scoped StringBuilder member (_sb)
Start a timer (or use a counter)
Append text updates to _sb
When timer ticks or certain counter reached reset and append to
text box
restart process from #1
No one has mentioned virtualization yet, which is really the only way to provide predictable performance for massive volumes of data. Even using a StringBuilder and converting it to a string every half a second will be very slow once the log gets large enough.
With data virtualization, you would only hold the necessary data in memory (i.e. what the user can see, and perhaps a little more on either side) whilst the rest would be stored on disk. Old data would "roll out" of memory as new data comes in to replace it.
In order to make the TextBox appear as though it has a lot of data in it, you would tell it that it does. As the user scrolls around, you would replace the data in the buffer with the relevant data from the underlying source (using random file access). So your UI would be monitoring a file, not listening for logging events.
Of course, this is all a lot more work than simply using a StringBuilder, but I thought it worth mentioning just in case.
Build your String together with a StringBuilder, then convert it to a String using toString(), and assign this to the textbox.
I have found that setting the textbox's WordWrap property to false greatly improves performance, as long as you're ok with having to scroll to the right to see all of your text. In my case, I wanted to paste a 20-50 MB file into a MultiLine textbox to do some processing on it. That took several minutes with WordWrap on, and just several seconds with WordWrap off.