How to read a huge log file while remaining performant? - c#

I'm reading a 500MB+ log file, I was wondering which would be faster in terms of speed.
I want to be able to scroll through the ~1mil entries, but shouldn't be slow because of the virtual mode(hopefully). It loads successfully, however scrolling is a bit lagged.
Currently I'm using a listview in virtual mode.
QUESTION - virtual-retrieval-item function -
list storing each entry of log info, and the retrieve item call to display the specific indexes displayed of the list such as list[e.getindex]
store the beginning of each log info (position in the log file) to an list, then calling a read function to read from the position to the escape character (gets one log entry). For example first entry would read 0 - escape character, say the second entry would read 16-43, third, 43-60 (the size of the log entries all vary)
There are pros and cons of both, but I am curious to see what others think in terms of speed.
On one hand, (1) reads all the data of a 1mil entries into a list, then reads them from memory because virtual mode helps display only the items that is viewable at the time (approx 10). However the overhead is that all the data is in memory
With (2) there is no storage of actual log entries into memory, however it has to make a call to file to scan through the file, and start reading at a specific line. It has to make this call for each item.
Is there an alternate? These are the fastest ways in which were researched.

Related

Fastest way to draw a large text file in C# winforms

I have a large text file (~100MB) that I keep it's lines in a list of strings.
my Winform requires occasionally to show a part of it, for example 500,000 lines.
I have tried using a ListBox, RichTextBox and TextBox, but the drawing takes too much time.
for example, TextBox takes 25 seconds to show 500,000 lines,
whereas notepad opens a text file of this size immediately.
what will be the fastest solution for this purpose?
Why not open a file stream and just read the first few lines. You can use seek as the user scrolls in the file and display the appropriate lines. The point is - reading the whole file into memory takes to long so don't do that!
Starter Code
The following is a short code snippet that isn't complete but it should at least get you started:
// estimate the average line length in bytes somehow:
int averageLineLengthBytes = 100;
// also need to store the current scroll location in "lines"
int currentScroll = 0;
using (var binaryReader = new StreamReader(new FileStream(fileName, FileAccess.Read)))
{
if (binaryReader.BaseStream.CanSeek)
{
// seek the location to read:
binaryReader.BaseStream.Seek(averageLineLengthBytes * currentScroll, SeekOrigin.Begin);
// read the next few lines using this command
binaryReader.ReadLine();
}
else
{
// revert to a slower implementation here!
}
}
The biggest trick is going to be estimating how long the scroll bar needs to be (how many lines are in the file). For that you are going to have to either alter the scroll bar as the user scrolls or you can use prior knowledge of how long typical lines are in this file and estimate the length based on the total number of bytes. Either way, hope this helps!
A Note About Virtual Mode
Virtual mode is a method of using a ListBox or similar list control to load the items on an as needed basis. The control will execute a callback to retrieve the items based on an index when the user scrolls within the control. This is a viable solution only if your data meets the following criteria:
You must know (up front) the number of data items that you wish to present. If you need to read the entire file to get this total, it isn't going to work for you!
You must be able to retrieve a specific data item based an index for that item without reading the entire file.
You must be willing to present the data in an icon, small details, details or other supported format (or be willing to go to a ton of extra work to write a custom list view).
If you cannot meet these criteria, then virtual mode is not going to be particularly helpful. The answer I presented with seek will work regardless of whether or not you can perform these actions. Of course, if you can meet these minimum criteria, then by all means - look up virtual mode for list views and you should find some really useful information!
ListView have a Virtual Mode property. It allows you to only load the data are in view using the Retrieve Virtual Item Event. So when that event is trigger for item number 40,000 for example, you would perform a seek on the file read in the line.
You can also find example of a virtual list box on Microsoft. It really old, but it gives you a basic idea.

data structure for indexing big file

I need to build an index for a very big (50GB+) ASCII text file which will enable me to provide fast random read access to file (get nth line, get nth word in nth line). I've decided to use List<List<long>> map, where map[i][j] element is position of jth word of ith line in the file.
I will build the index sequentially, i.e. read the whole file and populating index with map.Add(new List<long>()) (new line) and map[i].Add(position) (new word). I will then retrieve specific word position with map[i][j].
The only problem I see is that I can't predict total count of lines/words, so I will bump into O(n) on every List reallocation, no idea of how I can avoid this.
Are there any other problems with the data structure I chose for the task? Which structure could be better?
UPD: File will not be altered during the runtime. There are no other ways to retrieve content except what I've listed.
Increasing size of a large list is very expensive operation; so, it's better to reserve list size at the beginning.
I'd suggest to use 2 lists. The first contains indexes of words within file, and the second contains indexes in the first list (index of the first word in the appropriate line).
You are very likely to exceed all available RAM. And when the system starts to page in/page out GC-managed RAM, performance of the program will be completely killed. I'd suggest to store your data in memory-mapped file rather than in managed memory. http://msdn.microsoft.com/en-us/library/dd997372.aspx
UPD memory mapped files are effective, when you need to work with huge amounts of data not fitting in RAM. Basically, it's your the only choice if your index becomes bigger than available RAM.

Best Way to Load a File, Manipulate the Data, and Write a New File

I have an issue where I need to load a fixed-length file. Process some of the fields, generate a few others, and finally output a new file. The difficult part is that the file is of part numbers and some of the products are superceded by other products (which can also be superceded). What I need to do is follow the superceded trail to get information I need to replace some of the fields in the row I am looking at. So how can I best handle about 200000 lines from a file and the need to move up and down within the given products? I thought about using a collection to hold the data or a dataset, but I just don't think this is the right way. Here is an example of what I am trying to do:
Before
Part Number List Price Description Superceding Part Number
0913982 3852943
3852943 0006710 CARRIER,BEARING
After
Part Number List Price Description Superceding Part Number
0913982 0006710 CARRIER,BEARING 3852943
3852943 0006710 CARRIER,BEARING
As usual any help would be appreciated, thanks.
Wade
Create structure of given fields.
Read file and put structures in collection. You may use part number as key for hashtable to provide fastest searching.
Scan collection and fix the data.
200 000 objects from given lines will fit easily in memory.
For example.
If your structure size is 50 bytes then you will need only 10Mb of memory. It is nothing for modern PC.

Is it better to have more smaller records or fewer larger records in Lucene?

I'm in the process of indexing a huge set of log files for an application I work on using Lucene.net. Right now I am parsing my log files per entry (i.e. an entry can span multiple lines until the next log entry) and adding each log entry as a document in Lucene.
Each document contains the log entry (which is analyzed) and has some other fields (which are just stored), such as log line time, log line number and what kind of log it came from. I'm also giving each log entry document a guid to map a sequence of log entries back to the original source document and I can reorder them back by line number.
While I like the granularity of being able to search per line entry in my index (and I can rebuild the original document by hinging off the guid I've assigned each log file), I'm curious if this kind of index creation will be sustainable. As it is, I already have something like 25 million entries which represent logs from just a single year. My search speeds are still pretty fast, I can search these 25 million records in about a second or two.
Is it better to have fewer documents but each document is larger? Does it matter? Will I run into performance bottlenecks with Lucene when I have 50 million entries? 100 million? 500 million? If I index only per log file I'd probably have 3 orders of magnitude less documents if I estimate that each log file has around 1000-20000 lines.
The advice with all these things is: performance will almost certainly not be your major problem. If the required functionality works best with a document per line, then do it that way.
That being said, Lucene's term dictionary looks something like:
term1 -> doc1 doc4 doc32 ...
term2 -> doc1 doc3 doc8
So having more documents will increase the size of the index.
Before you conclude that this is bad for performance, ask how you'll manage to return each line as its own search result if you do index the entire file as one document. You'll have to implement some secondary search on your search results, which is almost guaranteed to be slower than what Lucene does. So just let Lucene handle it.
As to your question about how high Lucene can scale: a patch was submitted a few years ago because the 32 bit IDs Lucene uses are too small. So there are people with indexes containing more than 2^32 = 4.2 billion documents.
RavenDB uses Lucene internally for all it's querying and perf tests have shown that fewer indexes with more fields have better perf than more indexes with fewer fields.
See this thread for some actual numbers, for instance:
100 Indexes with a single property each : 00:05:08
1 Index with 100 properties : 00:02:01
This is for 25,600 docs (each having 100 string properties filled with guids).
Note these numbers are for RavenDB, but it uses Lucene extensively, so I'd be surprised if there was a big difference when using Lucene directly

Writing huge amounts of text to a textbox

I am writing a log of lots and lots of formatted text to a textbox in a .net windows form app.
It is slow once the data gets over a few megs. Since I am appending the string has to be reallocated every time right? I only need to set the value to the text box once, but in my code I am doing line+=data tens of thousands of times.
Is there a faster way to do this? Maybe a different control? Is there a linked list string type I can use?
StringBuilder will not help if the text box is added to incrementally, like log output for example.
But, if the above is true and if your updates are frequent enough it may behoove you to cache some number of updates and then append them in one step (rather than appending constantly). That would save you many string reallocations... and then StringBuilder would be helpful.
Notes:
Create a class-scoped StringBuilder member (_sb)
Start a timer (or use a counter)
Append text updates to _sb
When timer ticks or certain counter reached reset and append to
text box
restart process from #1
No one has mentioned virtualization yet, which is really the only way to provide predictable performance for massive volumes of data. Even using a StringBuilder and converting it to a string every half a second will be very slow once the log gets large enough.
With data virtualization, you would only hold the necessary data in memory (i.e. what the user can see, and perhaps a little more on either side) whilst the rest would be stored on disk. Old data would "roll out" of memory as new data comes in to replace it.
In order to make the TextBox appear as though it has a lot of data in it, you would tell it that it does. As the user scrolls around, you would replace the data in the buffer with the relevant data from the underlying source (using random file access). So your UI would be monitoring a file, not listening for logging events.
Of course, this is all a lot more work than simply using a StringBuilder, but I thought it worth mentioning just in case.
Build your String together with a StringBuilder, then convert it to a String using toString(), and assign this to the textbox.
I have found that setting the textbox's WordWrap property to false greatly improves performance, as long as you're ok with having to scroll to the right to see all of your text. In my case, I wanted to paste a 20-50 MB file into a MultiLine textbox to do some processing on it. That took several minutes with WordWrap on, and just several seconds with WordWrap off.

Categories

Resources