Best strategy to implement reader for large text files

Best strategy to implement reader for large text files - c#

We have an application which logs its processing steps into text files. These files are used during implementation and testing to analyse problems. Each file is up to 10MB in size and contains up to 100,000 text lines.
Currently the analysis of these logs is done by opening a text viewer (Notepad++ etc) and looking for specific strings and data depending on the problem.
I am building an application which will help the analysis. It will enable a user to read files, search, highlight specific strings and other specific operations related to isolating relevant text.
The files will not be edited!
While playing a little with some concepts, I found out immediately that TextBox (or RichTextBox) don't handle display of large text very well. I managed to to implement a viewer using DataGridView with acceptable performance, but that control does not support color highlighting of specific strings.
I am now thinking of holding the entire text file in memory as a string, and only displaying a very limited number of records in the RichTextBox. For scrolling and navigating I thought of adding an independent scrollbar.
One problem I have with this approach is how to get specific lines from the stored string.
If anyone has any ideas, can highlight problems with my approach then thank you.

I would suggest loading the whole thing into memory, but as a collection of strings rather than a single string. It's very easy to do that:
string[] lines = File.ReadAllLines("file.txt");
Then you can search for matching lines with LINQ, display them easily etc.

Here is an approach that scales well on modern CPU's with multiple cores.
You create an iterator block that yields the lines from the text file (or multiple text files if required):
IEnumerable<String> GetLines(String fileName) {
using (var streamReader = File.OpenText(fileName))
while (!streamReader.EndOfStream)
yield return streamReader.ReadLine();
}
You then use PLINQ to search the lines in parallel. Doing that can speed up the search considerably if you have a modern CPU.
GetLines(fileName)
.AsParallel()
.AsOrdered()
.Where(line => ...)
.ForAll(line => ...);
You supply a predicate in Where that matches the lines you need to extract. You then supply an action to ForAll that will send the lines to their final destination.
This is a simplified version of what you need to do. Your application is a GUI application and you cannot perform the search on the main thread. You will have to start a background task for this. If you want this task to be cancellable you need to check a cancellation token in the while loop in the GetLines method.
ForAll will call the action on threads from the thread pool. If you want to add the matching lines to a user interface control you need to make sure that this control is updated on the user interface thread. Depending on the UI framework you use there are different ways to do that.
This solution assumes that you can extract the lines you need by doing a single forward pass of the file. If you need to do multiple passes perhaps based on user input you may need to cache all lines from the file in memory instead. Caching 10 MB is not much but lets say you decide to search multiple files. Caching 1 GB can strain even a powerful computer but using less memory and more CPU as I suggest will allow you to search very big files within a reasonable time on a modern desktop PC.

I suppose that when one has multiple gigabytes of RAM available, one naturally gravitates towards the "load the whole file into memory" path, but is anyone here really satisfied with such a shallow understanding of the problem? What happens when this guy wants to load a 4 gigabyte file? (Yeah, probably not likely, but programming is often about abstractions that scale and the quick fix of loading the whole thing into memory just isn't scalable.)
There are, of course, competing pressures: do you need a solution yesterday or do you have the luxury of time to dig into the problem and learning something new? The framework also influences your thinking by presenting block-mode files as streams... you have to check the stream's BaseStream.CanSeek value and, if that is true, access the BaseStream.Seek() method to get random access. Don't get me wrong, I absolutely love the .NET framework, but I see a construction site where a bunch of "carpenters" can't put up the frame for a house because the air-compressor is broken and they don't know how to use a hammer. Wax-on, wax-off, teach a man to fish, etc.
So if you have time, look into a sliding window. You can probably do this the easy way by using a memory-mapped file (let the framework/OS manage the sliding window), but the fun solution is to write it yourself. The basic idea is that you only have a small chunk of the file loaded into memory at any one time (the part of the file that is visible in your interface with maybe a small buffer on either side). As you move forward through the file, you can save the offsets of the beginning of each line so that you can easily seek to any earlier section of the file.
Yes, there are performance implications... welcome to the real world where one is faced with various requirements and constraints and must find the acceptable balance between time and memory utilization. This is the fun of programming... figuring out the various ways that a goal can be reached and learning what the tradeoffs are between the various paths. This is how you grow beyond the skill levels of that guy in the office who sees every problem as a nail because he only knows how to use a hammer.
[/rant]

I would suggest to use MemoryMappedFile in .NET 4 (or via DllImport in previous versions) to handle just small portion of file that visible on screen instead of wasting memory and time with loading of entire file.

Related

How can I access a c# Memory Mapped File from Coldfusion 10?

I have a c# application that generates data every 1 second (stock tick data) which can be discarded after each itteration.
I would like to pass this data to a Coldfusion (10) application and I have considered having the c# application writing the data to a file every second and then having the Coldfusion application reading that data, but this is most likely going to cause issues with the potential for both applications trying to read or write to the file at the same time ?
I was wondering if using Memory Mapped Files would be a better approach ? If so, how could I access the memory mapped file from Coldfusion ?
Any advice would be greatly appreciated. Thanks.

We have produced a number of stock applications that include tick by tick tracking of watchlists, charting etc. I think the idea of a file is probably not a great idea unless you are talking about a single stock with regular intervals. In my experience a change every "second" is probably way understating the case. Some stokes (AAPL or GOOG are good examples) have hundreds of "ticks" per second during peak times.
So if you are NOT taking every tick but really are "updating the file" every 1 second then your idea has some merit in that you could use a file watching gateway to fire events for you and "see" that the file is updated.
But keep in mind that you are in effect introducing something "in the middle". A file now stands between your Java or CF applications and the quote engine. That's going to introduce latency no matter what you choose to do (file handles getting and releasing etc). And the locks of one process may interfere with the other.
When you are dealing with facebook updates miliseconds don't really matter much - in spite of all the teenage girls who probably disagree with me :) With stock quotes however, half of the task is shaving off miliseconds to get your processes as close to real time as possible.
Our choice is usually to choose sockets instead of something in the middle bridging the data. The quote engine then keeps it's watchlist and updates it's arrays like normal but also sends any updates down stream to the socket engine which pushes it to something taht can handle it (a chart application, watchlist, socketgateway for webpage etc).
Hope this helps - it's not a clear answer but more of a clarification to the hurdles you face.

Most efficient way to search for files

I am writing a program that searches and copies mp3-files to a specified directory.
Currently I am using a List that is filled with all the mp3s in a directory (which takes - not surprisingly - a very long time.) Then I use taglib-sharp to compare the ID3Tags with the artist and title entered. If they match I copy the file.
Since this is my first program and I am very new to programming I figure there must be a better/more efficient way to do this. Does anybody have a suggestion on what I could try?
Edit: I forgot to add an important detail: I want to be able to specify what directories should be searched every time I start a search (the directory to be searched will be specified in the program itself). So storing all the files in a database or something similar isn't really an option (unless there is a way to do this every time which is still efficient). I am basically looking for the best way to search through all the files in a directory where the files are indexed every time. (I am aware that this is probably not a good idea but I'd like to do it that way. If there is no real way to do this I'll have to reconsider but for now I'd like to do it like that.)

You are mostly saddled with the bottleneck that is IO, a consequence of the hardware with which you are working. It will be the copying of files that is the denominator here (other than finding the files, which is dwarfed compared to copying).
There are other ways to go about file management, and each exposing better interfaces for different purposes, such as NTFS Change Journals and low-level sector handling (not recommended) for example, but if this is your first program in C# then maybe you don't want to venture into p/invoking native calls.
Other than alternatives to actual processes, you might consider mechanisms to minimise disk access - i.e. not redoing anything you have already done, or don't need to do.

Use an database (simple binary serialized file or an embedded database like RavenDb) to cache all files. And query that cache instead.
Also store modified time for each folder in the database. Compare the time in the database with the time on the folder each time you start your application (and sync changed folders).
That ought to give you much better performance. Threading will not really help searching folders since it's the disk IO that takes time, not your application.

how can i make my program more responsive? (program that loads atleast 200 files) - i might have 1 idea

first - i want to say sorry for my butchered English.
I am building a program that uses a lot of files. i have a lot of foreach loops that loops through the harddisk and those files (atleast 200 files - 600 bytes each file in average), the loop is using XPath to search for values in the file (the files are XML files of course)
I need to find a way to make my program more responsive - i thought of one which is the following:
Computers memory has a faster speed of loading than computer harddisk - and i thought - maybe i should load those files to the memory and than loop the memory instead of looping the harddisk.., by the way if someone can tell me how much faster computers memory are (from harddisks) than thanks
Thanks in advanced..
Din
if someone didn't understand my English i will try to explain again

The best approach I think of is PLINQ in C#4.0. Group these XML files and query them with LINQ-to-XML parallelly. The following is a simple example, which loads all xml files in C:\xmlFolder and choose those documents which contains an element whose name is "key".
List<XDocument> xmls = Directory.EnumerateFiles(#"C:\XmlFolder").AsParallel()
.Select(path => XDocument.Load(path))
.Where(doc => doc.Descendants()
.Any(ele => ele.Name.Equals("key")))
.ToList();

You should parse the XML files in a different thread and create objects with the required information, this way you will have instant access to the information.

Define "responsive." Do you mean that you want UI cues to continue to happen, or that you want to continue to be able to do other things in the UI while it's processing the files?
The former is easy, you can just toss in the occasional Application.DoEvents() in your loops. This will prompt the UI to perform any cues that are waiting (such as draw the window, etc.).
The latter is going to involve multi-threading. Diving into that is a bit more complex than can be taught in a paragraph or two, but some Google searches for "c# .net multi threading tutorial" should yield a ton of results. If you're not familiar with the basic concept of what multi-threading provides, I can further explain it.

Use a BackgroundWorker or a ThreadPool to spawn off multiple threads for the I/O, and have then read the data into a Queue (this is assuming the total size of your data is not too large). Have another thread(s) reading off of that Queue, and doing your internal xPath logic to pull whatever you need from those files.
Essentially, think of it as an instance of the Producer/Consumer design pattern, wherein your I/O reader threads are producers, and your XPath logic threads are consumers.
The type of the object in the queue could be just a byte-array, but I'd suggest a custom C# class that contains the byte array, as well as some of the file metadata in case you need it for whatever reason.

You can use database for storing XML files, it will be faster, more secure and reliable than you current schema. You can build indexes, concurrent access is enabled, XQuery/Xpath is supported and much more "pluses".
If you have only XML files, you can consider Native XML Databases, or if you have other types as well you can consider XML enabled DBMLS (such as Oracle or DB2).

Reading from SerialPort & Memory Management - C#

I am working on a program that reads in from a serial port, then parses, formats, and displays the information appropriately. This program, however, needs to be run for upwards of 12+ hours - constantly handling a stream of incoming data. I am finding that as I let my app run for a while, the memory usage increases at a linear rate - not good for a 12 hour usage.
I have implemented a logger that writes the raw incoming binary data to a file - is there a way I can utilize this idea to clear my memory cache at regular intervals? I.e. how can I, every so often, write to the log file in such a way that the data doesn't need to be stored in memory?
Also - are there other aspects of a Windows Form Application that would contribute to this? E.g. I print the formatted strings to a textbox, which ends up displaying the entire string. Because this is running for so long, it easily displays hundreds of thousands of lines of text. Should I be writing this to a file and clearing the text? Or something else?

Obviously, if the string grows over time, your app's memory usage will also grow over time. Also, WinForms textboxes can have trouble dealing with very large strings. How large does the string get?
Unless you really want to display the entire string onscreen, you should definitely clear it periodically (depending on your users' expectations); this will save memory and probably improve performance.

Normally, memory management in .NET is completely automatic. You should be careful about extrapolating a short observations (minutes) to a 12 hour period. And please note that TaskManager is not a very good tool to measure memory usage.
Writing the incoming data should not increase memory usage significantly. But there are a few thing you should avoid doing, and concatenating to a string over and over is one of them. Your TextBox is probably costing a lot more than you seem to think. Using a ListBox would be more efficient. And easier.

I have several serial applications which run either as an application or as a windows service. These are required to be up 24/7-365. The best mechanism I have found to avoid this same problem is two-fold.
1) Write the information out to a log file. For a service, this is the only way of getting the info out. The log file does not increase your memory usage.
2) For the application, write the information out to a log file as well as put it into a listbox. I generally limit the listbox to the last 500 or 1000 entries. With the newer .net controls, the listboxes are virtualized which helps but you also don't run into other memory issues such as the textbox concatenation.
You can take a system down with a textbox by constantly appending the string over a number of hours as it is not intended for that kind of abuse out of the box.

How can I quickly create large (>1gb) text+binary files with "natural" content? (C#)

For purposes of testing compression, I need to be able to create large files, ideally in text, binary, and mixed formats.
The content of the files should be neither completely random nor uniform.
A binary file with all zeros is no good. A binary file with totally random data is also not good. For text, a file with totally random sequences of ASCII is not good - the text files should have patterns and frequencies that simulate natural language, or source code (XML, C#, etc). Pseudo-real text.
The size of each individual file is not critical, but for the set of files, I need the total to be ~8gb.
I'd like to keep the number of files at a manageable level, let's say o(10).
For creating binary files, I can new a large buffer and do System.Random.NextBytes followed by FileStream.Write in a loop, like this:
Int64 bytesRemaining = size;
byte[] buffer = new byte[sz];
using (Stream fileStream = new FileStream(Filename, FileMode.Create, FileAccess.Write))
{
while (bytesRemaining > 0)
{
int sizeOfChunkToWrite = (bytesRemaining > buffer.Length) ? buffer.Length : (int)bytesRemaining;
if (!zeroes) _rnd.NextBytes(buffer);
fileStream.Write(buffer, 0, sizeOfChunkToWrite);
bytesRemaining -= sizeOfChunkToWrite;
}
fileStream.Close();
}
With a large enough buffer, let's say 512k, this is relatively fast, even for files over 2 or 3gb. But the content is totally random, which is not what I want.
For text files, the approach I have taken is to use Lorem Ipsum, and repeatedly emit it via a StreamWriter into a text file. The content is non-random and non-uniform, but it does has many identical repeated blocks, which is unnatural. Also, because the Lorem Ispum block is so small (<1k), it takes many loops and a very, very long time.
Neither of these is quite satisfactory for me.
I have seen the answers to Quickly create large file on a windows system?. Those approaches are very fast, but I think they just fill the file with zeroes, or random data, neither of which is what I want. I have no problem with running an external process like contig or fsutil, if necessary.
The tests run on Windows.
Rather than create new files, does it make more sense to just use files that already exist in the filesystem? I don't know of any that are sufficiently large.
What about starting with a single existing file (maybe c:\windows\Microsoft.NET\Framework\v2.0.50727\Config\enterprisesec.config.cch for a text file) and replicating its content many times? This would work with either a text or binary file.
Currently I have an approach that sort of works but it takes too long to run.
Has anyone else solved this?
Is there a much faster way to write a text file than via StreamWriter?
Suggestions?
EDIT: I like the idea of a Markov chain to produce a more natural text. Still need to confront the issue of speed, though.

For text, you could use the stack overflow community dump, there is 300megs of data there. It will only take about 6 minutes to load into a db with the app I wrote and probably about the same time to dump all the posts to text files, that would easily give you anywhere between 200K to 1 Million text files, depending on your approach (with the added bonus of having source and xml mixed in).
You could also use something like the wikipedia dump, it seems to ship in MySQL format which would make it super easy to work with.
If you are looking for a big file that you can split up, for binary purposes, you could either use a VM vmdk or a DVD ripped locally.
Edit
Mark mentions the project gutenberg download, this is also a really good source for text (and audio) which is available for download via bittorrent.

You could always code yourself a little web crawler...
UPDATE
Calm down guys, this would be a good answer, if he hadn't said that he already had a solution that "takes too long".
A quick check here would appear to indicate that downloading 8GB of anything would take a relatively long time.

I think you might be looking for something like a Markov chain process to generate this data. It's both stochastic (randomised), but also structured, in that it operates based on a finite state machine.
Indeed, Markov chains have been used for generating semi-realistic looking text in human languages. In general, they are not trivial things to analyse properly, but the fact that they exhibit certain properties should be good enough for you. (Again, see Properties of Markov chains section of the page.) Hopefully you should see how to design one, however - to implement, it is actually quite a simple concept. Your best bet will probably be to create a framework for a generic Markov process and then analyse either natural language or source code (whichever you want your random data to emulate) in order to "train" your Markov process. In the end, this should give you very high quality data in terms of your requirements. Well worth going to the effort, if you need these enormous lengths of test data.

I think the Windows directory will probably be a good enough source for your needs. If you're after text, I would recurse through each of the directories looking for .txt files and loop through them copying them to your output file as many times as needed to get the right size file.
You could then use a similiar approach for binary files by looking for .exes or .dlls.

For text files you might have some success taking an english word list and simply pulling words from it at random. This wont produce real english text but I would guess it would produce a letter frequency similar to what you might find in english.
For a more structured approach you could use a Markov chain trained on some large free english text.

Why don't you just take Lorem Ipsum and create a long string in memory before your output. The text should expand at a rate of O(log n) if you double the amount of text you have every time. You can even calculate the total length of the data before hand allowing you to not suffer from the having to copy contents to a new string/array.
Since your buffer is only 512k or whatever you set it to be, you only need to generate that much data before writing it, since that is only the amount you can push to the file at one time. You are going to be writing the same text over and over again, so just use the original 512k that you created the first time.

Wikipedia is excellent for compression testing for mixed text and binary. If you need benchmark comparisons, the Hutter Prize site can provide a high water mark for the first 100mb of Wikipedia. The current record is a 6.26 ratio, 16 mb.

Thanks for all the quick input.
I decided to consider the problems of speed and "naturalness" separately. For the generation of natural-ish text, I have combined a couple ideas.
To generate text, I start with a few text files from the project gutenberg catalog, as suggested by Mark Rushakoff.
I randomly select and download one document of that subset.
I then apply a Markov Process, as suggested by Noldorin, using that downloaded text as input.
I wrote a new Markov Chain in C# using Pike's economical Perl implementation as an example. It generates a text one word at a time.
For efficiency, rather than use the pure Markov Chain to generate 1gb of text one word at a time, the code generates a random text of ~1mb and then repeatedly takes random segments of that and globs them together.
UPDATE: As for the second problem, the speed - I took the approach to eliminate as much IO as possible, this is being done on my poor laptop with a 5400rpm mini-spindle. Which led me to redefine the problem entirely - rather than generating a FILE with random content, what I really want is the random content. Using a Stream wrapped around a Markov Chain, I can generate text in memory and stream it to the compressor, eliminating 8g of write and 8g of read. For this particular test I don't need to verify the compression/decompression round trip, so I don't need to retain the original content. So the streaming approach worked well to speed things up massively. It cut 80% of the time required.
I haven't yet figured out how to do the binary generation, but it will likely be something analogous.
Thank you all, again, for all the helpful ideas.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.