Large Text Files versus Small tex files - c#

This is just a general question? I have an application that uses the File.ReadAllLines and i was wondering, is it easier to load as one big file or multiple small files.
My current file is a Dictionary for my speech recognition. Currently it has about 40,000 lines in it. When i start the program it takes about 2 minutes for it to load the grammars.
When the file is less than 1,000 lines it processes and reads all line instantly. When added phases two implementing the Entire King James Bible. It slowed down tremendously.
So for better performance would be better to have multiple smaller files instead of one big file? I will be adding 16 more txt documents with each file roughly 2,000 to 3,000 lines in each txt document.
If its easier and faster to have several smaller then i wont bother adding them to one big file.
I am building a custom Speech Dictation engine, so it will consist of The KJV Bible, names, numbers and different NLP variations. Eventually i will be adding the Exhaustive Strong Concordance. So far the program works perfectly. I am just trying to get faster load times, on the initial start up.

Related

Most efficient and quickest way to search numerous 1Gb text files for a certain string or instances of that string?

For work I am tasked with finding out how many times a certain set of sequential characters are used in a string. These files are all 1GB + in size and there are anywhere from 4 to 15 files like these. ex. find "cat" in "catastrophe" in every instance "cat" is part of the word in every file.
In theory (at least to me) I would load one text file into memory and then line by line look for the match. At the end of the text file I would remove it from memory and load the next text file.... until all files have been searched.
I have been doing mostly script files to automate tasks these last few years and I have been out of the coding game for so long, I don't remember, or maybe I never knew, the most efficient and fastest way to do this.
When I say speed I mean in elapsed time of the program, not how long it will take me to write this.
I would like to do this in C# because I am trying to get more comfortable with the language but really I could do it in any language. Preferably not assembly ...that was a joke...

Best Approach Towards Extracting Large Quantities of PDF files?

I'm attempting to improve the performance behind an automation workflow that extracts several zip files(10 to 15), each containing roughly 5,000 PDF files.
My question is: If I want to extract all these zip files quickly into one directory, what is your recommended best approach? What if I wanted to archive these files first and then copy the output to a separate directory for post-processing?
My initial thoughts were to run each zip file through 7za.exe(7-zip) in parallel, which I have done manually in the past. However, it is still extremely slow and can take up to two hours. Whether I run the extraction process one at a time or all at once does not appear to affect performance.
I will be recreating this workflow in C# and T-SQL. The goal is to take SSIS out of the picture for better input validation and logic before the data reaches our database. Here is a screenshot of the zip extraction section of the current automation.
Any suggestions or help is greatly appreciated. Thank you

C# How to iterate through a list of files without using Directory.GetFiles() method

I have several security cameras that upload pictures to my ftp server. Some of these cams conveniently create subfolders at the start of a new day in the format "yyyymmdd". This is great and makes it easy to maintain/delete older pictures by a particular day. Other cams aren't so nice and just dump pictures in a giant folder, making deletion difficult.
So I am writing a C# windows form program to go to a specific folder (source folder) using FolderBrowserDialog and I name a target folder also using FBD. I was using the standard process to iterate through a file list using a string array filled via Directory.GetFiles() method. I use the file's Creation Date to create a subfolder if it doesn't exist. In either case I move the file to that date based subfolder. It works great while testing with small numbers of files.
Now I'm ready to test against real data and I'm concerned that with some folders having thousands of files, I'm going to have many issues with memory and other problems. How well can a string array handle such huge volumes of data? Note one folder has over 28,000 pictures. Can a string array handle such a large number of FileInfo objects?
My question then is how can I iterate through a list of files in a given folder without having to use a string array and Directory.GetFiles() method? I'm open to any thoughts though I do want to use c# in a windows form environment. I have an added feature that lets me delete pictures older than a given date instead of moving them.
Many thanks!
You'll be just fine with thousands of file names. You might have a problem with millions, but thousands isn't a big deal for C#. You may have a performance issue just because of how NTFS works, but if so there's nothing you can do about that in C#; it's a problem inherent in the file system.
However, if you really want to pick at this, you can do a little better by using Directory.EnumerateFileSystemInfos(). This method has two benefits over GetFiles():
It loads the file name and creation date in one disk access, instead of two.
It allows you to work with an IEnumerable instead of an array, such that you only need memory for one file record at a time.

Zip folder to SVN?

This may sound a silly question but I just wanted to clear something up. I've zipped a folder up and added it to my SVN repository. Is doing this all ok? or should I upload the unzipped folder instead?
I just need to be sure!
If you are going to change the contents of the directory, then you should store it unzipped. Having it in zip file will exhaust storage on server much faster, as if you were storing every version of your zip as a separate file on your server.
Zip format has one cool properly: every file inside archive takes some segment of bytes, and is compressed/decompressed independently of all the other files. As the result, if you have a 100 MB zip, and modify two files inside each having size 1 MB, then the new zip will have at most 2 MB of new data, the rest 98 MB will be most likely by byte-exact copies of some pieces of the old zip. So it is in theory possible to represent small in-zip changes as small deltas. But there are many problems in practice.
First of all, you must be sure that you don't recompress the unchanged files. If you make both the first zip and the second zip from scratch using different programs, program versions, compression settings, etc., you can get slightly different compression on the unchanged files. As the result, the actual bytes in zip file will greatly differ, and any hope for small delta will be lost. The better approach is taking the first zip, and adding/removing files in it.
The main problem however is how SVN stores deltas. As far as I know, SVN uses xdelta algorithm for computing deltas. This algorithm is perfectly capable of detecting equal blocks inside zip file, if given unlimited memory. The problem is that SVN uses memory-limited version with a window of size = 100 KB. Even if you simply remove a segment longer than 100 KB from a file, then SVN's delta computation will break on it, and the rest of the file will be simply copied into delta. Most likely, the delta will take as much space as the whole file takes.

Text editing LinkedList vs List of lines

I'm writing a text editor using C#, I know i'm just reinventing the wheel but it's a learning experience and something I want to do.
Right now I have a basic Text Document using something that resembles a Gap Buffer however I have to update my line buffer to hold the start of each line each time an edit is made to the buffer.
I am looking at creating another Text Document for testing using a list of lines and editing each line instead.
Now the question that I have is what would be the benefits of using a LinkedList vs a standard List?
A linked list of lines will be fast to insert new lines or delete lines, and fast to move down (and up if you have a doubly-linked list) from a specific line by a small number of lines, and fast to move to the start of the document. To move quickly to the end of the document you will also need to store the end of the linked list. It is relatively slow to go to a specific line number though, as you will have to start from the beginning and iterate over the lines, although this shouldn't be a problem unless your documents have very many lines.
An ordinary list is fast to move to a specific line number but slow to add or remove any line except at the end as the entire buffer will need to be copied each time a line is inserted or deleted.
I would prefer a linked list over an array based list for the purposes of editing large documents. In either case you may have problems if the document contains any extremely long lines as strings are immutable and changing each character will be slow.
I'm using a one string per line array. Arrays are often faster or equal to update then linked list because they have a much better cache locality then linked list and for a single 1st level cache miss you can move already a few dozens pointer items in an array. I would say for everything less then 10000 items just use an array of pointers.
If your edited text are small (like hand written source code files) it is the way to go. A lot of the meta information you need for state of the art text editing can be split very well into "beginning of a line" points. Most importantly syntax highlighting.
Also error lines, breakpoints, profiling infos, code coverage markers, all of them work on a line. It's the native structure of source code text and in same cases also for literature text (if you write a text processor and not a source code editor) but in that case your need to take a paragraph as a unit.
If i ever find the time to redesign my editor i will add different buffer implementations because on larger texts the overhead of all the line info data (it's about 80 byte per line on a 32bit system) is significant. Then a gap buffer model is better, it is also much better if you don't have lines at all for example when displaying binary files in hex mode.
And finally a third buffer model is required when you allow a user to open large files. It's funny to see marketing bullshit (free open source is surprisingly worse here) about unlimited file size editing and once you open a 400 MB log file, the whole system becames unresponsive. You need a buffer model here which is not loading the whole file into the buffer first.

Categories

Resources