Most efficient and quickest way to search numerous 1Gb text files for a certain string or instances of that string? - c#

For work I am tasked with finding out how many times a certain set of sequential characters are used in a string. These files are all 1GB + in size and there are anywhere from 4 to 15 files like these. ex. find "cat" in "catastrophe" in every instance "cat" is part of the word in every file.
In theory (at least to me) I would load one text file into memory and then line by line look for the match. At the end of the text file I would remove it from memory and load the next text file.... until all files have been searched.
I have been doing mostly script files to automate tasks these last few years and I have been out of the coding game for so long, I don't remember, or maybe I never knew, the most efficient and fastest way to do this.
When I say speed I mean in elapsed time of the program, not how long it will take me to write this.
I would like to do this in C# because I am trying to get more comfortable with the language but really I could do it in any language. Preferably not assembly ...that was a joke...

Related

Large Text Files versus Small tex files

This is just a general question? I have an application that uses the File.ReadAllLines and i was wondering, is it easier to load as one big file or multiple small files.
My current file is a Dictionary for my speech recognition. Currently it has about 40,000 lines in it. When i start the program it takes about 2 minutes for it to load the grammars.
When the file is less than 1,000 lines it processes and reads all line instantly. When added phases two implementing the Entire King James Bible. It slowed down tremendously.
So for better performance would be better to have multiple smaller files instead of one big file? I will be adding 16 more txt documents with each file roughly 2,000 to 3,000 lines in each txt document.
If its easier and faster to have several smaller then i wont bother adding them to one big file.
I am building a custom Speech Dictation engine, so it will consist of The KJV Bible, names, numbers and different NLP variations. Eventually i will be adding the Exhaustive Strong Concordance. So far the program works perfectly. I am just trying to get faster load times, on the initial start up.

Write Text to a file at a specific point

I am creating a program that sorts through 500k+ lines of text, and pulls certain strings out to be written in a clean version.
When I get my finished array of new clean lines to be written to file, I am curious as to if there is a way that I can use threads and tell the code exactly what line number or index to start writing on in the text file.
Effectively using multiple threads simultaneously writing sections of my text, maintaining the original order of compiling.
A simple example would be, say, I wanted to start writing text at the 125923rd line of the text file regardless of what already exists, if anything.
Thank you
You ca not write a single line without wrting the whole file unless it is the same lenght of the original line. By the way in your case, if I got it right, you want to use multiple threads to write to a single file but unfortunately this is not possibile in your case

How to match and erase a (potentially) large portion of text between certain points in C#?

I'm trying to find a way to clear out links in a .txt document loaded into the project as a string via StreamReader.
Firstly I need to identify that there is a link (it could be inside of tags, or it could just be out by itself in the middle of a sentence, like http://www.somesite.com )
I found a neat class online called GetStringInBetween which allows me to find all the links in the document. However I'm struggling in using the same class to then match both the found link(s) AND another point - I was trying to go for a linebreak so that I'm able to replace everything between a linebreak and the end of the url - effectively erasing chunks of text surrounding the url; they typically say something like "you can visit our site at http:/", etc.
What is the best way to a) identify links in an extremely long string and b) how to erase them AND some text around them?
I'd also like to note that unless I specify to use Encoding.UTF7 the text comes out all garbled when it's read from the text files. I don't know if this might be a source of the matching issues.
Thanks ladies and gents :)
First of all - how big is the file that you're trying to parse? If it's just on the order of a few hundred MB, then you can load it in RAM entirely which simplifies things.
The UTF-7 encoding should not bother you, because all .NET strings are internally UTF-16 and .NET converts from UTF-7 to UTF-16 when reading the file so you don't have to worry about encodings anymore.
After you have it in one big string, your best bet is to proceed with using regexps on it. They allow replacing text as well, so you might be able to "clean" your file in one line of code! Of course, regexps for matching URLs will never be perfect (and even less so for parsing HTML), so you can expect that some parts of more exotic URLs might escape now and then. But if you want perfection, then it might get REALLY tricky.
Alternatively, if the file is large, and you only care about removing one line at a time, you might try reading the file line-by-line and then process each line separately. If you find and URL in it, discard the line. If there is no URL, write to target file. That's also a very simple to write. You'd still be dependent on regexps for finding URLs though.

Handling large text files in C#

Hey there! I am needing to read large text files up to 100mb in size. I need to read each line Search for a string and write the results to a log. What would be the best way of doing this? Should I read each line individually and search it then move on to the next one?
Allocating a string of up to 200mb isn't that much at all these days. Just read it all at once and process it.
One option is to use memory mapped files.

Text editing LinkedList vs List of lines

I'm writing a text editor using C#, I know i'm just reinventing the wheel but it's a learning experience and something I want to do.
Right now I have a basic Text Document using something that resembles a Gap Buffer however I have to update my line buffer to hold the start of each line each time an edit is made to the buffer.
I am looking at creating another Text Document for testing using a list of lines and editing each line instead.
Now the question that I have is what would be the benefits of using a LinkedList vs a standard List?
A linked list of lines will be fast to insert new lines or delete lines, and fast to move down (and up if you have a doubly-linked list) from a specific line by a small number of lines, and fast to move to the start of the document. To move quickly to the end of the document you will also need to store the end of the linked list. It is relatively slow to go to a specific line number though, as you will have to start from the beginning and iterate over the lines, although this shouldn't be a problem unless your documents have very many lines.
An ordinary list is fast to move to a specific line number but slow to add or remove any line except at the end as the entire buffer will need to be copied each time a line is inserted or deleted.
I would prefer a linked list over an array based list for the purposes of editing large documents. In either case you may have problems if the document contains any extremely long lines as strings are immutable and changing each character will be slow.
I'm using a one string per line array. Arrays are often faster or equal to update then linked list because they have a much better cache locality then linked list and for a single 1st level cache miss you can move already a few dozens pointer items in an array. I would say for everything less then 10000 items just use an array of pointers.
If your edited text are small (like hand written source code files) it is the way to go. A lot of the meta information you need for state of the art text editing can be split very well into "beginning of a line" points. Most importantly syntax highlighting.
Also error lines, breakpoints, profiling infos, code coverage markers, all of them work on a line. It's the native structure of source code text and in same cases also for literature text (if you write a text processor and not a source code editor) but in that case your need to take a paragraph as a unit.
If i ever find the time to redesign my editor i will add different buffer implementations because on larger texts the overhead of all the line info data (it's about 80 byte per line on a 32bit system) is significant. Then a gap buffer model is better, it is also much better if you don't have lines at all for example when displaying binary files in hex mode.
And finally a third buffer model is required when you allow a user to open large files. It's funny to see marketing bullshit (free open source is surprisingly worse here) about unlimited file size editing and once you open a 400 MB log file, the whole system becames unresponsive. You need a buffer model here which is not loading the whole file into the buffer first.

Categories

Resources