I have a big text file (which has about 20k lines) using which I need to replace some lines of text in other text files (about 60-70 of them).
The other files can be called as templates. The lines in these templates needs to replaced based on some conditions
Sample content of the file:
ISA*00* *00* *01*000123456 *ZZ*PARTNERID~ *090827*0936*U*00401*000000055*0*T*>~
GS*PO*000123456*PARTNERID*20090827*1041*2*X*004010~
ST*850*0003~
BEG*00*SA*1000012**20090827~
REF*SR*N~
CSH*Y~
TD5*****UPSG~
N1*ST*John Doe~
N3*126 Any St*~
N4*Hauppauge*NY*11788-1234*US~
PO1*1*1*EA*19.95**VN*0054321~
CTT*1*1~
SE*11*0003~
GE*1*2~
IEA*1*000000001~
I am loading the filestream from the content file as below and reading it using stream reader.
FileStream baseFileStream= new FileStream("C:\\Content.txt", FileMode.Open);
Then I need to loop through the template files in a folder one by one. Once I pick a template file I will load into another FileStream(templates at max will have 300 lines ).
While reading the file I will have to go back to the previous lines numerous times. But if I read the files using ReadToEnd() or ReadLine() methods going back to the previous lines will not be possible.
To overcome this I am reading the template into collection of Lines. But will it be a good idea to read the Content file into a collection as it’s very huge. There will be a lot of searching involved in this file.Will buffered stream be of any use here?
Or is there any better approach for this?
Thanks
In my opinion, you're almost in a catch-22 situation. It's either you load the large file into memory (via your collection) which depending on the average size and memory available on the server, might be the best approach, or another alternative would be to iterate through the template files, and for each iteration, load a new file stream opening the large file every time (slower due to file I/O, but low memory consumption), so that you can perform your "search", as we all know the file stream is forward only.
Related
I use the CSV reader and found that it takes a lot of time to parse the data. how can I load the entire csv file to memory and then process it record by record as I have to do custom mapping of the records.
TextReader tr = new StreamReader(File.Open(#"C:\MarketData\" + symbol + ".txt", FileMode.Open));
CsvReader csvr = new CsvReader(tr);
while (csvr.Read())
{
// do your magic
}
Create a class that exactly represents/mirrors your CSV file. Then read all the contents into a list of that class. The following snip is from CsvHelper's documentation.
var csv = new CsvReader( textReader );
var records = csv.GetRecords<MyClass>().ToList();
The important part is the .ToList(), as this will force the load of all the data into your list, rather than yielding results as you access them.
You can then perform additional mapping / extraction on that list, which will be in memory.
If you're already doing this, you may benefit from loading your csv into a HashSet rather than a List via (ToHashSet()). See HashSet vs List Performance
To answer your question directly: You can load the file fully into a memory stream and then re-read from that stream using your CsvReader. Similarly, you can create a bigger read buffer for your filestream, eg, 15MB, which would read the entire file into the buffer in one hit. I doubt either of these will actually improve performance for 10MB files.
Find your real performance bottleneck: Time to read file content from disk, time to parse CSV into fields, or time to process a record? A 10MB file looks really small. I'm processing sets of 250MB+ csv files with a custom csv reader with no complaints.
If processing is the bottleneck and you have several threads available and your csv file format does not need to support escaped line breaks, then you could read the entire file into a list of lines (System.IO.File.ReadAllLines / .ReadLines) and parse each line using a different Task. For example:
System.IO.File.ReadLines()
.Skip(1) // header line. Assume trusted to be correct.
.AsParallel()
.Select(ParseRecord) // RecordClass ParseRecord(string line)
.ForAll(ProcessRecord); // void ProcessRecord(RecordClass)
If you have many files to parse, you could process each file in a different Task and use async methods to maximise throughput. If they all come from the same physical disk then your milage will vary and may even get worse than a single-threaded approach.
More advanced:
If you know your files to contain 8-bit characters only, then you can operate on byte arrays and skip the StreamReader overheads to cast bytes into chars. This way you can read the entire file into a byte array in a single call and scan for line breaks assuming no line break escapes need to be supported. In that case scanning for line breaks can be done by multiple threads, each looking at a part of the byte array.
If you don't need to support field escapes (a,"b,c",d), then you can write a faster parser, simply looking for field separators (typically comma). You can also split field-demarcation parsing and field content parsing into threads if that's a bottleneck, though memory access locality may negate any benefits.
Under certain circumstances you may not need to parse fields into intermediate data structures (eg doubles, strings) and can process directly off references to the start/end of fields and save yourself some intermediate data structure creation.
I am searching for a way to read text file content very fast, for example: 2GB file in under 20 seconds (without loading the entire file into memory, thus keeping the memory usage low).
I have read here about Memory Mapped Files. The CreateFromFile method can create a file mapping in memory. I have tested this code:
using(var mmf1 = MemoryMappedFile.CreateFromFile(file, FileMode.Open)) {
using(var reader = mmf1.CreateViewAccessor()) {
//...
}
}
My question is: is this the correct way to index the file into memory so that it will be available for fast search?
Once the file has been mapped/indexed into memory, I hope to use a Parallel loop to search the file for content. Right now I am testing this implementation. If I understand it correctly, it uses 1 thread to read a line in a file, and the 2nd thread to check/process the line.
Is it possible to use 1 thread to read and process the file content from top to bottom and the 2nd thread to read file content from the bottom up, thus increasing the search speed by a factor of 2.
Because as soon as the needed content has been found, all threads must stop search.
Environment: Any .Net Framework welcomed.
I have a log file that gets written to 24/7.
I am trying to create an application that will read the log file and process the data.
What's the best way to read the log file efficiently? I imagine monitoring the file with something like FileSystemWatcher. But how do I make sure I don't read the same data once it's been processed by my application? Or say the application aborts for some unknown reason, how would it pick up where it left off last?
There's usually a header and footer around the payload that's in the log file. Maybe an id field in the content as well. Not sure yet though about the id field being there.
I also imagined maybe saving the lines read count somewhere to maybe use that as bookmark.
For obvious reasons reading the whole content of the file, as well as removing lines from the log files (after loading them into your application) is out of question.
What I can think of as a partial solution is having a small database (probable something much smaller than a full-blown MySQL/MS SQL/PostgreSQL instance) and populating table with what has been read from the log file. I am pretty sure that even if there is power cut off and then the machine is booted again, most of the relational databases should be able to restore it's state with ease. This solution requires some data that could be used to identify the row from the log file (for example: exact time of the action logged, machine on which the action has taken place etc.)
Well, you will have to figure out your magic for your particular case yourself. If you are going to use well-known text encoding it may be pretty simple thoght. Look toward System.IO.StreamReader and it's ReadLine(), DiscardBufferedData() methods and BaseStream property. You should be able to remember your last position in the file and rewind to that position later and start reading again, given that you are sure that file is only appended. There are other things to consider though and there is no single universal answer to this.
Just as a naive example (you may still need to adjust a lot to make it work):
static void Main(string[] args)
{
string filePath = #"c:\log.txt";
using (var stream = new FileStream(filePath, FileMode.Open, FileAccess.Read, FileShare.Read))
{
using (var streamReader = new StreamReader(stream,Encoding.Unicode))
{
long pos = 0;
if (File.Exists(#"c:\log.txt.lastposition"))
{
string strPos = File.ReadAllText(#"c:\log.txt.lastposition");
pos = Convert.ToInt64(strPos);
}
streamReader.BaseStream.Seek(pos, SeekOrigin.Begin); // rewind to last set position.
streamReader.DiscardBufferedData(); // clearing buffer
for(;;)
{
string line = streamReader.ReadLine();
if( line==null) break;
ProcessLine(line);
}
// pretty sure when everything is read position is at the end of file.
File.WriteAllText(#"c:\log.txt.lastposition",streamReader.BaseStream.Position.ToString());
}
}
}
I think you will find the File.ReadLines(filename) function in conjuction with LINQ will be very handy for something like this. ReadAllLines() will load the entire text file into memory as a string[] array, but ReadLines will allow you to begin enumerating the lines immediately as it traverses through the file. This not only saves you time but keeps the memory usage very low as it is processing each line one at a time. Using statements are important because if this program is interrupted it will close the filestreams flushing the writer and saving unwritten content to the file. Then when it starts up it will skip all the files that are already read.
int readCount = File.ReadLines("readLogs.txt").Count();
using (FileStream readLogs = new FileStream("readLogs.txt", FileMode.Append))
using (StreamWriter writer = new StreamWriter(readLogs))
{
IEnumerable<string> lines = File.ReadLines(bigLogFile.txt).Skip(readCount);
foreach (string line in lines)
{
// do something with line or batch them if you need more than one
writer.WriteLine(line);
}
}
As MaciekTalaska mentioned, I would strongly recommend using a database if this is something written to 24/7 and will get quite large. File systems are simply not equipped to handle such volume and you will spend a lot of time trying to invent solutions where a database could do it in a breeze.
Is there a reason why it logs to a file? Files are great because they are simple to use and, being the lowest common denominator, there is relatively little that can go wrong. However, files are limited. As you say, there's no guarantee a write to the file will be complete when you read the file. Multiple applications writing to the log can interfere with each other. There is no easy sorting or filtering mechanism. Log files can grow very big very quickly and there's no easy way to move old events (say those more than 24 hours old) into separate files for backup and retention.
Instead, I would considering writing the logs to a database. The table structure can be very simple but you get the advantage of transactions (so you can extract or backup with ease) and search, sort and filter using an almost universally understood syntax. If you are worried about load spikes, use a message queue, like http://msdn.microsoft.com/en-us/library/ms190495.aspx for SQL Server.
To make the transition easier, consider using a logging framework like log4net. It abstracts much of this away from your code.
Another alternative is to use a system like syslog or, if you have multiple servers and a large volume of logs, flume. By moving the log files away the source computer, you can store them or inspect them on a different machine far more effectively. However, these are probably overkill for your current problem.
Is there some way I can combine two XmlDocuments without holding the first in memory?
I have to cycle through a list of up to a hundred large (~300MB) XML files, appending to each up to 1000 nodes, repeating the whole process several times (as the new node list is cleared to save memory). Currently I load the whole XmlDocument into memory before appending new nodes, which is currently not tenable.
What would you say is the best way to go about this? I have a few ideas but I'm not sure which is best:
Never load the whole XMLDocument, instead using XmlReader and XmlWriter simultaneously to write to a temp file which is subsequently renamed.
Make a XmlDocument for the new nodes only, and then manually write it to the existing file (i.e. file.WriteLine( "<node>\n" )
Something else?
Any help will be much appreciated.
Edit Some more details in answer to some of the comments:
The program parses several large logs into XML, grouping into different files by source. It only needs to run once a day, and once the XML is written there is a lightweight proprietary reader program which gives reports on the data. The program only needs to run once a day so can be slow, but runs on a server which performs other actions, mainly file compression and transfer, which cannot be effected too much.
A database would probably be easier, but the company isn't going to do this any time soon!
As is, the program runs on the dev machine using a few GB of memory at the most, but throws out of memory exceptions when run on the sever.
Final Edit
The task is quite low-prority, which is why it would only cost extra to get a database (though I will look into mongo).
The file will only be appended to, and won't grow indefinitely - each final file is only for a day's worth of the log, and then new files are generated the following day.
I'll probably use the XmlReader/Writer method since it will be easiest to ensure XML validity, but I have taken all your comments/answers into consideration. I know that having XML files this large is not a particularly good solution, but it's what I'm limited to, so thanks for all the help given.
If you wish to be completely certain of the XML structure, using XMLWriter and XMLReader are the best way to go.
However, for absolutely highest possible performance, you may be able to recreate this code quickly using direct string functions. You could do this, although you'd lose the ability to verify the XML structure - if one file had an error you wouldn't be able to correct it:
using (StreamWriter sw = new StreamWriter("out.xml")) {
foreach (string filename in files) {
sw.Write(String.Format(#"<inputfile name=""{0}"">", filename));
using (StreamReader sr = new StreamReader(filename)) {
// Using .NET 4's CopyTo(); alternatively try http://bit.ly/RiovFX
if (max_performance) {
sr.CopyTo(sw);
} else {
string line = sr.ReadLine();
// parse the line and make any modifications you want
sw.Write(line);
sw.Write("\n");
}
}
sw.Write("</inputfile>");
}
}
Depending on the way your input XML files are structured, you might opt to remove the XML headers, maybe the document element, or a few other un-necessary structures. You could do that by parsing the file line by line
I have a method which uses a binarywriter to write a record consisting of few uints and a byte array to a file. This method executes about a dozen times a second as part of my program. The code is below:
iLogFileMutex.WaitOne();
using (BinaryWriter iBinaryWriter = new BinaryWriter(File.Open(iMainLogFilename, FileMode.OpenOrCreate, FileAccess.Write)))
{
iBinaryWriter.Seek(0, SeekOrigin.End);
foreach (ViewerRecord vR in aViewerRecords)
{
iBinaryWriter.Write(vR.Id);
iBinaryWriter.Write(vR.Timestamp);
iBinaryWriter.Write(vR.PayloadLength);
iBinaryWriter.Write(vR.Payload);
}
}
iLogFileMutex.ReleaseMutex();
The above code works fine, but if I remove the line with the seek call, the resulting binary file is corrupted. For example certain records are completely missing or parts of them are just not present although the vast majority of records are written just fine. So I imagine that the cause of the bug is when I repeatedly open and close the file the current position in the file isn't always at the end and things get overwritten.
So my question is: Why isn't C# ensuring that the current position is at the end when I open the file?
PS: I have ruled out threading issues from causing this bug
If you want to append to the file, you must use FileMode.Append in your Open call, otherwise the file will open with its position set to the start of the file, not the end.
The problem is a combination of FileMode.OpenOrCreate and the type of the ViewerRecord members. One or more of them isn't of a fixed size type, probably a string.
Things go wrong when the file already exists. You'll start writing data at the start of the file, overwriting existing data. But what you write only overwrites an existing record by chance, the string would have to be the exact same size. If you don't write enough records then you won't overwrite all of the old records. And get into trouble when you read the file, you'll read part of an old record after you've read the last written record. You'll get junk for a while.
Making the record a fixed size doesn't really solve the problem, you'll read a good record but it will be an old one. Which particular set of old records you'll get depends on how much new data you wrote. This should be just as bad as reading garbled data.
If you really do need to preserve the old records then you should append to the file, FileMode.Append. If you don't then you should rewrite the file, FileMode.Create.