Environment: Any .Net Framework welcomed.
I have a log file that gets written to 24/7.
I am trying to create an application that will read the log file and process the data.
What's the best way to read the log file efficiently? I imagine monitoring the file with something like FileSystemWatcher. But how do I make sure I don't read the same data once it's been processed by my application? Or say the application aborts for some unknown reason, how would it pick up where it left off last?
There's usually a header and footer around the payload that's in the log file. Maybe an id field in the content as well. Not sure yet though about the id field being there.
I also imagined maybe saving the lines read count somewhere to maybe use that as bookmark.
For obvious reasons reading the whole content of the file, as well as removing lines from the log files (after loading them into your application) is out of question.
What I can think of as a partial solution is having a small database (probable something much smaller than a full-blown MySQL/MS SQL/PostgreSQL instance) and populating table with what has been read from the log file. I am pretty sure that even if there is power cut off and then the machine is booted again, most of the relational databases should be able to restore it's state with ease. This solution requires some data that could be used to identify the row from the log file (for example: exact time of the action logged, machine on which the action has taken place etc.)
Well, you will have to figure out your magic for your particular case yourself. If you are going to use well-known text encoding it may be pretty simple thoght. Look toward System.IO.StreamReader and it's ReadLine(), DiscardBufferedData() methods and BaseStream property. You should be able to remember your last position in the file and rewind to that position later and start reading again, given that you are sure that file is only appended. There are other things to consider though and there is no single universal answer to this.
Just as a naive example (you may still need to adjust a lot to make it work):
static void Main(string[] args)
{
string filePath = #"c:\log.txt";
using (var stream = new FileStream(filePath, FileMode.Open, FileAccess.Read, FileShare.Read))
{
using (var streamReader = new StreamReader(stream,Encoding.Unicode))
{
long pos = 0;
if (File.Exists(#"c:\log.txt.lastposition"))
{
string strPos = File.ReadAllText(#"c:\log.txt.lastposition");
pos = Convert.ToInt64(strPos);
}
streamReader.BaseStream.Seek(pos, SeekOrigin.Begin); // rewind to last set position.
streamReader.DiscardBufferedData(); // clearing buffer
for(;;)
{
string line = streamReader.ReadLine();
if( line==null) break;
ProcessLine(line);
}
// pretty sure when everything is read position is at the end of file.
File.WriteAllText(#"c:\log.txt.lastposition",streamReader.BaseStream.Position.ToString());
}
}
}
I think you will find the File.ReadLines(filename) function in conjuction with LINQ will be very handy for something like this. ReadAllLines() will load the entire text file into memory as a string[] array, but ReadLines will allow you to begin enumerating the lines immediately as it traverses through the file. This not only saves you time but keeps the memory usage very low as it is processing each line one at a time. Using statements are important because if this program is interrupted it will close the filestreams flushing the writer and saving unwritten content to the file. Then when it starts up it will skip all the files that are already read.
int readCount = File.ReadLines("readLogs.txt").Count();
using (FileStream readLogs = new FileStream("readLogs.txt", FileMode.Append))
using (StreamWriter writer = new StreamWriter(readLogs))
{
IEnumerable<string> lines = File.ReadLines(bigLogFile.txt).Skip(readCount);
foreach (string line in lines)
{
// do something with line or batch them if you need more than one
writer.WriteLine(line);
}
}
As MaciekTalaska mentioned, I would strongly recommend using a database if this is something written to 24/7 and will get quite large. File systems are simply not equipped to handle such volume and you will spend a lot of time trying to invent solutions where a database could do it in a breeze.
Is there a reason why it logs to a file? Files are great because they are simple to use and, being the lowest common denominator, there is relatively little that can go wrong. However, files are limited. As you say, there's no guarantee a write to the file will be complete when you read the file. Multiple applications writing to the log can interfere with each other. There is no easy sorting or filtering mechanism. Log files can grow very big very quickly and there's no easy way to move old events (say those more than 24 hours old) into separate files for backup and retention.
Instead, I would considering writing the logs to a database. The table structure can be very simple but you get the advantage of transactions (so you can extract or backup with ease) and search, sort and filter using an almost universally understood syntax. If you are worried about load spikes, use a message queue, like http://msdn.microsoft.com/en-us/library/ms190495.aspx for SQL Server.
To make the transition easier, consider using a logging framework like log4net. It abstracts much of this away from your code.
Another alternative is to use a system like syslog or, if you have multiple servers and a large volume of logs, flume. By moving the log files away the source computer, you can store them or inspect them on a different machine far more effectively. However, these are probably overkill for your current problem.
Related
I did check to see if any existing questions matched mine but I didn't see any, if I did, my mistake.
I have two text files to compare against each other, one is a temporary log file that is overwritten sometimes, and the other is a permanent log, which will collect and append all of the contents of the temp log into one file (it will collect new lines in the log since when it last checked and append the new lines to the end of the complete log). However after a point this may lead to the complete log becoming quite large and therefore not so efficient to compare against so i have been thinking about different methods to approach this.
my first idea is to "buffer" the temp log (being that it will normally be the smaller of the two) strings into a list and simply loop through the archive log and do something like:
List<String> bufferedlines = new List<string>();
using (StreamReader ArchiveStream = new StreamReader(ArchivePath))
{
if (bufferedlines.Contains(ArchiveStream.ReadLine()))
{
}
}
Now there is a couple of ways I could proceed from here, I could create yet another list to store the inconsistencies, close the read stream (I'm not sure you can both read and write at the same time, if you can that might make things easier for my options) then open a write stream in append mode and write the list to the file. alternatively, cutting out the buffering the inconsistencies, i could open a write stream while the files are being compared and on the spot write the lines that aren't matched.
The other method i could think of was limited by my knowledge of whether it could be done or not, which was rather than buffer either file, compare the streams side by side as they are read and append the lines on the fly. Something like:
using (StreamReader ArchiveStream = new StreamReader(ArchivePath))
{
using (StreamReader templogStream = new StreamReader(tempPath))
{
if (!(ArchiveStream.ReadAllLines.Contains(TemplogStream.ReadLine())))
{
//write the line to the file
}
}
}
As I said I'm not sure whether that would work or that it may be more efficient than the first method, so i figured i'd ask, see if anyone had insight into how this might properly be implemented, and whether it was the most efficient way or there was a better method out there.
Effectively what you want here is all of the items from one set that aren't in another set. This is set subtraction, or in LINQ terms, Except. If your data sets were sufficiently small you could simply do this:
var lines = File.ReadLines(TempPath)
.Except(File.ReadLines(ArchivePath))
.ToList();//can't write to the file while reading from it
File.AppendAllLines(ArchivePath, lines);
Of course, this code requires bringing the all of the lines in the temp file into memory, because that's just how Except is implemented. It creates a HashSet of all of the items so that it can efficiently find matches from the other sequence.
Presumably here the number of lines that need to be added here is pretty small, so the fact that the lines that we find here all need to be stored in memory isn't a problem. If there will potentially be a lot the, you'd want to write them out to another file besides the first one (possibly concatting the two files together when done, if needed).
Is there some way I can combine two XmlDocuments without holding the first in memory?
I have to cycle through a list of up to a hundred large (~300MB) XML files, appending to each up to 1000 nodes, repeating the whole process several times (as the new node list is cleared to save memory). Currently I load the whole XmlDocument into memory before appending new nodes, which is currently not tenable.
What would you say is the best way to go about this? I have a few ideas but I'm not sure which is best:
Never load the whole XMLDocument, instead using XmlReader and XmlWriter simultaneously to write to a temp file which is subsequently renamed.
Make a XmlDocument for the new nodes only, and then manually write it to the existing file (i.e. file.WriteLine( "<node>\n" )
Something else?
Any help will be much appreciated.
Edit Some more details in answer to some of the comments:
The program parses several large logs into XML, grouping into different files by source. It only needs to run once a day, and once the XML is written there is a lightweight proprietary reader program which gives reports on the data. The program only needs to run once a day so can be slow, but runs on a server which performs other actions, mainly file compression and transfer, which cannot be effected too much.
A database would probably be easier, but the company isn't going to do this any time soon!
As is, the program runs on the dev machine using a few GB of memory at the most, but throws out of memory exceptions when run on the sever.
Final Edit
The task is quite low-prority, which is why it would only cost extra to get a database (though I will look into mongo).
The file will only be appended to, and won't grow indefinitely - each final file is only for a day's worth of the log, and then new files are generated the following day.
I'll probably use the XmlReader/Writer method since it will be easiest to ensure XML validity, but I have taken all your comments/answers into consideration. I know that having XML files this large is not a particularly good solution, but it's what I'm limited to, so thanks for all the help given.
If you wish to be completely certain of the XML structure, using XMLWriter and XMLReader are the best way to go.
However, for absolutely highest possible performance, you may be able to recreate this code quickly using direct string functions. You could do this, although you'd lose the ability to verify the XML structure - if one file had an error you wouldn't be able to correct it:
using (StreamWriter sw = new StreamWriter("out.xml")) {
foreach (string filename in files) {
sw.Write(String.Format(#"<inputfile name=""{0}"">", filename));
using (StreamReader sr = new StreamReader(filename)) {
// Using .NET 4's CopyTo(); alternatively try http://bit.ly/RiovFX
if (max_performance) {
sr.CopyTo(sw);
} else {
string line = sr.ReadLine();
// parse the line and make any modifications you want
sw.Write(line);
sw.Write("\n");
}
}
sw.Write("</inputfile>");
}
}
Depending on the way your input XML files are structured, you might opt to remove the XML headers, maybe the document element, or a few other un-necessary structures. You could do that by parsing the file line by line
I am developing a program to log data from a incoming serial communication. I have to invoke the serial box by sending a command, to recieve something. All this works fine, but i have a problem.
The program have to be run from a netbook ( approx: 1,5 gHZ, 2 gig ram ), and it can't keep up when i ask it to save these information to a XML file.
I am only getting communication every 5 second, i am not reading the file anywhere.
I use xml.save(string filename) to save the file.
Is there another, better way, to save the information to my XML, or should i use an alternative?
If i should use an alternative, which should it be?
Edit:
Added some code:
XmlDocument xml = new XmlDocument();
xml.Load(logFile);
XmlNode p = xml.GetElementsByTagName("records")[0];
for (int i = 0; i < newDat.Length; i++)
{
XmlNode q = xml.CreateElement("record");
XmlNode a = xml.CreateElement("time");
XmlNode b = xml.CreateElement("temp");
XmlNode c = xml.CreateElement("addr");
a.AppendChild(xml.CreateTextNode(outDat[i, 0]));
b.AppendChild(xml.CreateTextNode(outDat[i, 1]));
c.AppendChild(xml.CreateTextNode(outDat[i, 2]));
sendTime = outDat[i, 0];
points.Add(outDat[i, 2], outDat[i, 1]);
q.AppendChild(a);
q.AppendChild(b);
q.AppendChild(c);
p.AppendChild(q);
}
xml.AppendChild(p);
xml.Save(this.logFile);
This is the XML related code, running once every 5 seconds. I am reading (I get no error), adding some childs, and then saving it again. It is when I save that I get the error.
You may want to look at using an XMLWriter and building the XML file by hand. That would allow you to open a file and keep it open for the duration of the logging, appending one XML fragment at a time, as you read in data. The XMLReader class is optimized for forward-only writing to an XMLStream.
The above approach should be much faster when compared to using the Save method to serialize (save) a full XML document each time you read data and when you really only want to append a new fragment at the end.
EDIT
Based on the code sample you posted, it's the Load and Save that's causing the unnecessary performance bottleneck. Every time you're adding a log entry you're essentially loading the full XML document and behind the scenes parsing it into a full-blown XML tree. Then you modify the tree (by adding nodes) and then serialize it all to disk again. This is very very counter productive.
My proposed solution is really the way to go: create and open the log file only once; then use an XMLWriter to write out the XML elements one by one, each time you read new data; this way you're not holding the full contents of the XML log in memory and you're only appending small chunks of data at the end of a file - which should be unnoticeable in terms of overhead; at the end, simply close the root XML tag, close the XMLWriter and close the file. That's it! This is guaranteed to not slow down your UI even if you implement it synchronously, on the UI thread.
While not a direct answer to your question, it sounds like you're doing everything in a very linear way:
Receive command
Modify in memory XML
Save in memory XML to disk
GoTo 1
I would suggest you look into using some threading, or possibly Task's to make this more asynchronous. This would certainly be more difficult, and you would have to wrestle with the task synchronization, but in the long run it's going to perform a lot better.
I would look at having a thread (possibly the main thread, not sure if you're using WinForms, a console app or what) that receives the command, and posts the "changes" to a holding class. Then have a second thread, which periodically polls this holding class and checks it for a "Dirty" state. When it detects this state, it grabs a copy of the XML and saves it to disk.
This allows your serial communication to continue uninterrupted, regardless of how poorly the hardware you're running on performs.
Normally for log files one picks append-friendly format, otherwise you have to re-parse whole file every time you need to append new record and save the result. Plain text CSV is likely the simplest option.
One other option if you need to have XML-like file is to store list of XML fragments instead of full XML. This way you still can use XML API (XmlReader can read fragments when specifying ConformanceLevel.Frament in XmlReaderSettings of XmlReader.Create call), but you don't need to re-read whole document to append new entry - simple file-level append is enough. I.e. WCF logs are written this way.
The answer from #Miky Dinescu is one technique for doing this if your output must be an XML formatted file. The reason why is that you are asking it to completed load and reparse the entire XML file every single time you add another entry. Loading and parsing the XML file becomes more and more IO, memory, and CPU intensive the bigger the file gets. So it doesn't take long before the amount of overhead that has will overwhelm any hardware when it must run within a very limited time frame. Otherwise you need to re-think your whole process and could simply buffer all the data into an in memory buffer which you could write out (flush) at a much more leisurely pace.
I made this work, however I do not believe that it is the "best practice" method.
I have another class, where I have my XmlDocument running at all times, and then trying to save every time data is added. If it fails, it simply waits to save the next time.
I will suggest others to look at Miky Dinescu's suggestion. I just felt that I was in to deep to change how to save data.
I have XML files stored in BLOB storage, and I am trying to figure out what is the most efficient way to update them ( and/or add some elements to them). In a WebRole, I came up with this :
using (MemoryStream ms = new MemoryStream())
{
var blob = container.GetBlobReference("file.xml");
blob.DownloadToStream(msOriginal);
XDocument xDoc= XDocument.Load(ms);
// Do some updates/inserts using LINQ to XML.
blob.Delete();//Details about this later on.
using(MemoryStream msNew = new MemoryStream())
{
xDoc.Save(msNew);
msNew.Seek(0,SeekOrigin.Begin);
blob.UploadFromStream(msNew);
}
}
I am looking at these parameters considering the efficiency:
BLOB Transactions.
Bandwidth. (Not sure if it's counted, because the code runs in the data-center)
Memory consumption on the instance.
Some things to mention:
My xml files are around 150-200 KB.
I am aware of the fact that XDocument loads the whole file into
memory, and working in streams ( XmlWriter and XmlReader ) could
solve this. But I Assume this will require working with BlobStream
which could lead to less efficient transaction-wise (I think).
About blob.Delete(), without it, the uploaded xml in the blob storage
seems to be missing some closing tags at the end of it. I assumed
this is caused by a collision with the old data. I could be
completely wrong here, but using the delete solved it ( costing one
more transaction though ).
Is the code I provided is a good practice or maybe a more efficient way exists considering the parameters I mentioned ?
I believe the problem with the stream based method is that the storage client doesn't know how long the stream is before it starts to send the data. This is probably causing the content-length to not be updated, giving the appearance of missing data at the end of the file.
Working with the content of the blob in text format will help. You can download the blob contents as text and then upload as text. Doing this, you should be able to both avoid the delete (saving you 1/3rd the transactions) and have simpler code.
var blob = container.GetBlobReference("file.xml");
var xml = blob.DownloadText(); // transaction 1
var xDoc= XDocument.Parse(xml);
// Do some updates/inserts using LINQ to XML.
blob.UploadText(xDoc.ToString()); // transaction 2
Additionally, if you can recreate the file without downloading it in the first place (we can do this sometimes), then you can just upload it and overwrite the old one using one storage transaction.
var blob = container.GetBlobReference("file.xml");
var xDoc= new XDocument(/* generate file */);
blob.UploadText(xDoc.ToString()); // transaction 1
I am aware of the fact that XDocument loads the whole file into memory, and working in streams ( XmlWriter and XmlReader ) could solve this.
Not sure it would solve too much. Think about it. How do you add koolaid to the water while it is flying through the hose. That is what a stream is. Better to wait until it is in a container.
Outside of that, what is the reason for the focus on efficiency (a technical problem) rather than editing (the business problem)? Are the documents changed often enough to warrant a serious look at performance? Or are you just falling prey to the normal developer tendency to do more than what is necessary? (NOTE: I am often guilty in this area too)
Without a concept of a Flush(), the Delete is an acceptable option, at first glance. I am not sure if moving to the asynch methods might facilitate the same end with less overhead.
I am creating a downloading application and I wish to preallocate room on the harddrive for the files before they are actually downloaded as they could potentially be rather large, and noone likes to see "This drive is full, please delete some files and try again." So, in that light, I wrote this.
// Quick, and very dirty
System.IO.File.WriteAllBytes(filename, new byte[f.Length]);
It works, atleast until you download a file that is several hundred MB's, or potentially even GB's and you throw Windows into a thrashing frenzy if not totally wipe out the pagefile and kill your systems memory altogether. Oops.
So, with a little more enlightenment, I set out with the following algorithm.
using (FileStream outFile = System.IO.File.Create(filename))
{
// 4194304 = 4MB; loops from 1 block in so that we leave the loop one
// block short
byte[] buff = new byte[4194304];
for (int i = buff.Length; i < f.Length; i += buff.Length)
{
outFile.Write(buff, 0, buff.Length);
}
outFile.Write(buff, 0, f.Length % buff.Length);
}
This works, well even, and doesn't suffer the crippling memory problem of the last solution. It's still slow though, especially on older hardware since it writes out (potentially GB's worth of) data out to the disk.
The question is this: Is there a better way of accomplishing the same thing? Is there a way of telling Windows to create a file of x size and simply allocate the space on the filesystem rather than actually write out a tonne of data. I don't care about initialising the data in the file at all (the protocol I'm using - bittorrent - provides hashes for the files it sends, hence worst case for random uninitialised data is I get a lucky coincidence and part of the file is correct).
FileStream.SetLength is the one you want. The syntax:
public override void SetLength(
long value
)
If you have to create the file, I think that you can probably do something like this:
using (FileStream outFile = System.IO.File.Create(filename))
{
outFile.Seek(<length_to_write>-1, SeekOrigin.Begin);
OutFile.WriteByte(0);
}
Where length_to_write would be the size in bytes of the file to write. I'm not sure that I have the C# syntax correct (not on a computer to test), but I've done similar things in C++ in the past and it's worked.
Unfortunately, you can't really do this just by seeking to the end. That will set the file length to something huge, but may not actually allocate disk blocks for storage. So when you go to write the file, it will still fail.