Let's say I have the following file format (Key value pairs):
Objec1tKey: Object1Value
Object2Key: Object2Value
Object3Key: Object3Value
Object4Key: Object4Value1
Object4Value2
Object4Value3
Object5Key: Object5Value
Object6Key: Object6Value
I'm reading this line by line with StreamReader. for the objects 1, 2, 3, 5 and 6 it wouldn't be a problem because the whole object is on one line, so it's possible to process the object.
But for object 4 I need to process multiple lines. can I use Peek for this? (MSDN for Peek: Returns the next available character but does not consume it.). Is there a method like Peek which returns the next line and not the character?
If I can use Peek, then my question is, can I use Peek two times so I can read the two next lines (or 3) until I know there is a new object (obect 5) to be processed?
I would strongly recommend that you separate the IO from the line handling entirely.
Instead of making your processing code use a StreamReader, pass it either an IList<string> or an IEnumerable<string>... if you use IList<string> that will make it really easy to just access the lines by index (so you can easily keep track of "the key I'm processing started at line 5" or whatever), but it would mean either doing something clever or reading the whole file in one go.
If it's not a big file, then just using File.ReadAllLines is going to be the very simplest way of reading a file as a list of lines.
If it is a big file, use File.ReadLines to obtain an IEnumerable<string>, and then your processing code needs to be a bit smarter... for example, it might want to create a List<string> for each key that it processes, containing all the lines for that key - and let that list be garbage collected when you read the next key.
There is now way to use Peek multiple time as you thing, because it will always return only "top" character in stream. It just read it but "not send" to stream information that it was read.
To sum up pointer to stream after Peek stays in same place.
If you use for example FileStream you can use Seek for going back, but you didn't precise what type of stream are you using.
You could do something like this:
List<MyObject> objects = new List<MyObject>();
using (StreamReader sr = new StreamReader(aPath))
{
MyObject curObj;
while (!sr.EndOfStream)
{
string line = sr.ReadLine();
if (line.IndexOf(':') >= 0) // or whatever identify the beginning of a new object
{
curObj = new MyObject(line);
objects.Add(curObj);
}
else
curObj.AddAttribute(line);
}
}
Related
I did check to see if any existing questions matched mine but I didn't see any, if I did, my mistake.
I have two text files to compare against each other, one is a temporary log file that is overwritten sometimes, and the other is a permanent log, which will collect and append all of the contents of the temp log into one file (it will collect new lines in the log since when it last checked and append the new lines to the end of the complete log). However after a point this may lead to the complete log becoming quite large and therefore not so efficient to compare against so i have been thinking about different methods to approach this.
my first idea is to "buffer" the temp log (being that it will normally be the smaller of the two) strings into a list and simply loop through the archive log and do something like:
List<String> bufferedlines = new List<string>();
using (StreamReader ArchiveStream = new StreamReader(ArchivePath))
{
if (bufferedlines.Contains(ArchiveStream.ReadLine()))
{
}
}
Now there is a couple of ways I could proceed from here, I could create yet another list to store the inconsistencies, close the read stream (I'm not sure you can both read and write at the same time, if you can that might make things easier for my options) then open a write stream in append mode and write the list to the file. alternatively, cutting out the buffering the inconsistencies, i could open a write stream while the files are being compared and on the spot write the lines that aren't matched.
The other method i could think of was limited by my knowledge of whether it could be done or not, which was rather than buffer either file, compare the streams side by side as they are read and append the lines on the fly. Something like:
using (StreamReader ArchiveStream = new StreamReader(ArchivePath))
{
using (StreamReader templogStream = new StreamReader(tempPath))
{
if (!(ArchiveStream.ReadAllLines.Contains(TemplogStream.ReadLine())))
{
//write the line to the file
}
}
}
As I said I'm not sure whether that would work or that it may be more efficient than the first method, so i figured i'd ask, see if anyone had insight into how this might properly be implemented, and whether it was the most efficient way or there was a better method out there.
Effectively what you want here is all of the items from one set that aren't in another set. This is set subtraction, or in LINQ terms, Except. If your data sets were sufficiently small you could simply do this:
var lines = File.ReadLines(TempPath)
.Except(File.ReadLines(ArchivePath))
.ToList();//can't write to the file while reading from it
File.AppendAllLines(ArchivePath, lines);
Of course, this code requires bringing the all of the lines in the temp file into memory, because that's just how Except is implemented. It creates a HashSet of all of the items so that it can efficiently find matches from the other sequence.
Presumably here the number of lines that need to be added here is pretty small, so the fact that the lines that we find here all need to be stored in memory isn't a problem. If there will potentially be a lot the, you'd want to write them out to another file besides the first one (possibly concatting the two files together when done, if needed).
I have a huge text file which i need to read.Currently I am reading text file like this..
string[] lines = File.ReadAllLines(FileToCopy);
But here all the lines are getting being stored in lines array and after this according to the condition is being processed programtically which is not efficient way as first it will Read irrelevant rows(lines) also of the text file into array and same way will go for the processing.
So my question is Can i put line number to be read from the text file..Suppose last time it had read 10001 lines and next time it should start from 10002..
How to achieve it?
Well you don't have to store all those lines - but you definitely have to read them. Unless the lines are of a fixed length (in bytes, not characters) how would you expect to be able to skip to a particular part of the file?
To store only the lines you want in memory though, use:
List<string> lines = File.ReadLines(FileToCopy).Skip(linesToSkip).ToList();
Note that File.ReadLines() was introduced in .NET 4, and reads the lines on-demand with an iterator instead of reading the entire file into memory.
If you only want to process a certain number of lines, you can use Take as well:
List<string> lines = File.ReadLines(FileToCopy)
.Skip(linesToSkip)
.Take(linesToRead)
.ToList();
So for example, linesToSkip=10000 and linesToRead=1000 would give you lines 10001-11000.
Ignore the lines, they're useless - if every line isn't the same length, you're going to have to read them one by one again, that's a huge waste.
Instead, use the position of the file stream. This way, you can skip right there on the second attempt, no need to read the data all over again. After that, you'll just use ReadLine in a loop until you get to the end, and mark the new end position.
Please, don't use ReadLines().Skip(). If you have a 10 GB file, it will read all the 10 GBs, create the appropriate strings, throw them away, and then, finally, read the 100 bytes you want to read. That's just crazy :) Of course, it's better than using File.ReadAllLines, but only because that doesn't need to keep the whole file in memory at once. Other than that, you're still reading every single byte of the file (you have to find out where the lines end).
Sample code of a method to read from last known location:
string[] ReadAllLinesFromBookmark(string fileName, ref long lastPosition)
{
using (var fs = File.OpenRead(fileName))
{
fs.Position = lastPosition;
using (var sr = new StreamReader(fs))
{
string line = null;
List<string> lines = new List<string>();
while ((line = sr.ReadLine()) != null)
{
lines.Add(line);
}
lastPosition = fs.Position;
return lines.ToArray();
}
}
}
Well you do have line numbers, in the form of the array index. Keep a note of the previously read lines array index and you start start reading from the next array index.
Use the Filestream.Position method to get the position of that file and then set the position.
I have text file which contain 200000 rows. I want to read first 50000 rows and then process it and then read second part say 50001 to 100000 etc. When I read second block I don't write to loop on first 1 to 50000. I want that reader pointer directly goes to row number 50001 and start reading.
How it can be possible? Which reader is used for that?
You need the StreamReader class.
With this you can do line by line reading with the ReadLine() method. You will need to keep track of the line count yourself and call a method to process your data every 50000 lines, but so long as you keep the reader open you should not need to restart the reading.
No unfortunately there is no way you can skip counting the Lines. At the raw level files do not work on a line number basis. Instead they work at a position / offset basis. The root file system has no concept of lines. It's a concept added by higher level components.
So there is no way to tell the operating system, please open file at line specified. Instead you have to open the file and skip around counting new lines until you've passed the specified number. Then store the next set of bytes into an array until you hit the next new line.
Though If each line has equal number of bytes present then you can try the following.
using( Stream stream = File.Open(fileName, FileMode.Open) )
{
stream.Seek(bytesPerLine * (myLine - 1), SeekOrigin.Begin);
using( StreamReader reader = new StreamReader(stream) )
{
string line = reader.ReadLine();
}
}
I believe the best way would be to use stream reader,
Here are two related questions to yours, in which you can get answers from there. But ultimately if you want to get blocks of text it is very hard to do unless it is a set amount.
However I believe these would be a good read for you to use:
Reading Block of text file
This one shows you how to separate blocks of code to read. The answer for this one would be best suited, you can just set the conditions to read how many lines you have read, and set the conditions to check if the line count == 50000 or so on then do something.
As you can see
This answer makes use of the keyword continue which I believe will be useful for what you are intending to do.
Reading text file block by block
This one shows you a more readable answer but doesn't really answer what you are looking for in reading blocks.
For your question I believe that what you want to do has confused you a little, it seems like you want to highlight 50000 lines and then read it as one, that is not the way streamreader works, and yes reading line by line makes the process longer but unfortunately that's the case.
Unless the rows are exactly the same length, you can't start directly at row 50001.
What you can do, however, is when reading the first 50000 rows, remember where the last row ends. You can then seek directly to that offset and continue reading from there.
Where the row length is fixed, you do something like this:
myfile.Seek(50000 * (rowCharacters + 2), SeekOrigin.Begin);
Seek goes to a specific offset in bytes, so you just need to tell it how many bytes 50000 rows occupy. Given an ASCII encoding, that's the number of characters in the line, plus 2 for the newline sequence.
var fileList = Directory.GetFiles("./", "split*.dat");
int fileCount = fileList.Length;
int i = 0;
foreach (string path in fileList)
{
string[] contents = File.ReadAllLines(path); // OutOfMemoryException
Array.Sort(contents);
string newpath = path.Replace("split", "sorted");
File.WriteAllLines(newpath, contents);
File.Delete(path);
contents = null;
GC.Collect();
SortChunksProgressChanged(this, (double)i / fileCount);
i++;
}
And for file that consists ~20-30 big lines(every line ~20mb) I have OutOfMemoryException when I perform ReadAllLines method. Why does this exception raise? And how do I fix it?
P.S. I use Mono on MacOS
You should always be very careful about performing operations with potentially unbounded results. In your case reading a file. As you mention, the file size and or line length is unbounded.
The answer lies in reading 'enough' of a line to sort then skipping characters until the next line and reading the next 'enough'. You probably want to aim to create a line index lookup such that when you reach an ambiguous line sorting order you can go back to get more data from the line (Seek to file position). When you go back you only need to read the next sortable chunk to disambiguate the conflicting lines.
You may need to think about the file encoding, don't go straight to bytes unless you know it is one byte per char.
The built in sort is not as fast as you'd like.
Side Note:
If you call GC.* you've probably done it wrong
setting contents = null does not help you
If you are using a foreach and maintaining the index then you may be better with a for(int i...) for readability
Okay, let me give you a hint to help you with your home work. Loading the complete file into memory will -as you know- not work, because it is given as a precondition of the assignment. You need to find a way to lazily load the data from disk as you go and throw as much data away as soon as possible. Because single lines could be too big, you will have to do this one char at a time.
Try creating a class that represents an abstraction over a line, for instance by wrapping the starting index and ending index of that line. When you let this class implement IComparable<T> it allows you to sort that line with other lines. Again, the trick is to be able to read characters from the file one at a time. You will need to work with Streams (File.Open) directly.
When you do this, you will be able to write your application code like this:
List<FileLine> lines = GetLines("fileToSort.dat");
lines.Sort();
foreach (var line in lines)
{
line.AppendToFile("sortedFile.dat");
}
Your task will be to implement GetLines(string path) and create the FileLine class.
Note that I assume that the actual number of lines will be small enough that the List<FileLine> will fit into memory (which means an approximate maximum of 40,000,000 lines). If the amount of lines can be higher, you would even need a more flexible approach, but since you are talking about 20 to 30 lines, this should not be a problem.
Basically you rapproach is bull. You are violatin a constraint of the homework you are given, and this constraint has been put there to make you think more.
As you said:
I must implement external sort and show my teacher that it works for files bigger than my
RAM
Ok, so how you think you will ever read the file in ;) this is there on purpose. ReadAllLiens does NOT implement incremental external sort. As a result, it blows.
I wrote a custom XML reader because I needed something that would not read ahead from the source stream. I wanted the ability to have an object read its data from the stream without negatively affecting the stream for the parent object. That way, the stream can be passed down the object tree.
It's a minimal implementation, meant only to serve the purpose of the project that uses it (right now). It works well enough, except for one method -- ReadString. That method is used to read the current element's content as a string, stopping when the end element is reached. It determines this by counting nesting levels. Meanwhile, it's reading from the stream, character by character, adding to a StringBuilder for the resulting string.
For a collection element, this can take a long time. I'm sure there is much that can be done to better implement this, so this is where my continuing education begins once again. I could really use some help/guidance. Some notes about methods it calls:
Read - returns the next byte in the stream or -1.
ReadUntilChar - calls Read until the specified character or -1 is reached, appending to a string with StringBuilder.
Without further ado, here is my two-legged turtle. Constants have been replaced with the actual values.
public string ReadString() {
int level = 0;
long originalPosition = m_stream.Position;
StringBuilder sb = new StringBuilder();
sbyte read;
try {
// We are already within the element that contains the string.
// Read until we reach an end element when the level == 0.
// We want to leave the reader positioned at the end element.
do {
sb.Append(ReadUntilChar('<'));
if((read = Read()) == '/') {
// End element
if(level == 0) {
// End element for the element in context, the string is complete.
// Replace the two bytes of the end element read.
m_stream.Seek(-2, System.IO.SeekOrigin.Current);
break;
} else {
// End element for a child element.
// Add the two bytes read to the resulting string and continue.
sb.Append('<');
sb.Append('/');
level--;
}
} else {
// Start element
level++;
sb.Append('<');
sb.Append((char)read);
}
} while(read != -1);
return sb.ToString().Trim();
} catch {
// Return to the original position that we started at.
m_stream.Seek(originalPosition - m_stream.Position, System.IO.SeekOrigin.Current);
throw;
}
}
Right off the bat, you should using a profiler for performance optimizations if you haven't already (I'd recommend SlimTune if you're on a budget). Without one you're just taking slightly-educated stabs in the dark.
Once you've profiled the parser you should have a good idea of where the ReadString() method is spending all its time, which will make your optimizing much easier.
One suggestion I'd make at the algorithm level is to scan the stream first, and then build the contents out: Instead of consuming each character as you see it, mark where you find <, >, and </ characters. Once you have those positions you can pull the data out of the stream in blocks rather than throwing characters into a StringBuilder one at a time. This will optimize away a significant amount of StringBuilder.Append calls, which may increase your performance (this is where profiling would help).
You may find this analysis useful for optimizing string operations, if they prove to be the source of the slowness.
But really, profile.
Your implementation assumes the Stream is seekable. If it is known to be seekable, why do anything? Just create an XmlReader at your position; consume the data; ditch the reader; and seek the Stream back to where you started?
How large is the xml? You may find that throwing the data into a DOM (XmlDocument / XDocument / ec) is a viable way of getting a reader that does what you need without requiring lots of rework. In the case of XmlDocument, XmlNodeReader would suffice, for example (it would also provide xpath support if you want to use non-trivial queries).
I wrote a custom XML reader because I needed something that would not read ahead from the
source stream. I wanted the ability to have an object read its data from the stream without
negatively affecting the stream for the parent object. That way, the stream can be passed
down the object tree.
That sounds more like a job for XmlReader.ReadSubTree(), which lets you create a new XmlReader to pass to another object to initialise itself from the reader without it being able to read beyond the bounds of the current element.
The ReadSubtree method is not intended to create a copy of the XML data that you can
work with independently. Rather, it can be used create a boundary around an XML
element. This is useful if you need to pass data to another component for processing
and you wish to limit how much of your data the component can access. When you pass an
XmlReader returned by the ReadSubtree method to another application, the application
can access only that XML element, rather than the entire XML document.
It does say that after reading the subtree the parent reader is re-positioned to the "EndElement" of the current element rather than remaining at the beginning, but is that likely to be a problem?
Why not use an existing one, like this one?