My custom XML reader is a two-legged turtle. Suggestions?

My custom XML reader is a two-legged turtle. Suggestions? - c#

I wrote a custom XML reader because I needed something that would not read ahead from the source stream. I wanted the ability to have an object read its data from the stream without negatively affecting the stream for the parent object. That way, the stream can be passed down the object tree.
It's a minimal implementation, meant only to serve the purpose of the project that uses it (right now). It works well enough, except for one method -- ReadString. That method is used to read the current element's content as a string, stopping when the end element is reached. It determines this by counting nesting levels. Meanwhile, it's reading from the stream, character by character, adding to a StringBuilder for the resulting string.
For a collection element, this can take a long time. I'm sure there is much that can be done to better implement this, so this is where my continuing education begins once again. I could really use some help/guidance. Some notes about methods it calls:
Read - returns the next byte in the stream or -1.
ReadUntilChar - calls Read until the specified character or -1 is reached, appending to a string with StringBuilder.
Without further ado, here is my two-legged turtle. Constants have been replaced with the actual values.
public string ReadString() {
int level = 0;
long originalPosition = m_stream.Position;
StringBuilder sb = new StringBuilder();
sbyte read;
try {
// We are already within the element that contains the string.
// Read until we reach an end element when the level == 0.
// We want to leave the reader positioned at the end element.
do {
sb.Append(ReadUntilChar('<'));
if((read = Read()) == '/') {
// End element
if(level == 0) {
// End element for the element in context, the string is complete.
// Replace the two bytes of the end element read.
m_stream.Seek(-2, System.IO.SeekOrigin.Current);
break;
} else {
// End element for a child element.
// Add the two bytes read to the resulting string and continue.
sb.Append('<');
sb.Append('/');
level--;
}
} else {
// Start element
level++;
sb.Append('<');
sb.Append((char)read);
}
} while(read != -1);
return sb.ToString().Trim();
} catch {
// Return to the original position that we started at.
m_stream.Seek(originalPosition - m_stream.Position, System.IO.SeekOrigin.Current);
throw;
}
}

Right off the bat, you should using a profiler for performance optimizations if you haven't already (I'd recommend SlimTune if you're on a budget). Without one you're just taking slightly-educated stabs in the dark.
Once you've profiled the parser you should have a good idea of where the ReadString() method is spending all its time, which will make your optimizing much easier.
One suggestion I'd make at the algorithm level is to scan the stream first, and then build the contents out: Instead of consuming each character as you see it, mark where you find <, >, and </ characters. Once you have those positions you can pull the data out of the stream in blocks rather than throwing characters into a StringBuilder one at a time. This will optimize away a significant amount of StringBuilder.Append calls, which may increase your performance (this is where profiling would help).
You may find this analysis useful for optimizing string operations, if they prove to be the source of the slowness.
But really, profile.

Your implementation assumes the Stream is seekable. If it is known to be seekable, why do anything? Just create an XmlReader at your position; consume the data; ditch the reader; and seek the Stream back to where you started?
How large is the xml? You may find that throwing the data into a DOM (XmlDocument / XDocument / ec) is a viable way of getting a reader that does what you need without requiring lots of rework. In the case of XmlDocument, XmlNodeReader would suffice, for example (it would also provide xpath support if you want to use non-trivial queries).

I wrote a custom XML reader because I needed something that would not read ahead from the
source stream. I wanted the ability to have an object read its data from the stream without
negatively affecting the stream for the parent object. That way, the stream can be passed
down the object tree.
That sounds more like a job for XmlReader.ReadSubTree(), which lets you create a new XmlReader to pass to another object to initialise itself from the reader without it being able to read beyond the bounds of the current element.
The ReadSubtree method is not intended to create a copy of the XML data that you can
work with independently. Rather, it can be used create a boundary around an XML
element. This is useful if you need to pass data to another component for processing
and you wish to limit how much of your data the component can access. When you pass an
XmlReader returned by the ReadSubtree method to another application, the application
can access only that XML element, rather than the entire XML document.
It does say that after reading the subtree the parent reader is re-positioned to the "EndElement" of the current element rather than remaining at the beginning, but is that likely to be a problem?

Why not use an existing one, like this one?

Related

What does "forward-only access" mean exactly?

I'm looking at classes to use to read a large xml file. A fast implementation of the C# XmlReader class, XmlTextReader, provides "forward-only access." What does this mean?

"forward-only" means just that - you can only go forward through data. The main benefits of such approach are no need to store previous information (leading to low memory usage) and ability to read from non-seekable sources like TCP stream (where you can't seek back unlike with file stream that allow random access).
"Forward-only" is very easy to see for table-based structures (like reading from database) - "forward-only" reader will let you only check "current" record or move to the next row. There will be no way to access data from already seen rows via such reader (you have to save data outside of reader to be able to access it).
For XmlReader it is slightly more confusing as it produces tree structure out of stream of text. From stream reading point of view "forward-only" means you will not be able to get any data that reader already looked at (like root node that is basically first line of the file or parent node of current one as it had to be earlier in the file).
But from XML tree generation point of view "forward-only" may be confusing - it produces elements in depth-first order (because that how they are present in the text of the XML) meaning that "next" element is not necessary the one you'd like to see in the tree (especially if you expect breadth-first access like "names of all authors of this book").
Note that XmlReader allows you to access all attributes of current node at any time as it considers them part of the "current element".

Read multiple lines with StreamReader with StreamReader.Peek

Let's say I have the following file format (Key value pairs):
Objec1tKey: Object1Value
Object2Key: Object2Value
Object3Key: Object3Value
Object4Key: Object4Value1
Object4Value2
Object4Value3
Object5Key: Object5Value
Object6Key: Object6Value
I'm reading this line by line with StreamReader. for the objects 1, 2, 3, 5 and 6 it wouldn't be a problem because the whole object is on one line, so it's possible to process the object.
But for object 4 I need to process multiple lines. can I use Peek for this? (MSDN for Peek: Returns the next available character but does not consume it.). Is there a method like Peek which returns the next line and not the character?
If I can use Peek, then my question is, can I use Peek two times so I can read the two next lines (or 3) until I know there is a new object (obect 5) to be processed?

I would strongly recommend that you separate the IO from the line handling entirely.
Instead of making your processing code use a StreamReader, pass it either an IList<string> or an IEnumerable<string>... if you use IList<string> that will make it really easy to just access the lines by index (so you can easily keep track of "the key I'm processing started at line 5" or whatever), but it would mean either doing something clever or reading the whole file in one go.
If it's not a big file, then just using File.ReadAllLines is going to be the very simplest way of reading a file as a list of lines.
If it is a big file, use File.ReadLines to obtain an IEnumerable<string>, and then your processing code needs to be a bit smarter... for example, it might want to create a List<string> for each key that it processes, containing all the lines for that key - and let that list be garbage collected when you read the next key.

There is now way to use Peek multiple time as you thing, because it will always return only "top" character in stream. It just read it but "not send" to stream information that it was read.
To sum up pointer to stream after Peek stays in same place.
If you use for example FileStream you can use Seek for going back, but you didn't precise what type of stream are you using.

You could do something like this:
List<MyObject> objects = new List<MyObject>();
using (StreamReader sr = new StreamReader(aPath))
{
MyObject curObj;
while (!sr.EndOfStream)
{
string line = sr.ReadLine();
if (line.IndexOf(':') >= 0) // or whatever identify the beginning of a new object
{
curObj = new MyObject(line);
objects.Add(curObj);
}
else
curObj.AddAttribute(line);
}
}

Best way to reading large amounts of xml's

What is the best approach on reading large amounts of xml files (I need to read 8000 xml's) and do some computations on them, and have best speed on it? Is it ok using a xmlreader and returning the nodes i'm interested in in a list? Or is it faster when reading the node, also to do some computations on it? I tried the second approach(Returning the nodes in a list, as values, because I tried writing my application with as much modules as possible). I am using C#, but this is not relevant.
Thank you.

Is it ok using a xmlreader and returning the nodes i'm interested in in a list? Or is it faster when reading the node, also to do some computations on it?
I can't say whether returning a list is ok or not, because I don't know how large each file is, which would be more important in this regard than the number of XML documents.
However, it certainly could be very expensive, if an XML document, and hence the list produced, were very large.
Conversely, reading the node and calculating as you go will certainly be quicker to start producing results, and use less memory and hence faster in a degree ranging from negligible to so considerable as to have other approaches be infeasible, depending on just how large that source data is. It's the approach I take if I either have a strong concern with performance, or a good reason to suspect such a large dataset.
Somewhere between the two, is the approach of an IEnumerable<T> implementation that yields objects as it reads, along the lines of:
public IEnumerable<SomeObject> ExtractFromXml(XmlReader rdr)
{
using(rdr)
while(rdr.Read())
if(rdr.NodeType == XmlNodeType.Element && rdr.LocalName = "thatElementYouReallyCareAbout")
{
var current = /*Code to create a SomeObject from the XML goes here */
yield return current;
}
}
As with producing a list, this separates the code doing the calculation from that which parses the XML, but because you can start enumerating through it with a foreach before it has finished that parsing, the memory use can be less, as will the time to start the calculation. This makes little difference with small documents, but a lot if they are large.

The best solution I have personally come up with to deal with XML files is by taking advantage of the .Net's XmlSerializer class. You can define a model for your xml and create a List of that model where you keep your xml data then:
using (StreamWriter sw = new StreamWriter("OutPutPath")) {
new XmlSerializer(typeof(List<Model>)).Serialize(sw, Models);
sw.WriteLine();
}
you can read the file and deserilize the data and then assign them back to the model by calling the Deserialize method.

Stitching together multiple streams in one Stream class

I want to make a class (let's call the class HugeStream) that takes an IEnumerable<Stream> in its constructor. This HugeStream should implement the Stream abstract class.
Basically, I have 1 to many pieces of UTF8 streams coming from a DB that when put together, make a gigantic XML document. The HugeStream needs to be file-backed so that I can seek back to position 0 of the whole stitched-together-stream at any time.
Anyone know how to make a speedy implementation of this?
I saw something similar created at this page but it does not seem optimal for handling large numbers of large streams. Efficiency is the key.
On a side note, I'm having trouble visualizing Streams and am a little confused now that I need to implement my own Stream. If there's a good tutorial on implementing the Stream class that anyone knows of, please let me know; I haven't found any good articles browsing around. I just see a lot of articles on using already-existing FileStreams and MemoryStreams. I'm a very visual learner and for some reason can't find anything useful to study the concept.
Thanks,
Matt

If you only read data sequentially from the HugeStream, then it simply needs to read each child stream (and append it into a local file as well as returning the read data to the caller) until the child-stream is exhausted, then move on to the next child-stream. If a Seek operation is used to jump "backwards" in the data, you must start reading from the local cache file; when you reach the end of the cache file, you must resume reading the current child stream where you left off.
So far, this is all pretty straight-forward to implement - you just need to indirect the Read calls to the appropriate stream, and switch streams as each one runs out of data.
The inefficiency of the quoted article is that it runs through all the streams every time you read to work out where to continue reading from. To improve on this, you need to open the child streams only as you need them, and keep track of the currently-open stream so you can just keep reading more data from that current stream until it is exhausted. Then open the next stream as your "current" stream and carry on. This is pretty straight-forward, as you have a linear sequence of streams, so you just step through them one by one. i.e. something like:
int currentStreamIndex = 0;
Stream currentStream = childStreams[currentStreamIndex++];
...
public override int Read(byte[] buffer, int offset, int count)
{
while (count > 0)
{
// Read what we can from the current stream
int numBytesRead = currentSteam.Read(buffer, offset, count);
count -= numBytesRead;
offset += numBytesRead;
// If we haven't satisfied the read request, we have exhausted the child stream.
// Move on to the next stream and loop around to read more data.
if (count > 0)
{
// If we run out of child streams to read from, we're at the end of the HugeStream, and there is no more data to read
if (currentStreamIndex >= numberOfChildStreams)
break;
// Otherwise, close the current child-stream and open the next one
currentStream.Close();
currentStream = childStreams[currentStreamIndex++];
}
}
// Here, you'd write the data you've just read (into buffer) to your local cache stream
}
To allow seeking backwards, you just have to introduce a new local file stream that you copy all the data into as you read (see the comment in my pseudocode above). You need to introduce a state so you know that you are reading from the cache file rather than the current child stream, and then just directly access the cache (seeking etc is easy because the cache represents the entire history of the data read from the HugeStream, so the seek offsets are identical between the HugeStream and the Cache - you simply have to redirect any Read calls to get the data out of the cache stream)
If you read or seek back to the end of the cache stream, you need to resume reading data from the current child stream. Just go back to the logic above and continue appending data to your cache stream.
If you wish to be able to support full random access within the HugeStream you will need to support seeking "forwards" (beyond the current end of the cache stream). If you don't know the lengths of the child streams beforehand, you have no choice but to simply keep reading data into your cache until you reach the seek offset. If you know the sizes of all the streams, then you could seek directly and more efficiently to the right place, but you will then have to devise an efficient means for storing the data you read to the cache file and recording which parts of the cache file contain valid data and which have not actually been read from the DB yet - this is a bit more advanced.
I hope that makes sense to you and gives you a better idea of how to proceed...
(You shouldn't need to implement much more than the Read and Seek interfaces to get this working).

What is a good method to handle line based network I/O streams?

Note: Let me appologize for the length of this question, i had to put a lot of information into it. I hope that doesn't cause too many people to simply skim it and make assumptions. Please read in its entirety. Thanks.
I have a stream of data coming in over a socket. This data is line oriented.
I am using the APM (Async Programming Method) of .NET (BeginRead, etc..). This precludes using stream based I/O because Async I/O is buffer based. It is possible to repackage the data and send it to a stream, such as a Memory stream, but there are issues there as well.
The problem is that my input stream (which i have no control over) doesn't give me any information on how long the stream is. It simply is a stream of newline lines looking like this:
COMMAND\n
...Unpredictable number of lines of data...\n
END COMMAND\n
....repeat....
So, using APM, and since i don't know how long any given data set will be, it is likely that blocks of data will cross buffer boundaries requiring multiple reads, but those multiple reads will also span multiple blocks of data.
Example:
Byte buffer[1024] = ".................blah\nThis is another l"
[another read]
"ine\n.............................More Lines..."
My first thought was to use a StringBuilder and simply append the buffer lines to the SB. This works to some extent, but i found it difficult to extract blocks of data. I tried using a StringReader to read newlined data but there was no way to know whether you were getting a complete line or not, as StringReader returns a partial line at the end of the last block added, followed by returning null aftewards. There isn't a way to know if what was returned was a full newlined line of data.
Example:
// Note: no newline at the end
StringBuilder sb = new StringBuilder("This is a line\nThis is incomp..");
StringReader sr = new StringReader(sb);
string s = sr.ReadLine(); // returns "This is a line"
s = sr.ReadLine(); // returns "This is incomp.."
What's worse, is that if I just keep appending to the data, the buffers get bigger and bigger, and since this could run for weeks or months at a time that's not a good solution.
My next thought was to remove blocks of data from the SB as I read them. This required writing my own ReadLine function, but then I'm stuck locking the data during reads and writes. Also, the larger blocks of data (which can consist of hundreds of reads and megabytes of data) require scanning the entire buffer looking for newlines. It's not efficient and pretty ugly.
I'm looking for something that has the simplicity of a StreamReader/Writer with the convenience of async I/O.
My next thought was to use a MemoryStream, and write the blocks of data to a memory stream then attach a StreamReader to the stream and use ReadLine, but again I have issues with knowing if a the last read in the buffer is a complete line or not, plus it's even harder to remove the "stale" data from the stream.
I also thought about using a thread with synchronous reads. This has the advantage that using a StreamReader, it will always return a full line from a ReadLine(), except in broken connection situations. However this has issues with canceling the connection, and certain kinds of network problems can result in hung blocking sockets for an extended period of time. I'm using async IO because i don't want to tie up a thread for the life of the program blocking on data receive.
The connection is long lasting. And data will continue to flow over time. During the intial connection, there is a large flow of data, and once that flow is done the socket remains open waiting for real-time updates. I don't know precisely when the initial flow has "finished", since the only way to know is that no more data is sent right away. This means i can't wait for the initial data load to finish before processing, I'm pretty much stuck processing "in real time" as it comes in.
So, can anyone suggest a good method to handle this situation in a way that isn't overly complicated? I really want this to be as simple and elegant as possible, but I keep coming up with more and more complicated solutions due to all the edge cases. I guess what I want is some kind of FIFO in which i can easily keep appending more data while at the same time poping data out of it that matches certain criteria (ie, newline terminated strings).

That's quite an interesting question. The solution for me in the past has been to use a separate thread with synchronous operations, as you propose. (I managed to get around most of the problems with blocking sockets using locks and lots of exception handlers.) Still, using the in-built asynchronous operations is typically advisable as it allows for true OS-level async I/O, so I understand your point.
Well I've gone and written a class for accomplishing what I believe you need (in a relatively clean manner I would say). Let me know what you think.
using System;
using System.Collections.Generic;
using System.IO;
using System.Text;
public class AsyncStreamProcessor : IDisposable
{
protected StringBuilder _buffer; // Buffer for unprocessed data.
private bool _isDisposed = false; // True if object has been disposed
public AsyncStreamProcessor()
{
_buffer = null;
}
public IEnumerable<string> Process(byte[] newData)
{
// Note: replace the following encoding method with whatever you are reading.
// The trick here is to add an extra line break to the new data so that the algorithm recognises
// a single line break at the end of the new data.
using(var newDataReader = new StringReader(Encoding.ASCII.GetString(newData) + Environment.NewLine))
{
// Read all lines from new data, returning all but the last.
// The last line is guaranteed to be incomplete (or possibly complete except for the line break,
// which will be processed with the next packet of data).
string line, prevLine = null;
while ((line = newDataReader.ReadLine()) != null)
{
if (prevLine != null)
{
yield return (_buffer == null ? string.Empty : _buffer.ToString()) + prevLine;
_buffer = null;
}
prevLine = line;
}
// Store last incomplete line in buffer.
if (_buffer == null)
// Note: the (* 2) gives you the prediction of the length of the incomplete line,
// so that the buffer does not have to be expanded in most/all situations.
// Change it to whatever seems appropiate.
_buffer = new StringBuilder(prevLine, prevLine.Length * 2);
else
_buffer.Append(prevLine);
}
}
public void Dispose()
{
Dispose(true);
GC.SuppressFinalize(this);
}
private void Dispose(bool disposing)
{
if (!_isDisposed)
{
if (disposing)
{
// Dispose managed resources.
_buffer = null;
GC.Collect();
}
// Dispose native resources.
// Remember that object has been disposed.
_isDisposed = true;
}
}
}
An instance of this class should be created for each NetworkStream and the Process function should be called whenever new data is received (in the callback method for BeginRead, before you call the next BeginRead I would imagine).
Note: I have only verified this code with test data, not actual data transmitted over the network. However, I wouldn't anticipate any differences...
Also, a warning that the class is of course not thread-safe, but as long as BeginRead isn't executed again until after the current data has been processed (as I presume you are doing), there shouldn't be any problems.
Hope this works for you. Let me know if there are remaining issues and I will try to modify the solution to deal with them. (There could well be some subtlety of the question I missed, despite reading it carefully!)

What you're explaining in you're question, reminds me very much of ASCIZ strings. (link text). That may be a helpfull start.
I had to write something similar to this in college for a project I was working on. Unfortunatly, I had control over the sending socket, so I inserted a length of message field as part of the protocol. However, I think that a similar approach may benefit you.
How I approached my solution was I would send something like 5HELLO, so first I'd see 5, and know I had message length 5, and therefor the message I needed was 5 characters. However, if on my async read, i only got 5HE, i would see that I have message length 5, but I was only able to read 3 bytes off the wire (Let's assume ASCII characters). Because of this, I knew I was missing some bytes, and stored what I had in fragment buffer. I had one fragment buffer per socket, therefor avoiding any synchronization problems. The rough process is.
Read from socket into a byte array, record how many bytes was read
Scan through byte by byte, until you find a newline character (this becomes very complex if you're not receiving ascii characters, but characters that could be multiple bytes, you're on you're own for that)
Turn you're frag buffer into a string, and append you're read buffer up until the new line to it. Drop this string as a completed message onto a queue or it's own delegate to be processed. (you can optimize these buffers by actually having you're read socket writing to the same byte array as you're fragment, but that's harder to explain)
Continue looping through, every time we find a new line, create a string from the byte arrange from a recorded start / end position and drop on queue / delegate for processing.
Once we hit the end of our read buffer, copy anything that's left into the frag buffer.
Call the BeginRead on the socket, which will jump to step 1. when data is available in the socket.
Then you use another Thread to read you're queue of incommign messages, or just let the Threadpool handle it using delegates. And do whatever data processing you have to do. Someone will correct me if I'm wrong, but there is very little thread synchronization issues with this, since you can only be reading or waiting to read from the socket at any one time, so no worry about locks (except if you're populating a queue, I used delegates in my implementation). There are a few details you will need to work out on you're own, like how big of a frag buffer to leave, if you receive 0 newlines when you do a read, the entire message must be appended to the fragment buffer without overwriting anything. I think it ran me about 700 - 800 lines of code in the end, but that included the connection setup stuff, negotiation for encryption, and a few other things.
This setup performed very well for me; I was able to perform up to 80Mbps on 100Mbps ethernet lan using this implementation a 1.8Ghz opteron including encryption processing. And since you're tied to the socket, the server will scale since multiple sockets can be worked on at the same time. If you need items processed in order, you'll need to use a queue, but if order doesn't matter, then delegates will give you very scalable performance out of the threadpool.
Hope this helps, not meant to be a complete solution, but a direction in which to start looking.
*Just a note, my implementation was down purely at the byte level and supported encryption, I used characters for my example to make it easier to visualize.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.