I have a very large XML file so I am using XmlReader in C#. Problem is some of the content contains XML-like markers that should not be processed by XmlReader.
<Narf name="DOH">Mark's test of <newline> like stuff</Narf>
This is legacy data, so it cannot be refactored... (of course)
I have tried ReadInnerXml but get the whole node.
I have tried ReadElementContentAsString but get an exception saying 'newline' is not closed.
// Does not deal with markup in the content (Both lines)
ms.mText = reader.ReadElementContentAsString();
XElement el = XNode.ReadFrom(reader) as XElement; ms.mText = el.ToString();
What I want is ms.mText to equal "Mark's test of <newline> like stuff" and not an exception.
System.Xml.XmlException was unhandled
HResult=-2146232000
LineNumber=56
LinePosition=63
Message=The 'newline' start tag on line 56 position 42 does not match the end tag of 'Narf'. Line 56, position 63.
Source=System.Xml
The duplicate flagged question did not solve the problem because it requires changing the input to remove the problem before using the data. As stated above, this is legacy data.
I figured it out based on responses here! Not elegant, but works...
public class TextWedge : TextReader
{
private StreamReader mSr = null;
private string mBuffer = "";
public TextWedge(string filename)
{
mSr = File.OpenText(filename);
// buffer 50
for (int i =0; i<50; i++)
{
mBuffer += (char) (mSr.Read());
}
}
public override int Peek()
{
return mSr.Peek() + mBuffer.Length;
}
public override int Read()
{
int iRet = -1;
if (mBuffer.Length > 0)
{
iRet = mBuffer[0];
int ic = mSr.Read();
char c = (char)ic;
mBuffer = mBuffer.Remove(0, 1);
if (ic != -1)
{
mBuffer += c;
// Run through the battery of non-xml tags
mBuffer = mBuffer.Replace("<newline>", "[br]");
}
}
return iRet;
}
}
Related
I'm reading string data from inside a file. When I search the string data I read, the value I want does not seem to exist. Can you help with this topic?
The word I'm trying to search is: GTA:SA:MP
The code I use is:
static byte[] ReadFile(string filePath)
{
byte[] buffer;
FileStream fileStream = new FileStream(filePath, FileMode.Open, FileAccess.Read);
try
{
int length = (int)fileStream.Length; // get file length
buffer = new byte[length]; // create buffer
int count; // actual number of bytes read
int sum = 0; // total number of bytes read
// read until Read method returns 0 (end of the stream has been reached)
while ((count = fileStream.Read(buffer, sum, length - sum)) > 0)
sum += count; // sum is a buffer offset for next reading
}
finally
{
fileStream.Close();
}
return buffer;
}
static void Main(string[] args)
{
byte[] data = ReadFile(#"FILE.exe");
string result = Encoding.ASCII.GetString(data);
if (result.Contains("GTA:SA:MP"))
{
Console.WriteLine("Found");
}
else
{
Console.WriteLine("Not found");
}
Console.ReadLine();
}
The answer to me: Not found
You've got a couple problems. As others have pointed out if your source is bytes then you should compare bytes not strings. Otherwise you have encoding issues. Second issue is you're using a buffer but you're not checking for any boundary conditions - where the pattern you're searching for is split across the buffer size boundary. One simple way to do something like this is treat the source as a stream and just check byte by byte. I'll include an example using a simple state machine made from local functions.
I used the local functions just because it seemed fun, you can do this in a myriad of ways..
static void Main(string[] _)
{
byte[] target = Encoding.UTF8.GetBytes("2:30pm");
long offsetInSource = 0;
int indexOfTarget = 0;
long current = 0;
bool found = false;
Func<byte, byte, bool> match = CheckStart;
using (BinaryReader reader = new BinaryReader(File.Open("foo.txt", FileMode.Open)))
{
while (current < reader.BaseStream.Length)
{
var b = reader.ReadByte();
var t = target[indexOfTarget];
if (match(t, b))
{
found = true;
break;
}
++current;
}
}
if (found)
{
Console.WriteLine($"Found matching pattern at: {offsetInSource}");
}
else
{
Console.WriteLine("Did not find pattern");
}
bool CheckStart(byte t, byte b)
{
if (t == b)
{
offsetInSource = current;
if (++indexOfTarget == target.Length)
return true;
match = CheckRest;
}
return false;
}
bool CheckRest(byte t, byte b)
{
if (t == b)
{
if (++indexOfTarget == target.Length)
return true;
}
else
{
indexOfTarget = 0;
match = CheckStart;
}
return false;
}
}
}
If your file is huge, you can read file as text in 500 characters (for example) and store them into a string variable and search your phrase in this variable. If your phrase not found, read another 500 characters by 450 (500-50) offset and store them into a string variable and search your phrase in this variable. Do this loop until your phrase found or EOF reached.
I hope all of you are having a nice day. So I fixed one error of my program but there's another :/
So here's the code where I create my and read the data from a file:
void ReadData(string fileName, Branch[] branches)
{
string shopsName = null;
using (StreamReader reader = new StreamReader(#fileName))
{
string line = null;
line = reader.ReadLine();
if (line != null)
{
shopsName = line;
}
Branch tempBranches = TempBranch(branches, shopsName);
string address = reader.ReadLine();
string phoneNumber = reader.ReadLine();
while (null != (line = reader.ReadLine()))
{
string[] values = line.Split(';');
string facturer = values[0];
string model = values[1];
double capacity = double.Parse(values[2]);
string energyClass = values[3];
string assemblyType = values[4];
string color = values[5];
string attribute = values[6];
double cost = double.Parse(values[7]);
Fridges fridge = new Fridges(facturer, model, capacity, energyClass, assemblyType, color, attribute, cost);
tempBranches.fridges.AddFridge(fridge);
}
}
And there's the code where I use the TempBranch method. The error is in this line: if (branches[i].ShopsName == shopsName). Hopefully you can help me, cuz I was trying to fix this yesterday for 30 minutes and it still wasn't working :D
private static Branch TempBranch(Branch[] branches, string shopsName)
{
for (int i = 0; i < MaxNumberOfFridges; i++)
{
if (branches[i].ShopsName == shopsName)
{
return branches[i];
}
}
return null;
}
If you replace MaxNumberOfFridges with branches.Length it will only try to find a Branch that's within the range of the branches array. The reason it's not working is because you're trying to access an index which is greater than the Length of the array.
Try this one. Use foreach, if you dont know the lenght of array.
private static Branch TempBranch(Branch[] branches, string shopsName)
{
foreach(var branch in branches)
{
if (branch.ShopsName == shopsName)
{
return branch;
}
}
return null;
}
You can also try to make use of a LINQ query,
return branches.Where(b => b.ShopsName == shopsName).FirstOrDefault();
EDIT:
To NullReferenceError which occurs in your new post occurs due to null being returned in your function where your shop gets created. This due to not finding the given shopname.
So it tries to add an fridge to an shop which does not exist, which is not possible. You will have to add a check so that this does not occur.
This raised error because MaxNumberOfFridges is bigger than branches length.. to simplify it, assume MaxNumberOfFridges is 20 but arry length is 10, so you are trying to access element 11 in array which is outside of array length.
to fix it
for (int i = 0; i < branches.Length; i++)
{
if (branches[i].ShopsName == shopsName)
{
return branches[i];
}
}
other option is to use foreach loop
foreach(var b in branches)
{
if (b.ShopsName == shopsName)
{
return branches[i];
}
}
I have created an XMLReader object out of a Stream object which I was written to earlier by XMLWriter object.
I know XMLReader object is forward only and therefore I want to be able to save current reading position, so I will be able to continue read just from the place I stopped reading.
Is it possible?
I know it is maybe tricky, as XMLreader read chunks of memory blocks so maybe it will be a problem to restore current XML element reading point.
Please advice only if you know for sure, it will work from your experience with this issue specifically.
Note :
1. I thought of simply saving the whole XMLReader object reference for that scenario.
2. XMLReader Position = current pointer to reading element not Stream.Position as it is something else.
I work in a project where an external system writes xmls (without a defined namespace) and we need to read them to find nodes with some special values:
When the value is not ready, we read again after a few minutes.
In other case, we process the node (attributes, values, etc.)
So, I think this code can help you:
var input1 = #"<root>
<ta>
<XGLi6id90>774825484.1418393</XGLi6id90>
<VAfrBVB>
<EG60sk>1030847709.7303829</EG60sk>
<XR>NOT_READY</XR>
</VAfrBVB>
</ta>
<DxshpR>1123</DxshpR>
var input2 = #"<root>
<ta>
<XGLi6id90>774825484.1418393</XGLi6id90>
<VAfrBVB>
<EG60sk>1030847709.7303829</EG60sk>
<XR>99999999</XR>
</VAfrBVB>
</ta>
<DxshpR>1123</DxshpR>
var stream1 = new MemoryStream(Encoding.UTF8.GetBytes(input1));
var stream2 = new MemoryStream(Encoding.UTF8.GetBytes(input2));
stream1.Position = 0;
stream2.Position = 0;
var position1 = DoWork(stream1, new Position());
var position2 = DoWork(stream2, position1);
public static Position DoWork(Stream stream, Position position)
{
using (XmlTextReader xmlTextReader = new XmlTextReader(stream))
{
using (XmlReader xmlReader = XmlReader.Create(xmlTextReader, xmlReaderSettings))
{
// restores the last position
xmlTextReader.SetPosition(position);
System.Diagnostics.Debug.WriteLine(xmlReader.Value); // Second time prints 99999999
while (xmlReader.Value != "NOT_READY" && xmlReader.Read())
{
// a custom logic to process nodes....
}
// saves the position to process later ...
position = xmlTextReader.GetPosition();
System.Diagnostics.Debug.WriteLine(xmlReader.Value); // First time prints NOT_READY
}
}
return position;
}
}
public class Position
{
public int LinePosition { get; set; }
public int LineNumber { get; set; }
}
public static class XmlReaderExtensions
{
public static void SetPosition(this XmlTextReader xmlTextReader, Position position)
{
if (position != null)
{
while (xmlTextReader.LineNumber < position.LineNumber && xmlTextReader.Read())
{
}
while (xmlTextReader.LinePosition < position.LinePosition && xmlTextReader.Read())
{
}
}
}
public static Position GetPosition(this XmlTextReader xmlTextReader)
{
Position output;
if (xmlTextReader.EOF)
{
output = new Position();
}
else
{
output = new Position { LineNumber = xmlTextReader.LineNumber, LinePosition = xmlTextReader.LinePosition };
}
return output;
}
}
Important and obviously, it will work only when the structure of the xml (line breaks, nodes, etc.) is always the same. In other case, it will not work.
I'm trying to display the modules names from the array to the listBox but I'm getting a "NullReferenceException was unhandled" error.
modules.xml
<?xml version="1.0" encoding="utf-8" ?>
<Modules>
<Module>
<MCode>3SFE504</MCode>
<MName>Algorithms and Data Structures</MName>
<MCapacity>5</MCapacity>
<MSemester>1</MSemester>
<MPrerequisite>None</MPrerequisite>
<MLectureSlot>0</MLectureSlot>
<MTutorialSlot>1</MTutorialSlot>
</Module>
</Modules>
form1.cs
Modules[] modules = new Modules[16];
Modules[] pickedModules = new Modules[8];
int modulecounter = 0, moduleDetailCounter = 0;
while (textReader.Read())
{
XmlNodeType nType1 = textReader.NodeType;
if ((nType1 != XmlNodeType.EndElement) && (textReader.Name == "ModuleList"))
{
// ls_modules_list.Items.Add("MODULE");
Modules m = new Modules();
while (textReader2.Read()) //While reader 2 reads the next 7 TEXT items
{
XmlNodeType nType2 = textReader2.NodeType;
if (nType2 == XmlNodeType.Text)
{
if (moduleDetailCounter == 0)
m.MCode = textReader2.Value;
if (moduleDetailCounter == 1)
m.MName = textReader2.Value;
if (moduleDetailCounter == 2)
m.MCapacity = textReader2.Value;
if (moduleDetailCounter == 3)
m.MSemester = textReader2.Value;
if (moduleDetailCounter == 4)
m.MPrerequisite = textReader2.Value;
if (moduleDetailCounter == 5)
m.MLectureSlot = textReader2.Value;
if (moduleDetailCounter == 6)
m.MTutorialSlot = textReader2.Value;
// ls_modules_list.Items.Add(reader2.Value);
moduleDetailCounter++;
}
if (moduleDetailCounter == 7) { moduleDetailCounter = 0; break; }
}
modules[modulecounter] = m;
modulecounter++;
}
}
for (int i = 0; i < modules.Length; i++)
{
ModulesListBox.Items.Add(modules[i].MName); // THE ERROR APPEARS HERE
}
}
I'm getting that error on the line which is marked with // THE ERROR APPEARS HERE.
Either ModulesListBox is null because you're accessing it before it is initialized or the modules array contains empty elements.
Like one of the commenters said, you're probably better off using XmlSerializer to handle loading the XML into the collection of modules. If that's not possible, change modules to a List<Modules> instead.
You initialize your modules array to be 16 in length and you load it with the modulecounter, but in the loop use the array length. Instead use the modulecounter variable to limit the loop, like this:
for (int i = 0; i < modulecounter; i++)
{
ModulesListBox.Items.Add(modules[i].MName);
}
Your array is null for every value modulecounter and up. That is why the error.
the for loop runs from 0 to 16 but modules is only 0 to 15, change modules.length to (modules.length -1)
Almost positive the issue is somewhere with your deserialization logic. One could debug it, but why reinvent the wheel?
var serializer = new XmlSerializer(typeof(List<Module>), new XmlRootAttribute("Modules"));
using (var reader = new StreamReader(workingDir + #"\ModuleList.xml"))
var modules = (List<Module>)serializer.Deserialize(reader);
this would give a nice complete collection of Modules assuming it was defined as
public class Module
{
public string MCode;
public string MName;
public int MCapacity;
public int MSemester;
public string MPrerequisite;
public int MLectureSlot;
public int MTutorialSlot;
}
If you have no problems with memory (i.e: the file is usually not too large), then I suggest not to use XmlTextReader and using XmlDocument instead:
XmlDocument d = new XmlDocument();
d.Load(#"FileNameAndDirectory");
XmlNodeList list = d.SelectNodes("/Modules/Module/MName");
foreach (XmlNode node in list)
{
// Whatsoever
}
The code above should extract every MName node for you and put them all in list, use it for good :)
an example (that might not be real life, but to make my point) :
public void StreamInfo(StreamReader p)
{
string info = string.Format(
"The supplied streamreaer read : {0}\n at line {1}",
p.ReadLine(),
p.GetLinePosition()-1);
}
GetLinePosition here is an imaginary extension method of streamreader.
Is this possible?
Of course I could keep count myself but that's not the question.
I came across this post while looking for a solution to a similar problem where I needed to seek the StreamReader to particular lines. I ended up creating two extension methods to get and set the position on a StreamReader. It doesn't actually provide a line number count, but in practice, I just grab the position before each ReadLine() and if the line is of interest, then I keep the start position for setting later to get back to the line like so:
var index = streamReader.GetPosition();
var line1 = streamReader.ReadLine();
streamReader.SetPosition(index);
var line2 = streamReader.ReadLine();
Assert.AreEqual(line1, line2);
and the important part:
public static class StreamReaderExtensions
{
readonly static FieldInfo charPosField = typeof(StreamReader).GetField("charPos", BindingFlags.NonPublic | BindingFlags.Instance | BindingFlags.DeclaredOnly);
readonly static FieldInfo byteLenField = typeof(StreamReader).GetField("byteLen", BindingFlags.NonPublic | BindingFlags.Instance | BindingFlags.DeclaredOnly);
readonly static FieldInfo charBufferField = typeof(StreamReader).GetField("charBuffer", BindingFlags.NonPublic | BindingFlags.Instance | BindingFlags.DeclaredOnly);
public static long GetPosition(this StreamReader reader)
{
// shift position back from BaseStream.Position by the number of bytes read
// into internal buffer.
int byteLen = (int)byteLenField.GetValue(reader);
var position = reader.BaseStream.Position - byteLen;
// if we have consumed chars from the buffer we need to calculate how many
// bytes they represent in the current encoding and add that to the position.
int charPos = (int)charPosField.GetValue(reader);
if (charPos > 0)
{
var charBuffer = (char[])charBufferField.GetValue(reader);
var encoding = reader.CurrentEncoding;
var bytesConsumed = encoding.GetBytes(charBuffer, 0, charPos).Length;
position += bytesConsumed;
}
return position;
}
public static void SetPosition(this StreamReader reader, long position)
{
reader.DiscardBufferedData();
reader.BaseStream.Seek(position, SeekOrigin.Begin);
}
}
This works quite well for me and depending on your tolerance for using reflection It thinks it is a fairly simple solution.
Caveats:
While I have done some simple testing using various Systems.Text.Encoding options, pretty much all of the data I consume with this are simple text files (ASCII).
I only ever use the StreamReader.ReadLine() method and while a brief review of the source for StreamReader seems to indicate this will still work when using the other read methods, I have not really tested that scenario.
No, not really possible. The concept of a "line number" is based upon the actual data that's already been read, not just the position. For instance, if you were to Seek() the reader to an arbitrary position, it's not actuall going to read that data, so it wouldn't be able to determine the line number.
The only way to do this is to keep track of it yourself.
It is extremely easy to provide a line-counting wrapper for any TextReader:
public class PositioningReader : TextReader {
private TextReader _inner;
public PositioningReader(TextReader inner) {
_inner = inner;
}
public override void Close() {
_inner.Close();
}
public override int Peek() {
return _inner.Peek();
}
public override int Read() {
var c = _inner.Read();
if (c >= 0)
AdvancePosition((Char)c);
return c;
}
private int _linePos = 0;
public int LinePos { get { return _linePos; } }
private int _charPos = 0;
public int CharPos { get { return _charPos; } }
private int _matched = 0;
private void AdvancePosition(Char c) {
if (Environment.NewLine[_matched] == c) {
_matched++;
if (_matched == Environment.NewLine.Length) {
_linePos++;
_charPos = 0;
_matched = 0;
}
}
else {
_matched = 0;
_charPos++;
}
}
}
Drawbacks (for the sake of brevity):
Does not check constructor argument for null
Does not recognize alternate ways to terminate the lines. Will be inconsistent with ReadLine() behavior when reading files separated by raw \r or \n.
Does not override "block"-level methods like Read(char[], int, int), ReadBlock, ReadLine, ReadToEnd. TextReader implementation works correctly since it routes everything else to Read(); however, better performance could be achieved by
overriding those methods via routing calls to _inner. instead of base.
passing the characters read to the AdvancePosition. See the sample ReadBlock implementation:
public override int ReadBlock(char[] buffer, int index, int count) {
var readCount = _inner.ReadBlock(buffer, index, count);
for (int i = 0; i < readCount; i++)
AdvancePosition(buffer[index + i]);
return readCount;
}
No.
Consider that it's possible to seek to any poisition using the underlying stream object (which could be at any point in any line).
Now consider what that would do to any count kept by the StreamReader.
Should the StreamReader go and figure out which line it's now on?
Should it just keep a number of lines read, regardless of position within the file?
There are more questions than just these that would make this a nightmare to implement, imho.
Here is a guy that implemented a StreamReader with ReadLine() method that registers file position.
http://www.daniweb.com/forums/thread35078.html
I guess one should inherit from StreamReader, and then add the extra method to the special class along with some properties (_lineLength + _bytesRead):
// Reads a line. A line is defined as a sequence of characters followed by
// a carriage return ('\r'), a line feed ('\n'), or a carriage return
// immediately followed by a line feed. The resulting string does not
// contain the terminating carriage return and/or line feed. The returned
// value is null if the end of the input stream has been reached.
//
/// <include file='doc\myStreamReader.uex' path='docs/doc[#for="myStreamReader.ReadLine"]/*' />
public override String ReadLine()
{
_lineLength = 0;
//if (stream == null)
// __Error.ReaderClosed();
if (charPos == charLen)
{
if (ReadBuffer() == 0) return null;
}
StringBuilder sb = null;
do
{
int i = charPos;
do
{
char ch = charBuffer[i];
int EolChars = 0;
if (ch == '\r' || ch == '\n')
{
EolChars = 1;
String s;
if (sb != null)
{
sb.Append(charBuffer, charPos, i - charPos);
s = sb.ToString();
}
else
{
s = new String(charBuffer, charPos, i - charPos);
}
charPos = i + 1;
if (ch == '\r' && (charPos < charLen || ReadBuffer() > 0))
{
if (charBuffer[charPos] == '\n')
{
charPos++;
EolChars = 2;
}
}
_lineLength = s.Length + EolChars;
_bytesRead = _bytesRead + _lineLength;
return s;
}
i++;
} while (i < charLen);
i = charLen - charPos;
if (sb == null) sb = new StringBuilder(i + 80);
sb.Append(charBuffer, charPos, i);
} while (ReadBuffer() > 0);
string ss = sb.ToString();
_lineLength = ss.Length;
_bytesRead = _bytesRead + _lineLength;
return ss;
}
Think there is a minor bug in the code as the length of the string is used to calculate file position instead of using the actual bytes read (Lacking support for UTF8 and UTF16 encoded files).
I came here looking for something simple. If you're just using ReadLine() and don't care about using Seek() or anything, just make a simple subclass of StreamReader
class CountingReader : StreamReader {
private int _lineNumber = 0;
public int LineNumber { get { return _lineNumber; } }
public CountingReader(Stream stream) : base(stream) { }
public override string ReadLine() {
_lineNumber++;
return base.ReadLine();
}
}
and then you make it the normal way, say from a FileInfo object named file
CountingReader reader = new CountingReader(file.OpenRead())
and you just read the reader.LineNumber property.
The points already made with respect to the BaseStream are valid and important. However, there are situations in which you want to read a text and know where in the text you are. It can still be useful to write that up as a class to make it easy to reuse.
I tried to write such a class now. It seems to work correctly, but it's rather slow. It should be fine when performance isn't crucial (it isn't that slow, see below).
I use the same logic to track position in the text regardless if you read a char at a time, one buffer at a time, or one line at a time. While I'm sure this can be made to perform rather better by abandoning this, it made it much easier to implement... and, I hope, to follow the code.
I did a very basic performance comparison of the ReadLine method (which I believe is the weakest point of this implementation) to StreamReader, and the difference is almost an order of magnitude. I got 22 MB/s using my class StreamReaderEx, but nearly 9 times as much using StreamReader directly (on my SSD-equipped laptop). While it could be interesting, I don't know how to make a proper reading test; maybe using 2 identical files, each larger than the disk buffer, and reading them alternately..? At least my simple test produces consistent results when I run it several times, and regardless of which class reads the test file first.
The NewLine symbol defaults to Environment.NewLine but can be set to any string of length 1 or 2. The reader considers only this symbol as a newline, which may be a drawback. At least I know Visual Studio has prompted me a fair number of times that a file I open "has inconsistent newlines".
Please note that I haven't included the Guard class; this is a simple utility class and it should be obvoius from the context how to replace it. You can even remove it, but you'd lose some argument checking and thus the resulting code would be farther from "correct". For example, Guard.NotNull(s, "s") simply checks that is s is not null, throwing an ArgumentNullException (with argument name "s", hence the second parameter) should it be the case.
Enough babble, here's the code:
public class StreamReaderEx : StreamReader
{
// NewLine characters (magic value -1: "not used").
int newLine1, newLine2;
// The last character read was the first character of the NewLine symbol AND we are using a two-character symbol.
bool insideNewLine;
// StringBuilder used for ReadLine implementation.
StringBuilder lineBuilder = new StringBuilder();
public StreamReaderEx(string path, string newLine = "\r\n") : base(path)
{
init(newLine);
}
public StreamReaderEx(Stream s, string newLine = "\r\n") : base(s)
{
init(newLine);
}
public string NewLine
{
get { return "" + (char)newLine1 + (char)newLine2; }
private set
{
Guard.NotNull(value, "value");
Guard.Range(value.Length, 1, 2, "Only 1 to 2 character NewLine symbols are supported.");
newLine1 = value[0];
newLine2 = (value.Length == 2 ? value[1] : -1);
}
}
public int LineNumber { get; private set; }
public int LinePosition { get; private set; }
public override int Read()
{
int next = base.Read();
trackTextPosition(next);
return next;
}
public override int Read(char[] buffer, int index, int count)
{
int n = base.Read(buffer, index, count);
for (int i = 0; i