System.XML and Encoding trouble - c#

I have an application which is using to create XML documents on the example of existing. But that's not the point. Today I noticed that there is an error if the opened file encoding is ANSI. Before that I worked with files UTF-8 and this problem does not arise. What should you do and how?
Fragments of code:
string filepath;
XmlDocument xdoc = new XmlDocument();
XmlElement root;
...............
if (openFileDialog1.ShowDialog() == DialogResult.OK)
{
filepath = openFileDialog1.FileName;
textBox1.Text = filepath;
load();
}
...............
public void load()
{
xdoc.Load(filepath);
root = xdoc.DocumentElement;
...............
Error:
An unhandled exception of type 'System.Xml.XmlException' occurred in
System.Xml.dll Additional information: An invalid character for the
specified encoding., Line 35, position 16.
In that line is Cyrillic symbols (russian language). But if I converted this document to UTF-8 by NotePad++ - it loaded correctly.

You could use a StreamReader to read the file with the correct encoding and then load that stream into the XmlDocument overload that accepts a stream.
using(var sr = new StreamReader(filepath, myEncoding))
{
xdoc.Load(sr);
}
You can obtain myEncoding via the GetEncoding method.

Related

XmlException in C#: Data at the root level is invalid

I'm trying to read an XML file in a C# WinRt app when the app resumes:
Windows.Storage.StorageFile File = await Windows.Storage.ApplicationData.Current.TemporaryFolder.GetFileAsync("PreviousSession.xml");
if (File != null)
{
var File2 = await Windows.Storage.ApplicationData.Current.TemporaryFolder.GetFileAsync("PreviousSession.xml");
string Document = File2.ToString();
System.Xml.Linq.XDocument.Parse(Document);
}
But I get a System.Xml.XmlException:
Data at the root level is invalid. Line 1, position 1.
How can I fix this and read the file properly?
My XML document is being constructed like this:
Windows.Data.Xml.Dom.XmlDocument Document = new Windows.Data.Xml.Dom.XmlDocument();
Windows.Data.Xml.Dom.XmlElement Element = (Windows.Data.Xml.Dom.XmlElement)Document.AppendChild(Document.CreateElement("PreviousSessionData"));
...
Windows.Storage.IStorageFile TempFile = await Windows.Storage.ApplicationData.Current.TemporaryFolder.CreateFileAsync("PreviousSession.xml", Windows.Storage.CreationCollisionOption.ReplaceExisting);
await Document.SaveToFileAsync(TempFile);
For a file like this:
<PreviousSessionData>...</PreviousSessionData>
System.Xml.Linq.XDocument.Parse expects an XML string, not an XML File name.
This code is wrong (see comments) :
string Document = File2.ToString(); // Return the name of "File2" object, not File2 content!
System.Xml.Linq.XDocument.Parse(Document); // Parse error, trying to parse the string "PreviousSession.xml" !
What you want is put the content of the file in a string:
string Document = File.ReadAllLines(File2);
System.Xml.Linq.XDocument.Parse(Document);
Or you can use XDocument.Load which expects a file path, not a string.

getting "Stream Was Not Writable" while saving an XML document that I read via a stream from a zip file

I have a zip file that contains an xml file,
I'm Loading this xml file to an xml document without having to extract the file.
this is done via a stream.
after doing so, I'm modifying the inner text of some nodes.
The Problem is that I'm getting the previous mentioned exception after trying to save the stream, here's the code:
(I'm using DotNetZip here)
ZipFile zipFile = ZipFile.Read(zipPath); // the path is my desktop
foreach (ZipEntry entry in zipFile)
{
if (entry.FileName == "myXML.xml")
{
//creating the stream and loading the xml doc from the zip file:
Stream stream = zipFile[entry.FileName].OpenReader();
XmlReader xReader = XmlReader.Create(stream);
XmlDocument xDoc = new XmlDocument();
xDoc.Load(xReader);
//changing the inner text of the doc nodes:
xDoc.DocumentElement.SelectSingleNode("Account/Name").InnerText = "VeXe";
xDoc.DocumentElement.SelectSingleNode("Account/Money").InnerText = "Million$";
xDoc.Save(stream); // here's where I got the exception.
break;
}
}
I'm not a pro coder, but instead of xDoc.Save(stream); I noticed that it could also take a XmlWriter as a parameter, so I tried making an instance of the XmlWriter immediately after instantiating the XmlReader ..
I tried doing this: xDoc.Save(XmlWriter)
I got an exception saying something like: "Cannot Write After Reading"
how can I successfully save the xDoc ?
ADDED:
I had an idea of saving the xml file in some other place, like a temp folder or something
then adding that saved file in the zip overwriting the old one, then deleting the xml file in the temp ..
but that's not what i want, I want to deal directly with the zip file, in and out, no third parties.
You're attempting to write to the same Stream you've opened it with. You cannot do that.
Perhaps try something like this:
ZipFile zipFile = ZipFile.Read(zipPath); // the path is my desktop
foreach (ZipEntry entry in zipFile)
{
if (entry.FileName == "myXML.xml")
{
//creating the stream and loading the xml doc from the zip file:
using (Stream stream = zipFile[entry.FileName].OpenReader()) {
XmlReader xReader = XmlReader.Create(stream);
XmlDocument xDoc = new XmlDocument();
xDoc.Load(xReader);
}
//changing the inner text of the doc nodes:
xDoc.DocumentElement.SelectSingleNode("Account/Name").InnerText = "VeXe";
xDoc.DocumentElement.SelectSingleNode("Account/Money").InnerText = "Million$";
using (StreamWriter streamWriter = new StreamWriter(pathToSaveTo)) {
xDoc.Save(streamWriter);
break;
}
}
}
A quick look at the docs leads me to believe that you should do it something like this:
using(ZipFile zipFile = ZipFile.Read(zipPath))
foreach (ZipEntry entry in zipFile)
{
if (entry.FileName == "myXML.xml")
{
XmlDocument xDoc = new XmlDocument();
//creating the stream and loading the xml doc from the zip file:
using(Stream stream = zipFile[entry.FileName].OpenReader())
using(XmlReader xReader = XmlReader.Create(stream))
{
xDoc.Load(xReader);
}
//changing the inner text of the doc nodes:
xDoc.DocumentElement.SelectSingleNode("Account/Name").InnerText = "VeXe";
xDoc.DocumentElement.SelectSingleNode("Account/Money").InnerText = "Million$";
using(var ms=new MemoryStream())
using(var sw=new StreamWriter(ms))
{
xDoc.Save(sw);
sw.Flush();
ms.Position=0;
zipFile.UpdateEntry(entry.FileName,ms);
}
break;
}
}

Error: The XML declaration must be the first node in the document

I am getting "Unexpected XML declaration. The XML declaration must be the first node in the document, and no white space characters are allowed to appear before it" error while trying to load xml. Both my C# code and contents of XML file are given below. XML definition exists in Line 6 of the xml file and hence the error.
I can not control what's there in the xml file so how can I edit/rewrite it using C# such that xml declaration comes first and then the comments to load it without any error!
//xmlFilepath is the path/name of the xml file passed to this function
static function(string xmlFilepath)
{
XmlReaderSettings readerSettings = new XmlReaderSettings();
readerSettings.IgnoreComments = true;
readerSettings.IgnoreWhitespace = true;
XmlReader reader = XmlReader.Create(XmlFilePath, readerSettings);
XmlDocument xml = new XmlDocument();
xml.Load(reader);
}
XmlDoc.xml
<!-- Customer ID: 1 -->
<!-- Import file: XmlDoc.xml -->
<!-- Start time: 8/14/12 3:15 AM -->
<!-- End time: 8/14/12 3:18 AM -->
<?xml version="1.0" encoding="ISO-8859-1" standalone="yes"?>
-----
As the error states, the first five characters of an XML document should be <?xml. No ifs, ands or buts. The comments you have above the opening XML tag are illegal; they must go inside the XML tag (because the comment structure is itself defined by the XML standard and so is meaningless outside the main XML tags).
EDIT: Something like this should be able to rearrange the rows, given the file format from the OP:
var lines = new List<string>();
using (var fileStream = File.Open(xmlFilePath, FileMode.Open, FileAccess.Read))
using(var reader = new TextReader(fileStream))
{
string line;
while((line = reader.ReadLine()) != null)
lines.Add(line);
}
var i = lines.FindIndex(s=>s.StartsWith("<?xml"));
var xmlLine = lines[i];
lines.RemoveAt(i);
lines.Insert(0,xmlLine);
using (var fileStream = File.Open(xmlFilePath, FileMode.Truncate, FileAccess.Write)
using(var writer = new TextWriter(fileStream))
{
foreach(var line in lines)
writer.Write(line);
writer.Flush();
}
That is not valid XML.
As the error clearly states, the XML declaration (<?xml ... ?>) must come first.
I'm using the following function to remove whitespace from xml:
public static void DoRemovespace(string strFile)
{
string str = System.IO.File.ReadAllText(strFile);
str = str.Replace("\n", "");
str = str.Replace("\r", "");
Regex regex = new Regex(#">\s*<");
string cleanedXml = regex.Replace(str, "><");
System.IO.File.WriteAllText(strFile, cleanedXml);
}
Don't put any comments in the beginning of your file!

Invalid character in the given encoding

XmlDocument oXmlDoc = new XmlDocument();
try
{
oXmlDoc.Load(filePath);
}
catch (Exception ex)
{
// Log Error Here
try
{
Encoding enc = Encoding.GetEncoding("iso-8859-1");
StreamReader sr = new StreamReader(filePath, enc);
String response = sr.ReadToEnd();
oXmlDoc.LoadXml(response);
}
catch (Exception innerException)
{
// Log Error Here
return false;
}
}
I got xml file from third party which also include the Document Type Definition file after xml declaration.
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE SoccerMatchPlus SYSTEM "SoccerMatchPlus.dtd">
<SoccerMatchPlus matchid="33226">
<Booking id="13642055" time="47">
<Player id="370927">
<Name firstName="Lasse" initials="L" lastName="Nielsen">L Nielsen</Name>
</Player>
<Team id="26415" name="AæB" homeOrAway="Home"/>
</Booking>
</SoccerMatchPlus>
If I parse the file with Invalid character in the given encoding. Line 102, position 56. If I catch the exception and retry to parse the file then I got another issue, file parses but
I got the error Could not find file 'C:\Windows\system32\SoccerMatchPlus.dtd'.
Document Type Definition file named SoccerMatchPlus.dtd is added before the root element by third party.
In the case of Load method the parser loads the file from the location where xml file also exists.
I put the SoccerMatchPlus.dtd in other location where xml file resides, can I load that SoccerMatchPlus.dtd file from the specified location at runtime or can you tell me the better way to load the xml file which contains the invalid characters data?
Use the XmlResolver property of XmlDocument class to disable DTD processing.
XmlDocument oXmlDoc = new XmlDocument();
oXmlDoc.XmlResolver = null;

Extracting text from a PDF file

I'm using PDFBox for a C# .NET project. and I'm getting a "TypeInitializationException" (The type initializer for 'java.lang.Throwable' threw an exception.) when executing the following block of code :
FileStream stream = new FileStream(#"C:\1.pdf",FileMode.Open);
//retrieve the pdf bytes from the stream.
byte[] pdfbytes=new byte[65000];
stream.Read(pdfbytes, 0, 65000);
//get the pdf file bytes.
allbytes = pdfbytes;
//create a stream from the file bytes.
java.io.InputStream ins = new java.io.ByteArrayInputStream(allbytes);
string txt;
//load the doc
PDDocument doc = PDDocument.load(ins);
PDFTextStripper stripper = new PDFTextStripper();
//retrieve the pdf doc's text
txt = stripper.getText(doc);
doc.close();
the exception occurs at the 3rd statement :
PDDocument doc = PDDocument.load(ins);
What can I do to solve this ?
This is the stack trace :
at java.lang.Throwable.__<map>(Exception , Boolean )
at org.pdfbox.pdfparser.PDFParser.parse()
at org.pdfbox.pdmodel.PDDocument.load(InputStream input, RandomAccess scratchFile)
at org.pdfbox.pdmodel.PDDocument.load(InputStream input)
at At.At.ExtractTextFromPDF(InputStream fileStream) in
C:\Users\Administrator\Documents\Visual Studio 2008\Projects\AtProject\Att\At.cs:line 61
Inner Exception of the InnerException :
InnerException {"Could not load file or assembly 'IKVM.Runtime, Version=0.30.0.0, Culture=neutral, PublicKeyToken=13235d27fcbfff58' or one of its dependencies. The system cannot find the file specified.":"IKVM.Runtime, Version=0.30.0.0, Culture=neutral, PublicKeyToken=13235d27fcbfff58"} System.Exception {System.IO.FileNotFoundException}
OK, I solved the previous problem by copying some .dll files of the PDFBox to my bin folder. but now I'm getting this error : expected='/' actual='.'--1 org.pdfbox.io.PushBackInputStream#283d742
Are there any alternatives to using PDFBox ? is there any other reliable library out there I can use to extract text from pdf files.
It looks like you missing some library for PDFBox. You need:
IKVM.GNU.Classpath.dll
PDFBox-X.X.X.dll
FontBox-X.X.X-dev.dll
IKVM.Runtime.dll
Read this topic Read from a PDF file using C#. You can find the discussion of similar problem in comments of this topic.
I found the versions of the DLL files were the culprits.
Go to http://www.netlikon.de/docs/PDFBox-0.7.2/bin/?C=M;O=A and download the following files:
IKVM.Runtime.dll
IKVM.GNU.Classpath.dll
PDFBox-0.7.2.dll
Then copy them into the root of your Visual Studio project. Right click the project and add reference, find all 3 and add them.
Finally here's the code I used to parse the PDF into Text
C#
private static string TransformPdfToText(string SourceFile)
{
string content = "";
PDDocument doc = new PDDocument();
PDFTextStripper stripper = new PDFTextStripper();
doc.close();
doc = PDDocument.load(SourceFile);
try
{
content = stripper.getText(doc);
doc.close();
}
catch (Exception ex)
{
Console.WriteLine(ex.Message);
}
finally
{
doc.close();
}
return content;
}
Visual Basic
Private Function parseUsingPDFBox(ByVal filename As String) As String
LogFile(" Attempting to parse file: " & filename)
Dim doc As PDDocument = New PDDocument()
Dim stripper As PDFTextStripper = New PDFTextStripper()
doc.close()
doc = PDDocument.load(filename)
Dim content As String = "empty"
Try
content = stripper.getText(doc)
doc.close()
Catch ex As Exception
LogFile(" Error parsing file: " & filename & vbcrlf & ex.Message)
Finally
doc.close()
End Try
Return content
End Function
had a similar problem but not with C++ but VisualBasic/VisualStudio; the missing dll is "commons-logging.dll"; after adding this dll to the bin-directory everything worked find

Categories

Resources