In my website, admin uploads a .docx file. I convert the file into xml using OpenXmlPowerTools Api.
The issue is the document has some bullets in it.
• This is my bullet 1 in the document.
• This is my bullet 2 in the document.
XElement html = OpenXmlPowerTools.HtmlConverter.ConvertToHtml(wDoc, settings);
var htmlString = html.ToString();
File.WriteAllText(destFileName.FullName, htmlString, Encoding.UTF8);
Now when I open the xml file, it renders the bullets as below:-
I need to read each node of XML & save in the database & reconsturct html from nodes.
Please don't ask me why so, as I am not the boss of the system.
How do I get the bullets render correctly in xml so that I can save the right
html in the database?
I have fixed same issue for my requirement and this working without issue so far.
In case like this you'll always have to try workaround i.e. copy this character and compare it within your input/read strings etc. if found then replace it with equivalent html encoded character. In your case it will be bullet list character "ampersandbull;" or "ampersand#8226;" .
Code should look like
listItem == "Compare with your copied character like one in your pic" ? "•" : listItem
you can find more equivalent characters at this link:
http://www.zytrax.com/tech/web/entities.html
Hey I don't think XML can read bullets. I'll advise you programmatically handle it. Try and debug and see what the square is being represented as and then do an if statement to find it and replace it with a code you can define so that when you return it to use it you can convert that code if found to a bullet.
Related
I am facing a problem. I am using a bbcode parser to HTML and when I try to parse it I have some problem when I have tags that is not in my set of parser.
For example:
My parser permit just [b], [center] and [i] tags.
If I try to parse [u] or [color={anyColor}] tags it returns me an exception.
I would like to remove any other tag not permited.
First I thought about not permitting it on my textarea, but, when I use ctrl+c/v to fill the textarea it fills with those tags and I notice it when the data is already on my database.
What I thought:
User enter the string with wrong tags
I call any method to remove not permitted tags (here is my problem)
save data on my database
Can anyone help me with it? Or suggest me something else?
After taking a quick look into the parser src found on the link you provided, it seems that if it runs into a tag that it does not know(meaning not in the list of tags provided during instantiation) it errors out(in some manner).
As it stands it looks like you have a few options:
Change your ErrorMode to ErrorFree.
this will no longer produce any exceptions and instead treat Unknown tags as text.
Go with your original Idea and restrict the input on the front end.
If you can, instead of going straight to HTML, add all possible tags to the parser, check to see if you can get a c# object out of the parser and eliminate the unwanted tags before outputting to html.
Or on the downswing of things after html is produced prohibit the use of the generated HTML tags.
Send the authors of the parser an email/(if you know german) a ticket/issue on codeplex and ask them to add support for striping unwanted tags.
Or if you want since you have the src add functionality to strip unwanted tags, yourself
This shouldn't be too hard I think, follow the pattern they have for the current Tags list in BBCodeParser.cs and make an TagsToIgnore list and just add a check before the rest of the parsing of a tag just to strip/ continue on to the next token.
EDIT:
You may be able to make the parser interpret the tags to display nothing. where you init the bbCodeParser.
var parser = new BBCodeParser(new[]
{
// keep these tags
new BBTag("b", "<b>", "</b>"),
new BBTag("i", "<span style=\"font-style:italic;\">", "</span>"),
new BBTag("u", "<span style=\"text-decoration:underline;\">", "</span>"),
// remove these (or at least there markup)
new BBTag("code", "", ""),
new BBTag("img", "", ""),
});
I am inserting text into an open XML document. The text I retrieve and insert into the document contains HTML formatting, i.e < p >some text< / p > < p >More text< / p > thus the inserted text inside word gets this as text. Can text with HTML get cast to something open XML documents will understand ?
New answer:
There is actually a project on codeplex that does exactly what you are looking for.
See here the project here:
Html to OpenXml on codeplex
However; if the formatting (headings/paragraphs etc...) are not important you can just strip the HTML-tags entirely.
Here is a tutorial on how to do that:
C# Remove HTML Tags
Old answer (OP worded his question a bit odd, and i misunderstood):
What you need to do is encode your HTML-code somehow; you could use base64 or whatever floats your boat. "Simple" HTML-encoding would probably be the best course of action here.
This way the HTML will not break your XML.
ASP.NET has support for this; but you can do it in any application by importing the required namespace.
Here's an example.
HtmlEncode from Class Library
I am using Dataset.ReadXML() to read an XML string. I get an error as the XML string contains the Invalid Character 0x1F which is 'US' - Unit seperator. This is contained within fully formed tags.
The data is extracted from an Oracle DB, using a Perl script. How would be the best way to escape this character so that the XML is read correctly.
EDIT: XML String:
<RESULT>
<DEPARTMENT>Oncology</DEPARTMENT>
<DESCRIPTION>Oncology</DESCRIPTION>
<STUDY_NAME>**7360C hsd**</STUDY_NAME>
<STUDY_ID>27</STUDY_ID>
</RESULT>
Is between the C and h in the bold part, is where there is a US seperator, which when pasted into this actually shows a space. So I want to know how can I ignore that in an XML string?
If you look at section 2.2 of the XML recommendation, you'll see that x01F is not in the range of characters allowed in XML documents. So while the string you're looking at may look like an XML document to you, it isn't one.
You have two problems. The relatively small one is what to do about this document. I'd probably preprocess the string and discard any character that's not legal in well-formed XML, but then I don't know anything about the relatively large problem.
And the relatively large problem is: what's this data doing in there in the first place? What purpose (if any) do non-visible ASCII characters in the middle of a (presumably) human-readable data field serve? Why is it doesn't the Perl script that produces this string failing when it encounters an illegal character?
I'll bet you one American dollar that it's because the person who wrote that script is using string manipulation and not an XML library to emit the XML document. Which is why, as I've said time and again, you should never use string manipulation to produce XML. (There are certainly exceptions. If you're writing a throwaway application, for instance, or an XML parser. Or if your name's Tim Bray.)
Your XmlReader/TextReader must be created with correct encoding. You can create it as below and pass to your Dataaset:
StreamReader reader = new StreamReader("myfile.xml",Encoding.ASCII); // or correct encoding
myDataset.ReadXml(reader);
I am writing a Word plugin to read all text in a document and saving it to a text file.
The text file generated will be used by another application of mine, and so I need to mark the end of every page's text by a '\f' character. My current logic merely saves the file though word as a plain text file, by using
object format = WdSaveFormat.WdFormatText;
...
Application.ActiveDocument.SaveAs( ..., ref format, ... );
The best method I found to insert a break was using ActiveDocument.Selection.InsertBreak().
Is there some way to determine the positions of page breaks in the original Word document so that I know where to insert the '\f' character?
This is one of the hard way to do it
use computestastics() for line number and get the no of lines
use goto to goto last line in the documnet and insert a Hard EOF
ex:
Selection.GoTo(wdGoToLine,wdGoToAbsolute,4)
The only thing that I can think of right now is for you to save it as an html which will give you a tag for every paragraph. Then you can get the begining of each paragraph text and used that to find the first starting position of each paragraph on the original document.
Also, you can do a Selection.Find and search for "^p" which is a paragraph mark.
I am trying to use Linq to XML to save & retrieve some HTML between an XML file and a windows forms application. When it saves it to the XML file the HTML tags get xml encoded and it isn't saved as straight HTML.
Example HTML:
<P><FONT color=#004080><U>Sample HTML</U></FONT></P>
Saved in XML File:
<P><FONT color=#004080><U>Sample HTML</U></FONT></P>
When I manually edit the XML file and put in the desired HTML the Linq pulls in the HTML and displays it properly.
Here is the code that saves the HTML to the XML file:
XElement currentReport = (from item in callReports.Descendants("callReport")
where (int)item.Element("localId") == myCallreports.LocalId
select item).FirstOrDefault();
currentReport.Element("studio").Value = myCallreports.Studio;
currentReport.Element("visitDate").Value = myCallreports.Visitdate.ToShortDateString();
// *** The next two XElements store the HTML
currentReport.Element("recomendations").Value = myCallreports.Comments;
currentReport.Element("reactions").Value = myCallreports.Ownerreaction;
I assume this is happening b/c of xml encoding but I am not sure how to deal with it. This question gave me some clues...but no answer (for me, at least).
Thanks for the help,
Oran
Setting the Value property will automatically encode the html string. This should do the trick, but you'll need to make sure that your HTML is valid XML (XHTML).
currentReport.Element("recomendations").ReplaceNodes(XElement.Parse(myCallreports.Comments));
Edit: You may need to wrap the user entered HTML in <div> </div> tags. XElement.Parse expects to find a string with at least a start and end xml tag. So, this might work better:
currentReport.Element("recomendations").ReplaceNodes(XElement.Parse("<div>" + myCallreports.Comments + "</div>"));
Then you just have to make sure that tags like <br> are being sent in as <br />.
Edit 2: The other option would be use XML CDATA. Wrap the HTML with <![CDATA[ and ]]>, but I've never actually used that and I'm not sure how it affects reading the xml.
currentReport.Element("recomendations").ReplaceNodes(XElement.Parse("<![CDATA[" + myCallreports.Comments + "]]>"));
Try using currentReport.Element("studio").InnerXml instead of currentReport.Element("studio").Value