I have a contenteditable div the user enter data. When they enter line break, each browser stores the data differently. When I export this data to Word using HtmlToOpenXml it adds a blank line for the content and I want to avoid that so the html page and word doc look the same.
One option for me is to replace the tags <br>, <div>, <p> with blank and then replace the </div> and </p> with <br/> in the C# code using RegEx. But I do not know what all formatting is used for contenteditable div by different browsers and this implementation may not help.
I would like to know what is the best way to address this or is there any open source tool/dll that helps me with this issue?
e.g. ContentEditable div actual data in browsers looks like below
Chrome -
line1<div>line2</div><div>line3</div>
IE Edge-
<div>line1</div><div>line22</div><div>line3<br></div>
FireFox - I read it uses <p> </p> instead of <div> </div>
Safari - ????
A Solution I found:
You could use RegEx, which I highly recommend in C# for parsing information.
Then effectively based on the formatting you could narrow down what browser it is and then move on towards parsing it's output and what its XML means universally. This will not be easy but no cross-platform ever truly is. I would give a example of how this could be done, but RegEx in all honesty takes a good amount of work and it would be quite a bit of code to make a example that could show you how to parse it and find out what the browser is.
Related
Here's a simulation of the HTML I am trying to use my XPath on:
<div class="stream-links">
<div>
value I need
</div>
<div>
value I need
</div>
<div>
value I need
</div>
</div>
Now, when I use the XPath pattern //div[#class='stream-links']/div/a in my browser it selects the <a ...> node. Everytime I press enter it selects the next one, but when I use the pattern //div[#class='stream-links']/div/a/text() it gets stuck on the text of the first <a ...> node so when I press enter it does not move to the next. (Using Firebug plugin on FireFox btw to inspect element)
I'm coding a program in C# and the amount of divs under the parent div is a variable so I can't use //div[#class='stream-links']/div[number here]/a/text() because I need to get all of them.
My code for using the Xpath is HtmlNodeCollection NODECOL1 = MEDOC.DocumentNode.SelectNodes("//div[#class='stream-links']/div/a[1]");
So my questions are:
1) Is there a particular reason Firebug doesn't jump to the next <a...> or is it a 'bug' on the plugin's side?
2) Will my code work nevertheless or do I need to approach it in another way?
There're a few things not right with the rest of my code so I can't see if that part of my code actually works or not, wouldn't ask question 2 if I could test it myself right now.
For your HTML, this XPath selects three a elements:
//div[#class='stream-links']/div/a
This XPath selects three text nodes:
//div[#class='stream-links']/div/a/text()
This XPath selects one a element:
//div[#class='stream-links']/div/a[1]
My code for using the Xpath is HtmlNodeCollection NODECOL1 =
MEDOC.DocumentNode.SelectNodes("//div[#class='stream-links']/div/a[1]");
1) Is there a particular reason Firebug doesn't jump to the next
or is it a 'bug' on the plugins side?
//div[#class='stream-links']/div/a[1] only selects one a element.
2) Will my code work nevertheless or do I need to approach it in
another way?
There's a few things not right with the rest of my code so I can't see
if that part of my code actually works or not, wouldn't ask question 2
if I could test it myself right now.
That's not a reasonable question to ask given what you've shown us. Perhaps knowing what the above XPaths return will help you answer it for yourself.
I have some HTML code to show up on an HTML page, so it must not be interpreted as HTML.
Also, I'd like to maintain space/empty line and so on.
I'm on C#/.NET 3.5 : what can I use?
Just use HtmlEncode.
Encodes a string to be displayed in a browser.
And documented in the overloads:
HTML encoding makes sure that text is displayed correctly in the browser and not interpreted by the browser as HTML. For example, if a text string contains a less than sign (<) or greater than sign (>), the browser would interpret these characters as the opening or closing bracket of an HTML tag. When the characters are HTML encoded, they are converted to the strings < and >, which causes the browser to display the less than sign and greater than sign correctly.
It is not clear for what purpose you want to display this, but you may want to pretty print before HTML encoding (the HTML Agility Pack may do this, not sure) - and to show it as fixed width you can enclose in a <pre> element.
Since you're not actually saying which technology within .Net you are using to render your Html page (Asp.Net WebForms or MVC or whatever) the answer falls back to how you would do it in HTML, regardless of your server technology. After that, how you actually achieve this output is entirely up to you.
Render it in a <pre /> block:
<pre>
<p>hello world!</p>
<pre>
Here the text will appear as <p>Hello world!</p> and, by default, appear in a fixed-width font and all whitespace will be retained.
I am using version 1.4 of the HtmlAgilityPack and as I understand it, the MixedCodeDocument and related classes are there to help you parse asp.net markup as found in aspx and ascx files. I've found zero documentation or examples for the MixedCodeDocument class. From what I've tried, it seems that the MixedCodeDocument breaks a file's text into chunks separating asp.net fragments from non-asp.net fragments. For example, the following snippet:
<asp:Label ID="lbl_xyz" runat="server" Text='<%=Name%>'></asp:Label>
<a href='#'>blah</a>
would be broken up into:
// Text fragment 1
<asp:Label ID="lbl_xyz" runat="server" Text="
// Code fragment 1
<%=Name%>
// Text fragment 2 (two lines)
></asp:Label>
<a href='#'>blah</a>
But there is no parsing done any deeper than that, i.e. the a tag is not parsed into its own node with attributes or anything like that.
So my best guess is that the MixedCodeDocument is expected to be used to strip out the code fragments so that the remaining text fragments can be pieced together and then parsed using the HtmlDocument class.
Does anybody know if that's correct? Or even better, does anybody have any tips for ways to successfully parse and manipulate an aspx or ascx file using the HAP or other?
You guess is 100% correct.
The MixedCodeDocument class was designed to be able to parse a text that contains two languages, that is, classic ASP, ASP.NET, etc. hence the name :-)
Originally the Html Agility Pack was used in a tool that is capable of processing and transforming a whole tree of various files, including HTML and other types of file. If you needed to replace only the HTML parts for other files, this class helped you split code & markup and. Separated code and markup blocks can then be parsed by other means.
I don't think anyone's using it today :)
I have been using the .NET WebBrowser control in edit mode as part of an interface for end users to create sections of HTML content for insertion into various websites. They have had a very cutdown list of tags available such as <p>, <br>, <a href>, <strong>, <ul> <li>... they could not apply any formatting on top of the tags as that was determined by the particular web pages css. This system has been working well up until now.
Unfortunately I now have a need for xhtml to go into a larger xml document for aggregation purposes by various other websites. The WebBrowsers main problem seems to be lists where it produces:
<UL><LI>Item1
<LI>item2
<LI>item3</LI></UL>
Is there a good converter library to fix this or could I force the WebBrowser control to create XHTML? I have tried the HTMLAgilityPack but it converted to XHTML by doing something like:
<UL><LI>Item1
<LI>item2
<LI>item3</LI></LI></LI></UL>
I don't think his is appropriately set as surely the tags should be at the end of each item although it would pass xhtml validation. If it is ok, will I end up with rendering issues on certain browsers when the XML is eventually put into whatever website?
Try this.
http://tidy.sourceforge.net/
You must be using Internet Explorer, which is the only browser I can think of that doesn't close list-item tags in a content-editable section. Also, the tags ought to be lower case, which is the other give-away.
It is worth checking that you are sending the correct document-type to the browser as this may solve your problem (i.e. make sure the editable bit is definitely an XHTML page). Other than this, you could manage it by having a plain-text editable area with some custom(ish) mark-up and a preview area below. Erm... a bit like Stack Overflow. That way, you can create the exact mark-up you want, rather than relying on what a browser generates.
I have a requirement that user can input HTML tags in the ASP.NET TextBox. The value of the textbox will be saved in the database and then we need to show it
on some other page what he had entered. SO to do so I set the ValidateRequest="false" on the Page directive.
Now the problem is that when user input somthing like :
<script> window.location = 'http://www.xyz.com'; </script>
Now its values saved in the database, but when I am showing its value in some other page It redirects me to "http://www.xyz.com" which is obvious
as the javascript catches it. But I need to find a solution as I need to show exactly what he had entered.
I am thinking of Server.HtmlEncode. Can you guide me to a direction for my requirement
Always always always encode the input from the user and then and only then persist in your database. You can achieve this easily by doing
Server.HtmlEncode(userinput)
Now, when it come time to display the content to the user decode the user input and put it on the screen:
Server.HtmlDecode(userinput)
You need to encode all of the input before you output it back to the user and you could consider implementing a whitelist based approach to what kind of HTML you allow a user to submit.
I suggest a whitelist approach because it's much easier to write rules to allow p,br,em,strong,a (for example) rather than to try and identify every kind of malicious input and blacklist them.
Possibly consider using something like MarkDown (as used on StackOverflow) instead of allowing plain HTML?
You need to escape some characters during generating the HTML: '<' -> <, '>' -> >, '&' -> &. This way you get displayed exactly what the user entered, otherwise the HTML parser would possibly recognize HTML tags and execute them.
Have you tried using HTMLEncode on all of your inputs? I personally use the Telerik RadEditor that escapes the characters before submitting them... that way the system doesn't barf on exceptions.
Here's an SO question along the same lines.
You should have a look at the HTML tags you do not want to support because of vulnerabilities as the one you described, such as
script
img
iframe
applet
object
embed
form, button, input
and replace the leading "<" by "& lt;".
Also replace < /body> and < /html>
HTML editors such as CKEditor allow you to require well-formed XHTML, and define tags to be excluded from input.