Why isn't MarkdownSharp encoding my HTML?

Why isn't MarkdownSharp encoding my HTML? - c#

In my mind, one of the bigger goals of Markdown is to prevent the user from typing potentially malformed HTML directly.
Well that isn't exactly working for me in MarkdownSharp.
This example works properly when you have the extra line break immediately after "abc"...
But when that line break isn't there, I think it should still be HtmlEncoded, but that isn't happening here...
Behind the scenes, the rendered markup is coming from an iframe. And this is the code behind it...
<%
var md = new MarkdownSharp.Markdown();
%>
<%= md.Transform(Request.Form[0]) %>
Surely I must be missing something. Oh, and I am using v1.13 (the latest version as of this writing).
EDIT (this is a test for StackOverflow's implementation)
abc
this shouldn't be red

For those not wanting to use Steve Wortham's customized solution, I have submitted an issue and a proposed fix to the MarkdownSharp guys: http://code.google.com/p/markdownsharp/issues/detail?id=43
If you download my attached Markdown.cs file you will find a new option that you can set. It will stop MarkdownSharp from re-encoding text within the code blocks.
Just don't forget to HTML encode your input BEFORE you pass it into markdown, NOT after.
Another solution is to white-list HTML tags like Stack Overflow does. You would do this AFTER you pass your content to markdown.
See this for more information: http://www.CodeTunnel.com/blog/post/24/mardownsharp-and-encoded-html

Since it became clear that the StackOverflow implementation contains quite a few customizations that could be time consuming to test and figure out, I decided to go another direction.
I created my own simplified markup language that's a subset of Markdown. The open-source project is at http://ultralight.codeplex.com/ and you can see a working example at http://www.bucketsoft.com/ultralight/
The project is a complete ASP.NET MVC solution with a Javascript editor. And unlike MarkdownSharp, safe HTML is guaranteed. The Javascript parser is used both client-side and server-side to guarantee consistent markup (special thanks to the Jurassic Javascript compiler). It's a beautiful thing to only have to maintain one codebase for that parser.
Although the project is still in beta, I'm using it on my own site already and it seems to be working well so far.

Maybe I'm not understanding? If you are starting a new code block in Markdown, in all its varieties, you do need a double linebreak and four-space indentation -- a single newline won't do in any of the renderers I have to hand.
abc -- Here comes a code block:
<div style="background-color: red"> This is code</div>
yielding:
abc -- Here comes a code block:
<div style="background-color: red"> This is code</div>
From what you are saying it seems that MarkdownSharp does fine with this rule, so with just one newline (but indentation):
abc -- Here comes a code block:
<div style="background-color: red"> This should be code</div>
we get a mess not a code block:
abc -- Here comes a code block:
This should be code
I assume StackOverflow is stripping the <div> tags, because they think comments shouldn't have divisions and suchlike things. (?) (In general they have to do a lot of other processing don't they, e.g. to get syntax highlighting and so on?)
EDIT: I think people are expecting the wrong thing of a Markdown implementation. For example, as I say below, there is no such thing as 'invalid markdown'. It isn't a programming language or anything like one. I have verified that all three markdown implementations I have available from the command line indifferently 'convert' random .js and .c files, or those inserted into otherwise sensible markdown -- and also interpolated zip files and other nonsense -- into valid html that browsers don't mind displaying at all -- chicken scratches though it is. If you want to exclude something, e.g. in a wiki program, you do something further, of course, as most markdown-employing wiki programs do.

Related

HtmlAgilityPack Drops Paragraph End Tags [duplicate]

I am using HtmlAgilityPack. I create an HtmlDocument and LoadHtml with the following string:
<select id="foo_Bar" name="foo.Bar"><option selected="selected" value="1">One</option><option value="2">Two</option></select>
This does some unexpected things. First, it gives two parser errors, EndTagNotRequired. Second, the select node has 4 children - two for the option tags and two more for the inner text of the option tags. Last, the OuterHtml is like this:
<select id="foo_Bar" name="foo.Bar"><option selected="selected" value="1">One<option value="2">Two</select>
So basically it is deciding for me to drop the closing tags on the options. Let's leave aside for a moment whether it is proper and desirable to do that. I am using HtmlAgilityPack to test HTML generation code, so I don't want it to make any decision for me or give any errors unless the HTML is truly malformed. Is there some way to make it behave how I want? I tried setting some of the options for HtmlDocument, specifically:
doc.OptionAutoCloseOnEnd = false;
doc.OptionCheckSyntax = false;
doc.OptionFixNestedTags = false;
This is not working. If HtmlAgilityPack cannot do what I want, can you recommend something that can?

The exact same error is reported on the HAP home page's discussion, but it looks like no meaningful fixes have been made to the project in a few years. Not encouraging.
A quick browse of the source suggests the error might be fixable by commenting out line 92 of HtmlNode.cs:
// they sometimes contain, and sometimes they don 't...
ElementsFlags.Add("option", HtmlElementFlag.Empty);
(Actually no, they always contain label text, although a blank string would also be valid text. A careless author might omit the end-tag, but then that's true of any element.)
ADD
An equivalent solution is calling HtmlNode.ElementsFlags.Remove("option"); before any use of liberary (without need to modify the liberary source code)

It seems that there is some reason not to parse the Option tag as a "generic" tag, for XHTML compliance, however this can be a real pain in the neck.
My suggestion is to do a whole-string-replace and change all "option" tags to "my_option" tags, that way you:
Don't have to modify the source of the library (and can upgrade it later).
Can parse as you usually would.
The original post on HtmlAgilityPack forum can be found at:
http://htmlagilitypack.codeplex.com/Thread/View.aspx?ThreadId=14982

Detect Razor/C# code?

Is there a way to detect if an HTML page contains any razor/C# code? Essentially I want users to be able to provide custom layouts, with tags that I will replace with RenderSection. I want to validate that prior to making this replacement, that none of the HTML contains anything like for example, <a href="#(some C# code)".
All discussions about alternative ways to do this, should/could/would aside, just simply:
Is there a way to programmatically detect if a file contains C#/Razor code?

I don't know a lot about the Razor markup -- but I am thinking that when you grab the layout string they are passing in you will want to parse the text out and grab everything that starts with an # and toss those words into an array. Then, when you republish it to you website use razor code to access the data in the array...
Alternately, and easier, would be to go through all the passed in code and replace all the # signs with a different symbol say & that way it wont get interpreted by the Razor processor:
layoutString = layoutString.Replace('#', '&');

In the browser? No, because unless the programmer made a mistake, there is no Razor/C# code in teh rendered HTML, only HTML that was the result of that.
What you ask is like asking what type of oven was used to bake a pizza from the pizza. Bad news - you never will know.
If you provie sensible tags from those, you could parse them in javascript, but you have to output that metadata yourself as part of the generated html.

After reading your comment to TomTom; the answer is:
No. Razor does not come with any public syntax parser.

Asp.net Mvc, Razor and Localization

I know this matter has already been brought on these pages many times, but still I haven't found the "good solution" I am required to find. Let's start the explanation.
Localization in .net, and in mvc, is made in 2 ways that can even be mixed together:
Resource files (both local or global)
Localized views with a viewengine to call the appropriate view based on culture
I'll explain the solutions I tried and all the problems I got with every one of them.
Text in resource files, all tags in the view
This solution would have me put every text in resources, and every tag in the view, even the inline tags such as [strong] or [span].
Pros:
Clean separation, no structure whatsoever in localization.
Easy encoding: everything that is returned from the resource gets html encoded.
Cons:
If I have a paragraph with some strongs, a couple of link etc I have to split it in many resource keys. This is considered to make the view too unreadable and also takes too much time to create it.
For the same reason as above, if in two different languages the [strong] text is in different places (like "Il cane di Marco" and "Marcos's dog"), I can't achieve it, since all my tags are in the view.
Text and inline tags in resource files, through parameters
This method will have the resources contain some placeholders for string.Format, and those placeholders will be filled with inline tags post-encoding.
Pros:
Clean separation, with just placeholders in the text, so if I am ever to replace [strong] with [em] I do it in the view where I pass it as parameter and it gets changed in every language
Cons:
Encoding is a bit harder, I have to pre-encode the value from the resource, then use string.Format, and finally return it as MvcHtmlString to tell the view engine to not re-encode it when displaying.
For the same reason as above, including, for instance, an ActionLink as parameter would be troublesome. Let's say I get the text for my actionlink from a resource. My method already encodes it. But then, the ActionLink method would re-encode it again. I would need a distinct method to get resources without encoding them, or new helper methods that get an MvcHtmlString instead of a string as text parameter, but both are rather unpractical.
Still takes a whole lot of time to build views, having to create all the resource keys and then fill them.
Localized views
Pros:
All views are plain html. No resources to read.
Cons:
Duplicated html everywhere. I don't even need to explain how this is totally evil.
Have to manually encode all troublesome characters like grave vowels, quotes and such.
Conclusions
A mix of the above techinques inherits pros and cons, but it's still no good. I am challenged to find a proper productive solution, while all of the above are considered "unpractical" and "time consuming".
To make things worse, I found out that there isn't a single tool that refactors "text" from aspx or cshtml (or even html) views/pages into resources. All the tools out there can refactor System.String instances in code files (.cs or .vb) into resources only (resharper for instance, and a couple of others I can't remember now).
So I'm stuck, can't find anything appropriate on my own, and can't find anything on the web either. Is it possible noone else got challenged with this problem before and found a solution?

I personally like the idea of storing inline tags in the resource file. However I do it a little differently. I store very plain tags like <span class='emphasis'>dog</span> and then I use CSS to style the tags appropriately. Now, instead of "passing in" a tag as a parameter, I simply style the span.emphasis rule in my CSS appropriately. Change carries over to all languages.
The Sexier Option:
Another option I thought of and quite enjoy is to use a "readable markup" language like StackOverflow's very own MarkdownSharp. This way you aren't storing any HTML in the resource file, only markdown text. So in your resource you would have **dog** and then it gets shunted through markdown in the view (I created a helper for this, (Usage: Html.Markdown(string text)). Now you're not storing tags, you're storing a common human readable markup language. The markdownsharp source is one .CS file and it's easy to modify. So you could always change the way it renders the ending HTML. This gives you total control over all your resources without storing HTML, and without duplicating views or chunks of HTML.
EDIT
This also gives you control over the encoding. You could easily make sure the content of your resource files contain no valid HTML. Markdown syntax (as you know from using stack overflow) does not contain HTML tags and thus can be encoded without harm. Then you just use your helper to convert the Markdown syntax to valid HTML.
EDIT #2
There is one bug in markdown that I had to fix myself. Anything markdown detects is to be rendered as a "code" block will be HTML encoded. This is a problem if you have already HTML encoded all content being passed to markdown as anything in the code blocks will be essentially re-encoded which turns > into &gt; and completely screws up the text within code blocks. To fix this I modified the markdown.cs file to include a boolean option that stops markdown from encoding text within code blocks. See this issue for the fixed .cs file that I added to the MarkdownSharp project issues.
EDIT #3 - Html Helper Sample
public static class HtmlHelpers
{
public static MvcHtmlString Markdown(this HtmlHelper helper, string text)
{
var markdown = new MarkdownSharp.Markdown
{
AutoHyperlink = true,
EncodeCodeBlocks = false, // This option is my custom option to stop the code block encoding problem.
LinkEmails = true,
EncodeProblemUrlCharacters = true
};
string html = markdown.Transform(markdownText);
return MvcHtmlString.Create(html);
}
}

Nothing stops you from storing HTML in resource files, then calling #Html.Raw(MyResources.Resource).

Have you thought about using localized models, have your view be strongly types to IMyModel and then pass in the appropriately decorated model then you can use/change how your doing your localization fairly easy by modifying the appropriated model.
it's clean, very flexible, and very easy to maintain.
you could start out with Recourse file based localization and then for paces you need to update more often switch that model to a cached DB based localization model.

Displaying C# code in Wordpress.com

I have researched this for a few hours and I am kind of frustrated. Maybe I am just missing something as I am new to blogging.
I am not hosting my own blog, I am just using WordPress.com.
I want to include snippets of c# code and have them look like they do in Visual Studio, or at least make them look nice, certainly with line numbers and color.
The solutions I have seen for this all seem to assume you are hosting your own blog.
I cannot figure out how to install plugins.
Is there a widget that will make code snippets look nice, or some other solution I can easily use?
Thank you
EDIT: Sarfraz has outlined one way to solve my problem (thank you!), and I have tried it but there is an issue I have, namely that it does not colorize most of my code (newer keywords like var, from, where, select, etc). Is there a fix to this or is there some other solution?

Just edit your aricles in html mode and enclose your code within these tags.
[sourcecode language="css"]
[/sourcecode]
Example:
[sourcecode language="javascript"]
// javascript hello world program
alert('Hello, World !!');
[/sourcecode]
Note: You need to specify correct language identifier for the language attribute as shown above.
More Information Here :)

The [sourcecode] tag usually works fine for C#, but for me it often breaks when I post XAML code.
Instead I use this page to format my code. The result looks nice (you can see it on my blog), but it requires the "Custom CSS" option ($15/year).
EDIT: actually the [sourcecode] tag works fine, and I'm now using it in all my posts

[code language="csharp"]
//Your code here
[/code]

looks like this has been updated, now you can use
[code language="[the lang you are posting]"]
your code here
[/code]
note: you can shorthand language as lang
[code lang="[the lang you are posting]"]
your code here
[/code]
here is the list of supported languages

Regex with not match at end

I'm trying to write a regex to match patterns like this:
<td style="alskdjf" />
i.e. a self terminating <td>
but not this:
<td style=alsdkjf"><br /></td>
I initially came up with:
<td\s+.*?/>
but that obviously fails on the second example and I thought that something like this might work:
<td\s+.*?[^>]/>
but it doesn't. I'm using C#.NET.
Only looking for <td>'s that have an attribute. e.g. looking for <td style="alsdfkj" /> but not <td>.

This will match what you're looking for, and not match the problematic case you had with your first few tries:
<td[^>]*?/>
Note, however, that if you need to allow > characters in attribute values, you'd need something like this:
<td(?:[^>]|"[^"]*?")*?/>
Which allows > only within matching double-quotes (you could similarly expand it to allow single-quotes).
You can add whatever specific attribute you're looking for into the regex; for instance for your example:
<td[^>]*? style="alskdjf"[^>]*?/>

You're going to have problems using regexps with HTML since HTML is not regular. I'd recommend using an HTML parser for all but the very simplest cases.

Regex will have serious trouble interpreting messy HTML, as is the sort browsers often have to deal with. There are all sorts of horrible obfuscations that can be done to the markup that you just don't want to have to think about!
The HTML Agility Pack is what you really want to be using, and has had very good reviews everywhere I've seen. It is a robust library for reading any kind of mangled HTML into a DOM model. I have personally found it to be an superb library, as surely have others, many using the library in the context of business applications.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.