Is there a way to detect if an HTML page contains any razor/C# code? Essentially I want users to be able to provide custom layouts, with tags that I will replace with RenderSection. I want to validate that prior to making this replacement, that none of the HTML contains anything like for example, <a href="#(some C# code)".
All discussions about alternative ways to do this, should/could/would aside, just simply:
Is there a way to programmatically detect if a file contains C#/Razor code?
I don't know a lot about the Razor markup -- but I am thinking that when you grab the layout string they are passing in you will want to parse the text out and grab everything that starts with an # and toss those words into an array. Then, when you republish it to you website use razor code to access the data in the array...
Alternately, and easier, would be to go through all the passed in code and replace all the # signs with a different symbol say & that way it wont get interpreted by the Razor processor:
layoutString = layoutString.Replace('#', '&');
In the browser? No, because unless the programmer made a mistake, there is no Razor/C# code in teh rendered HTML, only HTML that was the result of that.
What you ask is like asking what type of oven was used to bake a pizza from the pizza. Bad news - you never will know.
If you provie sensible tags from those, you could parse them in javascript, but you have to output that metadata yourself as part of the generated html.
After reading your comment to TomTom; the answer is:
No. Razor does not come with any public syntax parser.
Related
I've tried this and searched for help but I cannot figure it out. I can get the source for a page but I don't need the whole thing, just one string that is repeated. Think of it like trying to grab only the titles of articles on a page and adding them in order to an array without losing any special characters. Can someone shed some light?
You can use a Regular Expression
to extract the content you want from a string, such as your html string.
Or you can use a DOM parser such as
Html Agility Pack
Hope this helps!
You could use something like this -
var text = "12 hello 45 yes 890 bye 999";
var matches = System.Text.RegularExpressions.Regex.Matches(text,#"\d+").Cast<Match>().Select(m => m.Value).ToList();
The example pulls all numbers in the text variable into a list of strings. But you could change the Regular Expression to do something more suited to your needs.
if the page is well-formed xml, you could use linq to xml by loading the page into an XDocument and using XPath or another way of traversing to the element(s) you desire and loading what you need into the array for which you are looking (or just use the enumerable if all you want to do is enumerate). if the page is not under your control, though, this is a brittle solution that could break at any time when subtle changes could break the well-formedness of the xml. if that's the case, you're probably better off using regular expressions. eiither way, though, the page could be changed under you and your code suddenly won't work anymore.
the best thing you could do would be to get the provider of the page to expose what you need as a webservice rather than trying to scrape their page.
I know this matter has already been brought on these pages many times, but still I haven't found the "good solution" I am required to find. Let's start the explanation.
Localization in .net, and in mvc, is made in 2 ways that can even be mixed together:
Resource files (both local or global)
Localized views with a viewengine to call the appropriate view based on culture
I'll explain the solutions I tried and all the problems I got with every one of them.
Text in resource files, all tags in the view
This solution would have me put every text in resources, and every tag in the view, even the inline tags such as [strong] or [span].
Pros:
Clean separation, no structure whatsoever in localization.
Easy encoding: everything that is returned from the resource gets html encoded.
Cons:
If I have a paragraph with some strongs, a couple of link etc I have to split it in many resource keys. This is considered to make the view too unreadable and also takes too much time to create it.
For the same reason as above, if in two different languages the [strong] text is in different places (like "Il cane di Marco" and "Marcos's dog"), I can't achieve it, since all my tags are in the view.
Text and inline tags in resource files, through parameters
This method will have the resources contain some placeholders for string.Format, and those placeholders will be filled with inline tags post-encoding.
Pros:
Clean separation, with just placeholders in the text, so if I am ever to replace [strong] with [em] I do it in the view where I pass it as parameter and it gets changed in every language
Cons:
Encoding is a bit harder, I have to pre-encode the value from the resource, then use string.Format, and finally return it as MvcHtmlString to tell the view engine to not re-encode it when displaying.
For the same reason as above, including, for instance, an ActionLink as parameter would be troublesome. Let's say I get the text for my actionlink from a resource. My method already encodes it. But then, the ActionLink method would re-encode it again. I would need a distinct method to get resources without encoding them, or new helper methods that get an MvcHtmlString instead of a string as text parameter, but both are rather unpractical.
Still takes a whole lot of time to build views, having to create all the resource keys and then fill them.
Localized views
Pros:
All views are plain html. No resources to read.
Cons:
Duplicated html everywhere. I don't even need to explain how this is totally evil.
Have to manually encode all troublesome characters like grave vowels, quotes and such.
Conclusions
A mix of the above techinques inherits pros and cons, but it's still no good. I am challenged to find a proper productive solution, while all of the above are considered "unpractical" and "time consuming".
To make things worse, I found out that there isn't a single tool that refactors "text" from aspx or cshtml (or even html) views/pages into resources. All the tools out there can refactor System.String instances in code files (.cs or .vb) into resources only (resharper for instance, and a couple of others I can't remember now).
So I'm stuck, can't find anything appropriate on my own, and can't find anything on the web either. Is it possible noone else got challenged with this problem before and found a solution?
I personally like the idea of storing inline tags in the resource file. However I do it a little differently. I store very plain tags like <span class='emphasis'>dog</span> and then I use CSS to style the tags appropriately. Now, instead of "passing in" a tag as a parameter, I simply style the span.emphasis rule in my CSS appropriately. Change carries over to all languages.
The Sexier Option:
Another option I thought of and quite enjoy is to use a "readable markup" language like StackOverflow's very own MarkdownSharp. This way you aren't storing any HTML in the resource file, only markdown text. So in your resource you would have **dog** and then it gets shunted through markdown in the view (I created a helper for this, (Usage: Html.Markdown(string text)). Now you're not storing tags, you're storing a common human readable markup language. The markdownsharp source is one .CS file and it's easy to modify. So you could always change the way it renders the ending HTML. This gives you total control over all your resources without storing HTML, and without duplicating views or chunks of HTML.
EDIT
This also gives you control over the encoding. You could easily make sure the content of your resource files contain no valid HTML. Markdown syntax (as you know from using stack overflow) does not contain HTML tags and thus can be encoded without harm. Then you just use your helper to convert the Markdown syntax to valid HTML.
EDIT #2
There is one bug in markdown that I had to fix myself. Anything markdown detects is to be rendered as a "code" block will be HTML encoded. This is a problem if you have already HTML encoded all content being passed to markdown as anything in the code blocks will be essentially re-encoded which turns > into > and completely screws up the text within code blocks. To fix this I modified the markdown.cs file to include a boolean option that stops markdown from encoding text within code blocks. See this issue for the fixed .cs file that I added to the MarkdownSharp project issues.
EDIT #3 - Html Helper Sample
public static class HtmlHelpers
{
public static MvcHtmlString Markdown(this HtmlHelper helper, string text)
{
var markdown = new MarkdownSharp.Markdown
{
AutoHyperlink = true,
EncodeCodeBlocks = false, // This option is my custom option to stop the code block encoding problem.
LinkEmails = true,
EncodeProblemUrlCharacters = true
};
string html = markdown.Transform(markdownText);
return MvcHtmlString.Create(html);
}
}
Nothing stops you from storing HTML in resource files, then calling #Html.Raw(MyResources.Resource).
Have you thought about using localized models, have your view be strongly types to IMyModel and then pass in the appropriately decorated model then you can use/change how your doing your localization fairly easy by modifying the appropriated model.
it's clean, very flexible, and very easy to maintain.
you could start out with Recourse file based localization and then for paces you need to update more often switch that model to a cached DB based localization model.
I have something of a a hairy problem, I'd like to generate a couple of paragraphs of "description" of a given url, normally the start of an article. The Meta description field is one way to go but it isn't always good or set properly.
It's fair to say it's a bit problematic to accomplish this from the screenscraped HTML. I had a general idea that perhaps one could scan the HTML for the first "appropriate" segment but it's hard to say what that is, perhaps something like the first paragraph containing a certain amount of text...
Anyone have any good ideas? :) It doesn't have to be foolproof
So, you wanna become a new Google, heh? :-)
Many sites are "SEO friendly" these days. This enables you to go for the headings and then look for paragraphs bellow.
Also, look for lists. There is a lot of content in some sort of tab-like (tabs, accordions...) interfaces that is done using ordered or unordered lists.
If that fails, maybe look for a div with class "content" or "main" or a combination and start from there.
If you use different approaches, make sure you keep statistics of what worked and what didn't (maybe even save a full page), so you can review and tweak your parsing and searching methods.
As a side note, I've used htmlagilitypack to parse and search through html with success. Well, at leasts it beats parsing with regex :-)
Perhaps look for the div element that contains the most p elements, and then grab the first p child. If no div, get the first p from the body element.
This will always have its problems.
You can strip the HTML tags using this regular expression
string stripped = Regex.Replace(textBox1.Text,#"<(.|\n)*?>",string.Empty)
You will them get the content text you can use to generate your paragraphs.
I'm currently playing around with a CMS idea I've got. It's based on a MonoRail, NHibernate stack. I know there are already a million CMS solutions out there. This is more for my benefit for trying some new stuff out.
Anyway, the admin side of things is going well with a plugin architecture in full flow, however I've hit a bit of a road block with the front end template management side of things.
What I'm wanting to do is allow developers to write their own custom tags e.g.
<cms:news>
<h1><cms:news:title /></h1>
<p><cms:news:date /></p>
<cms:news:story />
</cms:news>
I believe this will give developers a great deal of flexibility.
The part I'm struggling with is the parsing of these tags. I could use reflection, however I'm worried that this may be quite intensive for every page. Has anyone else done something like this, that has a better solution?
Sorry for the lack of info guys. Here is a bit more info for you.
The above code would site in a "page" in the CMS. The complete page markup would simply be a DB record.
Once the parser hits there tags it would then need to process them to convert them to content. In the example above the parser would hit the cms:news tag and make a call to a function like this
public void news()
{
// Get all of the news articles from the database
}
The cms:news:title (or cms:news.title) tag would call a function like this
public string newstitle()
{
// Return the news title for the current news element we are rendering
}
Hope this makes more sense now
Thanks
John
I think I've been looking at this all wrong.
I could basically do this my using something like the Spark View Engine's InMemoryViewFolder and using ViewComponents for the custom tags.
The tags you're considering to use are not valid XML : you can't have multiple colons in an element name (only one to separate the namespace from the local name)
Consider this instead :
<cms:news>
<h1><cms:news.title /></h1>
<p><cms:news.date /></p>
<cms:news.story />
</cms:news>
To parse this XML, there are a number of options available to you :
XmlReader
XmlDocument
XDocument (Linq to XML)
I don't think XML serialization is an option if the tags are customizable...
Anyway, I'm not sure what you're trying to achieve exactly... What would you do with those tags ? Could you be more specific in your question ?
Ive made a small program in C#.net which doesnt really serve much of a purpose, its tells you the chance of your DOOM based on todays news lol. It takes an RSS on load from the BBC website and will then look for key words which either increment of decrease the percentage chance of DOOM.
Crazy little project which maybe one day the classes will come uin handy to use again for something more important.
I recieve the RSS in an xml format but it contains alot of div tags and formatting characters which i dont really want to be in the database of keywords,
What is the best way of removing these unwanted characters and div's?
Thanks,
Ash
If you want to remove the DIV tags WITH content as well:
string start = "<div>";
string end = "</div>";
string txt = Regex.Replace(htmlString, Regex.Escape(start) + "(?<data>[^" + Regex.Escape(end) + "]*)" + Regex.Escape(end), string.Empty);
Input: <xml><div>junk</div>XXX<div>junk2</div></xml>
Output: <xml>XXX</xml>
IMHO the easiest way is to use regular expressions. Something like:
string txt = Regex.Replace(htmlString, #"<(.|\n)*?>", string.Empty);
Depending on which tags and characters you want to remove you will modify the regex, of course. You will find a lot of material on this and other methods if you do a web search for 'strip html C#'.
SO question Render or convert Html to ‘formatted’ Text (.NET) might help you, too.
Stripping HTML tags from a given string is a common requirement and you can probably find many resources online that do it for you.
The accepted method, however, is to use a Regular expression based Search and Replace. This article provides a good sample along with benchmarks. Another point worth mentioning is that you would require separate Regex based lookups for the different kinds of unwanted characters you are seeing. (Perhaps showing us an example of the HTML you receive would help)
Note that your requirements may vary based on which tags you want to remove. In your question, you only mention DIV tags. If that is the only tag you need to replace, a simple string search and replace should suffice.
A regular expression such as this:
<([A-Z][A-Z0-9]*)\b[^>]*>(.*?)</\1>
Would highlight all HTML tags.
Use this to remove them form your data.