I'm interested in how a parser like the Razor view engine can parse two distinct languages like C# and JavaScript.
It's very cool that the following works, for instance:
$("#fm_duedate").val('#DateTime.Now.AddMonths(1).ToString("MM/dd/yyyy")');
I'm going to try and look at the source but I'm curious if there's a some kind of theoretical foundation for a parser like this or is it more brute force like taking the union of the two languages and parsing that?
Trying to reason it for myself, I say "you start with a parser for each language then you add to each one a set of productions that switch it to the other" but I doubt its so simple.
I guess the perfect answer would be a pointer to discussion on how the Razor engine is implemented or a walk-through of the source (I haven't actually Google'd this for fear of going down a rabbit hole). Alternately, just some insight on how the problem of parsing two languages is approached would be great.
As Corey points out, Razor and similar frameworks do not do anything particularly fancy.
However there are some more theoretically sound models for building parsers for languages where one language is embedded in another. My erstwhile colleague Luke Hoban has a great introductory article on parser combinators, which afford a very nice way to build a parser for one-language-embedded-in-another-language scenarios:
http://blogs.msdn.com/b/lukeh/archive/2007/08/19/monadic-parser-combinators-using-c-3-0.aspx
The wikipedia page is pretty straightforward as well:
http://en.wikipedia.org/wiki/Parser_combinator
Razor (and the other view engines) do not parse the HTML or JavaScript of a view. Instead they parse the text to detect specific tokens, with no real concern about the surrounding text.
In the case of Razor, every # character in the source file is processed as a code block of some sort. Razor is quite smart about detecting the expression that follows the # character, including handling things like #foreach (var x in collection) { and locating the closing } while not trying to parse the HTML (or JavaScript) inside. It also lets you use #{ } and #( ) to override the processing to a degree.
I find the ASPX <%...%> format simpler to read, since I've used that format more and I've got some established pattern recognition going on for those. Having explicit start/finish tokens is simpler to process and simpler to read in-place.
Related
I'm writing a C# web crawler and when I run the profiling I can see that HTMLAgilityPack's LoadHTML method is using 10% of the programs overall CPU usage. I'd like to try and lower this.
I'm sure a regular expression would be faster but as I look at link extracting examples on SO I see everyone saying this method should be avoided in favour of a html parser like HTMLAgilityPack.
As all I need to do is extract links from HTML is using HTMLAgilityPack over kill?
Are the reasons for favouring a HTML parser applicable to my case as I'm only using it for extracting links?
Downloaded HTML with WebClient then compared.
Using href\\s*=\\s*(?:[\"'](?<1>[^\"']*)[\"']|(?<1>\\S+)) (then trimming and adding to a list) is way faster than HTMLAgilityPack.
43 milliseconds compared to 3 consistently.
See my code on pastebin
Are the reasons for favouring a HTML parser applicable to my case as I'm only using it for extracting links?
In your case the HTML parser is overkill as your tests have shown.
People who answer on SO use that as a rote answer to all regex questions. One should use the tool if one actually needs to parse the domain of the HTML in a more robust fashion.
Bias against Regular Expressions are found by people who feel that they are too slow or cumbersome [to learn]. There is some merit on what is proposed by them for certain operations, in that specific optimized text for finding utilities do perform better. Sure I agree, but to dismiss regex out of hand, well that is par for the course on StackOverflow.
Why is that? Sometimes the analysis is simply flawed because the pattern provided introduces a lot of unnecessary backtracking and is not optimized. That handicaps regex out of the gate. One does have to learn the regex language and understand what it is doing to tune the engine of regex to not pollute.
For example I took your same C# code test, but I used an optimized pattern of yours and my own and was able to get it down to 1 millisecond consistently!
Most people learn basic pattern matching by doing searches with a *. When they first learn regex they use * with the . such as .*. That step along with indiscriminate usage of the * will most likely will doom any non beginning pattern to the hell of backtracking and slow responses.
Unless you know empirically that there are no items, use the + instead.
Back in 2009 I wrote about this subject on my blog Are C# .Net Regular Expressions Fast Enough for You?
I need to develop an application that will read and understand text file in which I'll find a custom language that describe a list of operations (ie cooking recipe). This language has not been defined yet, but it will probably take one of the following shape :
C++ like code
(This code is randomly generated, just for example purpose) :
begin
repeat(10)
{
bar(toto, 10, 1999, xxx);
}
result = foo(xxxx, 10);
if(foo == ok)
{
...
}
else
{
...
}
end
XML code
(This code is randomly generated, just for example purpose) :
<recipe>
<action name="foo" argument"bar, toto, xxx" repeat=10/>
<action name="bar" argument"xxxxx;10" condition="foo == ok">
<true>...</true>
<false>...</false>
</action>
</recipe>
No matter which language will be chosen, there will have to handle simple conditions, loops.
I never did such a thing but at first sight, it occurs to me that describing those operations into XML would be simplier yet less powerful.
After browsing StackOverFlow, I've found some chats on a tool called "ANTLR"... I started reading "The Definitive ANTLR Reference" but since I never done that kind of stuff, I find it hard to know if it's really the kind of tool I need...
In other words, what do I need to read a text file, interpret it properly and perform actions in my C# code. Those operations will interact between themselves by simple conditions like :
If operation1 failed, I do operation2 else operation3.
Repeat the operation4 10 times.
What would be the best language to do describe those text file (XML, my own) ? What are the key points during such developments ?
I hope I'm being clear :)
Thanks a lot for your help and advices !
XML is great for storing relational data in a verbose way. I think it is a terrible candidate for writing logic such as a program, however.
Have you considered using an existing grammar/scripting language that you can embed, rather than writing your own? E.g:
LUA
Python
In one of my projects I actually started with an XML like language as I already had an XML parser and parsed the XML structure into an expression tree in memory to be interpreted/run.
This works out very nicely to get passed the problem of figuring out tokenizing/parsing of text files and concentrate instead on your 'language' and the logic of the operations in your language. The down side is writing the text files is a little strange and very wordy. Its also very unnatural for a programmer use to C/C++ syntax.
Eventually you could easily replace your XML with a full blown scanner & lexer to parse a more 'natural C++' like text format into your expression tree.
As for writing a scanner & lexer, I found it easier to write these by hand using simple logic flow/loops for the scanner and recursive decent parser for the lexer.
That said, ANTLR is great at letting you write out rules for your language and generating your scanner & lexer for you. This allows for much more dynamic language which can easily change without having to refactor everything again when new things are added. So, it might be worth looking into as learning this as it would save you much time in rewrites as things change if you hand wrote your own.
I'd recommend writing the app in F#. It has many useful features for parsing strings and xmls like Pattern Matching and Active Patterns.
For parsing C-like code I would recommend F# (just did one interpreter with F#, works like a charm)
For parsing XML's I would recommend C#/F# + XmlDocument class.
You basically need to work on two files:
Operator dictionary
Code file in YourLanguage
Load and interpret the operators and then apply them recursively to your code file.
The best prefab answer: S-expressions
C and XML are good first steps. They have sort of opposite disadvantages. The C-like syntax won't add a ton of extra characters, but it's going to be hard to parse due to ambiguity, the variety of tokens, and probably a bunch more issues I can't think of. XML is relatively easy to parse and there's tons of example code, but it will also contain tons of extra text. It might also give you too many options for where to stick language features - for example, is the number of times to repeat a loop an attribute, element or text?
S-expressions are more terse than XML for sure, maybe even C. At the same time, they're specific to the task of applying operations to data. They don't admit ambiguity. Parsers are simple and easy to find example code for.
This might save you from having to learn too much theory before you start experimenting. I'll emphasize MerickOWA's point that ANTLR and other parser generators are probably a bigger battle than you want to fight right now. See this discussion on programmers.stackexchange for some background on when the full generality of this type of tool could help.
I'm building a small specialized search engine for prise info. The engine will only collect specific segments of data on each site. My plan is to split the process into two steps.
Simple screen scraping based on a URL that points to the page where the segment I need exists. Is the easiest way to do this just to use a WebClient object and get the full HTML?
Once the HTML is pulled and saved analyse it via some script and pull out just the segment and values I need (for example the price value of a product). My problem is that this script somehow has to be unique for each site I pull, it has to be able to handle really ugly HTML (so I don't think XSLT will do ...) and I need to be able to change it on the fly as the target sites updates and changes. I will finally take the specific values and write these to a database to make them searchable
Could you please give me some hints on how to architect the best way? Would you do different then described above?
Well, i would go with the way you describe.
1.
How much data is it going to handle? Fetching the full HTML via WebClient / HttpWebRequest should not be a problem.
2.
I would go for HtmlAgilityPack for HTML parsing. It's very forgiving, and can handle prety ugly markup. As HtmlAgilityPack supports XPath, it's pretty easy to have specific xpath selections for individual sites.
I'm on the run and going to expand on this answer asap.
Yes, a WebClient can work well for this. The WebBrowser control will work as well depending on your requirements. If you are going to load the document into a HtmlDocument (the IE HTML DOM) then it might be easier to use the web browser control.
The HtmlDocument object that is now built into .NET can be used to parse the HTML. It is designed to be used with the WebBrowser control but you can use the implementation from the mshtml dll as well. I hav enot used the HtmlAgilityPack, but I hear that it can do a similar job.
The HTML DOM objects will typically handle, and fix up, most ugly HTML That you throw at them. As well as allowing a nicer way to parse the html, document.GetElementsByTag to get a collection of tag objects for example.
As for handling the changing requirements of the site, it sounds like a good candidate for the strategy pattern. You could load the strategies for each site using reflection or something of that sort.
I have worked on a system that uses XML to define a generic set of parameters for extracting text from HTML pages. Basically it would define start and end elements to begin and end extraction. I have found this technique to work well enough for a small sample, but it gets rather cumbersome and difficult to customize as the collection of sites gets larger and larger. Keeping the XML up to date and trying to keep a generic set of XML and code the handle any type of site is difficult. But if the type and number of sites is small then this might work.
One last thing to mention is that you might want to add a cleaning step to your approach. A flexible way to clean up HTML as it comes into the process was invaluable on the code I have worked on in the past. Perhaps implementing a type of pipeline would be a good approach if you think the domain is complex enough to warrant it. But even just a method that runs some regexes over the HTML before you parse it would be valuable. Getting rid of images, replacing particular mis-used tags with nicer HTML , etc. The amount of really dodgy HTML that is out there continues to amaze me...
What is the best way to build a parser in c# to parse my own language?
Ideally I'd like to provide a grammar, and get Abstract Syntax Trees as an output.
Many thanks,
Nestor
I've had good experience with ANTLR v3. By far the biggest benefit is that it lets you write LL(*) parsers with infinite lookahead - these can be quite suboptimal, but the grammar can be written in the most straightforward and natural way with no need to refactor to work around parser limitations, and parser performance is often not a big deal (I hope you aren't writing a C++ compiler), especially in learning projects.
It also provides pretty good means of constructing meaningful ASTs without need to write any code - for every grammar production, you indicate the "crucial" token or sub-production, and that becomes a tree node. Or you can write a tree production.
Have a look at the following ANTLR grammars (listed here in order of increasing complexity) to get a gist of how it looks and feels
JSON grammar - with tree productions
Lua grammar
C grammar
I've played wtih Irony. It looks simple and useful.
You could study the source code for the Mono C# compiler.
While it is still in early beta the Oslo Modeling language and MGrammar tools from Microsoft are showing some promise.
I would also take a look at SableCC. Its very easy to create the EBNF grammer. Here is a simple C# calculator example.
There's a short paper here on constructing an LL(1) parser here, of course you could use a generator too.
Lex and yacc are still my favorites. Obscure if you're just starting out, but extremely simple, fast, and easy once you've got the lingo down.
You can make it do whatever you want; generate C# code, build other grammars, emulate instructions, whatever.
It's not pretty, it's a text based format and LL1, so your syntax has to accomodate that.
On the plus side, it's everywhere. There are great O'reilly books about it, lots of sample code, lots of premade grammars, and lots of native language libraries.
I've saved an entire webpage's html to a string, and now I want to grab the "href" values from the links, preferably with the ability to save them to different strings later. What's the best way to do this?
I've tried saving the string as an .xml doc and parsing it using an XPathDocument navigator, but (surprise surprise) it doesn't navigate a not-really-an-xml-document too well.
Are regular expressions the best way to achieve what I'm trying to accomplish?
I can recommend the HTML Agility Pack. I've used it in a few cases where I needed to parse HTML and it works great. Once you load your HTML into it, you can use XPath expressions to query the document and get your anchor tags (as well as just about anything else in there).
HtmlDocument yourDoc = // load your HTML;
int someCount = yourDoc.DocumentNode.SelectNodes("your_xpath").Count;
Regular expressions are one way to do it, but it can be problematic.
Most HTML pages can't be parsed using standard html techniques because, as you've found out, most don't validate.
You could spend the time trying to integrate HTML Tidy or a similar tool, but it would be much faster to just build the regex you need.
UPDATE
At the time of this update I've received 15 up and 9 downvotes. I think that maybe people aren't reading the question nor the comments on this answer. All the OP wanted to do was grab the href values. That's it. From that perspective, a simple regex is just fine. If the author had wanted to parse other items then there is no way I would recommend regex as I stated at the beginning, it's problematic at best.
For dealing with HTML of all shapes and sizes I prefer to use the HTMLAgility pack # http://www.codeplex.com/htmlagilitypack it lets you write XPaths against the nodes you want and get those return in a collection.
Probably you want something like the Majestic parser: http://www.majestic12.co.uk/projects/html_parser.php
There are a few other options that can deal with flaky html, as well. The Html Agility Pack is worth a look, as someone else mentioned.
I don't think regexes are an ideal solution for HTML, since HTML is not context-free. They'll probably produce an adequate, if imprecise, result; even deterministically identifying a URI is a messy problem.
It is always better, if possible not to rediscover the wheel. Some good tools exist that either convert HTML to well-formed XML, or act as an XmlReader:
Here are three good tools:
TagSoup, an open-source program, is a Java and SAX - based tool, developed by John Cowan. This is
a SAX-compliant parser written in Java that, instead of parsing well-formed or valid XML, parses HTML as it is found in the wild: poor, nasty and brutish, though quite often far from short. TagSoup is designed for people who have to process this stuff using some semblance of a rational application design. By providing a SAX interface, it allows standard XML tools to be applied to even the worst HTML. TagSoup also includes a command-line processor that reads HTML files and can generate either clean HTML or well-formed XML that is a close approximation to XHTML.
Taggle is a commercial C++ port of TagSoup.
SgmlReader is a tool developed by Microsoft's Chris Lovett.
SgmlReader is an XmlReader API over any SGML document (including built in support for HTML). A command line utility is also provided which outputs the well formed XML result.
Download the zip file including the standalone executable and the full source code: SgmlReader.zip
An outstanding achievement is the pure XSLT 2.0 Parser of HTML written by David Carlisle.
Reading its code would be a great learning exercise for everyone of us.
From the description:
"d:htmlparse(string)
d:htmlparse(string,namespace,html-mode)
The one argument form is equivalent to)
d:htmlparse(string,'http://ww.w3.org/1999/xhtml',true()))
Parses the string as HTML and/or XML using some inbuilt heuristics to)
control implied opening and closing of elements.
It doesn't have full knowledge of HTML DTD but does have full list of
empty elements and full list of entity definitions. HTML entities, and
decimal and hex character references are all accepted. Note html-entities
are recognised even if html-mode=false().
Element names are lowercased (if html-mode is true()) and placed into the
namespace specified by the namespace parameter (which may be "" to denote
no-namespace unless the input has explict namespace declarations, in
which case these will be honoured.
Attribute names are lowercased if html-mode=true()"
Read a more detailed description here.
Hope this helped.
Cheers,
Dimitre Novatchev.
I agree with Chris Lively, because HTML is often not very well formed you probably are best off with a regular expression for this.
href=[\"\'](http:\/\/|\.\/|\/)?\w+(\.\w+)*(\/\w+(\.\w+)?)*(\/|\?\w*=\w*(&\w*=\w*)*)?[\"\']
From here on RegExLib should get you started
You might have more luck using xml if you know or can fix the document to be at least well-formed. If you have good html (or rather, xhtml), the xml system in .Net should be able to handle it. Unfortunately, good html is extremely rare.
On the other hand, regular expressions are really bad at parsing html. Fortunately, you don't need to handle a full html spec. All you need to worry about is parsing href= strings to get the url. Even this can be tricky, so I won't make an attempt at it right away. Instead I'll start by asking a few questions to try and establish a few ground rules. They basically all boil down to "How much do you know about the document?", but here goes:
Do you know if the "href" text will always be lower case?
Do you know if it will always use double quotes, single quotes, or nothing around the url?
Is it always be a valid URL, or do you need to account for things like '#', javascript statements, and the like?
Is it possible to work with a document where the content describes html features (IE: href= could also be in the document and not belong to an anchor tag)?
What else can you tell us about the document?
I've linked some code here that will let you use "LINQ to HTML"...
Looking for C# HTML parser