For all you compiler gurus, I wanna write a recursive descent parser and I wanna do it with just code. No generating lexers and parsers from some other grammar and don't tell me to read the dragon book, i'll come around to that eventually.
I wanna get into the gritty details about implementing a lexer and parser for a reasonable simple language, say CSS. And I wanna do this right.
This will probably end up being a series of questions but right now I'm starting with a lexer. Tokenization rules for CSS can be found here.
I find my self writing code like this (hopefully you can infer the rest from this snippet):
public CssToken ReadNext()
{
int val;
while ((val = _reader.Read()) != -1)
{
var c = (char)val;
switch (_stack.Top)
{
case ParserState.Init:
if (c == ' ')
{
continue; // ignore
}
else if (c == '.')
{
_stack.Transition(ParserState.SubIdent, ParserState.Init);
}
break;
case ParserState.SubIdent:
if (c == '-')
{
_token.Append(c);
}
_stack.Transition(ParserState.SubNMBegin);
break;
What is this called? and how far off am I from something reasonable well understood? I'm trying to balance something which is fair in terms of efficiency and easy to work with, using a stack to implement some kind of state machine is working quite well, but I'm unsure how to continue like this.
What I have is an input stream, from which I can read 1 character at a time. I don't do any look a head right now, I just read the character then depending on the current state try to do something with that.
I'd really like to get into the mind set of writing reusable snippets of code. This Transition method is currently means to do that, it will pop the current state of the stack and then push the arguments in reverse order. That way, when I write Transition(ParserState.SubIdent, ParserState.Init) it will "call" a sub routine SubIdent which will, when complete, return to the Init state.
The parser will be implemented in much the same way, currently, having everything in a single big method like this allows me to easily return a token when I found one, but it also forces me to keep everything in one single big method. Is there a nice way to split these tokenization rules into separate methods?
What you're writing is called a pushdown automaton. This is usually more power than you need to write a lexer, it's certainly excessive if you're writing a lexer for a modern language like CSS. A recursive descent parser is close in power to a pushdown automaton, but recursive descent parsers are much easier to write and to understand. Most parser generators generate pushdown automatons.
Lexers are almost always written as finite state machines, i.e., like your code except get rid of the "stack" object. Finite state machines are closely related to regular expressions (actually, they're provably equivalent to one another). When designing such a parser, one usually starts with the regular expressions and uses them to create a deterministic finite automaton, with some extra code in the transitions to record the beginning and end of each token.
There are tools to do this. The lex tool and its descendants are well known and have been translated into many languages. The ANTLR toolchain also has a lexer component. My preferred tool is ragel on platforms that support it. There is little benefit to writing a lexer by hand most of the time, and the code generated by these tools will probably be faster and more reliable.
If you do want to write your own lexer by hand, good ones often look something like this:
function readToken() // note: returns only one token each time
while !eof
c = peekChar()
if c in A-Za-z
return readIdentifier()
else if c in 0-9
return readInteger()
else if c in ' \n\r\t\v\f'
nextChar()
...
return EOF
function readIdentifier()
ident = ""
while !eof
c = nextChar()
if c in A-Za-z0-9
ident.append(c)
else
return Token(Identifier, ident)
// or maybe...
return Identifier(ident)
Then you can write your parser as a recursive descent parser. Don't try to combine lexer / parser stages into one, it leads to a total mess of code. (According to the Parsec author, it's slower, too).
You need to write your own Recursive Descent Parser from your BNF/EBNF. I had to write my own recently and this page was a lot of help. I'm not sure what you mean by "with just code". Do you mean you want to know how to write your own recursive parser?
If you want to do that, you need to have your grammar in place first. Once you have your EBNF/BNF in place, the parser can be written quite easily from it.
The first thing I did when I wrote my parser, was to read everything in and then tokenize the text. So I essentially ended up with an array of tokens that I treated as a stack. To reduce the verbosity/overhead of pulling a value off a stack and then pushing it back on if you don't require it, you can have a peek method that simply returns the top value on the stack without popping it.
UPDATE
Based on your comment, I had to write a recursive-descent parser in Javascript from scratch. You can take a look at the parser here. Just search for the constraints function. I wrote my own tokenize function to tokenize the input as well. I also wrote another convenience function (peek, that I mentioned before). The parser parses according to the EBNF here.
This took me a little while to figure out because it's been years since I wrote a parser (last time I wrote it was in school!), but trust me, once you get it, you get it. I hope my example gets your further along on your way.
ANOTHER UPDATE
I also realized that my example may not be what you want because you might be going towards using a shift-reduce parser. You mentioned that right now you are trying to write a tokenizer. In my case, I did write my own tokenizer in Javascript. It's probably not robust, but it was sufficient for my needs.
function tokenize(options) {
var str = options.str;
var delimiters = options.delimiters.split("");
var returnDelimiters = options.returnDelimiters || false;
var returnEmptyTokens = options.returnEmptyTokens || false;
var tokens = new Array();
var lastTokenIndex = 0;
for(var i = 0; i < str.length; i++) {
if(exists(delimiters, str[i])) {
var token = str.substring(lastTokenIndex, i);
if(token.length == 0) {
if(returnEmptyTokens) {
tokens.push(token);
}
}
else {
tokens.push(token);
}
if(returnDelimiters) {
tokens.push(str[i]);
}
lastTokenIndex = i + 1;
}
}
if(lastTokenIndex < str.length) {
var token = str.substring(lastTokenIndex, str.length);
token = token.replace(/^\s+/, "").replace(/\s+$/, "");
if(token.length == 0) {
if(returnEmptyTokens) {
tokens.push(token);
}
}
else {
tokens.push(token);
}
}
return tokens;
}
Based on your code, it looks like you are reading, tokenizing, and parsing at the same time - I'm assuming that's what a shift-reduce parser does? The flow for what I have is tokenize first to build the stack of tokens, and then send the tokens through the recursive-descent parser.
If you are going to hand code everything from scratch I would definately consider going with a recursive decent parser. In your post you are not really saying what you will be doing with the token stream once you have parsed the source.
Some things I would recommend getting a handle on
1. Good design for your scanner/lexer, this is what will be tokenizing your source code for your parser.
2. The next thing is the parser, if you have a good ebnf for the source language the parser can usually translate quite nicely into a recursive decent parser.
3. Another data structure you will really need to get your head around is the symbol table. This can be as simple as a hashtable or as complex as a tree structure that can represent complex record structures etc. I think for CSS you might be somewhere between the two.
4. And finally you want to deal with code generation. You have many options here. For an interpreter, you might simply interpret on the fly as you parse the code. A better approach might be to generate a for of i-code that you can then write an iterpreter for, and later even a compiler. Of course for the .NET platform you could directly generate IL (probably not applicable for CSS :))
For references, I gather you are not heavy into the deep theory and I do not blame you. A really good starting point for getting the basics without complex, code if you do not mind the Pascal that is, is Jack Crenshaw's 'Let's build a compiler'
http://compilers.iecc.com/crenshaw/
Good luck I am sure you are going to enjoy this project.
It looks like you want to implement a "shift-reduce" parser, where you explicitly build a token stack. The usual alternative is a "recursive descent" parser, in which depth of procedure calls build the same token stack with their own local variables, on the actual hardware stack.
In shift-reduce, the term "reduce" refers to the operation performed on the explicitly-maintained token stack. For example, if the top of the stack has become Term, Operator, Term then a reduction rule can be applied resulting in Expression as a replacement for the pattern. The reduction rules are explicitly encoded in a data structure used by the shift-reduce parser; as a result, all reduction rules can be found in the same spot of the source code.
The shift-reduce approach brings a few benefits compared to recursive-descent. On a subjective level, my opinion is that shift-reduce is easier to read and maintain than recursive-descent. More objectively, shift-reduce allows for more informative error messages from the parser when an unexpected token occurs.
Specifically, because the shift-reduce parser has an explicit encoding of rules for making "reductions," the parser is easily extended to articulate what sorts of tokens could legally have followed. (e.g., "; expected"). A recursive descent implementation cannot easily be extended to do the same thing.
A great book on both kinds of parser, and the trade-offs in implementing different kinds of shift-reduce is "Introduction to Compiler Construction", by Thomas W. Parsons.
Shift-reduce is sometimes called "bottom-up" parsing and recursive-descent is sometimes called "top-down" parsing. In the analogy used, nodes composed with highest precedence (e.g., "factors" in multiplication expression) are considered to be "at the bottom" of the parsing. This is in accord with the same analogy used in "descent" of "recursive descent".
If you want to use the parser to also handle not-well-formed expressions, you really want a recursive descent parser. Much easier to get the error handling and reporting usable.
For literature, I'd recommend some of the old work of Niklaus Wirth. He knows how to write. Algorithms + Data Structures = Programs is what I used, but you can find his Compiler Construction online.
Related
So right now in my lexer I'm trying to skip certain tokens like comments and whitespace, except I need to add them to my "skipped list", rather than hiding them altogether.
In my Scanner frame I have
public int Skip(int sym) {
Token t = _InitToken();
t.SymbolId=sym;
t.Line = Current.Line;
t.Column =Current.Column;
t.Position=Current.Position;
t.Value = yytext;
t.Skipped = null;
_skipped.Add(t);
return yylex();
}
keep in mind this is c# but the interface isn't much different than the C one in lex/flex
I then use this function above in my scanner like so:
"/*" { if(!_TryReadUntilBlockEnd("*/")) return -1; return Skip(478); }
\/\/[^\n]* { return Skip(477); }
where 477 is my symbol id (the lex file is generated hence the lack of constants)
All _TryReadUntilBlockEnd("*/") does is read until it finds a trailing */, consuming it. It's a well tested method and can be ignored for the purposes of this question, except as explanation for how i match the end of a comment. This takes over the underlying input from gplex and handles advancing the underlying input stream itself (like fget() or whatever in C i forget). Basically it's neutral here other than reading the entire comment. Skip(478) is the relevant bit, not this.
It works fine in many cases. The only problem is I'm using it in a recursive descent parser that's parsing C#, and so the stack gets heavy, and when i have a huge stream of line comments it stack overflows.
I can solve it by finding some way to run a match without invoking a lex action again instead of calling yylex() if it's possible - that way i can rewrite it to be iterative, but i have no idea how, and what I've seen from the generated code suggests it's not possible.
The other way I can solve it - and this is my preferred way - is to match multiple C# line comments in one match. That way I only recurse once.
But this is multiline match expression which is disabled by default i think?
How do i enable multiline matching in either flex, lex or *gplex? Or is there another solution to the above problem? *gplex 1.2.2 preferred but it's completely undocumented
I'll take anything at this point. Thanks in advance!
I shouldn't have been calling yylex() at all. Thanks Jonathan and rici in the comments.
I'm trying to implement syntax highlighting in C# on Android, using Xamarin. I'm using the ANTLR v4 library for C# to achieve this. My code, which is currently syntax highlighting Java with this grammar, does not attempt to build a parse tree and use the visitor pattern. Instead, I simply convert the input into a list of tokens:
private static IList<IToken> Tokenize(string text)
{
var inputStream = new AntlrInputStream(text);
var lexer = new JavaLexer(inputStream);
var tokenStream = new CommonTokenStream(lexer);
tokenStream.Fill();
return tokenStream.GetTokens();
}
Then I loop through all of the tokens in the highlighter and assign a color to them based on their kind.
public void HighlightAll(IList<IToken> tokens)
{
int tokenCount = tokens.Count;
for (int i = 0; i < tokenCount; i++)
{
var token = tokens[i];
var kind = GetSyntaxKind(token);
HighlightNext(token, kind);
if (kind == SyntaxKind.Annotation)
{
var nextToken = tokens[++i];
Debug.Assert(token.Text == "#" && nextToken.Type == Identifier);
HighlightNext(nextToken, SyntaxKind.Annotation);
}
}
}
public void HighlightNext(IToken token, SyntaxKind tokenKind)
{
int count = token.Text.Length;
if (token.Type != -1)
{
_text.SetSpan(_styler.GetSpan(tokenKind), _index, _index + count, SpanTypes.InclusiveExclusive);
_index += count;
}
}
Initially, I figured this was wise because syntax highlighting is largely context-independent. However, I have already found myself needing to special-case identifiers in front of #, since I want those to get highlighted as annotations just as on GitHub (example). GitHub has further examples of coloring identifiers in certain contexts: here, List and ArrayList are colored, while mItems is not. I will likely have to add further code to highlight identifiers in those scenarios.
My question is, is it a good idea to examine tokens rather than a parse tree here? On one hand, I'm worried that I might have to end up doing a lot of special-casing for when a token's neighbors alter how it should be highlighted. On the other, parsing will add additional overhead for memory-constrained mobile devices, and make it more complicated to implement efficient syntax highlighting (e.g. not re-tokenizing/parsing everything) when the user edits text in the code editor. I also found it significantly less complicated to handle all of the token types rather than the parser rule types, because you just switch on token.Type rather than overriding a bunch of Visit* methods.
For reference, the full code of the syntax highlighter is available here.
It depends on what you are syntax highlighting.
If you use a naive parser, then any syntax error in the text will cause highlighting to fail. That makes it quite a fragile solution since a lot of the texts you might want to syntax highlight are not guaranteed to be correct (particularly​ user input, which at best will not be correct until it is fully typed). Since syntax highlighting can help make syntax errors visible and is often used for that purpose, failing completely on syntax errors is counter-productive.
Text with errors does not readily fit into a syntax tree. But it does have more structure than a stream of tokens. Probably the most accurate representation would be a forest of subtree fragments, but that is an even more awkward data structure to work with than a tree.
Whatever the solution you choose, you will end up negotiating between conflicting goals: complexity vs. accuracy vs. speed vs. usability. A parser may be part of the solution, but so may ad hoc pattern matching.
Your approach is totally fine and pretty much what everybody's using. And it's totally normal to fine tune type matching by looking around (and it's cheap since the token types are cached). So you can always just look back or ahead in the token stream if you need to adjust actually used SyntaxKind. Don't start parsing your input. It won't help you.
I ended up choosing to use a parser because there were too many ad hoc rules. For example, although I wanted to color regular identifiers white, I wanted types in type declarations (e.g. C in class C) to be green. There ended up being about 20 of these special rules in total. Also, the added overhead of parsing turned out to be miniscule compared to other bottlenecks in my app.
For those interested, you can view my code here: https://github.com/jamesqo/Repository/blob/e5d5653093861bc35f4c0ac71ad6e27265e656f3/Repository.EditorServices/Internal/Java/Highlighting/JavaSyntaxHighlighter.VisitMethods.cs#L19-L76. I've highlighted all of the ~20 special rules I've had to make.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
How do I go about writing a Parser (Recursive Descent?) in C#? For now I just want a simple parser that parses arithmetic expressions (and reads variables?). Though later I intend to write an xml and html parser (for learning purposes). I am doing this because of the wide range of stuff in which parsers are useful: Web development, Programming Language Interpreters, Inhouse Tools, Gaming Engines, Map and Tile Editors, etc. So what is the basic theory of writing parsers and how do I implement one in C#? Is C# the right language for parsers (I once wrote a simple arithmetic parser in C++ and it was efficient. Will JIT compilation prove equally good?). Any helpful resources and articles. And best of all, code examples (or links to code examples).
Note: Out of curiosity, has anyone answering this question ever implemented a parser in C#?
I have implemented several parsers in C# - hand-written and tool generated.
A very good introductory tutorial on parsing in general is Let's Build a Compiler - it demonstrates how to build a recursive descent parser; and the concepts are easily translated from his language (I think it was Pascal) to C# for any competent developer. This will teach you how a recursive descent parser works, but it is completely impractical to write a full programming language parser by hand.
You should look into some tools to generate the code for you - if you are determined to write a classical recursive descent parser (TinyPG, Coco/R, Irony). Keep in mind that there are other ways to write parsers now, that usually perform better - and have easier definitions (e.g. TDOP parsing or Monadic Parsing).
On the topic of whether C# is up for the task - C# has some of the best text libraries out there. A lot of the parsers today (in other languages) have an obscene amount of code to deal with Unicode etc. I won't comment too much on JITted code because it can get quite religious - however you should be just fine. IronJS is a good example of a parser/runtime on the CLR (even though its written in F#) and its performance is just shy of Google V8.
Side Note: Markup parsers are completely different beasts when compared to language parsers - they are, in the majority of the cases, written by hand - and at the scanner/parser level very simple; they are not usually recursive descent - and especially in the case of XML it is better if you don't write a recursive descent parser (to avoid stack overflows, and because a 'flat' parser can be used in SAX/push mode).
Sprache is a powerful yet lightweight framework for writing parsers in .NET. There is also a Sprache NuGet package. To give you an idea of the framework here is one of the samples that can parse a simple arithmetic expression into an .NET expression tree. Pretty amazing I would say.
using System;
using System.Linq.Expressions;
using Sprache;
namespace LinqyCalculator
{
static class ExpressionParser
{
public static Expression<Func<decimal>> ParseExpression(string text)
{
return Lambda.Parse(text);
}
static Parser<ExpressionType> Operator(string op, ExpressionType opType)
{
return Parse.String(op).Token().Return(opType);
}
static readonly Parser<ExpressionType> Add = Operator("+", ExpressionType.AddChecked);
static readonly Parser<ExpressionType> Subtract = Operator("-", ExpressionType.SubtractChecked);
static readonly Parser<ExpressionType> Multiply = Operator("*", ExpressionType.MultiplyChecked);
static readonly Parser<ExpressionType> Divide = Operator("/", ExpressionType.Divide);
static readonly Parser<Expression> Constant =
(from d in Parse.Decimal.Token()
select (Expression)Expression.Constant(decimal.Parse(d))).Named("number");
static readonly Parser<Expression> Factor =
((from lparen in Parse.Char('(')
from expr in Parse.Ref(() => Expr)
from rparen in Parse.Char(')')
select expr).Named("expression")
.XOr(Constant)).Token();
static readonly Parser<Expression> Term = Parse.ChainOperator(Multiply.Or(Divide), Factor, Expression.MakeBinary);
static readonly Parser<Expression> Expr = Parse.ChainOperator(Add.Or(Subtract), Term, Expression.MakeBinary);
static readonly Parser<Expression<Func<decimal>>> Lambda =
Expr.End().Select(body => Expression.Lambda<Func<decimal>>(body));
}
}
C# is almost a decent functional language, so it is not such a big deal to implement something like Parsec in it. Here is one of the examples of how to do it: http://jparsec.codehaus.org/NParsec+Tutorial
It is also possible to implement a combinator-based Packrat, in a very similar way, but this time keeping a global parsing state somewhere instead of doing a pure functional stuff. In my (very basic and ad hoc) implementation it was reasonably fast, but of course a code generator like this must perform better.
I know that I am a little late, but I just published a parser/grammar/AST generator library named Ve Parser. you can find it at http://veparser.codeplex.com or add to your project by typing 'Install-Package veparser' in Package Manager Console. This library is kind of Recursive Descent Parser that is intended to be easy to use and flexible. As its source is available to you, you can learn from its source codes. I hope it helps.
In my opinion, there is a better way to implement parsers than the traditional methods that results in simpler and easier to understand code, and especially makes it easier to extend whatever language you are parsing by just plugging in a new class in a very object-oriented way. One article of a larger series that I wrote focuses on this parsing method, and full source code is included for a C# 2.0 parser:
http://www.codeproject.com/Articles/492466/Object-Oriented-Parsing-Breaking-With-Tradition-Pa
Well... where to start with this one....
First off, writing a parser, well that's a very broad statement especially with the question your asking.
Your opening statement was that you wanted a simple arithmatic "parser" , well technically that's not a parser, it's a lexical analyzer, similar to what you may use for creating a new language. ( http://en.wikipedia.org/wiki/Lexical_analysis ) I understand however exactly where the confusion of them being the same thing may come from. It's important to note, that Lexical analysis is ALSO what you'll want to understand if your going to write language/script parsers too, this is strictly not parsing because you are interpreting the instructions as opposed to making use of them.
Back to the parsing question....
This is what you'll be doing if your taking a rigidly defined file structure to extract information from it.
In general you really don't have to write a parser for XML / HTML, beacuse there are already a ton of them around, and more so if your parsing XML produced by the .NET run time, then you don't even need to parse, you just need to "serialise" and "de-serialise".
In the interests of learning however, parsing XML (Or anything similar like html) is very straight forward in most cases.
if we start with the following XML:
<movies>
<movie id="1">
<name>Tron</name>
</movie>
<movie id="2">
<name>Tron Legacy</name>
</movie>
<movies>
we can load the data into an XElement as follows:
XElement myXML = XElement.Load("mymovies.xml");
you can then get at the 'movies' root element using 'myXML.Root'
MOre interesting however, you can use Linq easily to get the nested tags:
var myElements = from p in myXML.Root.Elements("movie")
select p;
Will give you a var of XElements each containing one '...' which you can get at using somthing like:
foreach(var v in myElements)
{
Console.WriteLine(string.Format("ID {0} = {1}",(int)v.Attributes["id"],(string)v.Element("movie"));
}
For anything else other than XML like data structures, then I'm afraid your going to have to start learning the art of regular expressions, a tool like "Regular Expression Coach" will help you imensly ( http://weitz.de/regex-coach/ ) or one of the more uptodate similar tools.
You'll also need to become familiar with the .NET regular expression objects, ( http://www.codeproject.com/KB/dotnet/regextutorial.aspx ) should give you a good head start.
Once you know how your reg-ex stuff works then in most cases it's a simple case case of reading in the files one line at a time and making sense of them using which ever method you feel comfortable with.
A good free source of file formats for almost anything you can imagine can be found at ( http://www.wotsit.org/ )
For the record I implemented parser generator in C# just because I couldn't find any working properly or similar to YACC (see: http://sourceforge.net/projects/naivelangtools/).
However after some experience with ANTLR I decided to go with LALR instead of LL. I know that theoretically LL is easier to implement (generator or parser) but I simply cannot live with stack of expressions just to express priorities of operators (like * goes before + in "2+5*3"). In LL you say that mult_expr is embedded inside add_expr which does not seem natural for me.
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
I need a fast runtime expression parser
How do I make it that when someone types in x*y^z in a textbox on my page to calculate that equation in the code behind and get the result?
.NET does not have a built-in function for evaluating arbitrary strings. However, an open source .NET library named NCalc does.
NCalc is a mathematical expressions evaluator in .NET. NCalc can parse
any expression and evaluate the result, including static or dynamic
parameters and custom functions.
Answer from operators as strings by user https://stackoverflow.com/users/1670022/matt-crouch, using built-in .NET functionality:
"If all you need is simple arithmetic, do this.
DataTable temp = new DataTable();
Console.WriteLine(temp.Compute("15 / 3",string.Empty));
EDIT: a little more information. Check out the MSDN documentation for the Expression property of the System.Data.DataColumn class. The stuff on "Expression Syntax" outlines a list of commands you can use in addition to the arithmetic operators. (ex. IIF, LEN, etc.)."
EDIT 2: For convenience, you can put this into a little function like:
public string Eval(string expr)
{
var temp = new System.Data.DataTable();
string result = null;
try
{
result = $"{temp.Compute(expr, string.Empty)}";
}
catch (System.Data.EvaluateException ex)
{
if (ex.Message.ToLower().Contains("cannot find column"))
throw new System.Data.SyntaxErrorException($"Syntax error: Invalid expression: '{expr}'."
+ " Variables as operands are not supported.");
else
throw;
}
return result;
}
So you can use it like:
Console.WriteLine(Eval("15 * (3 + 5) / (7 - 2)"));
giving the expected output:
24
Note that the error handler helps to handle exceptions caused by using variables which are not allowed here. Example: Eval("a") - Instead of returning "Cannot find column [a]", which doesn't make much sense in this context (we're not using it in a database context) it is returning "Syntax error: Invalid expression: 'a'. Variables as operands are not supported."
Run it on DotNetFiddle
There are two main approaches to this problem, each with some variations, as illustrated in the variety of answers.
Option A: Find an existing mathematical expresssion evaluator
Option B: Write your own parser and the logic to compute the result
Before going into some details about this, it is appropriate to stress that interpreting arbitrary mathematical expressions is not a trivial task, for any expression grammar other than "toy" grammars such as these that only accept one or two arithmetic operations and do not allow parenthesis etc.
Understanding that such task is deceivingly trivial, and acknowledging that, after all, interpreting arithmetic expressions of average complexity is a relatively recurrent need for various applications [hence one for which mature solutions should be available], it is probably wise to try and make do with "Option A".
I'd therefore second Jed's recommendation of a ready-make expression evaluator such as NCalc.
It may be useful however to take the time and understand the various concepts and methods associated with parsing and interpreting arithmetic expressions, as if one were going to whip-up one's own implementation.
The key concept is that of a formal grammar. The arithmetic expressions which the evaluator will accept must follow a set of rules such as the list of arithmetic operations allowed. For example will the evaluator support, say, trigonometric functions, or if it does, will this also include say atan2(). The rules also indicate what consitutes an operand, for example will it be allowed to input numerical values as big as say 45 digits. etc. The point is that all these rules are formalized in a grammar.
Typically a grammar works on tokens which have previously been extracted from the raw input text. Essentially at some time in the process, some logic needs to analyze the input string, character by character, and determine which sequences of characters go together. For example in the 123 + 45 / 9.3 expression, the tokens are the integer value 123, the plus operator, the integer value 45, the division operator and finally the 9.3 real value. The task of identifying the tokens and associating them with a token type is the job a lexer. Lexers can be build themselves on a grammar (a grammar which "tokens" are single characters, as opposed to the grammar for the arithmetic expression parser which tokens are short strings produced by the lexer.)
BTW, grammars are used to define many other things beyond arithmetic expressions. Computer languages follow [rather sophiticated] grammars, but it is relatively common to introduce Domain Specific Languages aka DSLs in support of various features of computer applications.
For very simple grammars, one may be able to write the corresponding lexer and parser from scratch. But sooner than later the grammars may get complicated to the point that hand-writing these modules becomes fastidious, bug-prone and maybe more importantly difficult to read. Hence the existence of Lexer and Parser Generators which are stand-alone programs that produce the code of lexers and parsers (in a particular programming language such as C, Java or C#) from a list of rules (expressed in a syntax particular to the generator, though many generators tend to use similar syntaxes, loosely base on BNF).
When using such a lexer/parser generator, work in done in multiple steps:
- first one writes a definition of the grammar (in the generator-specific language/syntax)
- one runs this grammar through the generator.
- one often repeats the above two steps multiple times, because writing a grammar is an exacting exercise: the generator will complain of many possible ambiguities one may write into the grammar.
- eventually the generator produces a source file (in the desired target language such as C# etc.)
- this source is included in the overall project
- other source files in the project may invoke the functions exposed in the source files produced by the generator and/or some logic corresponding to various patterns identified during parsing may readily be may imbedded in the generator produced code.
- the project can then be build as usual, i.e. as if the parser and lexer had be hand-written.
And that's about it for a 20,000 feet high presentation of the process of working with formal grammars and code generators.
A list of parser-generators (aka compiler-compilers) can be found at this link. For simple work in C# I also want to mention Irony. It may be very insightful to peruse these sites, to get a better feel for these concept, even without the intent of becoming a practitioner at this time.
As said, I wish to stress that for this particular application, a ready-made arithmetic evaluator is likely the better approach. The main downside of these would be
some limitations as to what the allowed expression syntax is (either the grammar allowed is too restrictive: you also need say stddev() or is too broad: you don't want your users to use trig functions. With the more mature evaluators, there will be some form of configuration/extension feature which allows dealing with this problem.
the learning curve of such a 3rd party module. Hopefully many of them should be relatively "plug-and-play".
solved with this library http://www.codeproject.com/Articles/21137/Inside-the-Mathematical-Expressions-Evaluator
my final code
Calculator Cal = new Calculator();
txt_LambdaNoot.Text = (Cal.Evaluate(txt_C.Text) / fo).ToString();
now when some one type 3*10^11 he will get 300000000000
You will need to implement (or find a third-party source) an expression parser. This is not a trivial thing to do.
What you need - if you want to do it yourself - is a Scanner (also known as Lexer) + Parser in the code behind which interprets the expression. Alternatively, you can find a 3rd party library which does the job and works similar as the JavaScript eval(string) function does.
Please take a look here, it describes an recursive descent parser. The example is written in C, but you should be able to adapt it to C# easily once you got the idea described in the article.
It is less complicated than it sounds, especially if you have a limited amount of operators to support.
The advantage is that you keep full control on what expressions will be executed (to prevent malicious code injections by the end-user of your website).
Below is a code in C# for rss reader, why this code is bad? this class generates a list of the 5 most recent posts, sorted by title. What do you use to analyze code in C#?
static Story[] Parse(string content)
{
var items = new List<string>();
int start = 0;
while (true)
{
var nextItemStart = content.IndexOf("<item>", start);
var nextItemEnd = content.IndexOf("</item>", nextItemStart);
if (nextItemStart < 0 || nextItemEnd < 0) break;
String nextItem = content.Substring(nextItemStart, nextItemEnd + 7 - nextItemStart);
items.Add(nextItem);
start = nextItemEnd;
}
var stories = new List<Story>();
for (byte i = 0; i < items.Count; i++)
{
stories.Add(new Story()
{
title = Regex.Match(items[i], "(?<=<title>).*(?=</title>)").Value,
link = Regex.Match(items[i], "(?<=<link>).*(?=</link>)").Value,
date = Regex.Match(items[i], "(?<=<pubDate>).*(?=</pubdate>)").Value
});
}
return stories.ToArray();
}
why don't use XmlReader or XmlDocument or LINQ to Xml?
It is bad because it's using string parsing when there are excellent classes in the framework for parsing XML. Even better, there are classes to deal with RSS feeds.
ETA:
Sorry to not have answered your second question earlier. There are a great number of tools to analyze correctness and quality of C# code. There's probably a huge list compiled somewhere, but here's a few I use on a daily basis to help ensure quality code:
StyleCop (code formatting standards)
Resharper (idiomatic programming, gotcha catching)
FxCop (code correctness, standards adherence, idiomatic programming)
Pex (white box testing)
Nitriq (code quality metrics)
NUnit (unit testing)
You shouldn't parse XML with string functions and regular expressions. XML can get very complicated and be formatted many ways that a real XML parser like XmlReader can handle, but will break your simple string parsing code.
Basically: don't try and reinvent the wheel (an xml parser), especially when you don't realize how complicated that wheel actually is.
I think the worst thing for the code is the performance issue. You should parse the xml string into a XDocument(or similar structure) instead of parse it again and again using regex.
simply because you are reinventing a xml parser , use Linq to xml instead, it is very simple and clean.I am sure that you can do all the above in three lines of code if you use Linq to XML , your code uses a lot of magic numbers (ex: 7-n ..), the thing that make it unstable and non generic
For starters, it uses byte as a indexer instead of int (what if items has more items in it than a byte can represent?). It doesn't use idiomatic C# (see user1645569's response). It also unnecessarily uses var instead of specific data types (that's more stylistic though, but for me I prefer not to, hence by my metric it's not ideal (and you've given no other metric)).
Let me clarify what I am saying about "unnecessarily using var": var in and of itself is not bad, and I am not suggesting that. I am (mostly) suggesting the usage here isn't very consistent. For example, explicitly declaring start as an int, but then declaring nextItemEnd as var (which will deduce to int) and assigning nextItemEnd to start seems (to me) like a weird mixture between wanting to automatically deduce a variable's type and explicitly declaring it. I think it's good that var was not used in start's declaration (because then it's not exactly clear if the intent is an integer or a floating point number), but I (personally) don't think it helps any to declare nextItemStart and nextItemEnd as var. I tend to prefer to use var for more complex/longer data types (similar to how I use auto in C++ for iterators, but not for "simpler" data types).