why is this code bad? - c#

Below is a code in C# for rss reader, why this code is bad? this class generates a list of the 5 most recent posts, sorted by title. What do you use to analyze code in C#?
static Story[] Parse(string content)
{
var items = new List<string>();
int start = 0;
while (true)
{
var nextItemStart = content.IndexOf("<item>", start);
var nextItemEnd = content.IndexOf("</item>", nextItemStart);
if (nextItemStart < 0 || nextItemEnd < 0) break;
String nextItem = content.Substring(nextItemStart, nextItemEnd + 7 - nextItemStart);
items.Add(nextItem);
start = nextItemEnd;
}
var stories = new List<Story>();
for (byte i = 0; i < items.Count; i++)
{
stories.Add(new Story()
{
title = Regex.Match(items[i], "(?<=<title>).*(?=</title>)").Value,
link = Regex.Match(items[i], "(?<=<link>).*(?=</link>)").Value,
date = Regex.Match(items[i], "(?<=<pubDate>).*(?=</pubdate>)").Value
});
}
return stories.ToArray();
}

why don't use XmlReader or XmlDocument or LINQ to Xml?

It is bad because it's using string parsing when there are excellent classes in the framework for parsing XML. Even better, there are classes to deal with RSS feeds.
ETA:
Sorry to not have answered your second question earlier. There are a great number of tools to analyze correctness and quality of C# code. There's probably a huge list compiled somewhere, but here's a few I use on a daily basis to help ensure quality code:
StyleCop (code formatting standards)
Resharper (idiomatic programming, gotcha catching)
FxCop (code correctness, standards adherence, idiomatic programming)
Pex (white box testing)
Nitriq (code quality metrics)
NUnit (unit testing)

You shouldn't parse XML with string functions and regular expressions. XML can get very complicated and be formatted many ways that a real XML parser like XmlReader can handle, but will break your simple string parsing code.
Basically: don't try and reinvent the wheel (an xml parser), especially when you don't realize how complicated that wheel actually is.

I think the worst thing for the code is the performance issue. You should parse the xml string into a XDocument(or similar structure) instead of parse it again and again using regex.

simply because you are reinventing a xml parser , use Linq to xml instead, it is very simple and clean.I am sure that you can do all the above in three lines of code if you use Linq to XML , your code uses a lot of magic numbers (ex: 7-n ..), the thing that make it unstable and non generic

For starters, it uses byte as a indexer instead of int (what if items has more items in it than a byte can represent?). It doesn't use idiomatic C# (see user1645569's response). It also unnecessarily uses var instead of specific data types (that's more stylistic though, but for me I prefer not to, hence by my metric it's not ideal (and you've given no other metric)).
Let me clarify what I am saying about "unnecessarily using var": var in and of itself is not bad, and I am not suggesting that. I am (mostly) suggesting the usage here isn't very consistent. For example, explicitly declaring start as an int, but then declaring nextItemEnd as var (which will deduce to int) and assigning nextItemEnd to start seems (to me) like a weird mixture between wanting to automatically deduce a variable's type and explicitly declaring it. I think it's good that var was not used in start's declaration (because then it's not exactly clear if the intent is an integer or a floating point number), but I (personally) don't think it helps any to declare nextItemStart and nextItemEnd as var. I tend to prefer to use var for more complex/longer data types (similar to how I use auto in C++ for iterators, but not for "simpler" data types).

Related

Is it advisable to use tokens for the purpose of syntax highlighting?

I'm trying to implement syntax highlighting in C# on Android, using Xamarin. I'm using the ANTLR v4 library for C# to achieve this. My code, which is currently syntax highlighting Java with this grammar, does not attempt to build a parse tree and use the visitor pattern. Instead, I simply convert the input into a list of tokens:
private static IList<IToken> Tokenize(string text)
{
var inputStream = new AntlrInputStream(text);
var lexer = new JavaLexer(inputStream);
var tokenStream = new CommonTokenStream(lexer);
tokenStream.Fill();
return tokenStream.GetTokens();
}
Then I loop through all of the tokens in the highlighter and assign a color to them based on their kind.
public void HighlightAll(IList<IToken> tokens)
{
int tokenCount = tokens.Count;
for (int i = 0; i < tokenCount; i++)
{
var token = tokens[i];
var kind = GetSyntaxKind(token);
HighlightNext(token, kind);
if (kind == SyntaxKind.Annotation)
{
var nextToken = tokens[++i];
Debug.Assert(token.Text == "#" && nextToken.Type == Identifier);
HighlightNext(nextToken, SyntaxKind.Annotation);
}
}
}
public void HighlightNext(IToken token, SyntaxKind tokenKind)
{
int count = token.Text.Length;
if (token.Type != -1)
{
_text.SetSpan(_styler.GetSpan(tokenKind), _index, _index + count, SpanTypes.InclusiveExclusive);
_index += count;
}
}
Initially, I figured this was wise because syntax highlighting is largely context-independent. However, I have already found myself needing to special-case identifiers in front of #, since I want those to get highlighted as annotations just as on GitHub (example). GitHub has further examples of coloring identifiers in certain contexts: here, List and ArrayList are colored, while mItems is not. I will likely have to add further code to highlight identifiers in those scenarios.
My question is, is it a good idea to examine tokens rather than a parse tree here? On one hand, I'm worried that I might have to end up doing a lot of special-casing for when a token's neighbors alter how it should be highlighted. On the other, parsing will add additional overhead for memory-constrained mobile devices, and make it more complicated to implement efficient syntax highlighting (e.g. not re-tokenizing/parsing everything) when the user edits text in the code editor. I also found it significantly less complicated to handle all of the token types rather than the parser rule types, because you just switch on token.Type rather than overriding a bunch of Visit* methods.
For reference, the full code of the syntax highlighter is available here.
It depends on what you are syntax highlighting.
If you use a naive parser, then any syntax error in the text will cause highlighting to fail. That makes it quite a fragile solution since a lot of the texts you might want to syntax highlight are not guaranteed to be correct (particularly​ user input, which at best will not be correct until it is fully typed). Since syntax highlighting can help make syntax errors visible and is often used for that purpose, failing completely on syntax errors is counter-productive.
Text with errors does not readily fit into a syntax tree. But it does have more structure than a stream of tokens. Probably the most accurate representation would be a forest of subtree fragments, but that is an even more awkward data structure to work with than a tree.
Whatever the solution you choose, you will end up negotiating between conflicting goals: complexity vs. accuracy vs. speed vs. usability. A parser may be part of the solution, but so may ad hoc pattern matching.
Your approach is totally fine and pretty much what everybody's using. And it's totally normal to fine tune type matching by looking around (and it's cheap since the token types are cached). So you can always just look back or ahead in the token stream if you need to adjust actually used SyntaxKind. Don't start parsing your input. It won't help you.
I ended up choosing to use a parser because there were too many ad hoc rules. For example, although I wanted to color regular identifiers white, I wanted types in type declarations (e.g. C in class C) to be green. There ended up being about 20 of these special rules in total. Also, the added overhead of parsing turned out to be miniscule compared to other bottlenecks in my app.
For those interested, you can view my code here: https://github.com/jamesqo/Repository/blob/e5d5653093861bc35f4c0ac71ad6e27265e656f3/Repository.EditorServices/Internal/Java/Highlighting/JavaSyntaxHighlighter.VisitMethods.cs#L19-L76. I've highlighted all of the ~20 special rules I've had to make.

How can i create a parser according to given grammar and check given input code is correct or not according to grammar? [duplicate]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
How do I go about writing a Parser (Recursive Descent?) in C#? For now I just want a simple parser that parses arithmetic expressions (and reads variables?). Though later I intend to write an xml and html parser (for learning purposes). I am doing this because of the wide range of stuff in which parsers are useful: Web development, Programming Language Interpreters, Inhouse Tools, Gaming Engines, Map and Tile Editors, etc. So what is the basic theory of writing parsers and how do I implement one in C#? Is C# the right language for parsers (I once wrote a simple arithmetic parser in C++ and it was efficient. Will JIT compilation prove equally good?). Any helpful resources and articles. And best of all, code examples (or links to code examples).
Note: Out of curiosity, has anyone answering this question ever implemented a parser in C#?
I have implemented several parsers in C# - hand-written and tool generated.
A very good introductory tutorial on parsing in general is Let's Build a Compiler - it demonstrates how to build a recursive descent parser; and the concepts are easily translated from his language (I think it was Pascal) to C# for any competent developer. This will teach you how a recursive descent parser works, but it is completely impractical to write a full programming language parser by hand.
You should look into some tools to generate the code for you - if you are determined to write a classical recursive descent parser (TinyPG, Coco/R, Irony). Keep in mind that there are other ways to write parsers now, that usually perform better - and have easier definitions (e.g. TDOP parsing or Monadic Parsing).
On the topic of whether C# is up for the task - C# has some of the best text libraries out there. A lot of the parsers today (in other languages) have an obscene amount of code to deal with Unicode etc. I won't comment too much on JITted code because it can get quite religious - however you should be just fine. IronJS is a good example of a parser/runtime on the CLR (even though its written in F#) and its performance is just shy of Google V8.
Side Note: Markup parsers are completely different beasts when compared to language parsers - they are, in the majority of the cases, written by hand - and at the scanner/parser level very simple; they are not usually recursive descent - and especially in the case of XML it is better if you don't write a recursive descent parser (to avoid stack overflows, and because a 'flat' parser can be used in SAX/push mode).
Sprache is a powerful yet lightweight framework for writing parsers in .NET. There is also a Sprache NuGet package. To give you an idea of the framework here is one of the samples that can parse a simple arithmetic expression into an .NET expression tree. Pretty amazing I would say.
using System;
using System.Linq.Expressions;
using Sprache;
namespace LinqyCalculator
{
static class ExpressionParser
{
public static Expression<Func<decimal>> ParseExpression(string text)
{
return Lambda.Parse(text);
}
static Parser<ExpressionType> Operator(string op, ExpressionType opType)
{
return Parse.String(op).Token().Return(opType);
}
static readonly Parser<ExpressionType> Add = Operator("+", ExpressionType.AddChecked);
static readonly Parser<ExpressionType> Subtract = Operator("-", ExpressionType.SubtractChecked);
static readonly Parser<ExpressionType> Multiply = Operator("*", ExpressionType.MultiplyChecked);
static readonly Parser<ExpressionType> Divide = Operator("/", ExpressionType.Divide);
static readonly Parser<Expression> Constant =
(from d in Parse.Decimal.Token()
select (Expression)Expression.Constant(decimal.Parse(d))).Named("number");
static readonly Parser<Expression> Factor =
((from lparen in Parse.Char('(')
from expr in Parse.Ref(() => Expr)
from rparen in Parse.Char(')')
select expr).Named("expression")
.XOr(Constant)).Token();
static readonly Parser<Expression> Term = Parse.ChainOperator(Multiply.Or(Divide), Factor, Expression.MakeBinary);
static readonly Parser<Expression> Expr = Parse.ChainOperator(Add.Or(Subtract), Term, Expression.MakeBinary);
static readonly Parser<Expression<Func<decimal>>> Lambda =
Expr.End().Select(body => Expression.Lambda<Func<decimal>>(body));
}
}
C# is almost a decent functional language, so it is not such a big deal to implement something like Parsec in it. Here is one of the examples of how to do it: http://jparsec.codehaus.org/NParsec+Tutorial
It is also possible to implement a combinator-based Packrat, in a very similar way, but this time keeping a global parsing state somewhere instead of doing a pure functional stuff. In my (very basic and ad hoc) implementation it was reasonably fast, but of course a code generator like this must perform better.
I know that I am a little late, but I just published a parser/grammar/AST generator library named Ve Parser. you can find it at http://veparser.codeplex.com or add to your project by typing 'Install-Package veparser' in Package Manager Console. This library is kind of Recursive Descent Parser that is intended to be easy to use and flexible. As its source is available to you, you can learn from its source codes. I hope it helps.
In my opinion, there is a better way to implement parsers than the traditional methods that results in simpler and easier to understand code, and especially makes it easier to extend whatever language you are parsing by just plugging in a new class in a very object-oriented way. One article of a larger series that I wrote focuses on this parsing method, and full source code is included for a C# 2.0 parser:
http://www.codeproject.com/Articles/492466/Object-Oriented-Parsing-Breaking-With-Tradition-Pa
Well... where to start with this one....
First off, writing a parser, well that's a very broad statement especially with the question your asking.
Your opening statement was that you wanted a simple arithmatic "parser" , well technically that's not a parser, it's a lexical analyzer, similar to what you may use for creating a new language. ( http://en.wikipedia.org/wiki/Lexical_analysis ) I understand however exactly where the confusion of them being the same thing may come from. It's important to note, that Lexical analysis is ALSO what you'll want to understand if your going to write language/script parsers too, this is strictly not parsing because you are interpreting the instructions as opposed to making use of them.
Back to the parsing question....
This is what you'll be doing if your taking a rigidly defined file structure to extract information from it.
In general you really don't have to write a parser for XML / HTML, beacuse there are already a ton of them around, and more so if your parsing XML produced by the .NET run time, then you don't even need to parse, you just need to "serialise" and "de-serialise".
In the interests of learning however, parsing XML (Or anything similar like html) is very straight forward in most cases.
if we start with the following XML:
<movies>
<movie id="1">
<name>Tron</name>
</movie>
<movie id="2">
<name>Tron Legacy</name>
</movie>
<movies>
we can load the data into an XElement as follows:
XElement myXML = XElement.Load("mymovies.xml");
you can then get at the 'movies' root element using 'myXML.Root'
MOre interesting however, you can use Linq easily to get the nested tags:
var myElements = from p in myXML.Root.Elements("movie")
select p;
Will give you a var of XElements each containing one '...' which you can get at using somthing like:
foreach(var v in myElements)
{
Console.WriteLine(string.Format("ID {0} = {1}",(int)v.Attributes["id"],(string)v.Element("movie"));
}
For anything else other than XML like data structures, then I'm afraid your going to have to start learning the art of regular expressions, a tool like "Regular Expression Coach" will help you imensly ( http://weitz.de/regex-coach/ ) or one of the more uptodate similar tools.
You'll also need to become familiar with the .NET regular expression objects, ( http://www.codeproject.com/KB/dotnet/regextutorial.aspx ) should give you a good head start.
Once you know how your reg-ex stuff works then in most cases it's a simple case case of reading in the files one line at a time and making sense of them using which ever method you feel comfortable with.
A good free source of file formats for almost anything you can imagine can be found at ( http://www.wotsit.org/ )
For the record I implemented parser generator in C# just because I couldn't find any working properly or similar to YACC (see: http://sourceforge.net/projects/naivelangtools/).
However after some experience with ANTLR I decided to go with LALR instead of LL. I know that theoretically LL is easier to implement (generator or parser) but I simply cannot live with stack of expressions just to express priorities of operators (like * goes before + in "2+5*3"). In LL you say that mult_expr is embedded inside add_expr which does not seem natural for me.

In C# can you split a string into variables?

I recently came across a php code where a CSV string was split into two variables:
list($this->field_one, $this->field_two) = explode(",", $fields);
I turned this into:
string[] tmp = s_fields.Split(',');
field_one = tmp[0];
field_two = tmp[1];
Is there a C# equivalent without creating a temporary array?
Jon Skeet said the right thing. GC will do the thing, don't you worry.
But if you like the syntax so much (which is pretty considerable), you can use this, I guess.
public static class MyPhpStyleExtension
{
public void SplitInTwo(this string str, char splitBy, out string first, out string second)
{
var tempArray = str.Split(splitBy);
if (tempArray.length != 2) {
throw new NotSoPhpResultAsIExpectedException(tempArray.length);
}
first = tempArray[0];
second = tempArray[1];
}
}
I almost feel guilty by writing this code. Hope this will do the thing.
The answer to your quite narrow question is no. C# does not provide a 'multi-assignment' capability, so you cannot extract an arbitrary set of values from anything (such as Split()) and break them out into individual named variables.
There is a workaround for a specific number of variables, by writing a parameter with out arguments. See #vlad for an answer based on that.
But why would you want to? C# provides an impressive range of features that will allow you to take apart strings and deal them with the parts in a such a wide range of different ways that the lack of 'multi-assignment' should barely be noticed.
Parsing strings usually involves other operations such as dealing with formatting errors, trimming white space, case folding. There could be less than 2 strings, or more. Requirements could change over time. When you are ready for a more capable string parser, C# will be waiting.
You may use the following approach:
-Create a class that inherits from dynamic object.
-Create a local dictionary that will store your variables values in the inherited class.
-Override the TryGetMember and TrySetMember functions to get and set values from and into the dictionary.
-Now, you can split your string and put it in the dictionary, then access your variable like:
dynamicObject.var1

ANTLR: Get token name?

I've got a grammar rule,
OR
: '|';
But when I print the AST using,
public static void Preorder(ITree tree, int depth)
{
if (tree == null)
{
return;
}
for (int i = 0; i < depth; i++)
{
Console.Write(" ");
}
Console.WriteLine(tree);
for(int i=0; i<tree.ChildCount; ++i)
Preorder(tree.GetChild(i), depth + 1);
}
(Thanks Bart) it displays the actual | character. Is there a way I can get it to say "OR" instead?
robert inspired this answer.
if (ExpressionParser.tokenNames[tree.Type] == tree.Text)
Console.WriteLine(tree.Text);
else
Console.WriteLine("{0} '{1}'", ExpressionParser.tokenNames[tree.Type], tree.Text);
I had to do this a couple of weeks ago, but with the Python ANTLR. It doesn't help you much, but it might help somebody else searching for an answer.
With Python ANTLR, tokens types are integers. The token text is included in the token object. Here's the solution I used:
import antlrGeneratedLexer
token_names = {}
for name, value in antlrGeneratedLexer.__dict__.iteritems():
if isinstance(value, int) and name == name.upper():
token_names[value] = name
There's no apparent logic to the numbering of tokens (at least, with Python ANTLR), and the token names are not stored as strings except in the module __dict__, so this is the only way of getting to them.
I would guess that in C# token types are in an enumeration, and I believe enumerations can be printed as strings. But that's just a guess.
Boy, I spent way too much time banging my head against a wall trying to figure this out. Mark's answer gave me the hint I needed, and it looks like the following will get the token name from a TerminalNode in Antlr 4.5:
myLexer.getVocabulary.getSymbolicName(myTerminalNode.getSymbol.getType)
or, in C#:
myLexer.Vocabulary.GetSymbolicName(myTerminalNode.Symbol.Type)
(Looks like you can actually get the vocabulary from either the parser or the lexer.)
Those vocabulary methods seem to be the preferred way get at the tokens in Antlr 4.5, and tokenNames appears to be deprecated.
It does seem needlessly complicated for what I think is a pretty basic operation, so maybe there's an easier way.
I'm new to Antlr, but it seems ITree has no direct obligation to be related to Parser (in .NET). Instead there is a derived interface IParseTree, returned from Parser (in Antlr4), and it contains few additional methods including override:
string ToStringTree(Parser parser);
It converts the whole node subtree into text representation. For some cases it is useful. If you like to see just the name of some concrete node without it's children, then use static method in class Trees:
public static string GetNodeText(ITree t, Parser recog);
This method does basically the same as Mark and Robert suggested, but in more general and flexible way.
In addition to robert's pythonic answer (and hopefully will be useful for other languages):
If using the nextToken() method of your generated lexer, you can use the 'type' property of the lexer (not the token, unintuitively enough) to get the numeric code given to the token type by the lexer. In the lexer itself you can see which type got which number. Hope this is helpful.

hand coding a parser

For all you compiler gurus, I wanna write a recursive descent parser and I wanna do it with just code. No generating lexers and parsers from some other grammar and don't tell me to read the dragon book, i'll come around to that eventually.
I wanna get into the gritty details about implementing a lexer and parser for a reasonable simple language, say CSS. And I wanna do this right.
This will probably end up being a series of questions but right now I'm starting with a lexer. Tokenization rules for CSS can be found here.
I find my self writing code like this (hopefully you can infer the rest from this snippet):
public CssToken ReadNext()
{
int val;
while ((val = _reader.Read()) != -1)
{
var c = (char)val;
switch (_stack.Top)
{
case ParserState.Init:
if (c == ' ')
{
continue; // ignore
}
else if (c == '.')
{
_stack.Transition(ParserState.SubIdent, ParserState.Init);
}
break;
case ParserState.SubIdent:
if (c == '-')
{
_token.Append(c);
}
_stack.Transition(ParserState.SubNMBegin);
break;
What is this called? and how far off am I from something reasonable well understood? I'm trying to balance something which is fair in terms of efficiency and easy to work with, using a stack to implement some kind of state machine is working quite well, but I'm unsure how to continue like this.
What I have is an input stream, from which I can read 1 character at a time. I don't do any look a head right now, I just read the character then depending on the current state try to do something with that.
I'd really like to get into the mind set of writing reusable snippets of code. This Transition method is currently means to do that, it will pop the current state of the stack and then push the arguments in reverse order. That way, when I write Transition(ParserState.SubIdent, ParserState.Init) it will "call" a sub routine SubIdent which will, when complete, return to the Init state.
The parser will be implemented in much the same way, currently, having everything in a single big method like this allows me to easily return a token when I found one, but it also forces me to keep everything in one single big method. Is there a nice way to split these tokenization rules into separate methods?
What you're writing is called a pushdown automaton. This is usually more power than you need to write a lexer, it's certainly excessive if you're writing a lexer for a modern language like CSS. A recursive descent parser is close in power to a pushdown automaton, but recursive descent parsers are much easier to write and to understand. Most parser generators generate pushdown automatons.
Lexers are almost always written as finite state machines, i.e., like your code except get rid of the "stack" object. Finite state machines are closely related to regular expressions (actually, they're provably equivalent to one another). When designing such a parser, one usually starts with the regular expressions and uses them to create a deterministic finite automaton, with some extra code in the transitions to record the beginning and end of each token.
There are tools to do this. The lex tool and its descendants are well known and have been translated into many languages. The ANTLR toolchain also has a lexer component. My preferred tool is ragel on platforms that support it. There is little benefit to writing a lexer by hand most of the time, and the code generated by these tools will probably be faster and more reliable.
If you do want to write your own lexer by hand, good ones often look something like this:
function readToken() // note: returns only one token each time
while !eof
c = peekChar()
if c in A-Za-z
return readIdentifier()
else if c in 0-9
return readInteger()
else if c in ' \n\r\t\v\f'
nextChar()
...
return EOF
function readIdentifier()
ident = ""
while !eof
c = nextChar()
if c in A-Za-z0-9
ident.append(c)
else
return Token(Identifier, ident)
// or maybe...
return Identifier(ident)
Then you can write your parser as a recursive descent parser. Don't try to combine lexer / parser stages into one, it leads to a total mess of code. (According to the Parsec author, it's slower, too).
You need to write your own Recursive Descent Parser from your BNF/EBNF. I had to write my own recently and this page was a lot of help. I'm not sure what you mean by "with just code". Do you mean you want to know how to write your own recursive parser?
If you want to do that, you need to have your grammar in place first. Once you have your EBNF/BNF in place, the parser can be written quite easily from it.
The first thing I did when I wrote my parser, was to read everything in and then tokenize the text. So I essentially ended up with an array of tokens that I treated as a stack. To reduce the verbosity/overhead of pulling a value off a stack and then pushing it back on if you don't require it, you can have a peek method that simply returns the top value on the stack without popping it.
UPDATE
Based on your comment, I had to write a recursive-descent parser in Javascript from scratch. You can take a look at the parser here. Just search for the constraints function. I wrote my own tokenize function to tokenize the input as well. I also wrote another convenience function (peek, that I mentioned before). The parser parses according to the EBNF here.
This took me a little while to figure out because it's been years since I wrote a parser (last time I wrote it was in school!), but trust me, once you get it, you get it. I hope my example gets your further along on your way.
ANOTHER UPDATE
I also realized that my example may not be what you want because you might be going towards using a shift-reduce parser. You mentioned that right now you are trying to write a tokenizer. In my case, I did write my own tokenizer in Javascript. It's probably not robust, but it was sufficient for my needs.
function tokenize(options) {
var str = options.str;
var delimiters = options.delimiters.split("");
var returnDelimiters = options.returnDelimiters || false;
var returnEmptyTokens = options.returnEmptyTokens || false;
var tokens = new Array();
var lastTokenIndex = 0;
for(var i = 0; i < str.length; i++) {
if(exists(delimiters, str[i])) {
var token = str.substring(lastTokenIndex, i);
if(token.length == 0) {
if(returnEmptyTokens) {
tokens.push(token);
}
}
else {
tokens.push(token);
}
if(returnDelimiters) {
tokens.push(str[i]);
}
lastTokenIndex = i + 1;
}
}
if(lastTokenIndex < str.length) {
var token = str.substring(lastTokenIndex, str.length);
token = token.replace(/^\s+/, "").replace(/\s+$/, "");
if(token.length == 0) {
if(returnEmptyTokens) {
tokens.push(token);
}
}
else {
tokens.push(token);
}
}
return tokens;
}
Based on your code, it looks like you are reading, tokenizing, and parsing at the same time - I'm assuming that's what a shift-reduce parser does? The flow for what I have is tokenize first to build the stack of tokens, and then send the tokens through the recursive-descent parser.
If you are going to hand code everything from scratch I would definately consider going with a recursive decent parser. In your post you are not really saying what you will be doing with the token stream once you have parsed the source.
Some things I would recommend getting a handle on
1. Good design for your scanner/lexer, this is what will be tokenizing your source code for your parser.
2. The next thing is the parser, if you have a good ebnf for the source language the parser can usually translate quite nicely into a recursive decent parser.
3. Another data structure you will really need to get your head around is the symbol table. This can be as simple as a hashtable or as complex as a tree structure that can represent complex record structures etc. I think for CSS you might be somewhere between the two.
4. And finally you want to deal with code generation. You have many options here. For an interpreter, you might simply interpret on the fly as you parse the code. A better approach might be to generate a for of i-code that you can then write an iterpreter for, and later even a compiler. Of course for the .NET platform you could directly generate IL (probably not applicable for CSS :))
For references, I gather you are not heavy into the deep theory and I do not blame you. A really good starting point for getting the basics without complex, code if you do not mind the Pascal that is, is Jack Crenshaw's 'Let's build a compiler'
http://compilers.iecc.com/crenshaw/
Good luck I am sure you are going to enjoy this project.
It looks like you want to implement a "shift-reduce" parser, where you explicitly build a token stack. The usual alternative is a "recursive descent" parser, in which depth of procedure calls build the same token stack with their own local variables, on the actual hardware stack.
In shift-reduce, the term "reduce" refers to the operation performed on the explicitly-maintained token stack. For example, if the top of the stack has become Term, Operator, Term then a reduction rule can be applied resulting in Expression as a replacement for the pattern. The reduction rules are explicitly encoded in a data structure used by the shift-reduce parser; as a result, all reduction rules can be found in the same spot of the source code.
The shift-reduce approach brings a few benefits compared to recursive-descent. On a subjective level, my opinion is that shift-reduce is easier to read and maintain than recursive-descent. More objectively, shift-reduce allows for more informative error messages from the parser when an unexpected token occurs.
Specifically, because the shift-reduce parser has an explicit encoding of rules for making "reductions," the parser is easily extended to articulate what sorts of tokens could legally have followed. (e.g., "; expected"). A recursive descent implementation cannot easily be extended to do the same thing.
A great book on both kinds of parser, and the trade-offs in implementing different kinds of shift-reduce is "Introduction to Compiler Construction", by Thomas W. Parsons.
Shift-reduce is sometimes called "bottom-up" parsing and recursive-descent is sometimes called "top-down" parsing. In the analogy used, nodes composed with highest precedence (e.g., "factors" in multiplication expression) are considered to be "at the bottom" of the parsing. This is in accord with the same analogy used in "descent" of "recursive descent".
If you want to use the parser to also handle not-well-formed expressions, you really want a recursive descent parser. Much easier to get the error handling and reporting usable.
For literature, I'd recommend some of the old work of Niklaus Wirth. He knows how to write. Algorithms + Data Structures = Programs is what I used, but you can find his Compiler Construction online.

Categories

Resources