Why is the following LINQ syntax (sometimes called "query" syntax) called "comprehension" syntax? What's being comprehended (surely that's wrong)? Or, what is comprehensively represented (maybe I'm on the right track, now)?
It comes from the more language-agnostic term List Comprehension which many languages follow. The history apparently is:
The SETL programming language (later 1960s) had a set formation construct, and the computer algebra system AXIOM (1973) has a similar construct that processes streams, but the first use of the term "comprehension" for such constructs was in Rod Burstall and John Darlington's description of their functional programming language NPL from 1977.
FOLDOC mostly echoes this as well:
According to a note by Rishiyur Nikhil , (August 1992), the term itself seems to have been coined by Phil Wadler circa 1983-5, although the programming construct itself goes back much further (most likely Jack Schwartz and the SETL language).
The term "list comprehension" appears in the references below.
The earliest reference to the notation is in Rod Burstall and John Darlington's description of their language, NPL.
["The OL Manual" Philip Wadler, Quentin Miller and Martin Raskovsky, probably 1983-1985].
["How to Replace Failure by a List of Successes" FPCA September 1985, Nancy, France, pp. 113-146].
I suspect this is related to the second meaning of Comprehend:
to take in or embrace; include;
comprise
This syntax has to do with defining what should be included in a set.
I think this paper can shed light http://dl.acm.org/citation.cfm?id=181564
I.e they argue and define (I think) what a comprehension syntax is. It is issued in 1994 and maybe it affected the design concepts of LINQ.
My understanding of the term linq comprehension syntax as a.NET developer is that it allows you to write linq in a familiar style query language. As a person's understanding of linq improves they may move to what is known in .NET as extension method syntax, which is also how the .NET compiler will interpret linq at compile time.
Since the term "comprehension" and "comprehensive" is very often used in English language to indicate the "whole" and "completeness", one meaning of comprehension syntax could be a sintax that allows to build expressions which are able to generate sets of values that "include all values"(comprehend) that respect the rules expressed by those expressions.
Another meaning could be more related to the generation of subsets of values (lists) starting from some specified set, and therefore the subset of values that belongs to the starting set and it is "comprised" in the original set. For that reason, the comprehension sintax could be the sintax for the programming languages's constructs that can generate subset values comprised into a specified original set.
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
How do I go about writing a Parser (Recursive Descent?) in C#? For now I just want a simple parser that parses arithmetic expressions (and reads variables?). Though later I intend to write an xml and html parser (for learning purposes). I am doing this because of the wide range of stuff in which parsers are useful: Web development, Programming Language Interpreters, Inhouse Tools, Gaming Engines, Map and Tile Editors, etc. So what is the basic theory of writing parsers and how do I implement one in C#? Is C# the right language for parsers (I once wrote a simple arithmetic parser in C++ and it was efficient. Will JIT compilation prove equally good?). Any helpful resources and articles. And best of all, code examples (or links to code examples).
Note: Out of curiosity, has anyone answering this question ever implemented a parser in C#?
I have implemented several parsers in C# - hand-written and tool generated.
A very good introductory tutorial on parsing in general is Let's Build a Compiler - it demonstrates how to build a recursive descent parser; and the concepts are easily translated from his language (I think it was Pascal) to C# for any competent developer. This will teach you how a recursive descent parser works, but it is completely impractical to write a full programming language parser by hand.
You should look into some tools to generate the code for you - if you are determined to write a classical recursive descent parser (TinyPG, Coco/R, Irony). Keep in mind that there are other ways to write parsers now, that usually perform better - and have easier definitions (e.g. TDOP parsing or Monadic Parsing).
On the topic of whether C# is up for the task - C# has some of the best text libraries out there. A lot of the parsers today (in other languages) have an obscene amount of code to deal with Unicode etc. I won't comment too much on JITted code because it can get quite religious - however you should be just fine. IronJS is a good example of a parser/runtime on the CLR (even though its written in F#) and its performance is just shy of Google V8.
Side Note: Markup parsers are completely different beasts when compared to language parsers - they are, in the majority of the cases, written by hand - and at the scanner/parser level very simple; they are not usually recursive descent - and especially in the case of XML it is better if you don't write a recursive descent parser (to avoid stack overflows, and because a 'flat' parser can be used in SAX/push mode).
Sprache is a powerful yet lightweight framework for writing parsers in .NET. There is also a Sprache NuGet package. To give you an idea of the framework here is one of the samples that can parse a simple arithmetic expression into an .NET expression tree. Pretty amazing I would say.
using System;
using System.Linq.Expressions;
using Sprache;
namespace LinqyCalculator
{
static class ExpressionParser
{
public static Expression<Func<decimal>> ParseExpression(string text)
{
return Lambda.Parse(text);
}
static Parser<ExpressionType> Operator(string op, ExpressionType opType)
{
return Parse.String(op).Token().Return(opType);
}
static readonly Parser<ExpressionType> Add = Operator("+", ExpressionType.AddChecked);
static readonly Parser<ExpressionType> Subtract = Operator("-", ExpressionType.SubtractChecked);
static readonly Parser<ExpressionType> Multiply = Operator("*", ExpressionType.MultiplyChecked);
static readonly Parser<ExpressionType> Divide = Operator("/", ExpressionType.Divide);
static readonly Parser<Expression> Constant =
(from d in Parse.Decimal.Token()
select (Expression)Expression.Constant(decimal.Parse(d))).Named("number");
static readonly Parser<Expression> Factor =
((from lparen in Parse.Char('(')
from expr in Parse.Ref(() => Expr)
from rparen in Parse.Char(')')
select expr).Named("expression")
.XOr(Constant)).Token();
static readonly Parser<Expression> Term = Parse.ChainOperator(Multiply.Or(Divide), Factor, Expression.MakeBinary);
static readonly Parser<Expression> Expr = Parse.ChainOperator(Add.Or(Subtract), Term, Expression.MakeBinary);
static readonly Parser<Expression<Func<decimal>>> Lambda =
Expr.End().Select(body => Expression.Lambda<Func<decimal>>(body));
}
}
C# is almost a decent functional language, so it is not such a big deal to implement something like Parsec in it. Here is one of the examples of how to do it: http://jparsec.codehaus.org/NParsec+Tutorial
It is also possible to implement a combinator-based Packrat, in a very similar way, but this time keeping a global parsing state somewhere instead of doing a pure functional stuff. In my (very basic and ad hoc) implementation it was reasonably fast, but of course a code generator like this must perform better.
I know that I am a little late, but I just published a parser/grammar/AST generator library named Ve Parser. you can find it at http://veparser.codeplex.com or add to your project by typing 'Install-Package veparser' in Package Manager Console. This library is kind of Recursive Descent Parser that is intended to be easy to use and flexible. As its source is available to you, you can learn from its source codes. I hope it helps.
In my opinion, there is a better way to implement parsers than the traditional methods that results in simpler and easier to understand code, and especially makes it easier to extend whatever language you are parsing by just plugging in a new class in a very object-oriented way. One article of a larger series that I wrote focuses on this parsing method, and full source code is included for a C# 2.0 parser:
http://www.codeproject.com/Articles/492466/Object-Oriented-Parsing-Breaking-With-Tradition-Pa
Well... where to start with this one....
First off, writing a parser, well that's a very broad statement especially with the question your asking.
Your opening statement was that you wanted a simple arithmatic "parser" , well technically that's not a parser, it's a lexical analyzer, similar to what you may use for creating a new language. ( http://en.wikipedia.org/wiki/Lexical_analysis ) I understand however exactly where the confusion of them being the same thing may come from. It's important to note, that Lexical analysis is ALSO what you'll want to understand if your going to write language/script parsers too, this is strictly not parsing because you are interpreting the instructions as opposed to making use of them.
Back to the parsing question....
This is what you'll be doing if your taking a rigidly defined file structure to extract information from it.
In general you really don't have to write a parser for XML / HTML, beacuse there are already a ton of them around, and more so if your parsing XML produced by the .NET run time, then you don't even need to parse, you just need to "serialise" and "de-serialise".
In the interests of learning however, parsing XML (Or anything similar like html) is very straight forward in most cases.
if we start with the following XML:
<movies>
<movie id="1">
<name>Tron</name>
</movie>
<movie id="2">
<name>Tron Legacy</name>
</movie>
<movies>
we can load the data into an XElement as follows:
XElement myXML = XElement.Load("mymovies.xml");
you can then get at the 'movies' root element using 'myXML.Root'
MOre interesting however, you can use Linq easily to get the nested tags:
var myElements = from p in myXML.Root.Elements("movie")
select p;
Will give you a var of XElements each containing one '...' which you can get at using somthing like:
foreach(var v in myElements)
{
Console.WriteLine(string.Format("ID {0} = {1}",(int)v.Attributes["id"],(string)v.Element("movie"));
}
For anything else other than XML like data structures, then I'm afraid your going to have to start learning the art of regular expressions, a tool like "Regular Expression Coach" will help you imensly ( http://weitz.de/regex-coach/ ) or one of the more uptodate similar tools.
You'll also need to become familiar with the .NET regular expression objects, ( http://www.codeproject.com/KB/dotnet/regextutorial.aspx ) should give you a good head start.
Once you know how your reg-ex stuff works then in most cases it's a simple case case of reading in the files one line at a time and making sense of them using which ever method you feel comfortable with.
A good free source of file formats for almost anything you can imagine can be found at ( http://www.wotsit.org/ )
For the record I implemented parser generator in C# just because I couldn't find any working properly or similar to YACC (see: http://sourceforge.net/projects/naivelangtools/).
However after some experience with ANTLR I decided to go with LALR instead of LL. I know that theoretically LL is easier to implement (generator or parser) but I simply cannot live with stack of expressions just to express priorities of operators (like * goes before + in "2+5*3"). In LL you say that mult_expr is embedded inside add_expr which does not seem natural for me.
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
I need a fast runtime expression parser
How do I make it that when someone types in x*y^z in a textbox on my page to calculate that equation in the code behind and get the result?
.NET does not have a built-in function for evaluating arbitrary strings. However, an open source .NET library named NCalc does.
NCalc is a mathematical expressions evaluator in .NET. NCalc can parse
any expression and evaluate the result, including static or dynamic
parameters and custom functions.
Answer from operators as strings by user https://stackoverflow.com/users/1670022/matt-crouch, using built-in .NET functionality:
"If all you need is simple arithmetic, do this.
DataTable temp = new DataTable();
Console.WriteLine(temp.Compute("15 / 3",string.Empty));
EDIT: a little more information. Check out the MSDN documentation for the Expression property of the System.Data.DataColumn class. The stuff on "Expression Syntax" outlines a list of commands you can use in addition to the arithmetic operators. (ex. IIF, LEN, etc.)."
EDIT 2: For convenience, you can put this into a little function like:
public string Eval(string expr)
{
var temp = new System.Data.DataTable();
string result = null;
try
{
result = $"{temp.Compute(expr, string.Empty)}";
}
catch (System.Data.EvaluateException ex)
{
if (ex.Message.ToLower().Contains("cannot find column"))
throw new System.Data.SyntaxErrorException($"Syntax error: Invalid expression: '{expr}'."
+ " Variables as operands are not supported.");
else
throw;
}
return result;
}
So you can use it like:
Console.WriteLine(Eval("15 * (3 + 5) / (7 - 2)"));
giving the expected output:
24
Note that the error handler helps to handle exceptions caused by using variables which are not allowed here. Example: Eval("a") - Instead of returning "Cannot find column [a]", which doesn't make much sense in this context (we're not using it in a database context) it is returning "Syntax error: Invalid expression: 'a'. Variables as operands are not supported."
Run it on DotNetFiddle
There are two main approaches to this problem, each with some variations, as illustrated in the variety of answers.
Option A: Find an existing mathematical expresssion evaluator
Option B: Write your own parser and the logic to compute the result
Before going into some details about this, it is appropriate to stress that interpreting arbitrary mathematical expressions is not a trivial task, for any expression grammar other than "toy" grammars such as these that only accept one or two arithmetic operations and do not allow parenthesis etc.
Understanding that such task is deceivingly trivial, and acknowledging that, after all, interpreting arithmetic expressions of average complexity is a relatively recurrent need for various applications [hence one for which mature solutions should be available], it is probably wise to try and make do with "Option A".
I'd therefore second Jed's recommendation of a ready-make expression evaluator such as NCalc.
It may be useful however to take the time and understand the various concepts and methods associated with parsing and interpreting arithmetic expressions, as if one were going to whip-up one's own implementation.
The key concept is that of a formal grammar. The arithmetic expressions which the evaluator will accept must follow a set of rules such as the list of arithmetic operations allowed. For example will the evaluator support, say, trigonometric functions, or if it does, will this also include say atan2(). The rules also indicate what consitutes an operand, for example will it be allowed to input numerical values as big as say 45 digits. etc. The point is that all these rules are formalized in a grammar.
Typically a grammar works on tokens which have previously been extracted from the raw input text. Essentially at some time in the process, some logic needs to analyze the input string, character by character, and determine which sequences of characters go together. For example in the 123 + 45 / 9.3 expression, the tokens are the integer value 123, the plus operator, the integer value 45, the division operator and finally the 9.3 real value. The task of identifying the tokens and associating them with a token type is the job a lexer. Lexers can be build themselves on a grammar (a grammar which "tokens" are single characters, as opposed to the grammar for the arithmetic expression parser which tokens are short strings produced by the lexer.)
BTW, grammars are used to define many other things beyond arithmetic expressions. Computer languages follow [rather sophiticated] grammars, but it is relatively common to introduce Domain Specific Languages aka DSLs in support of various features of computer applications.
For very simple grammars, one may be able to write the corresponding lexer and parser from scratch. But sooner than later the grammars may get complicated to the point that hand-writing these modules becomes fastidious, bug-prone and maybe more importantly difficult to read. Hence the existence of Lexer and Parser Generators which are stand-alone programs that produce the code of lexers and parsers (in a particular programming language such as C, Java or C#) from a list of rules (expressed in a syntax particular to the generator, though many generators tend to use similar syntaxes, loosely base on BNF).
When using such a lexer/parser generator, work in done in multiple steps:
- first one writes a definition of the grammar (in the generator-specific language/syntax)
- one runs this grammar through the generator.
- one often repeats the above two steps multiple times, because writing a grammar is an exacting exercise: the generator will complain of many possible ambiguities one may write into the grammar.
- eventually the generator produces a source file (in the desired target language such as C# etc.)
- this source is included in the overall project
- other source files in the project may invoke the functions exposed in the source files produced by the generator and/or some logic corresponding to various patterns identified during parsing may readily be may imbedded in the generator produced code.
- the project can then be build as usual, i.e. as if the parser and lexer had be hand-written.
And that's about it for a 20,000 feet high presentation of the process of working with formal grammars and code generators.
A list of parser-generators (aka compiler-compilers) can be found at this link. For simple work in C# I also want to mention Irony. It may be very insightful to peruse these sites, to get a better feel for these concept, even without the intent of becoming a practitioner at this time.
As said, I wish to stress that for this particular application, a ready-made arithmetic evaluator is likely the better approach. The main downside of these would be
some limitations as to what the allowed expression syntax is (either the grammar allowed is too restrictive: you also need say stddev() or is too broad: you don't want your users to use trig functions. With the more mature evaluators, there will be some form of configuration/extension feature which allows dealing with this problem.
the learning curve of such a 3rd party module. Hopefully many of them should be relatively "plug-and-play".
solved with this library http://www.codeproject.com/Articles/21137/Inside-the-Mathematical-Expressions-Evaluator
my final code
Calculator Cal = new Calculator();
txt_LambdaNoot.Text = (Cal.Evaluate(txt_C.Text) / fo).ToString();
now when some one type 3*10^11 he will get 300000000000
You will need to implement (or find a third-party source) an expression parser. This is not a trivial thing to do.
What you need - if you want to do it yourself - is a Scanner (also known as Lexer) + Parser in the code behind which interprets the expression. Alternatively, you can find a 3rd party library which does the job and works similar as the JavaScript eval(string) function does.
Please take a look here, it describes an recursive descent parser. The example is written in C, but you should be able to adapt it to C# easily once you got the idea described in the article.
It is less complicated than it sounds, especially if you have a limited amount of operators to support.
The advantage is that you keep full control on what expressions will be executed (to prevent malicious code injections by the end-user of your website).
I know this question have been asked a lot of time.
I want a suggestion on which algorithm is suitable for approximate string matching.
The application is specifically for company name matching only and nothing else.
The biggest challenge is probably the company end name part and short named part
Example:
1. companyA pty ltd vs companyA pty. ltd. vs companyA
2. WES Engineering vs W.E.S. Engineering (extremely rare occurance)
Do you think Levenshtein Edit Distance is adequate?
I'm using C#
Regards,
Max
There are various string distance metrics you could use.
I would recommend Jaro-Winkler. Unlike edit-distance where the result of a comparison is in discrete units of edits, JW gives you a 0-1 score. It is especially suited for proper names. Also look at this nice tutorial and this SO question.
I haven't worked with C# but here are some implementations of JW I found online:
Impl 1 (They have a DOT NET version too if you look at the file list)
Impl 2
If you want to do a bit more sophisticated matching, you can try to do some custom normalization of word forms commonly occurring in company names such as ltd/limited, inc/incorporated, corp/corporation to account for case insensitivity, abbreviations etc. This way if you compute
distance (normalize("foo corp."),
normalize("FOO CORPORATION") )
you should get the result to be 0 rather than 14 (which is what you would get if you computed levenshtein edit-distance).
Yes, Levenshtein distance is suitable for this. It will work for all those you have listed at least.
You could also possibly use Soundex, but I don't think you'll need it.
In these simple examples, just removing all non-alpha-numeric characters gives you a match, and is the easiest to do as you can pre-compute the data on each side, then do a straight equals match which will be a lot faster than cross multiplying and calculating the edit distance.
I have provided my answer already in another question.
https://stackoverflow.com/a/30120166/2282794
I have worked on really large scale system with similar name matching requirements that you have talked about.
Name matching is not very straightforward and the order of first and last names might be different.
Simple fuzzy name matching algorithms fail miserably in such scenarios.
If we just want to talk about the Approximate String matching algorithms, then there are many. Few of them are: Jaro-Winkler, Edit distance(Levenshtein), Jaccard similarity, Soundex/Phonetics based algorithms etc. A simple googling would give us all the details.
You can implement all of them in C#
Irony is, they work while you try to match two given input strings. Alright theoretically and to demonstrate the way fuzzy or approximate string matching works.
However, grossly understated point is, how do we use the same in production settings. Not everybody that I know of who were scouting for an approximate string matching algorithm knew how they could solve the same in the production environment.
I might have just talked about Lucene which is specific to Java but there is Lucene for .Net also.
https://lucenenet.apache.org/
I was looking at some code length metrics other than Lines of Code. Something that Source Monitor reports is statements. This seemed like a valuable thing to know, but the way Source Monitor counted some things seemed unintuitive. For example, a for statement is one statement, even though it contains a variable definition, a condition, and an increment statement. And if a method call is nested in an argument list to another method, the whole thing is considered one statement.
Is there a standard way that statements are counted and are their rules governing such a thing?
The first rule of metrics is "be careful what you measure". You ask for a count of statements, that's what you're going to get. As you note, that figure is perhaps not actually relevant.
If you're interested in other measures, like how "complex" code is, consider looking into other code metrics, like cyclometric complexity.
http://en.wikipedia.org/wiki/Cyclomatic_complexity
UPDATE: Re: your comment
I agree that "doing too much" is an interesting metric. My rule of thumb is that one statement should have one side effect (usually a "local" side effect like mutating a local variable, but sometimes a visible side effect, like writing to a file) and therefore "number of statements" should be roughly correlated with how much the method is "doing" in terms of its number of side effects.
In practice, of course no one's code, my own included, actually meets that bar all the time. You might consider a metric for "how much the method is doing" to count not just statements but also, say, method calls.
To actually answer your question: I'm not aware of any industry standard that regulates what "number of statements" is. The C# specification certainly defines what a "statement" is lexically, but then of course you have to do some interpretation to do a count. For example:
void M()
{
try
{
if (blah)
{
Frob();
Blob();
}
}
catch(Exception ex)
{ /* eat it */ }
finally
{
Grob();
}
}
How many statements are there in M? Well, the body of M consists of one statement, a try-catch-finally. So is the answer one? The body of the try contains one statement, an "if" statement. The consequence of the "if" contains one statement -- remember, a block is a statement. The block contains two statements. The finally contains one statement. The catch block contains no statements -- a catch block is not a statement, lexically -- but it certainly is highly relevant to the operation of the method!
So how many statements is that altogether? One could make a reasonable case for any number from one to six, depending on whether you count blocks as "real" statements, whether you consider child statements as in addition to their parent statement or not, and so on. There is no standards body which regulates the answer to this question that I'm aware of.
The closest you might get to a formal definition of "what is a statement" would be the C# specification itself. Good luck working out whether a particular tool's measurement agrees with your reading of the specification.
Given that metrics are best used as a guide to better/worse code, and not a strict formula, does the exact definition used by the tool make much difference?
If I have three methods, with "statement lengths" of 2500, 1500 and 150, I know which method I'll be examining first; that another tool might report 2480, 1620 and 174 isn't too important.
One of the best tools I've seen for measuring metrics is NDepend, though again I'm not 100% sure what definitions it is using. According to the website, NDepend has 82 separate metrics, including Number of instructions and Cyclomatic Complexity.
The C# Metrics Tool defines the things being counted ("statements", "operands"), etc. by using a precise C# BNF language definition. (In fact, it precisely parses the code according a full C# grammar and then computes structural metrics by walking over the parse tree; SLOC count it gets by countline lines as you'd expect).
You might still argue that such a definition it unintuitive (grammars rarely are), but they are precise. I agree with other posters here, however, that the precise measure isn't as important as the relative value that one block of code has with respect to another. A value of "173.92" complexity just isn't very helpful by itself; compard to another complexity value of "81.02", we can say there's a good indication that the first one is more complex than the second, and that's enough to provide a focus of attention.
I think that metrics are also useful in trending; if last week, this code was "81.02" complex, ad this week it is "173.92", I should wonder why is all that happening inthis part of the code?
You might also consider a ratio of a structural metric (e.g., Cyclomatic) to SLOC as an indication of "doing too much", or at least an indication of writing code that is way too dense to understand
One simple metric is to just count the punctuation marks (;, ,, .) between tokens (so as to avoid those in strings, comments, or numbers). Thus, for (x = 0, y = 1; x < foo.Count; x++, y++) bar[y] = foo[x]; would count as 6.
Of these two options:
var result = from c in coll where c % 2 == 0 select c;
var result = coll.Where ( c => c % 2 == 0 );
Which is preferable?
Is there any advantage to using one over the other? To me the second one looks better, but I would like to hear other people's opinions.
If you've only got one or two clauses, I'd go for "dot notation". When you start doing joins, groupings, or anything else that introduces transparent identifiers, query syntax starts to appeal a lot more.
It's often worth trying it both ways and seeing what's the most readable for that particular situation.
In terms of the generated code, they'll be exactly the same in most cases. Occasionally there'll be an overload you can use in dot notation which makes it simpler than the query expression syntax, but value readability over everything else in most cases.
I also have a blog post on this topic. I would definitely recommend that developers should be comfortable with both options - I'd be quite concerned if a colleague were using LINQ but didn't understand the fundamentals of what query expressions were about, for example. (They don't need to know every translation involved, but some idea of what's going on will make their lives a lot easier.)
I always use the lambda syntax because to me it's clearer what's actually happening and it looks cool to boot. But we have some devs here that always do the opposite (sql nerds I guess :) Fortunately, tools like ReSharper can just transform between the two with a click.