Every time I write a simple lexer and parser, I stumble upon the same question: how should the lexer and the parser communicate? I see four different approaches:
The lexer eagerly converts the entire input string into a vector of tokens. Once this is done, the vector is fed to the parser which converts it into a tree. This is by far the simplest solution to implement, but since all tokens are stored in memory, it wastes a lot of space.
Each time the lexer finds a token, it invokes a function on the parser, passing the current token. In my experience, this only works if the parser can naturally be implemented as a state machine like LALR parsers. By contrast, I don't think it would work at all for recursive descent parsers.
Each time the parser needs a token, it asks the lexer for the next one. This is very easy to implement in C# due to the yield keyword, but quite hard in C++ which doesn't have it.
The lexer and parser communicate through an asynchronous queue. This is commonly known under the title "producer/consumer", and it should simplify the communication between the lexer and the parser a lot. Does it also outperform the other solutions on multicores? Or is lexing too trivial?
Is my analysis sound? Are there other approaches I haven't thought of? What is used in real-world compilers? It would be really cool if compiler writers like Eric Lippert could shed some light on this issue.
While I wouldn't classify much of the above as incorrect, I do believe several items are misleading.
Lexing an entire input before running a parser has many advantages over other options. Implementations vary, but in general the memory required for this operation is not a problem, especially when you consider the type of information that you'd like to have available for reporting compilation errors.
Benefits
Potentially more information available during error reporting.
Languages written in a way that allows lexing to occur before parsing are easier to specify and write compilers for.
Drawbacks
Some languages require context-sensitive lexers that simply cannot operate before the parsing phase.
Language implementation note: This is my preferred strategy, as it results in separable code and is best suited for translation to implementing an IDE for the language.
Parser implementation note: I experimented with ANTLR v3 regarding memory overhead with this strategy. The C target uses over 130 bytes per token, and the Java target uses around 44 bytes per token. With a modified C# target, I showed it's possible to fully represent the tokenized input with only 8 bytes per token, making this strategy practical for even quite large source files.
Language design note: I encourage people designing a new language to do so in a way that allows this parsing strategy, whether or not they end up choosing it for their reference compiler.
It appears you've described a "push" version of what I generally see described as a "pull" parser like you have in #3. My work emphasis has always been on LL parsing, so this wasn't really an option for me. I would be surprised if there are benefits to this over #3, but cannot rule them out.
The most misleading part of this is the statement about C++. Proper use of iterators in C++ makes it exceptionally well suited to this type of behavior.
A queue seems like a rehash of #3 with a middleman. While abstracting independent operations has many advantages in areas like modular software development, a lexer/parser pair for a distributable product offering is highly performance-sensitive, and this type of abstraction removes the ability to do certain types of optimization regarding data structure and memory layout. I would encourage the use of option #3 over this.
As an additional note on multi-core parsing: The initial lexer/parser phases of compilation for a single compilation unit generally cannot be parallelized, nor do they need to be considering how easy it is to simply run parallel compilation tasks on different compilation units (e.g. one lexer/parser operation on each source file, parallelizing across the source files but only using a single thread for any given file).
Regarding other options: For a compiler intended for widespread use (commercial or otherwise), generally implementers choose a parsing strategy and implementation which provides the best performance under the constraints of the target language. Some languages (e.g. Go) can be parsed exceptionally quickly with a simple LR parsing strategy, and using a "more powerful" parsing strategy (read: unnecessary features) would only serve to slow things down. Other languages (e.g. C++) are extremely challenging or impossible to parse with typical algorithms, so slower but more powerful/flexible parsers are employed.
I think there is no golden rule here. Requirements may vary from one case to another. So, reasonable solutions can be different also. Let me comment on your options from my own experience.
"Vector of tokens". This solution may have big memory footprint. Imagine compiling source file with a lot of headers. Storing the token itself is not enough. Error message should contain context with the file name and the line number. It may happen that lexer depends on the parser. Reasonable example: ">>" - is this a shift operator or this is closing of 2 layers of template instantiations? I would not recommend this option.
(2,3). "One part calls another". My impression is that more complex system should call less complex one. I consider lexer to be more simple. This means parser should call lexer. I do not see why C# is better than C++. I implemented C/C++ lexer as a subroutine (in reality this is a complex class) that is called from the grammar based parser. There were no problems in this implementation.
"Communicating processes". This seems to me an overkill. There is nothing wrong in this approach, but maybe it is better to keep the things simple? Multicore aspect. Compiling single file is a relatively rare case. I would recommend to load each core with its own file.
I do not see other reasonable options of combiming lexer and parser together.
I wrote these notes thinking about compiling sources of the software project. Parsing a short query request is completely different thing, and reasons can significantly differ.
My answer is based on my own experience. Other people may see this differently.
The lexer-parser relationship is simpler than the most general case of coroutines, because in general the communication is one-way; the parser does not need to send information back to the lexer. This is why the method of eager generation works (with some penalty, although it does mean that you can discard the input earlier).
As you've observed, if either the lexer or the parser can be written in a reinvocable style then the other can be treated as a simple subroutine. This can always be implemented as a source code transformation, with local variables translated to object slots.
Although C++ doesn't have language support for coroutines, it is possible to make use of library support, in particular fibers. The Unix setcontext family is one option; another is to use multithreading but with a synchronous queue (essentially single-threading but switching between two threads of control).
Also consider for #1 that you lex tokens that don't need it, for example if there is an error, and in addition, you may run low on memory or I/O bandwidth. I believe that the best solution is that employed by parsers generated by tools like Bison, where the parser calls the lexer to get the next token. Minimizes space requirements and memory bandwidth requirements.
#4 is just not going to be worth it. Lexing and parsing are inherently synchronous- there's just not enough processing going on to justify the costs of communication. In addition, typically you parse/lex multiple files concurrently- this can already max out all your cores at once.
The way I handle it in my toy buildsystem project in progress is by having a "file reader" class, with a function bool next_token(std::string&,const std::set<char>&). This class contains one line of input (for error reporting purposes with line number). The function accepts a std::string reference to put the token in, and a std::set<char> which contains the "token-ending" characters. My input class is both parser and lexer, but you could easily split it up if you need more fanciness. So the parsing functions just call next_token and can do their thing, including very detailed error output.
If you need to keep the verbatim input, you'll need to store each line that's read in a vector<string> or something, but not store each token seperately and/or double-y.
The code I'm talking about is located here:
https://github.com/rubenvb/Ambrosia/blob/master/libAmbrosia/Source/nectar_loader.cpp
(search for ::next_token and the extract_nectar function is where is all begins)
Related
I need to develop an application that will read and understand text file in which I'll find a custom language that describe a list of operations (ie cooking recipe). This language has not been defined yet, but it will probably take one of the following shape :
C++ like code
(This code is randomly generated, just for example purpose) :
begin
repeat(10)
{
bar(toto, 10, 1999, xxx);
}
result = foo(xxxx, 10);
if(foo == ok)
{
...
}
else
{
...
}
end
XML code
(This code is randomly generated, just for example purpose) :
<recipe>
<action name="foo" argument"bar, toto, xxx" repeat=10/>
<action name="bar" argument"xxxxx;10" condition="foo == ok">
<true>...</true>
<false>...</false>
</action>
</recipe>
No matter which language will be chosen, there will have to handle simple conditions, loops.
I never did such a thing but at first sight, it occurs to me that describing those operations into XML would be simplier yet less powerful.
After browsing StackOverFlow, I've found some chats on a tool called "ANTLR"... I started reading "The Definitive ANTLR Reference" but since I never done that kind of stuff, I find it hard to know if it's really the kind of tool I need...
In other words, what do I need to read a text file, interpret it properly and perform actions in my C# code. Those operations will interact between themselves by simple conditions like :
If operation1 failed, I do operation2 else operation3.
Repeat the operation4 10 times.
What would be the best language to do describe those text file (XML, my own) ? What are the key points during such developments ?
I hope I'm being clear :)
Thanks a lot for your help and advices !
XML is great for storing relational data in a verbose way. I think it is a terrible candidate for writing logic such as a program, however.
Have you considered using an existing grammar/scripting language that you can embed, rather than writing your own? E.g:
LUA
Python
In one of my projects I actually started with an XML like language as I already had an XML parser and parsed the XML structure into an expression tree in memory to be interpreted/run.
This works out very nicely to get passed the problem of figuring out tokenizing/parsing of text files and concentrate instead on your 'language' and the logic of the operations in your language. The down side is writing the text files is a little strange and very wordy. Its also very unnatural for a programmer use to C/C++ syntax.
Eventually you could easily replace your XML with a full blown scanner & lexer to parse a more 'natural C++' like text format into your expression tree.
As for writing a scanner & lexer, I found it easier to write these by hand using simple logic flow/loops for the scanner and recursive decent parser for the lexer.
That said, ANTLR is great at letting you write out rules for your language and generating your scanner & lexer for you. This allows for much more dynamic language which can easily change without having to refactor everything again when new things are added. So, it might be worth looking into as learning this as it would save you much time in rewrites as things change if you hand wrote your own.
I'd recommend writing the app in F#. It has many useful features for parsing strings and xmls like Pattern Matching and Active Patterns.
For parsing C-like code I would recommend F# (just did one interpreter with F#, works like a charm)
For parsing XML's I would recommend C#/F# + XmlDocument class.
You basically need to work on two files:
Operator dictionary
Code file in YourLanguage
Load and interpret the operators and then apply them recursively to your code file.
The best prefab answer: S-expressions
C and XML are good first steps. They have sort of opposite disadvantages. The C-like syntax won't add a ton of extra characters, but it's going to be hard to parse due to ambiguity, the variety of tokens, and probably a bunch more issues I can't think of. XML is relatively easy to parse and there's tons of example code, but it will also contain tons of extra text. It might also give you too many options for where to stick language features - for example, is the number of times to repeat a loop an attribute, element or text?
S-expressions are more terse than XML for sure, maybe even C. At the same time, they're specific to the task of applying operations to data. They don't admit ambiguity. Parsers are simple and easy to find example code for.
This might save you from having to learn too much theory before you start experimenting. I'll emphasize MerickOWA's point that ANTLR and other parser generators are probably a bigger battle than you want to fight right now. See this discussion on programmers.stackexchange for some background on when the full generality of this type of tool could help.
I'm evaluating using Coco/R vs. ANTLR for use in a C# project as part of what's essentially a scriptable mail-merge functionality. To parse the (simple) scripts, I'll need a parser.
I've focussed on Coco/R and ANTLR because both seem fairly mature and well-maintained and capable of generating decent C# parsers.
Neither seem to be trivial to use either, however, and simplicity is something I'd appreciate - particularly maintainability by others.
Does anyone have any recommendations to make? What are the pros/cons of either for a parsing a small language - or am I looking into the wrong things entirely? How well do these integrate into a typical continuous integration setup? What are the pitfalls?
Related: Well, many questions, such as 1, 2, 3, 4, 5.
We have used Coco for 2 years, having replaced Antler we were formerly using. For a typical big-data query (our application), our experience has been this. Caveat: We are dependent upon full Utf-8 handling, with the parser implemented in C++. These numbers are for a language that has some 200 EBNF productions.
Antler: 260 usecs/query and a 108 MEGABYTE memory footprint for the generated parser/lexer
Coco: 220 usecs/query and a 70 KBYTE memory footprint for the parser/scanner
Initially, Coco had a 1.2 msecs startup time and generated several 60 KBYTE tables for mapping Utf-8. We have made many local enhancements to Coco, such as to eliminate the big tables, eliminated the 1.2 msec startup time, hugely enhanced internal documentation (as well as documentation in the generated code).
Our version of (open source) Coco has a tiny footprint compared to Antlr and is very measurably faster, has no startup delay and just... works. It does not have Antler's nice UI but that never entered our mind to be an issue once we started using Coco.
ANTLR is LL(*), which is as powerful as PEG, though usually much more efficient and flexible. LL(*) degenerates to LL(k) for k>1 one arbitrary lookahead is not necessary.
If you're simply merging data into a complicated template, consider Terence Parr's StringTemplate engine. He's the man behind ANTLR. StringTemplate may be better suited and easier to use than a full parser generator. It's a very feature-rich template engine.
There is a C# port available in the downloads.
Basically, coco/r generates recursive descent parsers and only supports LL(1) grammars whereas ANTLR uses back-tracking (among other techniques), which allows it to handle more complex grammars. coco/r parsers are much more light-weight and easier to understand and deploy but sometimes it's a struggle getting the grammar into a form that coco/r understands given its one look-ahead constraint - for many common programming language grammars (e.g. C++, SQL), it's not possible at all.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
For someone who’s been happily programming in C# for quite some time now and planning to learn a new language I find the Python community more closely knit than many others.
Personally dynamic typing puts me off, but I am fascinated by the way the Python community rallies around it. There are a lot of other things I expect I would miss in Python (LINQ, expression trees, etc.)
What are the good things about Python that developers love? Stuff that’ll excite me more than C#.
For me its the flexibility and elegance, but there are a handful of things I wish could be pulled in from other languages though (better threading, more robust expressions).
In typical I can write a little bit of code in python and do a lot more than the same amount of lines in many other languages. Also, in python code form is of utmost importance and the syntax lends its self to highly readable, clean looking code. That of course helps out with maintenance.
I love having a command line interpreter that I can quickly prototype an algorithm in rather than having to start up a new project, code, compile, test, repeat. Not to mention the fact I can use it to help me automate my server maintenance as well (I double as a SA for my company).
The last thing that comes to mind immediately is the vast amounts of libraries. There are a lot of things already solved out there, the built-in library has a lot to offer, and the third party ones are many times very good (not always though).
Being able to type in some code and get the result back immediately.
(Disclaimer: I use both C# and Python regularly, and I think both have their good and bad points.)
I'm primarily .NET developer and using Python for me personal projects.
What are the good things about python that developers love?
I can say for myself - Python is like a breath of fresh air.
1) It's simple to learn, took about a week for me in the evenings. I'm saying about Python + Django. Python syntax is quite simple.
2) It's simple to use. No troubles installing Python + Django on Windows at all.
3) It can be run on Windows and UNIX.
4) I need it for web, so I get cheaper hosting than ASP.NET.
5) All the advantages of Python language over C#. Like tuples - so useful!
The only thing I don't like is that my favorite IDE Visual Studio doesn't support it (I know about IronPython, don't you worry).
I'm a very heavy user of both C# and Python; I've built very complicated applications in both languages, and I've also embedded Python scripting in my major C# application. I'm not using either to do much in the way of web work right now, but other than that I feel like I'm pretty qualified to answer the question.
The things about Python that excite me, in particular:
The deep integration of generators into the language. This was the first thing that made me realize that I needed to take a long, serious look at Python. My appreciation for this has deepened considerably since I've become conversant with the itertools module, which looks like a nifty set of tools but is in fact a new way of life.
The coupling of dynamic typing and the fact that everything's an object makes pretty sophisticated techniques extremely simple to implement. It's so easy to replace logic with tables in Python (e.g. o = class_map[k]() instead of if k='foo': o = Foo()) that it becomes a basic technique. It's so normal in Python to write methods that take methods as parameters that you don't raise an eyebrow when you see d = defaultdict(list).
zip, and the methods that are designed with it in mind. It takes a while before you can intuitively grasp what dict(zip(k, v)) and d.update(zip(k, v)) are doing, but it's a paradigm-shifting moment when you get there. An entire universe of uninteresting and potentially error-laden code eliminated, just by using one function. Then you start designing functions and classes with the expectation that they'll be used in conjunction with zip, and suddenly your code gets simpler and easier. (Protip: Or itertools.izip. Or itertools.izip_longest.)
Speaking of dictionaries, the way that they're deeply integrated into the language. Understanding what a line of code like self.__dict__.update(**kwargs) does is another one of those paradigm-shifting moments.
List comprehensions and generator expressions, of course.
Inexpensive exceptions.
An interactive intepreter.
Function decorators.
IronPython, which is so much simpler to use than we have any right to expect.
And that's without even getting into the remarkable array of functionality in the standard modules, or the ridiculous bounty of third-party tools like BeautifulSoup or SQL Alchemy or Pylons.
One of the most direct benefits that I've gotten from getting deeply into Python is that it has greatly improved my C# code. I could generally understand code that had a variable of type Dictionary<string, Action<Foo>> in it, but it didn't seem natural to write it. (I use static dictionaries to replace hard-coded logic far more frequently today than I did a year ago.) I have no difficulty understanding what LINQ is doing now, or how IEnumerable<T> and return yield work.
So what don't I like about Python?
Dynamic typing really limits what you can do with static code analysis. Not only isn't there a tool like Resharper for Python, in a language where it's possible to write getattr(x, y)() there really can't be.
It has a bunch of inelegant conventions. How I would love to be able to go back in time and try to talk GVR out of the idea that lambda expressions should be introduced with the word lambda - it's pretty damning that something as fundamental as lambda expressions should be more concise in C# than they are in Python. The leading and trailing double-underscore convention is horrible, and the fact that people mutely acquiesce to it is testimony to Dostoevsky's observation that man is the animal who can get used to anything. And don't get me started on the fact that a module with the name of StringIO was allowed to get out the door.
Some of the features that make Python work on multiple platforms also make it kind of baffling. It's easy to use import, but it's really not easy to understand what the hell it's actually doing. (Where is it looking? What does __init__.py do? Etc.)
The amazingly rich library of standard modules is so amazingly rich that it's hard to know what's in it. It's often easier to write a function than it is to find out whether or not there's something in the standard library that does the same thing - I'm looking at you, itertools.chain.
Your question is kind of like a plumber asking why carpenters are always going on and on about hammers. After all the plumber doesn't have a hammer and has never missed it. Python (even IronPython) and C# target different types of developers and different types of programs. I am very comfortable in Python and enjoy the freedom to focus on the business rules without being distracted by the syntax requirements of the language. On the other hand I have written some fairly substantial code in C# and would be very concerned about the lack of type safety had I taken on the same task in Python. This is not to say that Python is a "toy" language. You can (and people have) write a complete medium or large application in Python. You have the freedom of dynamic typing, but you also have the responsibility to keep it all straight (frameworks help here). Similarly you can write a small application in C#, but you will bring along some overhead you do not likely need.
So if the problem is a nail use a hammer, if the problem is a screw use a screw driver. In other words spend some time to learn Python, get to know it's streangths (text processing, quick coding cycles, simple clean code, etc) and then when you are looking at tackling a new problem ask whether you would be better off in Python or C#. One thing is certain. So long as C# is the only programming language you know, it is the only one you will ever use.
Pat O
My language of choice is C#, and I didn't quite see the point for me to learn Python so far. This talk from PDC09 really piqued my interest: the guy demonstrates how you can use IronPython (or IronRuby) to make a C# app scriptable (in his demo, drop a Python script in a text box, and it works with/extends your C# code). I found this really fascinating: I don't even know where I would start to do something similar in C#, and this made me at least appreciate that it brings something different to the table, which could really enrich what I can develop!
I'm an asymmetrical user of both languages, in a sense that I use C# mostly professionally and Python for all my "fun" projects (not that work is never fun, but... you know...)
This difference of context may skew my perspective, including my opinion that they are two distinct types (pun intended) of languages for, generally, distinct purposes.
This said, it may not be a coincidence that Python is, at this point in time, [one of?] the languages of choice for all kinds of cutting edge, somewhat scholarly, technology/science oriented projects. (And BTW, this "scholarly" keyword here doesNOT imply, that Python is a university toy, plenty of "serious" applications in plenty of domains/industry are proof to the contrary). This may be due to several factors:
(I don't develop most points, readily well expressed in other responses)
the openness and quasi universal availability of Python (unlike C# !)
the lightweight / ease of use / low learning curve
the extensive, high quality, "standard" library and the extensiver (and occasionally bum quality, but on the whole available, open-sourced, etc.) additional library.
the wide array of open source projects in Python language
the relative ease to bind with C/C++ for reusing legacy code, but also for placing performance-critical portions of a project
the generally higher level of abstraction of may constructs of the language
the multi-paradigms (imperative, object oriented and functional)
the availability of practitioners in so many domain of science and technology
and, yes, the
"herd mentality effect" mentioned in a remark, possibly in a [self?] deriding way. The fact that a language attracts a broad, "closely knit" community, makes it attractive too, beyond the superficial ("look cool" and such) traits of herd mentality. Put in broader context, sometimes the best technology/language to use is not measured on the its intrinsic merits but on the overall "picture", including the user community.
I like all stuff with [] and {}. Selectors like this [-1:1]. Possibility to write less code, but more something meaningfull, that gives to write Models and other declarative things very DRY.
Like any programming language, it is just a tool in the box or a brush by which you may paint your creation. Any creative endeavour requires that the artist loves the tools he uses; otherwise, the outcome suffers. Some people like Python for the same reason others love Perl. Incidentally, I have found that most Python lovers loathe Perl's flexible and expressive syntax. As a Perl lover, I don't hate Python, but consider it to be overly structured and restrictive.
If you ask me, all of these throngs of people who seem to love Python were silently suffering under the tool choices before Python came into being. Some suffered under Perl, others under something else. In other words, I believe that when Python came along, it found a large group of silent sufferers longing for a tool like Python.
I can't program in Python because I can't "think" in Python. I can "think" in Perl, therefore, it is the tool I prefer. The silently suffering mass of, now, Python users seem to have found some long lost salvation. Now if they could only keep their evangelism to themselves :).
If you are familiar with the .NET CLR and prefer a statically-typed language, but you like Python's lightweight syntax, then perhaps Boo is the language for you.
Don't get me wrong, I am and will always be a devoted fan of C#.
But sometimes there are things I can't do in C#. lthough C# keeps reducing those gaps, Python is still the language I go to to fill them.
It's dynamic, flexible, powerful, and clean. Lovely language. Whenever I need to script or build dynamic or functional (as in functional programming) software, I go Python.
For me Python is the most elegant language I've used. The syntax is minimalist (significantly less punctuation than most) and intentionally modeled after the psuedo-code conventions which are ubiquitously used by programmer to outline their intentions.
Python's if __name__ == '__main__': suite encourages re-use and test driven development.
For example, the night before last I hacked together to run thousands of ssh jobs (with about 100 concurrently) and gather up all the results (output, error messages, exit values) ... and record the time take on each. It also handles timeouts (An ssh command can stall indefinitely on connection to a thrashing system --- it's connection timeouts and retry options don't apply after the socket connection is made, not matter if the authentication stalls). This only takes a few dozen lines of Python and it's really is easiest to create it as a class (defined above the __main__ suite) and do my command line parsing in a simple wrapper down inside __main__. That's sufficient to do the job at hand (I ran the script on 25,000 hosts the next day, in about two hours). It I can now use this code in other scripts as easily as:
from sshwrap import SSHJobMan
cmd = '/etc/init.d/foo restart'
targets = queryDB(some_criteria)
job = SSHJobMan(cmd, targets)
job.start()
while not job.done():
completed = job.poll()
# ...
# Deal with incremental disposition of of completed jobs
for each in sorted(job.results):
# ...
# Summarize results
... and so on.
So my script can be used for simple jobs ... and it can be imported as a module for more specialized work that couldn't be described on my wrapper's command line. (For example I could start up "consumer" subprocesses for handling other work on each host where the job was successful while spitting out service tickets or automated reboot requests for all hosts reporting timeouts or failures, etc).
For modules which have no standalone usage I can use the __main__ suite to contain unit-tests. Thus every module can contain its own tests ... which, in fact, can be integrated into the "doc strings" using the doctest module from the standard libraries. (Which, incidentally, means that properly formatted examples in the documentary comments can be kept in sync with the implementation ... since they are parts of the unit-test suite).
The main thing I like about Python is its very concise, readable syntax. Though using indentation as a block delimiter can seem strange at first, once you begin to code a lot in the language I find it begins to make sense. Though the core language is quite simple, its more advanced features, e.g. list comprehension, decorators and generators, are rather useful too.
In addition, the Python standard library is just fantastic; its documentation is very well written, and it contains a lot of very useful packages. I also find that there are plenty of good bindings for C libraries, such as PyGTK, Webkit and Qt, to name but a few.
One caveat is that Python, like most dynamic languages, is quite slow in comparison with compiled, statically-typed languages. However, you can easily extend it with C, allowing you to write code requiring better performance in C and the rest in Python.
It's a great language overall, and (for me at least) makes coding more productive and enjoyable.
I'm writing a console tool to generate some C# code for objects in a class library. The best/easiest way I can actual generate the code is to use reflection after the library has been built. It works great, but this seems like a haphazard approch at best. Since the generated code will be compiled with the library, after making a change I'll need to build the solution twice to get the final result, etc. Some of these issues could be mitigated with a build script, but it still feels like a bit too much of a hack to me.
My question is, are there any high-level best practices for this sort of thing?
Its pretty unclear what you are doing, but what does seem clear is that you have some base line code, and based on some its properties, you want to generate more code.
So the key issue here are, given the base line code, how do you extract interesting properties, and how do you generate code from those properties?
Reflection is a way to extract properties of code running (well, at least loaded) into the same execution enviroment as the reflection user code. The problem with reflection is it only provides a very limited set of properties, typically lists of classes, methods, or perhaps names of arguments. IF all the code generation you want to do can be done with just that, well, then reflection seems just fine. But if you want more detailed properties about the code, reflection won't cut it.
In fact, the only artifact from which truly arbitrary code properties can be extracted is the the source code as a character string (how else could you answer, is the number of characters between the add operator and T in middle of the variable name is a prime number?). As a practical matter, properties you can get from character strings are generally not very helpful (see the example I just gave :).
The compiler guys have spent the last 60 years figuring out how to extract interesting program properties and you'd be a complete idiot to ignore what they've learned in that half century.
They have settled on a number of relatively standard "compiler data structures": abstract syntax trees (ASTs), symbol tables (STs), control flow graphs (CFGs), data flow facts (DFFs), program triples, ponter analyses, etc.
If you want to analyze or generate code, your best bet is to process it first into such standard compiler data structures and then do the job. If you have ASTs, you can answer all kinds of question about what operators and operands are used. If you have STs, you can answer questions about where-defined, where-visible and what-type. If you have CFGs, you can answer questions about "this-before-that", "what conditions does statement X depend upon". If you have DFFs, you can determine which assignments affect the actions at a point in the code. Reflection will never provide this IMHO, because it will always be limited to what the runtime system developers are willing to keep around when running a program. (Maybe someday they'll keep all the compiler data structures around, but then it won't be reflection; it will just finally be compiler support).
Now, after you have determined the properties of interest, what do you do for code generation? Here the compiler guys have been so focused on generation of machine code that they don't offer standard answers. The guys that do are the program transformation community (http://en.wikipedia.org/wiki/Program_transformation). Here the idea is to keep at least one representation of your program as ASTs, and to provide special support for matching source code syntax (by constructing pattern-match ASTs from the code fragments of interest), and provide "rewrite" rules that say in effect, "when you see this pattern, then replace it by that pattern under this condition".
By connecting the condition to various property-extracting mechanisms from the compiler guys, you get relatively easy way to say what you want backed up by that 50 years of experience. Such program transformation systems have the ability to read in source code,
carry out analysis and transformations, and generally to regenerate code after transformation.
For your code generation task, you'd read in the base line code into ASTs, apply analyses to determine properties of interesting, use transformations to generate new ASTs, and then spit out the answer.
For such a system to be useful, it also has to be able to parse and prettyprint a wide variety of source code langauges, so that folks other than C# lovers can also have the benefits of code analysis and generation.
These ideas are all reified in the
DMS Software Reengineering Toolkit. DMS handles C, C++, C#, Java, COBOL, JavaScript, PHP, Verilog, ... and a lot of other langauges.
(I'm the architect of DMS, so I have a rather biased view. YMMV).
Have you considered using T4 templates for performing the code generation? It looks like it's getting much more publicity and attention now and more support in VS2010.
This tutorial seems database centric but it may give you some pointers: http://www.olegsych.com/2008/09/t4-tutorial-creatating-your-first-code-generator/ in addition there was a recent Hanselminutes on T4 here: http://www.hanselminutes.com/default.aspx?showID=170.
Edit: Another great place is the T4 tag here on StackOverflow: https://stackoverflow.com/questions/tagged/t4
EDIT: (By asker, new developments)
As of VS2012, T4 now supports reflection over an active project in a single step. This means you can make a change to your code, and the compiled output of the T4 template will reflect the newest version, without requiring you to perform a second reflect/build step. With this capability, I'm marking this as the accepted answer.
You may wish to use CodeDom, so that you only have to build once.
First, I would read this CodeProject article to make sure there are not language-specific features you'd be unable to support without using Reflection.
From what I understand, you could use something like Common Compiler Infrastructure (http://ccimetadata.codeplex.com/) to programatically analyze your existing c# source.
This looks pretty involved to me though, and CCI apparently only has full support for C# language spec 2. A better strategy may be to streamline your existing method instead.
I'm not sure of the best way to do this, but you could do this
As a post-build step on your base dll, run the code generator
As another post-build step, run csc or msbuild to build the generated dll
Other things which depend on the generated dll will also need to depend on the base dll, so the build order remains correct
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I've recently purchased a Mac and use it primarily for C# development under VMWare Fusion. With all the nice Mac applications around I've started thinking about Xcode lurking just an install click away, and learning Objective-C.
The syntax between the two languages looks very different, presumably because Objective-C has its origins in C and C# has its origins in Java/C++. But different syntaxes can be learnt so that should be OK.
My main concern is working with the language and if it will help to produce well-structured, readable and elegant code. I really enjoy features such as LINQ and var in C# and wonder if there are equivalents or better/different features in Objective-C.
What language features will I miss developing with Objective-C? What features will I gain?
Edit: The framework comparisons are useful and interesting but a language comparison are what this question is really asking (partly my fault for originally tagging with .net). Presumably both Cocoa and .NET are very rich frameworks in their own right and both have their purpose, one targeting Mac OS X and the other Windows.
Thank you for the well thought out and reasonably balanced viewpoints so far!
No language is perfect for all tasks, and Objective-C is no exception, but there are some very specific niceties. Like using LINQ and var (for which I'm not aware of a direct replacement), some of these are strictly language-related, and others are framework-related.
(NOTE: Just as C# is tightly coupled with .NET, Objective-C is tightly coupled with Cocoa. Hence, some of my points may seem unrelated to Objective-C, but Objective-C without Cocoa is akin to C# without .NET / WPF / LINQ, running under Mono, etc. It's just not the way things are usually done.)
I won't pretend to fully elaborate the differences, pros, and cons, but here are some that jump to mind.
One of the best parts of Objective-C is the dynamic nature — rather than calling methods, you send messages, which the runtime routes dynamically. Combined (judiciously) with dynamic typing, this can make a lot of powerful patterns simpler or even trivial to implement.
As a strict superset of C, Objective-C trusts that you know what you're doing. Unlike the managed and/or typesafe approach of languages like C# and Java, Objective-C lets you do what you want and experience the consequences. Obviously this can be dangerous at times, but the fact that the language doesn't actively prevent you from doing most things is quite powerful. (EDIT: I should clarify that C# also has "unsafe" features and functionality, but they default behavior is managed code, which you have to explicitly opt out of. By comparison, Java only allows for typesafe code, and never exposes raw pointers in the way that C and others do.)
Categories (adding/modifying methods on a class without subclassing or having access to source) is an awesome double-edged sword. It can vastly simplify inheritance hierarchies and eliminate code, but if you do something strange, the results can sometimes be baffling.
Cocoa makes creating GUI apps much simpler in many ways, but you do have to wrap your head around the paradigm. MVC design is pervasive in Cocoa, and patterns such as delegates, notifications, and multi-threaded GUI apps are well-suited to Objective-C.
Cocoa bindings and key-value observing can eliminate tons of glue code, and the Cocoa frameworks leverage this extensively. Objective-C's dynamic dispatch works hand-in-hand with this, so the type of the object doesn't matter as long as it's key-value compliant.
You will likely miss generics and namespaces, and they have their benefits, but in the Objective-C mindset and paradigm, they would be niceties rather than necessities. (Generics are all about type safety and avoiding casting, but dynamic typing in Objective-C makes this essentially a non-issue. Namespaces would be nice if done well, but it's simple enough to avoid conflicts that the cost arguably outweighs the benefits, especially for legacy code.)
For concurrency, Blocks (a new language feature in Snow Leopard, and implemented in scores of Cocoa APIs) are extremely useful. A few lines (frequently coupled with Grand Central Dispatch, which is part of libsystem on 10.6) can eliminates significant boilerplate of callback functions, context, etc. (Blocks can also be used in C and C++, and could certainly be added to C#, which would be awesome.) NSOperationQueue is also a very convenient way to add concurrency to your own code, by dispatching either custom NSOperation subclasses or anonymous blocks which GCD automatically executes on one or more different threads for you.
I've been programming in C, C++ and C# now for over 20 years, first started in 1990. I have just decided to have a look at the iPhone development and Xcode and Objective-C. Oh my goodness... all the complaints about Microsoft I take back, I realise now how bad things code have been. Objective-C is over complex compared to what C# does. I have been spoilt with C# and now I appreciate all the hard work Microsoft have put in. Just reading Objective-C with method invokes is difficult to read. C# is elegant in this. That is just my opinion, I hoped that the Apple development language was a good as the Apple products, but dear me, they have a lot to learn from Microsoft. There is no question C#.NET application I can get an application up and running many times faster than XCode Objective-C. Apple should certainly take a leaf out of Microsoft's book here and then we'd have the perfect environment. :-)
No technical review here, but I just find Objective-C much less readable.
Given the example Cinder6 gave you:
C#
List<string> strings = new List<string>();
strings.Add("xyzzy"); // takes only strings
strings.Add(15); // compiler error
string x = strings[0]; // guaranteed to be a string
strings.RemoveAt(0); // or non-existant (yielding an exception)
Objective-C
NSMutableArray *strings = [NSMutableArray array];
[strings addObject:#"xyzzy"];
[strings addObject:#15];
NSString *x = strings[0];
[strings removeObjectAtIndex:0];
It looks awful. I even tried reading 2 books on it, they lost me early on,
and normally I don't get that with programming books / languages.
I'm glad we have Mono for Mac OS, because if I'd had to rely on Apple
to give me a good development environment...
Manual memory management is something beginners to Objective-C seems to have most problem with, mostly because they think it is more complex than it is.
Objective-C and Cocoa by extension relies on conventions over enforcement; know and follow a very small set of rules and you get a lot for free by the dynamic run-time in return.
The not 100% true rule, but good enough for everyday is:
Every call to alloc should be matched with a release at the end of the current scope.
If the return value for your method has been obtained by alloc then it should be returned by return [value autorelease]; instead of being matched by a release.
Use properties, and there is no rule three.
The longer explanation follows.
Memory management is based on ownership; only the owner of an object instance should ever release the object, everybody else should always do nothing. This mean that in 95% of all code you treat Objective-C as if it was garbage collected.
So what about the other 5%? You have three methods to look out for, any object instance received from these method are owned by the current method scope:
alloc
Any method beginning with the word new, such as new or newService.
Any method containing the word copy, such as copy and mutableCopy.
The method have three possible options as of what to do with it's owned object instances before it exits:
Release it using release if it is no longer needed.
Give ownership to the a field (instance variable), or a global variable by simply assigning it.
Relinquish ownership but give someone else a chance to take ownership before the instance goes away by calling autorelease.
So when should you pro-actively take ownership by calling retain? Two cases:
When assigning fields in your initializers.
When manually implementing setter method.
Sure, if everything you saw in your life is Objective C, then its syntax looks like the only possible. We could call you a "programming virgin".
But since lots of code is written in C, C++, Java, JavaScript, Pascal and other languages, you'll see that ObjectiveC is different from all of them, but not in a good way. Did they have a reason for this? Let's see other popular languages:
C++ added a lot extras to C, but it changed the original syntax only as much as needed.
C# added a lot extras compared to C++ but it changed only things that were ugly in C++ (like removing the "::" from the interface).
Java changed a lot of things, but it kept the familiar syntax except in parts where the change was needed.
JavaScript is a completely dynamic language that can do many things ObjectiveC can't. Still, its creators didn't invent a new way of calling methods and passing parameters just to be different from the rest of the world.
Visual Basic can pass parameters out of order, just like ObjectiveC. You can name the parameters, but you can also pass them the regular way. Whatever you use, it's normal comma-delimited way that everyone understands. Comma is the usual delimiter, not just in programming languages, but in books, newspapers, and written language in general.
Object Pascal has a different syntax than C, but its syntax is actually EASIER to read for the programmer (maybe not to the computer, but who cares what computer thinks). So maybe they digressed, but at least their result is better.
Python has a different syntax, which is even easier to read (for humans) than Pascal. So when they changed it, making it different, at least they made it better for us programmers.
And then we have ObjectiveC. Adding some improvements to C, but inventing its own interface syntax, method calling, parameter passing and what not. I wonder why didn't they swap + and - so that plus subtracts two numbers. It would have been even cooler.
Steve Jobs screwed up by supporting ObjectiveC. Of course he can't support C#, which is better, but belongs to his worst competitor. So this is a political decision, not a practical one. Technology always suffers when tech decisions are made for political reasons. He should lead the company, which he does good, and leave programming matters to real experts.
I'm sure there would be even more apps for iPhone if he decided to write iOS and support libraries in any other language than ObjectiveC. To everyone except die-hard fans, virgin programmers and Steve Jobs, ObjectiveC looks ridiculous, ugly and repulsive.
One thing I love about objective-c is that the object system is based on messages, it lets you do really nice things you couldn't do in C# (at least not until they support the dynamic keyword!).
Another great thing about writing cocoa apps is Interface Builder, it's a lot nicer than the forms designer in Visual Studio.
The things about obj-c that annoy me (as a C# developer) are the fact that you have to manage your own memory (there's garbage collection, but that doesn't work on the iPhone) and that it can be very verbose because of the selector syntax and all the [ ].
As a programmer just getting started with Objective-C for iPhone, coming from C# 4.0, I'm missing lambda expressions, and in particular, Linq-to-XML. The lambda expressions are C#-specific, while the Linq-to-XML is really more of a .NET vs. Cocoa contrast. In a sample app I was writing, I had some XML in a string. I wanted to parse the elements of that XML into a collection of objects.
To accomplish this in Objective-C/Cocoa, I had to use the NSXmlParser class. This class relies on another object which implements the NSXMLParserDelegate protocol with methods that are called (read: messages sent) when an element open tag is read, when some data is read (usually inside the element), and when some element end tag is read. You have to keep track of the parsing status and state. And I honestly have no idea what happens if the XML is invalid. It's great for getting down to the details and optimize performance, but oh man, that's a whole lot of code.
By contrast, here's the code in C#:
using System.Linq.Xml;
XDocument doc = XDocument.Load(xmlString);
IEnumerable<MyCustomObject> objects = doc.Descendants().Select(
d => new MyCustomObject{ Name = d.Value});
And that's it, you've got a collection of custom objects drawn from XML. If you wanted to filter those elements by value, or only to those that contain a specific attribute, or if you just wanted the first 5, or to skip the first 1 and get the next 3, or just find out if any elements were returned... BAM, all right there in the same line of code.
There are many open-source classes that make this processing a lot easier in Objective-C, so that does much of the heavy lifting. It's just not this built in.
*NOTE: I didn't actually compile the code above, it's just meant as an example to illustrate the relative lack of verbosity required by C#.
Probably most important difference is memory management. With C# you get garbage collection, by virtue of it being a CLR based language. With Objective-C you need to manage memory yourself.
If you're coming from a C# background (or any modern language for that matter), moving to a language without automatic memory management will be really painful, as you will spend a lot of your coding time on properly managing memory (and debugging as well).
Here's a pretty good article comparing the two languages:
http://www.coderetard.com/2008/03/16/c-vs-objective-c/
Other than the paradigm difference between the 2 languages, there's not a lot of difference. As much as I hate to say it, you can do the same kind of things (probably not as easily) with .NET and C# as you can with Objective-C and Cocoa. As of Leopard, Objective-C 2.0 has garbage collection, so you don't have to manage memory yourself unless you want to (code compatibility with older Macs and iPhone apps are 2 reasons to want to).
As far as structured, readable code is concerned, much of the burden there lies with the programmer, as with any other language. However, I find that the message passing paradigm lends itself well to readable code provided you name your functions/methods appropriately (again, just like any other language).
I'll be the first to admit that I'm not very familiar with C# or .NET. But the reasons Quinn listed above are quite a few reasons that I don't care to become so.
The method calls used in obj-c make for easily read code, in my opinion much more elegant than c# and obj-c is built on top of c so all c code should work fine in obj-c. The big seller for me though is that obj-c is an open standard so you can find compilers for any system.