'Tokenizing' sections of text while processing text file

'Tokenizing' sections of text while processing text file - c#

I'm working on a hobby project to port an existing markup library into a C# / .NET Class Library. If you're familiar with Markdown, it's a similar concept.
One early question I have is that there is a syntax for marking a section of text to stop it from being processed by any of the other syntax rules, and I'd like some advice on this.
One method that occurs to me is to search for these sections first, remove and replace them with some sort of meaningful token, run the rest of the processing rules, and then as the last step, replace the tokens with the text they represent.
Is that what makes the most sense to others? Also, how would you generate the tokens such that you don't face the possibility that you might accidentally create a token that matches existing text?
Any help / advice appreciated!
pt

Why not use a proper parser generator to create your tokenizer?
You could probably knock something together with ANTLR in a few hours.

Related

General purpose plain-text linting tool

I'm looking for a command-line tool where I can specify regex patterns (or similar) for certain file extensions (e.g. cs files, js files, xaml files) that can provide errors/warnings when run, like during a build. These would scan plain-text source code of all types.
I know there are tools for specific languages... I plan on using those too. This tool is for quick patterns we want to flag where we don't want to invest in writing a Rosyln rule, for example. I'd like to flag certain patterns or API usages in an easy way where anyone can add a new rule without thinking too hard. Often times we don't add rules because it is hard.
Features like source tokenization is bonus. Open-source / free is mega bonus.
Is there such a tool?

If you want to go old-skool, you can dust-off Awk for this one.
It scans file line by line (for some configurable definition of line, with a sane default) cuts them in pieces (on whitespace IMMSMR) and applies a set of regexes and fires the code behind the matching regex. There are some conditions to match the beginning and end of a file to print headers/footers.
It seems to be what you want, but IMHO, a perl or ruby script is easier and has replaced AWK for me a long time ago. But it IS simple and straightforward for your use-case AFAICT.

Our Source Code Search Engine (SCSE) can do this.
SCSE lexes (using language-accurate tokenization including skipping language-specific whitespace but retaining comments) a set of source files, and then builds a token index for each token type. One can provide SCSE with token-based search queries such as:
'if' '(' I '='
to search for patterns in the source code; this example "lints" C-like code for the common mistake of assigning a variable (I for "identifier") in an IF statement caused by accidental use of '=' instead of the intended '=='.
The search is accomplished using the token indexes to speed up the search. Typically SCSE can search millions of lines of codes in a few seconds, far faster than grep or other scheme that insists on reading the file content for each query. It also produces fewer false positives because the token checks are accurate, and the queries are much easier to write because one does not have to worry about white space/line breaks/comments.
A list of hits on the pattern can be logged or merely counted.
Normally SCSE is used interactively; queries produce a list of hits, and clicking on a hit produces a view of a page of the source text with the hit superimposed. However, one can also script calls on the SCSE.
SCSE can be obtained with langauge-accurate lexers for some 40 languages.

Automatically extract classes from a generated file

I have a file that is generated containing multiple classes that I want to split into multiple files each one containing just one class.
The code is in c#
Is there a program that can do this (preferably with source code available)? Is there a simple Regex that can extract the classes/interfaces?

I don't believe Regex would be the correct strategy to parse C# code. It could probably works in some simple cases but you probably face some situation tricking you. Think as an example about having some commentend unbalanced '{' in the code.
I suggest to you to investigate this other SO question: Parser for C# about how to parse c# code.

If that's a one off, do the least thing that (quickly) works. So if it is just one or a few files and you're not after a very generic solution, I would identify all the headers (public class Foo and the like), for example with the help of Notepad++ and a recorded macro, manually correcting the results, and then I'd write a little C# program to split the file in places where these headers are

I think you should try with Visual Studio Macro Programming.

What techniques are used to write a parser that switches between languages?

I'm interested in how a parser like the Razor view engine can parse two distinct languages like C# and JavaScript.
It's very cool that the following works, for instance:
$("#fm_duedate").val('#DateTime.Now.AddMonths(1).ToString("MM/dd/yyyy")');
I'm going to try and look at the source but I'm curious if there's a some kind of theoretical foundation for a parser like this or is it more brute force like taking the union of the two languages and parsing that?
Trying to reason it for myself, I say "you start with a parser for each language then you add to each one a set of productions that switch it to the other" but I doubt its so simple.
I guess the perfect answer would be a pointer to discussion on how the Razor engine is implemented or a walk-through of the source (I haven't actually Google'd this for fear of going down a rabbit hole). Alternately, just some insight on how the problem of parsing two languages is approached would be great.

As Corey points out, Razor and similar frameworks do not do anything particularly fancy.
However there are some more theoretically sound models for building parsers for languages where one language is embedded in another. My erstwhile colleague Luke Hoban has a great introductory article on parser combinators, which afford a very nice way to build a parser for one-language-embedded-in-another-language scenarios:
http://blogs.msdn.com/b/lukeh/archive/2007/08/19/monadic-parser-combinators-using-c-3-0.aspx
The wikipedia page is pretty straightforward as well:
http://en.wikipedia.org/wiki/Parser_combinator

Razor (and the other view engines) do not parse the HTML or JavaScript of a view. Instead they parse the text to detect specific tokens, with no real concern about the surrounding text.
In the case of Razor, every # character in the source file is processed as a code block of some sort. Razor is quite smart about detecting the expression that follows the # character, including handling things like #foreach (var x in collection) { and locating the closing } while not trying to parse the HTML (or JavaScript) inside. It also lets you use #{ } and #( ) to override the processing to a degree.
I find the ASPX <%...%> format simpler to read, since I've used that format more and I've got some established pattern recognition going on for those. Having explicit start/finish tokens is simpler to process and simpler to read in-place.

Can I use ANTLR for not pre-processed code?

I am about to write a parser for OpenEdge (a 4GL database language) and I would like to use ANTLR (or similar).
There are two reasons I think this may be a problem:
OpenEdge is a 4GL database language which allows constructs like:
assign
customer.name = 'Customer name'
customer.age = 20
.
Where the . at the end is the line separator and this statement combines the assignment of two database fields. OpenEdge has many more of these constructs;
I need to preserve all details of the source files, so I cannot expand preprocessor statements before I can parse the file, so:
// file myinc.i
7 * 14
// source.p
assign customer.age = {myinc.i}.
In the above example, I need to preserve the fact that customer.age was assigned using {myinc.i} instead of 7 * 14.
Can I use ANTLR to acchieve this or do I need to write my own parser?
UPDATE:
I need this parser not to generate an executable from it, but rather for code analysis. This is why I need the AST to contain the fact that the include was used.

Just to clarify: ANTLR isn't a parser, but a parser generator.
You either write your own parser for the language, or you write a (ANTLR) grammar for it, and let ANTLR generate the lexer and parser for you. You can mix custom code in your grammar to keep track of your assignments.
So, the answer is: yes, you can use ANTLR.
Note I am unfamiliar with OpenEdge, but SQL grammars are usually tough to write parser or grammars for. Have a look at the ANTLR wiki to see that it's no trivial task to write one from the ground up. You didn't mention it, but I assume you've looked at existing parsers that can parse your language?
FYI: you might already have it, but here's a link to the documentation including a BNF grammar for the OpenEdge SQL dialect: http://www.progress.com/progress/products/documentation/docs/dmsrf/dmsrf.pdf

The solution lies within the OpenEdge architect itself. You should checkout the openedge architect jar files (C:\Progress\OpenEdge\oeide\eclipse\plugins\com.openedge.pdt.core_10.2.1.01\lib\progressparser.jar)
Here you will find the parser classes. They are linked all the way to Eclipse, but I did the separation from the eclipse framework, and it works.
The progressparser uses antlr, and the antlr document can be found in the following folder...
C:\Progress\OpenEdge\oeide\eclipse\plugins\com.openedge.pdt.core_10.2.1.01\oe_common_services.jar.
Inside that file you will find the antlr definition (check for openedge.g).
Good luck. If you want the separated eclipse environment just drop me a mail.

Are you aware that there is already an open source parser for OpenEdge / Progress 4GL? It is called Proparse, written using ANTLR (originally it was hand-coded in OpenEdge itself, but eventually converted to ANTLR). It is written in Java, but I think you can run it in C# by using IKVM.
The license is the Eclipse license, so it is business-friendly.

You can do the same thing the C preprocessor is doing - extend your grammar with some sort of pragmas that set a source location, and let your preprocessor generate code stuffed with that pragmas.

The issue with multiple assignments is easy enought to handle in a grammar. Just allow multiple assignements:
assign_stmt = 'assign' assignments '.' ;
assignements = ;
assignments = assignments target '=' expression ;
One method you can use is to augment the grammar to allow preprocessor token sequences wherever a nonterminal can be allowed, and simply not do preprocessor expansion. For your example, you have some grammar rule:
expression = ... ;
just add the rule:
expression = '{' include_reference '}' ;
This works to the extent that the preprocessor isn't used abusively to generate several lanaguage elements that span nonterminal boundaries.
What kind of code anlaysis do you intend to do? Pretty much to do anything, you'll need to name and type resolution, which will require to expand the preprocessor directives. In that case, you'll need a more sophisticated scheme, because you need the expanded tree to do the name resolution, and need the include information associated off to the side.
Our DMS Software Reengineering Toolkit has an OpenEdge parser, in which we present do the previous "keep the include file references" trick. DMS's C parser adds a "macro node" to the tree where the macro (OpenEdge "include" is just a funny way to write a macro definition) child nodes contains the tree as you expect it, and the reference information that refers back to the macro defintion. This requires some careful organization, and lots of special handliing of macro nodes where they occur.

Interpreting custom language

I need to develop an application that will read and understand text file in which I'll find a custom language that describe a list of operations (ie cooking recipe). This language has not been defined yet, but it will probably take one of the following shape :
C++ like code
(This code is randomly generated, just for example purpose) :
begin
repeat(10)
{
bar(toto, 10, 1999, xxx);
}
result = foo(xxxx, 10);
if(foo == ok)
{
...
}
else
{
...
}
end
XML code
(This code is randomly generated, just for example purpose) :
<recipe>
<action name="foo" argument"bar, toto, xxx" repeat=10/>
<action name="bar" argument"xxxxx;10" condition="foo == ok">
<true>...</true>
<false>...</false>
</action>
</recipe>
No matter which language will be chosen, there will have to handle simple conditions, loops.
I never did such a thing but at first sight, it occurs to me that describing those operations into XML would be simplier yet less powerful.
After browsing StackOverFlow, I've found some chats on a tool called "ANTLR"... I started reading "The Definitive ANTLR Reference" but since I never done that kind of stuff, I find it hard to know if it's really the kind of tool I need...
In other words, what do I need to read a text file, interpret it properly and perform actions in my C# code. Those operations will interact between themselves by simple conditions like :
If operation1 failed, I do operation2 else operation3.
Repeat the operation4 10 times.
What would be the best language to do describe those text file (XML, my own) ? What are the key points during such developments ?
I hope I'm being clear :)
Thanks a lot for your help and advices !

XML is great for storing relational data in a verbose way. I think it is a terrible candidate for writing logic such as a program, however.
Have you considered using an existing grammar/scripting language that you can embed, rather than writing your own? E.g:
LUA
Python

In one of my projects I actually started with an XML like language as I already had an XML parser and parsed the XML structure into an expression tree in memory to be interpreted/run.
This works out very nicely to get passed the problem of figuring out tokenizing/parsing of text files and concentrate instead on your 'language' and the logic of the operations in your language. The down side is writing the text files is a little strange and very wordy. Its also very unnatural for a programmer use to C/C++ syntax.
Eventually you could easily replace your XML with a full blown scanner & lexer to parse a more 'natural C++' like text format into your expression tree.
As for writing a scanner & lexer, I found it easier to write these by hand using simple logic flow/loops for the scanner and recursive decent parser for the lexer.
That said, ANTLR is great at letting you write out rules for your language and generating your scanner & lexer for you. This allows for much more dynamic language which can easily change without having to refactor everything again when new things are added. So, it might be worth looking into as learning this as it would save you much time in rewrites as things change if you hand wrote your own.

I'd recommend writing the app in F#. It has many useful features for parsing strings and xmls like Pattern Matching and Active Patterns.
For parsing C-like code I would recommend F# (just did one interpreter with F#, works like a charm)
For parsing XML's I would recommend C#/F# + XmlDocument class.
You basically need to work on two files:
Operator dictionary
Code file in YourLanguage
Load and interpret the operators and then apply them recursively to your code file.

The best prefab answer: S-expressions
C and XML are good first steps. They have sort of opposite disadvantages. The C-like syntax won't add a ton of extra characters, but it's going to be hard to parse due to ambiguity, the variety of tokens, and probably a bunch more issues I can't think of. XML is relatively easy to parse and there's tons of example code, but it will also contain tons of extra text. It might also give you too many options for where to stick language features - for example, is the number of times to repeat a loop an attribute, element or text?
S-expressions are more terse than XML for sure, maybe even C. At the same time, they're specific to the task of applying operations to data. They don't admit ambiguity. Parsers are simple and easy to find example code for.
This might save you from having to learn too much theory before you start experimenting. I'll emphasize MerickOWA's point that ANTLR and other parser generators are probably a bigger battle than you want to fight right now. See this discussion on programmers.stackexchange for some background on when the full generality of this type of tool could help.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.