What are the methods for tokenizing strings in .Net? - c#

This must be a classic .NET question for anyone migrating from Java.
.NET does not seem to have a direct equivalent to java.io.StreamTokenizer, however the JLCA provides a SupportClass that attempts to implement it. I believe the JLCA also provides a Tokenizer SupportClass that takes a String as the source, which I thought a StreamTokenizer would be derived from, but isn't.
What is the preferred way to Tokenize both a Stream and a String? or is there one? How are streams tokenized in .Net? I'd like to have the flexibility that java.io.StreamTokenizer provides. Any thoughts?

There isn't anything in .NET that is completely equivalent to StreamTokenizer. For simple cases, you can use String.Split(), but for more advanced token parsing, you'll probably end up using System.Text.RegularExpressions.Regex.

Use System.String.Split if you need to split a string based on a collection
of specific characters.
Use System.Text.RegularExpressions.RegEx.Split to split based
on matching patterns.

There's a tokenizer in the Nextem library -- you can see an example here: http://trac.assembla.com/nextem/browser/trunk/Examples/Parsing.n
It's implemented as a Nemerle macro, but you can write this and then use it from C# easily.

I don't think so, for very simple tokenizing have a look at System.String.Split().
More complex tokenizing can be achieved by System.Text.RegularExpressions.Regex.

We had the same problem of finding a StreamTokenizer equivalent when porting tuProlog from Java to C#. We ended up writing what as far as I know is a straight conversion of StreamTokenizer which takes a TextReader as a "stream" for input purposes. You will find the code in the download for tuProlog.NET 2.1 (LGPL-licensed) so feel free to reuse and adapt it to your needs.

To tokenize a string, use string.Split(...).

Related

Parsing a function like string

I need to parse a string like this in C#:
GET_VALUE("USER_NAME")
... and return something like:
USER 101
And USER_NAME is one of several properties supported in the system.
I have already done some string parsing to do this.
My question is if there is some good pattern/implementation I should follow that parses the above with better/maintainable code, as well as addressing some future potential enhancements (like supporting operators like +, etc)?
Thanks
If you're looking for eval() -like functionality, C#'s designers avoided this on purpose. It was a deliberate design decisions, because they had many use cases in mind for the language where this would be a huge gaping security problem. That doesn't mean it's impossible; there are several ways to accomplish this: Roslyn, Reflection.Emit, and System.CodeDom all have some capability in this area, but none of them make it quick or simple.
If you really need to parse code, be aware it's a very slippery slope. Today, they want simple function calls; tomorrow, they'll want full expressions and grammar. You don't want to start from scratch for this. Take a look at lexer/parser libraries like ANTLR.
The other option is a DSL (Domain Specific Language), which lets you define a grammar for your program to parse input into executable code.
if your problem is to get the name of who is logged in, try this:
string vUserID = (Request.ServerVariables["LOGON_USER"]);

Converting CSV records to custom datatype IEnumerable (C#)

I've got some data in plain CSV format, say:
a1,b1
a2,b2
a3,b3
I've created a class (CsvRecord) which can consume one element each from a single csv line. Using Linq, I've been able to convert the CSV data to an IEnumerable<CSVRecord> using this line of code:
IEnumerable<CSVRecord> list = csvList.Select(a => new CSVRecord(a.Split(new char[]{','})[0], a.Split(new char[]{','})[1]));
This does do what I want but just by looking at it, it doesn't seem like a good way of achieving this. Could you please suggest how could I improve on this?
You absolutely do not want to write your own CSV parser (as you've identified) unless you are also prepared to implement the RFC 4180 standard fully. That is probably not what you want to do. There are so many un-obvious edge cases. Foruntately, implementations already exist.
I would use CsvHelper. I've used it for reading and writing on several occasions and it's always been a good library to use in my experience.
There are several good CSVReader libraries around.
A good way to improve is take a look at the code out there and if you want to reinvent the wheel as a learning experience, the use these as examples.
However, if you simply want their functionality, then use a pre-existing library.
Check out;
http://www.csvreader.com/ (costs, but is good. You can even buy the source code if needs be).
http://www.codeproject.com/KB/database/CsvReader.aspx
I'm sure there are plenty more out there also.

Creating a simple scripting language in C#

I need to create a very simple scripting language, as an evolution of a macro language (where placeholders were present and were exchanged for the realdata) which is based essentially on statements that need to be executed in order. I need to support nesting of statements and maybe possibly if conditions.
I think I need a parser to properly detect the statements
For example one statement could be:
Input("Message"=#Clipboard())
In this case, I would need to execute the #Clipboard() statement first and then the #Input.
Any suggestion of what's the approach for it? I guess I need to contruct a tree and execute it.
Thanks.
See my answer to a similar question here:
Basically, you parse your string using Postfix Notation.
Also, if you are going to use something more complex, look into building a Recursive Descent Parser. Eric White's blog has a great set of articles on the topic.

Sharing character buffer between C# strings objects

Is this possible? Given that C# uses immutable strings, one could expect that there would be a method along the lines of:
var expensive = ReadHugeStringFromAFile();
var cheap = expensive.SharedSubstring(1);
If there is no such function, why bother with making strings immutable?
Or, alternatively, if strings are already immutable for other reasons, why not provide this method?
The specific reason I'm looking into this is doing some file parsing. Simple recursive descent parsers (such as the one generated by TinyPG, or ones easily written by hand) use Substring all over the place. This means if you give them a large file to parse, memory churn is unbelievable. Sure there are workarounds - basically roll your own SubString class, and then of course forget about being able to use String methods such as StartsWith or String libraries such as Regex, so you need to roll your own version of these as well. I assume parser generators such as ANTLR basically do that, but my format is simple enough not to justify using such a monster tool. Even TinyPG is probably an overkill.
Somebody please tell me I am missing some obvious or not-so-obvious standard C# method call somewhere...
No, there's nothing like that.
.NET strings contain their text data directly, unlike Java strings which have a reference to a char array, an offset and a length.
Both solutions have "wins" in some situations, and losses in others.
If you're absolutely sure this will be a killer for you, you could implement a Java-style string for use in your own internal APIs.
As far as I know, all larger parsers use streams to parse from. Isn't that suitable for your situation?
The .NET framework supports string interning. This is a partial solution but does not offer the posibility to reuse parts of a string. I think reusing substring will cause some problems not that obviouse at a first look. If you have to do a lot of string manipulation using the StringBuilder is the way to go.
Nothing in C# provides you the out-of-the-box functionality you're looking for.
What want is a Rope data structure, an immutable data structure which supports O(1) concats and O(log n) substrings. I can't find any C# implementations of a rope, but here a Java one.
Barring that, there's nothing wrong with using TinyPG or ANTLR if that's the easiest way to get things done.
Well you could use "unsafe" to do the memory management yourself, which might allow you to do what you are looking for. Also the StringBuilder class is great for situations where a string needs to be manipulated numerous times, since it doesn't make a new string with each manipulation.
You could easily write a trivial class to represent "cheap". It would just hold the index of the start of the substring and the length of the substring. A couple of methods would allow you to read the substring out when needed - a string cast operator would be ideal as you could use
string text = myCheapObject;
and it would work seamlessly as if it were an actual string. Adding support for a few handy methods like StartsWith would be quick and easy (they'd all be one liners).
The other option is to write a regular parser and store your tokens in a Dictionary from which you share references to the tokens rather than keeping multiple copies.

Lex/Yacc for C#?

Actually, maybe not full-blown Lex/Yacc. I'm implementing a command-interpreter front-end to administer a webapp. I'm looking for something that'll take a grammar definition and turn it into a parser that directly invokes methods on my object. Similar to how ASP.NET MVC can figure out which controller method to invoke, and how to pony up the arguments.
So, if the user types "create foo" at my command-prompt, it should transparently call a method:
private void Create(string id) { /* ... */ }
Oh, and if it could generate help text from (e.g.) attributes on those controller methods, that'd be awesome, too.
I've done a couple of small projects with GPLEX/GPPG, which are pretty straightforward reimplementations of LEX/YACC in C#. I've not used any of the other tools above, so I can't really compare them, but these worked fine.
GPPG can be found here and GPLEX here.
That being said, I agree, a full LEX/YACC solution probably is overkill for your problem. I would suggest generating a set of bindings using IronPython: it interfaces easily with .NET code, non-programmers seem to find the basic syntax fairly usable, and it gives you a lot of flexibility/power if you choose to use it.
I'm not sure Lex/Yacc will be of any help. You'll just need a basic tokenizer and an interpreter which are faster to write by hand. If you're still into parsing route see Irony.
As a sidenote: have you considered PowerShell and its commandlets?
Also look at Antlr, which has C# support.
Still early CTP so can't be used in production apps but you may be interested in Oslo/MGrammar:
http://msdn.microsoft.com/en-us/oslo/
Jison is getting a lot of traction recently. It is a Bison port to javascript. Because of it's extremely simple nature, I've ported the jison parsing/lexing template to php, and now to C#. It is still very new, but if you get a chance, take a look at it here: https://github.com/robertleeplummerjr/jison/tree/master/ports/csharp/Jison
If you don't fear alpha software and want an alternative to Lex / Yacc for creating your own languages, you might look into Oslo. I would recommend you to sit through session recordings of sessions TL27 and TL31 from last years PDC. TL31 directly addresses the creation of Domain Specific Languages using Oslo.
Coco/R is a compiler generator with a .NET implementation. You could try that out, but I'm not sure if getting such a library to work would be faster than writing your own tokenizer.
http://www.ssw.uni-linz.ac.at/Research/Projects/Coco/
I would suggest csflex - C# port of flex - most famous unix scanner generator.
I believe that lex/yacc are in one of the SDKs already (i.e. RTM). Either Windows or .NET Framework SDK.
Gardens Point Parser Generator here provides Yacc/Bison functionality for C#. It can be donwloaded here. A usefull example using GPPG is provided here
As Anton said, PowerShell is probably the way to go. If you do want a lex/ yacc implementation then Malcolm Crowe has a good set.
Edit: Direct Link to the Compiler Tools
Just for the record, implementation of lexer and LALR parser in C# for C#:
http://code.google.com/p/naive-language-tools/
It should be similar in use to Lex/Yacc, however those tools (NLT) are not generators! Thus, forget about speed.

Categories

Resources