Porting a very Pythonesque library over to .NET

Porting a very Pythonesque library over to .NET - c#

I'm investigating the possibility of porting the Python library Beautiful Soup over to .NET. Mainly, because I really love the parser and there's simply no good HTML parsers on the .NET framework (Html Agility Pack is outdated, buggy, undocumented and doesn't work well unless the exact schema is known.)
One of my primary goals is to get the basic DOM selection functionality to really parallel the beauty and simplicity of BeautifulSoup, allowing developers to easily craft expressions to find elements they're looking for.
BeautifulSoup takes advantage of loose-binding and named parameters to make this happen. For example, to find all a tags with an id of test and a title that contains the word foo, I could do:
soup.find_all('a', id='test', title=re.compile('foo'))
However, C# doesn't have a concept of an arbitrary number of named elements. The .NET4 Runtime has named parameters, however they have to match an existing method prototype.
My Question: What is the C# design pattern that most parallels this Pythonic construct?
Some Ideas:
I'd like to go after this based on how I, as a developer, would like to code. Implementing this is out of the scope of this post. One idea I has would be to use anonymous types. Something like:
soup.FindAll("a", new { Id = "Test", Title = new Regex("foo") });
Though this syntax loosely matches the Python implementation, it still has some disadvantages.
The FindAll implementation would have to use reflection to parse the anonymous type, and handle any arbitrary metadata in a reasonable manner.
The FindAll prototype would need to take an Object, which makes it fairly unclear how to use the method unless you're well familiar with the documented behavior. I don't believe there's a way to declare a method that must take an anonymous type.
Another idea I had is perhaps a more .NET way of handling this but strays further away from the library's Python roots. That would be to use a fluent pattern. Something like:
soup.FindAll("a")
.Attr("id", "Test")
.Attr("title", new Regex("foo"));
This would require building an expression tree and locating the appropriate nodes in the DOM.
The third and last idea I have would be to use LINQ. Something like:
var nodes = (from n in soup
where n.Tag == "a" &&
n["id"] == "Test" &&
Regex.Match(n["title"], "foo").Success
select n);
I'd appreciate any insight from anyone with experience porting Python code to C#, or just overall recommendations on the best way to handle this situation.

Have you try to run your code inside the IronPython engine. As far as I know performs really well and you don't have to touch your python code.

Related

Does something like a VB "Variant" implementation exist which uses C#'s dynamic dispatch?

I realize that it goes against the strongly typed nature of C#, but I find that when working with dynamic objects in the language, some of the more useful features typically found in JavaScript or PowerShell are simply not practical.
For example, the following C# code will fail at runtime and it's obvious why.
dynamic x = 1.0;
int y = x;
But that makes the dynamic features of C# pretty limited when dealing with loosely typed data such as that produced by JSON payloads or CSV where subtle variations in the representation can result in very different behavior at runtime.
What I'm looking for is something that will behave much like the VBA / VBScript era Variant type. I imagine it's possible to derive such a type from DynamicObject that would wrap primitive values like Int32, String, etc and perform the appropriate casts at runtime. I've done something similar with "null" values in dynamic contexts.
But is there anything like this already established? I've looked around GitHub or Codeplex to no avail but it's kind of a hard thing to search for. Before I get started on what I imagine is going to be quite a complicated class, I want to make sure I'm not wasting my time.
About the practicality of all of this
I should note that I resisted the concept of dynamic dispatch in C# for a long time because it was not intended to be a dynamic language. Quite honestly, I wish it wasn't added to the language at all, or at least restricted to COM interop scenarios.
But having said that, I am always curious about ways to "hack" language features in such a way to make them do things that they were never intended to do. Sometimes something useful comes out of it. For example, there have been plenty of examples of people using the IDisposable interface and using statement to do things that have nothing to do with releasing resources.
I doubt I would use this in production applications or anything that needed to be handed off to another developer.

The visual basic languages hide a lot of the glue, that just isn't the C# way. The Variant type has a raft of conversion functions, they are invoked automatically by the vb runtime. .NET has automatic conversion functions too, you just have to use them explicitly:
dynamic x = 1.0;
int y = Convert.Int32(x);
With the C# justification for having to write code like that because it is not a language that hides cost.

C#: Creating a Methodheader Parser

I would like to write a parser to tell me what part of a string is a methodheader. What is the best way to do this in C#?
The language grammar specification can be found here. I don't think this is proper BNF/EBNF, so perhaps there is a way to transform it into such (like an html parser that puts it into proper BNF.)
Should I use regular expressions or a custom built parser somehow? I am restricted in that I need to build it myself without the help of outside tools.

I found the NRefactory library, part of the open-source SharpDevelop tool, to be very good at parsing C# modules into an abstract syntax tree. Once you have that you can scan through very easily to find the method headers, the locations, and so on.
Though its primary use is for within SharpDevelop (A GUI tool), it is a standalone DLL, and it can be used within any .NET app. The documentation isn't very thorough, as far as I could tell, but Reflector let me examine it and figure things out pretty easily.
some code:
internal static string CreateAstSexpression(string filename)
{
using (var fs = File.OpenRead(filename))
{
using (var parser = ParserFactory.CreateParser(SupportedLanguage.CSharp,
new StreamReader(fs)))
{
parser.Parse();
// RetrieveSpecials() returns an IList<ISpecial>
// parser.Lexer.SpecialTracker.RetrieveSpecials()...
// "specials" == comments, preprocessor directives, etc.
// parser.CompilationUnit retrieves the root node of the result AST
return SexpressionGenerator.Generate(parser.CompilationUnit).ToString();
}
}
}
The ParserFactory class is part of NRefactory.
In my case I wanted a lisp s-expression describing the C# buffer, so I wrote an S-expression generator that walked through the "CompilationUnit". It's just a tree of nodes, starting with namespace, then class/struct/enum. Within the class/struct node, there are method nodes (as well as field, property, etc).
If that finished DLL is not of interest, then maybe this is.
Before finding and embracing NRefactory, I tried to produce a wisent grammar for c#. This was for use within emacs, which has a wisent engine.
I never could get it to work properly.
Maybe it's of use to you.
you said that you didn't want to use "outside tools". Not sure of the motivation for that restriction; if it is homework, then I guess it makes sense, but for other purposes, it really would be a shame to not use the well-tested and well-understood tools that are already out there.
If you take either of the suggestions I've made here, you're building on something that is an outside tool. But some of the options are a little better than others.

Templates in C#

I know generics are in C# to fulfill a role similar to C++ templates but I really need a way to generate some code at compile time - in this particular situation it would be very easy to solve the problem with C++ templates.
Does anyone know of any alternatives? Perhaps a VS plug-in that preprocesses the code or something like that? It doesn't need to be very sophisticated, I just need to generate some methods at compile time.
Here's a very simplified example in C++ (note that this method would be called inside a tight loop with various conditions instead of just "Advanced" and those conditions would change only once per frame - using if's would be too slow and writing all the alternative methods by hand would be impossible to maintain). Also note that performance is very important and that's why I need this to be generated at compile time.
template <bool Advanced>
int TraceRay( Ray r )
{
do
{
if ( WalkAndTestCollision( r ) )
{
if ( Advanced )
return AdvancedShade( collision );
else
return SimpleShade( collision );
}
}
while ( InsideScene( r ) );
}

You can use T4.
EDIT: In your example, you can use a simple bool parameter.

Not really, as far as I know. You can do this type of thing at runtime, of course; a few meta-programming options, none of them trivial:
reflection (the simplest option if you don't need "fastest possible")
CSharpCodeProvider and some code-generation
the same with CodeDom
ILGenerator if you want hardcore

Generics does work as templates, if that the case.
There is a way to create code in runtime -
Check is CodeProject Example:
CodeProject

In addition to Marc's excellent suggestions, you may want to have a look at PostSharp.

I've done some Meta-Programming - style tricks using static generics that use reflection (and now I'm using dynamic code generation using System.Linq.Expressions; as well having used ILGenerator for some more insane stuff). http://www.lordzoltan.org/blog/post/Pseudo-Template-Meta-Programming-in-C-Sharp-Part-2.aspx for an example I put together (sorry about the lack of code formatting - it's a very old post!) that might be of use.
I've also used T4 (link goes to a series of tutorials by my favourite authority on T4 - Oleg Sych), as suggested by SLaks - which is a really nice way to generate code, especially if you're also comfortable with Asp.Net-style syntax. If you generate partial classes using the T4 output, then the developer can then embellish and add to the class however they see fit.
If it absolutely has to be compile-time - then I'd go for T4 (or write your own custom tool, but that's a bit heavy). If not, then a static generic could help, probably in partnership with the kind of solutions mentioned by Marc.

If you want true code generation, you could use CodeSmith http://www.codesmithtools.com which isn't free/included like T4, but has some more features, and can function as a VS.NET plug-in.

Here's an older article that uses genetic programming to generate and compile code on the fly:
http://msdn.microsoft.com/en-us/magazine/cc163934.aspx
"The Generator class is the kernel of the genetic programming application. It discovers available base class terminals and functions. It generates, compiles, and executes C# code to search for a good solution to the problem it is given. The constructor is passed a System.Type which is the root class for .NET reflection operations."
Might be overkill for your situation, but does show what C# can do. (Note this article is from the 1.0 days)

C# 4: Real-World Example of Dynamic Types

I think I have my brain halfway wrapped around the Dynamic Types concept in C# 4, but can't for the life of me figure out a scenario where I'd actually want to use it.
I'm sure there are many, but I'm just having trouble making the connection as to how I could engineer a solution that is better solved with dynamics as opposed to interfaces, dependency injection, etc.
So, what's a real-world application scenario where dynamic type usage is appropriate?

There are lots of cases where you are already using dynamic typing and dynamic binding today. You just don't realize it, because it is all hidden behind strings or System.Object, since until C# 4, the necessary support wasn't there.
One example is COM interop: COM is actually a semi-dynamic object system. When you do COM interop, a lot of methods actually return a dynamic object, but because C# didn't support them, they were returned as System.Object and you had to cast them yourself, possibly catching exceptions on the way.
Another example is interacting with dynamically typed (or even untyped) data, such as JSON, CSV, HTML, schemaless XML, schemaless web services, schemaless databases (which are, after all, the new hotness). Today, you use strings for those. An XML API would look like
var doc = new XmlDocument("/path/to/file.xml");
var baz = doc.GetElement("foo").GetElement("qux");
and so on. But how about:
dynamic doc = new XmlDocument("/path/to/file.xml");
var baz = doc.foo.qux;
Doesn't that look nice?
A third example is reflection. Today, invocation of a method via reflection is done by passing a string to InvokeMember (or whatever the thing is called). Wouldn't it be nicer to, you know, just invoke the damn thing?
Then, there is generation of dynamic data (basically the opposite of the second example). Here's an example how to generate some dynamic XML:
dynamic doc = new XmlBuilder();
doc.articles(id=42, type="List", () => {
article(() => {
number(42);
title("blahblubb");});});
This is not nearly as beautiful as the equivalent Ruby, but it is the best I could come up with at such short notice :-)
And last but certainly not least, integration with a dynamically typed language. Whether that is JavaScript in a Silverlight application, a custom rules engine embedded in your business app or a DLR instance that you host in your CAD program/IDE/text editor.

There's one example on MSDN:
Many COM methods allow for variation in argument types and return type by designating the types as object. This has necessitated explicit casting of the values to coordinate with strongly typed variables in C#. If you compile by using the /link (C# Compiler Options) option, the introduction of the dynamic type enables you to treat the occurrences of object in COM signatures as if they were of type dynamic, and thereby to avoid much of the casting.
Another example is if you have to interop with dynamic languages.
Also there are some occasions where you want to make some code generic but you can't because even though the objects implement the same method, they don't share a suitable base class or interface that declares the methods you need. An example of this is trying to make something generic with ints and short. It's a bit of a hack, but dynamic allows you to call the same methods on these different types, allowing more code reuse.
Update: A bit of searching on here found this related post.

From Walter Almeida's Blog: a scenario of use of the dynamic keyword in C# to enhance object orientation:
http://blog.walteralmeida.com/2010/05/using-the-dynamic-keyword-in-c-to-improve-objectorientation.html

Scott Watermasysk wrote an article about using dynamics for dictionary key-property mapping on the MongoDB C# driver.
http://simpable.com/code/mongodb-dynamics/

I think others have given some great answers so far so I just want to add this example by David Hanson. Hist post shows the most practical application I've found so far for dynamic types in C# where he uses them to create proxy objects. In this example he creates a proxy which allows raising of exceptions on WPF binding errors. I'm not sure if this could also be achieved in the case of WPF bindings by using CustomTypeDescriptors and property descriptor concepts in general but regardless I think the use of the new C# 4.0 dynamic type is a great demonstration of its capabilities.
Raising binding exceptions in WPF & Silverlight with .net 4.0 Dynamics
One other use that I can think of for Dynamic types is to create proxies that similarly can be plugged in as a DataContext in WPF or in other places where a generic object type is expected and reflection methods are normally used to interrogate the type. In these cases especially when building tests a dynamic type can be used which would then allow property accessors to be called and logged accordingly by the proxy object in a dynamic fashion without having to hardcode properties within a test-only class.

I read an interesting article about this (attached) by Scott Hanselman. He points out that as opposed to using object you can use dynamic to reference methods from older COM objects where the compiler doesn't know a method exists. I found the link useful.
Scott Hanselman - C#4 and the dynamic keyword

What's the best source of information on the DLR (.NET 4.0 beta 1)?

I'm currently researching the 2nd edition of C# in Depth, and trying to implement "dynamic protocol buffers" - i.e. a level of dynamic support on top of my existing protocol buffer library. As such, I have a DlrMessage type derived from DynamicObject. After a little bit of playing around I've managed to get it to respond to simple properties with remarkably little code, but I want to go a lot further - and to really understand what's going on.
So far I haven't found any good explanations of the DLR - and a lot of the blog posts are effectively out of date now, as things have changed (I believe) between the previous CTP and .NET 4.0 beta 1. The MSDN documentation for DynamicObject is pretty minimal at the moment.
My most immediate query is whether there's a simple way of saying, "Use reflection to bind any calls I can't handle, using this particular object." (In other words, I want to augment the existing reflection binding rather than doing everything myself, if possible.) Unfortunately I'm not getting very far by guesswork.
Are there any definitive and recent sources of documentation I should know about? I'm aware that part of writing about a new technology is exploration, but a helping hand would be appreciated :)

Best source I've found and read frequently is the last years worth of Chris Burrow's posts on his blog.
There's also the official DLR documentation page which is off the main DLR site.

I too am researching this at the moment and there is not too much info yet. I cant help with your query but below is some information I have found:
There is a fair amount within the PDC videos.
http://channel9.msdn.com/pdc2008/TL44/
http://channel9.msdn.com/pdc2008/TL10/
This article talks about how the DLR works with IronPython:
http://msdn.microsoft.com/en-us/magazine/cc163344.aspx
There is a very small amount in the training kit preview at: http://www.microsoft.com/downloads/details.aspx?FamilyID=752cb725-969b-4732-a383-ed5740f02e93&displayLang=en
Hope this helps
Alex

By default DynamicObject will say "fallback to reflection" if your Try* functions return false. So you already can inherit and add properties/fields/methods to your subclass that will all be handled by reflection if the dynamic path doesn't do the lookup.
Going more in depth you might want to look at IDynamicMetaObjectProvider. At this lower level the way you say fallback to reflection is to call the Fallback* method on the incoming DynamicMetaObjetBinder. This then lets the calling language to provide the resolution. You can then return that AST or compose it into a larger AST whcih you return. Basically Fallback* let you get the AST that the calling language would produce including the correct error (exception, undefined in JS, etc...).
The way DynamicObject does the fallback to reflection is that it actually calls the binder's Fallback* method twice. The first time it falls back without an "errorSuggestion" parameter. This gets either the error or the AST which was built using reflection. It then produces an AST which is something like:
if(TryGetMember("name", out value)) {
return value;
} else {
return resultOffallback;
}
It then takes this combined AST and actually hands it in as the error suggestion for the binder on a 2nd fallback. The binder should then respect this errorSuggestion if the binding is unsuccessful. But if the .NET member is present then errorSuggestion is thrown away and the .NET binding takes precedence. And finally if the language doesn't know if the binding was successful (e.g. the language has a "method missing" type feature) it can again combine the ASTs w/ it's dynamic checks. So using Fallback you can not only say do reflection but also you can choose whether dynamic or static members take precedence.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.