Why is linq to object implementing iterators manually?

Why is linq to object implementing iterators manually? - c#

While browsing the .net core source i notice that even in source form iterator classes are manually implemented instead of relying on the yield statement and auto IEnumerable implementation.
You can see at this line for example the decalartion and implementation of the where iterator https://github.com/dotnet/corefx/blob/master/src/System.Linq/src/System/Linq/Enumerable.cs#L168
I'm assuming if they went through the trouble of doing this instead of a simple yield statement there has to be some benefit but i can't immediately see which, it seems pretty similar to what i remember the compiler does automatically from i read back eric lippert's blog a few years back and i remember when i naively reimplemented LINQ with yield statements in it's early days to better understand it the performance profile was similar to that of .NET version.
It piqued my curiosity but it's also an actually important question as i'm in the middle of a fairly big data - in memory project and if i'm missing something obvious that makes this approach better i would love to know the tradeoffs.
Edit : to clarify, i do understand why they can't just yield in the where method (different enumeration for different container types), what i don't understand is why they implement the iterator itself (that is, instead of forking to diferent iterators, forking to diferent methods, iterating diferently based on type, and yielding to have the auto implemented state machine instead of manual case 1 goto 2 case 2 etc).

One possible reason is that the specialized iterators perform a few optimizations, like combining selectors and predicates and taking advantage of indexed collections.
The usefulness of these is that some sources can be iterated in a more optimal way than what the compiler magic for yield would generate. And by creating these custom iterators, they can pass this extra information about the source to subsequent LINQ operations in a single chain (instead of making that information available only to the first operation in the chain). Thus, all Where and Select operations (that don't have anything else between them) can be executed as one.

Related

Using Linq to XML to create existing object based on class/model - working, but is there a better way?

A webservice I use (I have no control over it) returns an XML string, which I convert to an XDcoument and then create a list of objects of a particular type:
private static List<ProductDetail> productList(XmlDocument _xDoc) {
XDocument xdoc = XDocument.Load(new XmlNodeReader(_xDoc));
var pList = from p in xdoc.Root.Elements("DataRow")
select new ProductDetail
{
Product = (string)p.Element("Product"),
ProductDesc = (string)p.Element("ProductDesc"),
ExtraKey = (string)p.Element("ExtraKey"),
SalesGroup = (string)p.Element("SalesGroup"),
Donation = (string)p.Element("Donation"),
Subscription = (string)p.Element("Subscription"),
StockItem = (string)p.Element("StockItem"),
MinimumQuantity = (string)p.Element("MinimumQuantity"),
MaximumQuantity = (string)p.Element("MaximumQuantity"),
ProductVATCategory = (string)p.Element("ProductVATCategory"),
DespatchMethod = (string)p.Element("DespatchMethod"),
SalesDescription = (string)p.Element("SalesDescription"),
HistoryOnly = (string)p.Element("HistoryOnly"),
Warehouse = (string)p.Element("Warehouse"),
LastStockCount = (string)p.Element("LastStockCount"),
UsesProductNumbers = (string)p.Element("UsesProductNumbers"),
SalesQuantity = (string)p.Element("SalesQuantity"),
AmendedBy = (string)p.Element("AmendedBy")
};
return pList.ToList();
}
This works fine and is very fast. However it means I have to maintain this code separately from the model if it changes and I was just wondering if there was a shortcut to avoid me having to specify each individual field as I'm doing? I already have a class for ProductDetail so is there some way of using that at the object level? I've a feeling that the answer may be "yes, but using reflection" which will probably have a negative impact on the process speed so is not something I'd be keen on.

There is another option that I can think of (beyond the two you already talked about in your question.. i.e. Manual mapping and Reflection based approach.
Dynamic Methods
It is DynamicMethod (The MSDN link has example as well)
This approach can give you best of both worlds.. i.e.
Performance
Dynamic
But the catch is, you trade it off with
Increased code complexity
Reduced debug ability.
It can be thought of as hybrid of the two, in the sense, it is can be as flexible / dynamic as you'd like it to be (effort will also vary accordingly), and yet hold the performance benefits similar to manually mapped objects (your sample code above).
With this approach, there is a one time cost of initializing your DynamicMethod at appropriate time (application startup / first use etc).. and then cache it for subsequent use. If you need the mapper only a handful number of times.. then it can be much less useful.. But I am assuming that is not the case for your scenario.
Technique I'd recommend
You'd notice from the example, that creating a DynamicMethod involves emitting IL op-codes at runtime (and reflection as well), and that can look very complex and difficult task, because it is more complex code and harder to debug. However, what I tend to do in this situation, is write the method I'd like to emit using DynamicMethod by hand (you already have that), and then study IL generated for that method by the compiler.
In other words, you don't have to be a master at writing IL by hand.. If you are not already familiar how to write IL for a given scenario, just write it up in plain C# as you imagine it.. compile it, and then use ILDASM to figure out how you want to emit similar IL for your DynamicMethod.
Other Options - Deserializers
For the sake of completeness, I'd say the problem you are trying to solve is in general that of deserializing XML payload into plain objects POCOs. It is not very clear from the question if you even considered them, and excluded them as an option, or they weren't even considered.
If you didn't even look in that direction, you can start with what is already available in .Net - DataContractSerializer. There can be other implementations which you may be able to find on the internet.. which may suit your needs.
The reasons why they may not be a good fit for your scenario (if I understand it right) could be -
Deserializers tend to be generic in functionality, and hence certainly not the same level of performance as code you have above.
May be there is a deserializer out there which uses DynamicMethod for performance, but I have never seen one. Also note that different implementations can obviously have different performance characteristics
The data may not lend itself for easy use with common / famous deserializers.
Like if the XML you have is deeply nested, and you want to map properties / element at different levels without creating complex object hierarchy. (One might now argue that such problems can be solved with XSL transforms.)
The implementation may not have the features you may really need.
Like what if the object to which data needs to be mapped is of Generic Type.
Conclusion
Manual mapping is fastest, but least flexible.
Reflection will certainly be slow, but can provide higher flexibility.
DynamicMethod (part of System.Reflection.Emit namespace) can give you most flexibility and performance (assuming high use with cached instance), if you are willing to pay the price of even higher complexity and development effort.
Deserializers give you varying degree of flexibility and performance, but are not always suitable.
EDIT: Realized, that for completeness, some more options could be added to the list.
T4 templates - Also hard to develop (IMHO) and debug / troubleshoot (Depends on complexity of what you are trying to achieve with them. I had to debug them by installing a VS add-in, and debug by attaching one Visual Studio instance as debugee from another Visual Studio instance as debugger - ugly. May be there are better ways, but I'd avoid them). Manual action may still be required to refresh generated code. (There is a good possibility that it can be automated during build, but I have never tried it)
Code generation - You write custom code generation tool, and hook it as a pre-build step in appropriate projects. (You could do this code generation as part of build, or in theory, also after your application is deployed and runs the first time, but for this scenario, build time generation should be more suitable.)

Is using reflection in .Net effects the performance reasonably bad? [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
How costly is Reflection?
For sake of making my code generic I have started to use Reflection -to obtain some proerties in DTO objects and to set them-
Does usage of reflection to obtain properties and set them can effect the performance so badly compared to hard-coded setters?

Yes, there is a cost involved in using Reflection.
However, the actual impact on the overall application performance varies. One rule of thumb is to never use Reflection in code that gets executed many times, such as loops. That will usually lead to algorithms that slow down exponentially (O(cn)).
In many cases you can write generic code using delegates instead of Reflection, as described in this blog post.

Yes, reflection is slow.
You can try to decrease the impact of it by caching the xxxInfo (like MethodInfo, PropertyInfo etc) objects you retrieve via reflection per reflected Type and, i.e. keep them im a dictionary. Subsequent lookups in the dictionary are faster than retrieving the information every time.
You can also search here at SO for some questions about Reflection performance. For certain edge-cases there are pretty performant workarounds like using CreateDelegate to call methods instead of using MethodInfo.Invoke().

Aside from the fact that it's slower having to set properties through reflection is a design issue as you apparently have separated concerns or encapsulated properties through object oriented design which is now preventing you from setting them directly. I would say you look at your design (there can be edge cases though) instead of thinking about reflection.
One of the downsides aside from the performance impact is that you're using a statically typed language thus the compiler checks your code and compiles it. Normally at compile time you have the certainty that all properties you're using are there and are spelled correctly. When you start to use reflection you push this check to runtime which is a real shame as you're (in my opinion) missing one of the biggest benefits of using a static typed language. This will also limit your refactoring opportunities in the (near) future as you're not sure anymore if you replaced all occurrences of an assignment for example when renaming a property.

System.Linq.Lookup vs. Wintellect.PowerCollections.MultiDictionary

I still use Wintellect's PowerCollections library, even though it is aging and not maintained because it did a good job covering holes left in the standard MS Collections libraries. But LINQ and C# 4.0 are poised to replace PowerCollections...
I was very happy to discover System.Linq.Lookup because it should replace Wintellect.PowerCollections.MultiDictionary in my toolkit. But Lookup seems to be immutable! Is that true, can you only created a populated Lookup by calling ToLookup?

Yes, you can only create a Lookup by calling ToLookup. The immutable nature of it means that it's easy to share across threads etc, of course.
If you want a mutable version, you could always use the Edulinq implementation as a starting point. It's internally mutable, but externally immutable - and I wouldn't be surprised if the Microsoft implementation worked in a similar way.
Personally I'm rarely in a situation where I want to mutate the lookup - I would prefer to perform appropriate transformations on the input first. I would encourage you to think in this way too - I find myself wishing for better immutability support from other collections (e.g. Dictionary) more often than I wish that Lookup were mutable :)

That is correct. Lookup is immutable, you can create an instance by using the Linq ToLookup() extension method. Technically even that fact is an implementation detail since the method returns an ILookup interface which in the future might be implemented by some other concrete class.

Difference between List.All() and List.TrueForAll()

Is there a practical difference between .All() and .TrueForAll() when operating on a List? I know that .All() is part of IEnumerable, so why add .TrueForAll()?

From the docs for List<T>.TrueForAll:
Supported in: 4, 3.5, 3.0, 2.0
So it was added before Enumerable.All.
The same is true for a bunch of other List<T> methods which work in a similar way to their LINQ counterparts. Note that ConvertAll is somewhat different, in that it has the advantage of knowing that it's working on a List<T> and creating a List<TResult>, so it gets to preallocate whatever it needs.

TrueForAll existed in .NET 2.0, before LINQ was in .NET 3.5.
See: http://msdn.microsoft.com/en-us/library/kdxe4x4w(v=VS.80).aspx

TrueForAll appears to be specific to List, while All is part of LINQ.
My guess is that the former dates back to the .NET 2 days, while the latter is new in .NET 3.5.

Sorry for digging this out, but I came across this question and have seen that the actual question about differences is not properly answered.
The differences are:
The IEnumerable.All() extension method does an additional check for the extended object (in case it is null, it throws an exception).
IEnumerable.All() may not check elements in order. In theory, it would be allowed by specification to have the items to check in a different order every call. List.TrueForAll() specifies in the documentation that it will always be in the order of the list.
The second point is because Enumerable.All must use a foreach or the MoveNext() method to iterate over the items, while List.TrueForAll() internally uses a for loop with list indices.
However, you can be pretty sure that also the foreach / MoveNext() approach will return the elements in the order of the list entries because a lot of programs expect that and would break if this would be changed in the future.
From a performance point of view, List.TrueForAll() should be faster because it has one check less and for on a list is cheaper than foreach. However, usually the compiler does a good job and optimizes here a lot, so that there will probably (almost) no difference measurable in the end.
Conclusion: List.TrueForAll() is the better choice in theory. But practically it makes no difference.

Basically, because this method existed before Linq did. TrueForAll on a List originated in Framework 2.0.

TrueForAll is not an extension method and in the framework from version 2.

How do you write a search function that is easy to comprehend (/maintainable)? Maybe in a modular manner?

Every time I see a search function, the code behind it is a mess. Several hundreds of lines, spaghetti code, and almost ALWAYS as one huge method. A programming language (Java/C#/PHP/etc) is used to construct one big fat SQL query. Many, many if else's.
There must be more elegant ways to do this than this? Or is this what you get when you use RMDBS instead of a flat data structure?
I'd be willing to learn more about this topic, perhaps even buy a book. /Adam

Use the query object pattern. If you can, also use an ORM, it will make things easier.
The implementation details depend on your platform and architecture, but here are some samples:
http://www.theserverside.com/patterns/thread.tss?thread_id=29319
http://www.lostechies.com/blogs/chad_myers/archive/2008/08/01/query-objects-with-the-repository-pattern.aspx

In my current project we use a simplified version of the query object pattern that mausch mentions. In our case we have a search criteria object that consists of a field and a value, and several such objects can be added to a list. We did have an operator property from the start, but it was never used so we removed it. Whether the criteria are treated as AND or OR depends on the search method that is used (I would say that it is AND in 95% of the cases in that project).
The find methods themselves do not do a lot with this information; they will invoke stored procs in the DB, passing the criteria as parameters. Most of those procs are fairly straight forward, even though we have a couple that does involve some string handling to unpack lists of critera for certain fields.
The code from a caller's perspective might look something like this (the Controller classes wraps repetetive stuff as instantiating a search object with a configurable implementation*, populating it with search criteria and such):
CustomerCollection customers = CustomerController.Find(new SearchCriterion("Name", "<the customer name>"));
If more than one search criterion is needed a collection can be passed instead. Inside the finder function the code will loop over the collection, map the present values to appropriate parameters in an SqlCommand object.
This approach has worked out rather well for us.
*) The "configurable implementation" means that we have created an architecture wher the search objects are defined as abstract classes that merely will define the interface and contain some generic pre- and post validation. The actual search code is implemented in separate decendent classes; which amongst other things allowed us to quickly create a "fake data layer" that could be used for mocking away the database for some of the unit tests.

Have you looked at the Lucene project (http://lucene.apache.org)? Its designed exactly for this purpose. The idea is that you build and then maintain a set of indexes that are then easily searchable. The lifecycle works like this:
Write a bunch of sql statements that index all of the searchable areas of your database
Run them against the full database to create an initial index of your data
Every time the data changes, update these indexes.
The query language is much simpler then, you're queries become much more targeted.
There is a great project in the hibernate tool suite called hibernate search (http://search.hibernate.org) that does the maintenance of your indexes for you if you are using hibernate as your ORM.

I've been tinkering with this thought a bit (since I actually had to implement something like this some time ago) and I've come to the conclusion that there's two ways I'd do it to make it both work and especially maintainable. But before going into those, here's some history first.
1. Why does the problem even exist
Most search functions are based on algorithms and technologies derivated from the ones in databases. SQL was originally developed in the early 1970's (Wikipedia says 1974) and back then programming was a whole another kind of beast than it is today because every byte counted, every extra function call could make the difference between excellent performance and bankruption, code was made by people who thought in Assembly...well you get the point.
The problem is that those technologies originally have mostly been carried over to modern world without changing them (and why should they be changed, don't fix something which isn't broken) which means the old paradigms creep around too. And then there's cases when the original algorithm is misinterpreted for some reason and you end up with what you now have, like slow regular expressions. A bit of underlining here is required though, the technologies themselves aren't bad, it's usually just the legacy paradigms which are!
2. Solutions to the problem
The solution I ended up using was a system which was a mix of builder pattern and query object pattern (linked by mausch already). As an example if I were to make a pragmatic system to build SQL queries, it would look something like this:
SQL.select("column1", "column2")
.from("relation")
.where().valueEquals("column1", "hello")
.and().valueIsLargerThan("column2", 3)
.toSQL();
The obvious downside of this is that the builder pattern has the tendency to be a bit too verbose. Upsides are that the each of the build steps (=methods) are quite small by nature, for example .valueIsLargerThan("a", x) merely may just be return columnName + ">=" + x;. This means they're easily unit-testable and one of the biggest upsides is that they can be generated easily from external sources like XML/whatnot and most notably it's rather easy to create a converter from, say, SQL query to Lucene query (Lucene has automation for this already afaik, this is just an example).
The second one I'd rather use but really avoid is because it's not order-safe (unless you spend a good amount of time creating metadata helper classes) while builders are. It's easier to write an example than to go into more detail what I mean, so:
import static com.org.whatever.SQL.*;
query(select("column1", "column2"),
from("relation"),
where(valueEquals("column1", "hello"),
valueIsLargerThan("column2", 3)));
I do count static imports as a downside but other than that, that looks like something I'd really want to use.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.