Generate and compile name to index translation/mapping for faster reusability

Generate and compile name to index translation/mapping for faster reusability - c#

Suppose I get data from a service (that I can't control) as:
public class Data
{
// an array of column names
public string[] ColumnNames { get; set; }
// an array of rows that contain arrays of strings as column values
public string[][] Rows { get; get; }
}
and on the middle tier I would like to map/translate this to an IEnumerable<Entity> where column names in Data may be represented as properties in my Entity class. I said may because I may not need all the data returned by the service but just some of it.
Transformation
This is an abstraction of an algorithm that would do the translation:
create an IDictionary<string, int> of ColumnNames so I can easily map individual column names to array indices in individual rows.
use reflection to examine my Entity properties' names so I'm able to match them with column names
iterate through Data.Rows and create my Entity objects and populate properties according to mapping done in #1. Likely using reflection and SetValue on properties to set them.
Optimisation
Upper algorithm would of course work, but I think that because it uses reflection it should do some caching and possibly some on the fly compilation, that could speed things up considerably.
When steps 1 and 2 are done, we could actually generate a method that takes an array of strings and instantiates my entities using indices directly and compile it and cache it for future reuse.
I'm usually getting a page of results, so subsequent requests would reuse the same compiled method.
Additional fact
This is not imperative to the question (and answers) but I also created two attributes that help with column-to-property mapping when these don't match in names. I created the most obvious MapNameAttribute (that takes a string and optionally also enable case sensitivity) and IgnoreMappingAttribute for properties on my Entity that shouldn't map to any data. But these attributes are read in step 2 of the upper algorithm so property names are collected and renamed according to this declarative metadata so they match column names.
Question
What is the best and easiest way to generate and compile such a method? Lambda expressions? CSharpCodeProvider class?
Do you maybe have an example of generated and compiled code that does a similar thing? I guess that mappings are a rather common scenario.
Note: In the meantime I will be examining PetaPoco (and maybe also Massive) because afaik they both do compilation and caching on the fly exactly for mapping purposes.

Suggestion: obtain FastMember from NuGet
Then just use:
var accessor = TypeAccessor.Create(typeof(Entity));
Then just in your loop, when you have found the memberName and newValue for the current iteration:
accessor[obj, memberName] = newValue;
This is designed to do what you are asking; internally, it maintains a set of types if has seen before. When a new type is seen, it creates a new subclass of TypeAccessor on-the-fly (via TypeBuilder) and caches it. Each unique TypeAccessor is aware of the properties for that type, and basically just acts like a:
switch(memberName) {
case "Foo": obj.Foo = (int)newValue;
case "Bar": obj.Bar = (string)newValue;
// etc
}
Because this is cached, you only pay any cost (and not really a big cost) the first time it ever sees your type; the rest of the time, it is free. Because it uses ILGenerator directly, it also avoids any unnecessary abstraction, for example via Expression or CodeDom, so it is about as fast as it can be.
(I should also clarify that for dynamic types, i.e. types that implement IDynamicMetaObjectProvider, it can use a single instance to support every object).
Additional:
What you could do is: take the existing FastMember code, and edit it to process MapNameAttribute and IgnoreMappingAttribute during WriteGetter and WriteSetter; then all the voodoo happens on your data names, rather than the member names.
This would mean changing the lines:
il.Emit(OpCodes.Ldstr, prop.Name);
and
il.Emit(OpCodes.Ldstr, field.Name);
in both WriteGetter and WriteSetter, and doing a continue at the start of the foreach loops if it should be ignored.

Related

C# custom file parsing with 2 delimiters and different record types

I have a (not quite valid) CSV file that contains rows of multiple types. Any record could be one of about 6 different types and each type has a different number of properties. The first part of any row contains the timestamp and the type of record, followed by a standard CSV of the data.
Example
1456057920 PERSON, Ted Danson, 123 Fake Street, 555-123-3214, blah
1476195120 PLACE, Detroit, Michigan, 12345
1440581532 THING, Bucket, Has holes, Not a good bucket
And to make matters more complex, I need to be able to do different things with the records depending on certain criteria. So a PERSON type can be automatically inserted into a DB without user input, but a THING type would be displayed on screen for the user to review and approve before adding to DB and continuing the parse, etc.
Normally, I would use a library like CsvHelper to map the records to a type, but in this case since the types could be different, and the first part uses a space instead of comma, I dont know how to do that with a standard CSV library. So currently how I am doing it each loop is:
String split based off comma.
Split the first array item by the space.
Use a switch statement to determine the type and create the object.
Put that object into a List of type object.
Get confused as to where to go now because i now have a list of various types and will have to use yet another switch or if to determine the next parts.
I don't really know for sure if I will actually need that List but I have a feeling the user will want the ability to manually flip through records in the file.
By this point, this is starting to make for very long, confusing code, and my gut feeling tells me there has to be a cleaner way to do this. I thought maybe using Type.GetType(string) would help simplify the code some, but this seems like it might be terribly inefficient in a loop with 10k+ records and might make things even more confusing. I then thought maybe making some interfaces might help, but I'm not the greatest at using interfaces in this context and I seem to end up in about this same situation.
So what would be a more manageable way to parse this file? Are there any C# parsing libraries out there that would be able to handle something like this?

You can implement an IRecord interface that has a Timestamp property and a Process method (perhaps others as well).
Then, implement concrete types for each type of record.
Use a switch statement to determine the type and create and populate the correct concrete type.
Place each object in a List
After that you can do whatever you need. Some examples:
Loop through each item and call Process() to handle it.
Use linq .OfType<{concrete type}> to segment the list. (Warning with 10k
records, this would be slow since it would traverse the entire list for each concrete type.)
Use an overridden ToString method to give a single text representation of the IRecord
If using WPF, you can define a datatype template for each concrete type, bind an ItemsControl derivative to a collection of IRecords and your "detail" display (e.g. ListItem or separate ContentControl) will automagically display the item using the correct DataTemplate

Continuing in my comment - well that depends. What u described is actually pretty good for starters, u can of course expand it to a series of factories one for each object type - so that you move from explicit switch into searching for first factory that can parse a line. Might prove useful if u are looking to adding more object types in the future - you just add then another factory for new kind of object. Up to you if these objects should share a common interface. Interface is used generally to define a a behavior, so it doesn't seem so. Maybe you should rather just a Dictionary? You need to ask urself if you actually need strongly typed objects here? Maybe what you need is a simple class with ObjectType property and Dictionary of properties with some helper methods for easy typed properties access like GetBool, GetInt or generic Get?

Utilisation of a Dictionary like class

For the purpose of XML serialisation I had to disband a Dictionary collection I was using. I wrote a very straightforward alternative which consists of 2 classes:
NameValueItem: contains Name (Key) and Value
NameValueCollection: derived from CollectionBase and maintains a collection of NameValueItem objects.
I've included some standard methods to help maintain the collection (Add, Contains and Remove). So just like most Dictionary types, the Name (or Key) is unique:
public bool Contains(NameValueItem item)
{
foreach (NameValueItem lItem in List)
if(lItem.Name.Equals(item.Name))
return true;
return false;
}
Add uses this Contains method to determine whether to include a given item into the collection:
public void Add(NameValueItem item)
{
if (!Contains(item))
List.Add(item);
}
As bog standard, straightforward and easy as this code appears it's proving to be a little sluggish. Is there anything that can be done to improve the performance of this? Or alternatives I could use?
I was considering creating a NameValueHashSet, which is derived from HashSet.
Optional...:
I had a question which I was going to ask in a separate thread, but I'll leave it up to you as to whether you'd like to address it or not.
I wanted to add 2 properties to the NameValueCollection, Names and Values, which return a List of strings from the Collection of NameValueItem objects. Instead I built them into methods GetNames() and GetValues(), as I have to build the collection (i.e. create a List (names/values), iterate over collection adding names/value to List and return List).
Is this a better alternative? In terms of good coding practise, performance, etc.? As my thoughts regarding properties has always been to have it as stripped back as possible, that only references, arithmetic, etc. should exist, with no layers of processes. If that is the case, then it should be built into a method. Thoughts?

Perhaps you shouldn't try to rebuild what the framework already provides? Your implementation of a dictionary is going to perform poorly as it does not scale. The built in Dictionary<TKey, TValue> has O(1) access performance and for most insert and delete operations (unless there are collisions or the internal storage must be expanded).
You can extend the existing dictionary to provide XML serialization support; see this question and answers: Serialize Class containing Dictionary member
As for your second question - Dictionary already provides methods for getting an IEnumerable of the keys and values. This enumerates the keys and/or values as requested by the caller; that is delayed execution and is likely the preferred method over building a full List every time (which requires iterating through all the elements in the dictionary). If the caller wants a list then they just do dictionary.Values.ToList().

Issues when using Reflection and IDataReader.GetSchemaTable to create a generic method that reads and processes an IDataReader's current result set

I am writing a class that encapsulates the complexity of retrieving data from a database using ADO.NET. Its core method is
private void Read<T>(Action<T> action) where T : class, new() {
var matches = new LinkedList<KeyValuePair<int, PropertyInfo>>();
// Read the current result set's metadata.
using (DataTable schema = this.reader.GetSchemaTable()) {
DataRowCollection fields = schema.Rows;
// Retrieve the target type's properties.
// This is functionally equivalent to typeof(T).GetProperties(), but
// previously retrieved PropertyInfo[]s are memoized for efficiency.
var properties = ReflectionHelper.GetProperties(typeof(T));
// Attempt to match the target type's columns...
foreach (PropertyInfo property in properties) {
string name = property.Name;
Type type = property.PropertyType;
// ... with the current result set's fields...
foreach (DataRow field in fields) {
// ... according to their names and types.
if ((string)field["ColumnName"] == name && field["DataType"] == type) {
// Store all successful matches in memory.
matches.AddLast(new KeyValuePair<int, PropertyInfo>((int)field["ColumnOrdinal"], property));
fields.Remove(field);
break;
}
}
}
}
// For each row, create an instance of the target type and set its
// properties to the row's values for their matched fields.
while (this.reader.Read()) {
T result = new T();
foreach (var match in matches)
match.Value.SetValue(result, this.reader[match.Key], null);
action(result);
}
// Go to the next result set.
this.reader.NextResult();
}
Regarding the method's correctness, which unfortunately I cannot test right now, I have the following questions:
When a single IDataReader is used to retrieve data from two or more result sets, does IDataReader.GetSchemaTable return the metadata of all result sets, or just the metadata corresponding to the current result set?
Are the column ordinals retrieved by IDataReader.GetSchemaTable equal to the ordinals used by the indexer IDataReader[int]? If not, is there any way to map the former into the latter?
Regarding the method's efficiency, I have the following question:
What is DataRowCollection's underlying data structure? Even if that question cannot eb answered, at least, what is the asymptotic computational complexity of removing a DataRow from a DataRowCollection using DataRowCollection.Remove()?
And, regarding the method's evident ugliness, I have the following questions:
Is there any way to retrieve specific metadata (e.g., just the columns' ordinals, names and types), not the full blown schema table, from an IDataReader?
Is the cast to string in (string)field["ColumnName"] == name necessary? How does .NET compare an object variable that happens to contain a reference to a string to a string variable: by reference value or by internal data value? (When in doubt, I prefer to err on the side of correctness, thus the cast; but, when able to remove all doubt, I prefer to do so.)
Even though I am using KeyValuePair<int, PropertyInfo>s to represent pairs of matched fields and properties, those pairs are not actual key-value pairs. They are just plain-old ordinary 2-tuples. However, version 2.0 of the .NET Framework does not provide a tuple data type, and, if I were to create my own special purpose tuple, I still would not know where to declare it. In C++, the most natural place would be inside the method. But this is C# and in-method type definitions are illegal. What should I do? Cope with the inelegance of using a type that, by definition, is not the most appropriate (KeyValuePair<int, PropertyInfo>) or cope with the inability to declare a type where it fits best?

As far as A1, I believe that until IDataReader.NextResult() is envoked, the GetSchemaTable will only return the information for the current resultset.
Then when NextResult() is envoked, you would have to do a GetSchemaTable again to get the information about the current resultset.
HTH.

I can answer a couple of these:
A2) Yes, the column ordinals that come out of GetSchemaTable are the same column ordinals that are used for the indexer.
B1) I'm not sure, but it won't matter, because you'll throw if you remove from the DataRowCollection while you're enumerating it in the foreach. If I were you, I'd make a hash table of the fields or the properties to help match them up instead of worrying about this linear-search-with-removal.
EDIT: I was wrong, this is a lie -- as Eduardo points out below, it won't throw. But it's still sort of slow if you think you might ever have a type with more than a few dozen properties.
C2) Yes, it's necessary, or else it would compare by reference.
C3) I would be inclined to use KeyValuePair anyway.

What's the best way to store the minimum differential data of a changed object?

I have an object which contains many properties, of many datatypes which is the settings for a search of a large cache. I would like to pass the values which have been changed on this object from the base settings and only the changes. I would also like to pass this information as a very short string.
So what I need is a technique for doing this in C# .NET 4 (pseudo code follows):
var changes = Diff(changedobject, baseobject);
return changes.ToShortString()
and later on a remote machine which only knows the object
var changedobject = CreateObject(diffstring)
Any ideas would be much appreciated.
Thanks

I do not have a code for you but this should be pretty clear
Create a metadata class which using
reflection, gets the properties of the
object and caches the getter method of
each. Then for each property, it does
the same if it is a complex property,
etc ... so you get an object graph
similar to your class. Then when
passed two objects of the same type,
this recursively loops through the
metadata and calls getter in order to
compare and returns the result. You
can also make it generically typed.
Implementaion would be something like this:
public class Metadata<T>
{
private Dictionary<string, Metadata<T>> _properties = new Dictionary<string, Metadata<T>>();
private MethodInfo _getter;
private void BuildMetadata()
{
Type t = typeof (T);
foreach (PropertyInfo propertyInfo in t.GetProperties())
{
// ....
}
}
}

The CSLA.NET framework uses reflection to walk the properties and fields of a type and write them to a hash table. This hash table is then serialized and stored.
The type is called UndoableBase, in the CSLA.NET project.
I can't remember if it records diffs, but the premise is that you need a copy of the state before (in the CSLA case, this would be the previously serialized item).
Assuming you have a copy of an item (an actual copying, not a reference) as a source of originals, then you can use reflection to check each property and add it to the hash table if it has changed.
Send this hash table over the wire.
An alternative and one that I would look at more, is to prefix your serialized item with bit flags denoting what fields are provided in the forthcoming stream. This will likely be more compact than a hash table of name values. You can include this in your reflection solution by first sorting the fields / properties alphabetically (or by some other means). This won't be version tolerant, however, if you store this serialized data across versions of the type.

Sorting a composite collection

So WPF doesn't support standard sorting or filtering behavior for views of CompositeCollections, so what would be a best practice for solving this problem.
There are two or more object collections of different types. You want to combine them into a single sortable and filterable collection (withing having to manually implement sort or filter).
One of the approaches I've considered is to create a new object collection with only a few core properties, including the ones that I would want the collection sorted on, and an object instance of each type.
class MyCompositeObject
{
enum ObjectType;
DateTime CreatedDate;
string SomeAttribute;
myObjectType1 Obj1;
myObjectType2 Obj2;
{
class MyCompositeObjects : List<MyCompositeObject> { }
And then loop through my two object collections to build the new composite collection. Obviously this is a bit of a brute force method, but it would work. I'd get all the default view sorting and filtering behavior on my new composite object collection, and I'd be able to put a data template on it to display my list items properly depending on which type is actually stored in that composite item.
What suggestions are there for doing this in a more elegant way?

I'm not yet very familiar with WPF but I see this as a question about sorting and filtering List<T> collections.
(withing having to manually implement sort or filter)
Would you reconsider implementing your own sort or filter functions? In my experience it is easy to use. The examples below use an anonymous delegate but you could easily define your own method or a class to implement a complex sort or filter. Such a class could even have properties to configure and change the sort and filter dynamically.
Use List<T>.Sort(Comparison<T> comparison) with your custom compare function:
// Sort according to the value of SomeAttribute
List<MyCompositeObject> myList = ...;
myList.Sort(delegate(MyCompositeObject a, MyCompositeObject b)
{
// return -1 if a < b
// return 0 if a == b
// return 1 if a > b
return a.SomeAttribute.CompareTo(b.SomeAttribute);
};
A similar approach for getting a sub-collection of items from the list.
Use List<T>.FindAll(Predicate<T> match) with your custom filter function:
// Select all objects where myObjectType1 and myObjectType2 are not null
myList.FindAll(delegate(MyCompositeObject a)
{
// return true to include 'a' in the sub-collection
return (a.myObjectType1 != null) && (a.myObjectType2 != null);
}

"Brute force" method you mention is actually ideal solution. Mind you, all objects are in RAM, there is no I/O bottleneck, so you can pretty much sort and filter millions of objects in less than a second on any modern computer.
The most elegant way to work with collections is System.Linq namespace in .NET 3.5
Thanks - I also considered LINQ to
objects, but my concern there is loss
of flexibility for typed data
templates, which I need to display the
objects in my list.
If you can't predict at this moment how people will sort and filter your object collection, then you should look at System.Linq.Expressions namespace to build your lambda expressions on demand during runtime (first you let user to build expression, then compile, run and at the end you use reflection namespace to enumerate through results). It's more tricky to wrap your head around it but invaluable feature, probably (to me definitively) even more ground-breaking feature than LINQ itself.

Update: I found a much more elegant solution:
class MyCompositeObject
{
DateTime CreatedDate;
string SomeAttribute;
Object Obj1;
{
class MyCompositeObjects : List<MyCompositeObject> { }
I found that due to reflection, the specific type stored in Obj1 is resolved at runtime and the type specific DataTemplate is applied as expected!

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.