Generating Clean Business Object Classes from a horrible data source

Generating Clean Business Object Classes from a horrible data source - c#

I'm starting with a blank slate for a layer of business/entity classes, but with an existing data source. Normally this would be cake, fire up Entity Framework, point it at the DB, and call it a day, but for the time being, I need to get the actual data from a third party vendor data source that...
Can only be accessed from a generic ODBC provider (not SQL)
Has very bad syntax, naming, and in some cases, duplicate data everywhere
Has well over 100 tables, which, when combined, would be in the neighborhood of 1,000 columns/properties of data
I only need read-only from this DB, so I don't need to support updates / inserts / deletes, just need to extract the "dirty" data into clean entities once per application run.
I'd like to isolate the database as much as possible from my nice, clean, well-named entity classes.
Is there a good way to:
Generate the initial entity classes from the DB tables? (That way I'm just mostly renaming and fixing class properties to sanitize it instead of starting from scratch.)
Map the data from the DB into my nice, clean classes without writing 1000 property sets?
Edit: The main goal here is not as much to come up with a pseudo-ORM as it is to generate as much of the existing code as possible based on what's already there, and then tweak as needed, eliminating a lot of the manual labor-intensive class writing tasks.

I like to avoid auto-generating classes from database schemas just to have more control over the structure of the entities - that and what makes sense for a database structure doesn't always make sense for a class structure. For 3rd party or legacy systems, we use an adapter pattern for our business objects - converting the old or poorly structured format - be in in a database, flat files, other components, etc. into something more suitable.
That being said, you could create views or stored procedures to represent the data in a manor more suitable to your needs than the database's current structure. This is assuming that you are allowed to touch the database.

Dump the database. I mean, redesign the schema, migrate your data and you should be good.

Related

How to build dynamic integration adapter?

We have a scenario where we have multiple sources of data coming in from various external systems through API calls, SQL tables and physical files, and we now have to map it against a number of transaction templates. I want to build an integration adapter and UI where I can choose any entity data class and map it's fields to a class or action that will be used to create a transaction in our financial system.
I want to have an object type or class that can be modified dynamically, setup links between these objects and possibly create a set of rules that defines the interaction between these objects. I have seen some versions of this types of software that uses a drag and drop type of UI interface to do the mappings, so that will be the ideal end goal.
I'm coming from a C# .Net background, so I need some advise or tips on where to start and what to look at.

I am currently doing something similar. I wrote some code to turn data from our legacy system into JSON objects written out to flat files (1 file per data record in a table), and then wrote some code to cleanup that data and push it into Acumatica via the REST API.
I like flat file json objects because they can easily be hashed, and the hashes used to compare them to new data that comes in. Only data where the hash has changed needs to be merged or overwritten and then pushed into the target system. The file names are usually primary key values from whatever table you're working with. Our legacy system has a hierarchical (non-tabular/SQL-like data structure), so my mileage may be greater than doing with this with a well-normalized SQL database.
There are also products like Alteryx that are built for doing data pipelines the way you have proposed.
I would caution to be practical in the building of these types of things. For us, for example, we have a limited set of data that needs to come over, so we don't need the perfect abstraction for every data type. We inevitably had to do cleanup on legacy/3rd party data as well, and those problems aren't always so easy to abstract. I had previously built a system using closures for function passing in order to write custom cleanup routines for any abstract data problem I might encounter (sort of sounds like what you're talking about), but realized in the end that just writing simpler code that deals with specific data problems was cleaner and simpler to maintain....in the end, there is probably only a limited amount of data that needs to be synched.

DDD: Is it ok to generate/update my entities classes from changes in database schema?

Some time ago, at work, we had to change our main system to be "cross-rdbms". I'm not sure if this is the correct term, but basically the system worked only with MSSQLServer but in order to acomodate a new client we had to make it possible for the system to work with both MSSQLServer and Oracle.
We don't use a ORM because of reasons. Instead, we use a custom ADO-based data access layer.
Before of this change, we rellied heavily on stored procedures, database functions, triggers, etc. A substantial amount of business logic was located on the database itself.
We decided to get rid of all stored procedures, triggers and stuff, and basically reduce to database to a mere storage layer.
To handle migrations, we created a .json file which contains a representation of our database schema: tables, columns, indexes, constraints, etc. A simple application was created to edit this file. By using it, we're able to edit existent tables and columns and add new ones.
This json file is versioned in our repository. When the application is executed, a routine reads the file, constructing a representation of the database in memory. It then reads the metadata from the actual database, compare it to the in-memory representation and generates scripts based on the differences found.
Finally, the scripts are executed, updating the database schema.
So, now comes my real problem. When a new column is added, the developer needs to:
- add a new property to the POCO class that represents a row in that table;
- edit the method which maps the table columns to the class properties, adding the new column/property mapping;
- edit the class which handles database commands, creating a new parameter referent to the new column.
When this approach was initially implemented, I thought about auto-generating and updating the POCO classes based on changes in the json file. These would keep the classes in sync with the database schema, and we wouldn't have to worry about developers forgetting to update the classes after creating new columns.
This feature wasn't implemented tough, and now I'm having serious doubts about it, mostly because I've been studying Clean Architecture/Onion Architecture and Domain Driven Design.
From a DDD perspective, everything should be about the Domain, which in turn should be tottally ignorant about its persistence.
So, my question is basically: how can I maintain my domain model and my database schema in sync, without violating DRY and without using a "database-centric" approach?

DDD puts the focus on the domain language and its representation in domain classes. DB issues are not the primary concern of DDD.
Therefore, generating domain classes from the database schema is the wrong direction if the intention is to apply DDD.
This question is more about finding a decent way to manage DB upgrades, which has little to do with DDD. Unit/integration tests for basic read/write DB operations may go a long way in assisting developers to remember editing the required files when DB columns are altered.

Fast Loading/Storing of large tree/graph-ish data structures using conventional database (long)

Requirements:
Let's have conventional mysql database server.
Let's have C# .net app using mysql connector.
Let's have set of tables (designed to fit the complex tree/graph like data structures and their relations).
The data structures can be fairy large and can contain many (hundreds to thousands) blob items (10 to 100kiB per item).
Allow loading/storing of data structures from/to database as fast as possible.
Let's have three freely convertible representations of any data structure (sub)item: xml, C# object in memory, sql.
Current solution:
It emerged via some sort of evolution, this means no big knowledge of applicable/typical methods was present (also - as usual - the original requirements were not so demanding).
Each data structure (sub)item implements (custom) ISqlSerializable interface having (among others) ReadSql(...) and WriteSql(...) methods. (Inspired by IXmlSerializable, which also has to be implemented due to XML serialization requirements).
Custom (De)Serializer calls such methods for each (root) data structure.
These calls emit SQL commands for reading/writing the data structure itself followed by the serialization requests for children (if any) - the same approach used by the IXmlSerializable.
This sequence emits the SQL code for the whole tree/graph disregarding the fact what is the root (you can fully de/serialize any data structure via calling ISqlSerializable - I mean, the root might be almost any structure implementing ISqlSerializable).
Problems:
This approach is terribly slow, beacuse e.g. reading/writing N adjacent objects from/to single table means N select/insert commands instead of possibly single effective one.
Sort of caching has been introduced (for loading, up to now) to speedup the process:
The preseleted root structure prefetches all the tables/rows/columns via DataAdapters/DataSets and complex sql commands.
All the (sub)structures then read itself from such built cache and do not emit SQL code to server.
It dramatically improved the speed, but it's still lot of hardwired SQL emiting code for few preselected root structures.
The responsibility to load/store children's data is now up to the parent who needs to know the wider context ("what exactly is the full rest of me") than in previous case: "handle myself, then let the children to handle themselves".
Questions:
What is the typical method to solve such task? I mean, what would use people who do this kind of tasks "every day"? I do not suspect this is somehow special scenario...
Nice candidate seems to be stored procedures approach:
SQL/app code isolation.
Possibility to update/tune the SQL code with no/little impact to app code.
Probaly better efficiency when everything SQL related runs on server side.
Is that the best way (before we rebuild the whole app)?
Is there a standard way to create sql-stored-procedure-driven conversion accepting/generating multi-table DataSets writing/reading multiple tables to/from mysql database (something more practical than simple 'hello-world' example which will perform superfast even on 386)?
Please note: I'm not expecting fully working copy/paste source code in first answer, just general ideas/thoughts/kicks from developers more experienced in this particular area, considering current state and possible future improvement. I hope I'll manage the rest. Thanks!

Serializing complex EF model over JSON

I have done a lot of searching and experimenting and have been unable to find a workable resolution to this problem.
Environment/Tools
Visual Studio 2013
C#
Three tier web application:
Database tier: SQL Server 2012
Middle tier: Entity Framework 6.* using Database First, Web API 2.*
Presentation tier: MVC 5 w/Razor, Bootstrap, jQuery, etc.
Background
I am building a web application for a client that requires a strict three-tier architecture. Specifically, the presentation layer must perform all data access through a web service. The presentation layer cannot access a database directly. The application allows a small group of paid staff members to manage people, waiting lists, and the resources they are waiting for. Based on the requirements the data model/database design is entirely centered around the people (User table).
Problem
When the presentation layer requests something, say a Resource, it is related to at least one User, which in turn is related to some other table, say Roles, which are related to many more Users, which are related to many more Roles and other things. The point being that, when I query for just about anything EF wants to bring in almost the entire database.
Normally this would be okay because of EF's default lazy-load behavior, but when serializing just about any object to JSON for returning to the presentation layer, the Newtonsoft.Json serializer hangs for a long time then blows a stack error.
What I Have Tried
Here is what I have attempted so far:
Set Newtonsoft's JSON serialier ReferenceLoopHandling setting to Ignore. No luck. This is not cyclic graph issue, it is just the sheer volume of data that gets brought in (there are over 20,000 Users).
Clear/reset unneeded collections and set reference properties to null. This showed some promise, but I could not get around Entity Framework's desire to track everything.
Just setting nav properties to null/clear causes those changes to be saved back to the database on the next .SaveChanges() (NOTE: This is an assumption here, but seemed pretty sound. If anyone knows different, please speak up).
Detaching the entities causes EF to automatically clear ALL collections and set ALL reference properties to null, whether I wanted it to or not.
Using .AsNotTracking() on everything threw some exception about not allowing non-tracked entities to have navigation properties (I don't recall the exact details).
Use AutoMapper to make copies of the object graph, only including related objects I specify. This approach is basically working, but in the process of (I believe) performing the auto-mapping, all of the navigation properties are accessed, causing EF to query and resolve them. In one case this leads to almost 300,000 database calls during a single request to the web service.
What I am Looking For
In short, has anyone had to tackle this problem before and come up with a working and performant solution?
Lacking that, any pointers for at least where to look for how to handle this would be greatly appreciated.
Additional Note: It occurred to me as I wrote this that I could possibly combine the second and third items above. In other words, set/clear nav properties, then automap the graph to new objects, then detach everything so it won't get saved (or perhaps wrap it in a transaction and roll it back at the end). However, if there is a more elegant solution I would rather use that.
Thanks,
Dave

It is true that doing what you are asking for is very difficult and it's an architectural trap I see a lot of projects get stuck in.
Even if this problem were solveable, you'd basically end up just having a data layer which just wraps the database and destroys performance because you can't leverage SQL properly.
Instead, consider building your data access service in such a way that it returns meaningful objects containing meaningful data; that is, only the data required to perform a specific task outlined in the requirements documentation. It is true that an post is related to an account, which has many achievements, etc, etc. But usually all I want is the text and the name of the poster. And I don't want it for one post. I want it for each post in a page. Instead, write data services and methods which do things which are relevant to your application.
To clarify, it's the difference between returning a Page object containing a list of Posts which contain only a poster name and message and returning entire EF objects containing large amounts of irrelevant data such as IDs, auditing data such as creation time.
Consider the Twitter API. If it were implemented as above, performance would be abysmal with the amount of traffic Twitter gets. And most of the information returned (costing CPU time, disk activity, DB connections as they're held open longer, network bandwidth) would be completely irrelevant to what developers want to do.
Instead, the API exposes what would be useful to a developer looking to make a Twitter app. Get me the posts by this user. Get me the bio for this user. This is probably implemented as very nicely tuned SQL queries for someone as big as Twitter, but for a smaller client, EF is great as long as you don't attempt to defeat its performance features.
This additionally makes testing much easier as the smaller, more relevant data objects are far easier to mock.

For three tier applications, especially if you are going to expose your entities "raw" in services, I would recommend that you disable Lazy Load and Proxy generation in EF. Your alternative would be to use DTO's instead of entities, so that the web services are returning a model object tailored to the service instead of the entity (as suggested by jameswilddev)
Either way will work, and has a variety of trade-offs.
If you are using EF in a multi-tier environment, I would highly recommend Julia Lerman's DbContext book (I have no affiliation): http://www.amazon.com/Programming-Entity-Framework-Julia-Lerman-ebook/dp/B007ECU7IC
There is a chapter in the book dedicated to working with DbContext in multi-tier environments (you will see the same recommendations about Lazy Load and Proxy). It also talks about how to manage inserts and updates in a multi-tier environment.

i had such a project which was the stressful one .... and also i needed to load large amount of data and process them from different angles and pass it to complex dashboard for charts and tables.
my optimization was :
1-instead of using ef to load data i called old-school stored procedure (and for more optimization grouping stuff to reduce table as much as possible for charts. eg query returns a table that multiple charts datasets can be extracted from it)
2-more important ,instead of Newtonsoft's JSON i used fastJSON which performance was mentionable( it is really fast but not compatible with complex object. simple example may be view models that have list of models inside and may so on and on or )
better to read pros and cons of fastJSON before
https://www.codeproject.com/Articles/159450/fastJSON
3-in relational database design who is The prime suspect of this problem it might be good to create those tables which have raw data to process in (most probably for analytics) denormalized schema which save performance on querying data.
also be ware of using model class from EF designer from database for reading or selecting data especially when u want serialize it(some times i think separating same schema model to two section of identical classes/models for writing and reading data in such a way that the write models has benefit of virtual collections came from foreign key and read models ignore it...i am not sure for this).
NOTE: in case of very very huge data its better go deeper and set up in-memory table OLTP for the certain table contains facts or raw data how ever in that case your table acts as none relational table like noSQL.
NOTE: for example in mssql you can use benefits of sqlCLR which let you write scripts in c#,vb..etc and call them by t-sql in other words handle data processing from database level.
4-for interactive view which needs load data i think its better to consider which information might be processed in server side and which ones can be handled by client side(some times its better to query data from client-side ... how ever you should consider that those data in client side can be accessed by user) how ever it is situation-wise.
5-in case of large raw data table in view using datatables.min.js is a good idea and also every one suggest using serverside-paging on tables.
6- in case of importing and exporting data from big files oledb is a best choice i think.
how ever still i doubt them to be exact solutions. if any body have practical solutions please mention it ;) .

I have fiddled with a similar problem using EF model first, and found the following solution satisfying for "One to Many" relations:
Include "Foreign key properties" in the sub-entities and use this for later look-up.
Define the get/set modifiers of any "Navigation Properties" (sub-collections) in your EF entity to private.
This will give you an object not exposing the sub-collections, and you will only get the main properties serialized. This workaround will require some restructuring of your LINQ queries, asking directly from your table of SubItems with the foreign key property as your filtering option like this:
var myFitnessClubs = context.FitnessClubs
?.Where(f => f.FitnessClubChainID == myFitnessClubChain.ID);
Note 1:
You may off-cause choose to implement this solution partly, hence only affecting the sub-collections that you strongly do not want to serialize.
Note 2:
For "Many to Many" relations, at least one of the entities needs to have a public representation of the collection. Since the relation cannot be retrieved using a single ID property.

What is more convenient resource-wise: generate values at runtime or save generated values to the database?

So I have a design decision to make. I'm building a website, so the speed would be the most important thing. I have values that depend on other values. I have two options:
1- Retrieve my objects from the database, and then generate the dependent values/objects.
2- Retrieve the objects with the dependent values already stored in the database.
I'm using ASP.NET MVC with Entity Framework.
What considerations should I have in making that choice?

You will almost certainly see no performance benefit in storing the derived values. Obviously this can change if the dependency is incredibly complex or relies on a huge amount of data, but you don't mention anything specific about the data so I can only speak in generalities.
In other words, don't store values that are completely derivative as they introduce update anomalies (in other words, someone has to have knowledge about and code for these dependencies when updating your data, rather than it being as self-explanatory and clear as possible).

Ask yourself this question:
Are the dependent values based on business rules?
If so, then don't store them in the database - not because you can't or shouldn't, but because it is good practice - you should only have business rules in the database if that is the best or only place to have it, not just because you can.
Serializing your objects to the database will usually be slower than creating the objects in normal compiled code. Database access is normally pretty quick, it is the act of serialization that is slow. However if you have a complicated object creation process that is time consuming then serialization could end up quicker, especially if you use a custom serialization method.
Sooooo.... if your 'objects' are relatively normal data objects with some calculated/derived values then I would suggest that you store the values of the 'objects' in the database, read those values from the database and map them to data objects created in the compiled code*, then calculate your dependent values.
*Note that this is standard data retrieval - some people use an ORM, some manually map the values to objects.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.