Converting a large set of data into C# objects

Converting a large set of data into C# objects - c#

I'm making a complex app (planning which involves: articles, sales, clients, manufacturing, machines...) that uses the information provided by the SQL Server database of an ERP.
I use about 30 different related objects, each of which has its info stored in a table/view. Some of this tables have from 20k to 100k records.
I need to convert all these tables into C# object for future processing that cannot be handled in SQL. I do not need all the rows, but there ins't a way to determine which ones will I need exactly, as this will depend on runtime events.
The question is about the best way to do this. I have tried the following approaches:
Retrieve all data and store it in a
DataSet using a SqlDataAdapter, which ocuppies about 300mb in RAM.
First problem here: sync, but it's admissable since data isn't going
to change that much during execution.
Then I ran through every row and convert it to C# objects,
stored in static Dictionaries for fast access through key. Problem with this is that creating so many objects (millions) takes the memory
usage up to 1,4GB, which is too much. Besides from memory, data
access is very fast.
So if getting all takes too much memory, I thought that I needed some kind of laxy loading, so I tried:
Another option I have considered is to query directly the database
through a SqlDataReader filtering by the item I need only the
first time it's required, then it's stored in the static
dictionary. This way memory usage it's the minimum, but this way
is slow (minutes order) as it means that I need to make like a millon different queries which the server doesn't seem to like (low performance).
Lastly, I tried an intermediate approach that kind of works, but I'm not sure if it's optimal, I suspect it's not:
A third option would be to fill a DataSet containing all the info and store a local static copy, but not convert all the rows to objects, just do it on demand (lazy), something like this:
public class ProductoTerminado : Articulo {
private static Dictionary<string, ProductoTerminado> productosTerminados = new Dictionary<string, ProductoTerminado>();
public PinturaTipo pinturaTipo { get; set; }
public ProductoTerminado(string id)
: base(id) { }
public static ProductoTerminado Obtener(string idArticulo)
{
idArticulo = idArticulo.ToUpper();
if (productosTerminados.ContainsKey(idArticulo))
{
return productosTerminados[idArticulo];
}
else
{
ProductoTerminado productoTerminado = new ProductoTerminado(idArticulo);
//This is where I get new data from that static dataset
var fila = Datos.bd.Tables["articulos"].Select("IdArticulo = '" + idArticulo + "'").First();
//Then I fill the object and add it to the dictionary.
productoTerminado.descripcion = fila["Descripcion"].ToString();
productoTerminado.paletizacion = Convert.ToInt32(fila["CantidadBulto"]);
productoTerminado.pinturaTipo = PinturaTipo.Obtener(fila["PT"].ToString());
productosTerminados.Add(idArticulo, productoTerminado);
return productoTerminado;
}
}
}
So, is this a good way to proceed or should I look into Entity Framework or something like a strongly typed DataSet?

I use relations between about 30 different objects, each of which has its info stored in a table/view. Some of this tables have from 20k to 100k records.
I suggest making a different decision for different types of objects. Usually, the tables that have thousands of records are more likely to change. Tables that have fewer records are less likely. In a project I was working on the decision was to cache in a List<T> the objects that don't change (on start-up). For a few hundred instances this should take well less than a second.
If you are using linq-to-sql, have an object local in a List<T> and have correctly defined the FK constraints, you can do obj.Items to access the Items table filtered by obj's ID. (In this example obj is the PK and Items is the FK table).
This design will also give users the performance they expect. When working on small sets everything is instantaneous (cached). When working on larger sets but making small selects or inserts - performance is good (quick queries that use the PK). You only really suffer when you start doing queries that join multiple big tables; and in those cases, users will probably expect this (though I can't be certain without knowing more about the use case).

Related

What is the best approach to synchronizing large database tables across EC Core contexts?

My Scenario
I have three warehouse databases (Firebird) numbered 1, 2 and 3, each sharing the same scheme, and the same DbContext class. The following is the model of the Products table:
public class Product
{
public string Sku { get; }
public string Barcode { get; }
public int Quantity { get; }
}
I also have a local "Warehouse Cache" database (MySQL) where I want to periodically download the contents of all three warehouses for caching reasons. The data model of a cached product is similar, with the addition of a number denoting the source warehouse index. This table should contain all product information from all three warehouses. If a product appears in both warehouses 1 and 3 (same Sku), then I want to have two entries in the local Cache table, each with the corresponding warehouse ID:
public class CachedProduct
{
public int WarehouseId { get; set; } // Can be either 1, 2 or 3
public string Sku { get; }
public string Barcode { get; }
public int Quantity { get; }
}
There are multiple possible solutions to this problem, but given the size of my datasets (~20k entries per warehouse), none of them seem viable or efficient, and I'm hoping that someone could give me a better solution.
The problem
If the local cache database is empty, then it's easy. Just download all products from all three warehouses, and dump them into the cache DB. However on subsequent synchronizations, the cache DB will no longer be empty. In this case, I don't want to add all 60k products again, because that would be a tremendous waste of storage space. Instead, I would like to "upsert" the incoming data into the cache, so new products would be inserted normally, but if a product already exists in the cache (matching Sku and WarehouseId), then I just want to update the corresponding record (e.g. the Quantity could have changed in one of the warehouses since the last sync). This way, the no. of records in the cache DB will always be exactly the sum of the three warehouses; never more and never less.
Things I've tried so far
The greedy method: This one is probably the simplest. For each product in each warehouse, check if a matching record exists in the cache table. If it does then update, otherwise insert. The obvious problem is that there is no way to batch/optimize this, and it would result in tens of thousands of select, insert and update calls being executed on each synchronization.
:Clearing the Cache: Clear the local cache DB before every synchronization, and re-download all the data. My problem with this one is that it leaves a small window of time when no cache data will be available, which might cause problems with other parts of the application.
Using an EF-Core "Upsert" library: This one seemed the most promising with the FlexLabs.Upsert library, since it seemed to support batched operations. Unfortunately the library seems to be broken, as I couldn't even get their own minimal example to work properly. A new row in inserted on every "upsert", regardless of the matching rule.
Avoiding EF Core completely: I have found a library called Dotmim.Sync that seems to be a DB-to-DB synchronization library. The main issue with this is that the warehouses are running FirebirdDB which doesn't seem to be supported by this library. Also, I'm not sure if I could even do data transformation, since I have to add the WarehouseId column before a row is added to the cache DB.
Is there a way to do this as efficiently as possible in EF Core?

There are a couple of options here. Which ones are viable depends on your staleness constraints for the Cache. Must the cache always 100% relect the warehouse state, or can it get out of sync for a period of time.
First, you absolutely should not use EFCore for this, except possibly as a client lib to do raw SQL. EfCore is optimized for many small transactions. It doesn't do great with batch workloads.
The 'best' option is probably an event based system. Firebird supports emitting events to an event listener, which would then update the cache based on the events. The risk here is if event processing fails, you could get out of sync. You could mitigate that risk by using an event bus of some sort (Rabbit, Kafka), but Firebird event handling itself would be the weak link.
If the cache can handle some inconsistency, you could attach a expiry timestamp to each cache entry. Your application hits the cache, and if the expiry date is past, it rechecks the warehouse dbs. Depending on the business process that update the source of truth databases you may also be able to bust cache entries (e.g. if there's an order management system, it can bust the cache for a line item when someone makes an order).
If you have to batch sync, do a swap table. Set up a table with the live cache data, a separate table you load the new cache data in, and a flag in your application that says which yo read from. You read from table A while you load into B, then when the load is done, you swap to read from table B.

For now I ended up going with a simple, yet effective solution that is fully within EF Core.
For each cache entry, I also maintain a SyncIndex column. During synchronization, I download all products from all three warehouses, I set SyncIndex to max(cache.SyncIndex) + 1, and I dump them into the cache database. Then I delete all entries from the cache with an older SyncIndex. This way I always have some cache data available, I don't waste a lot of space, and the speed is pretty acceptable too.

How does the Entity Framework walks through a collection that is too big?

I am new to Entity Framework.
And I have one concern:
I need to walktrhough a quite big amount of data that is gathered via a LINQ to Entities that combines couple of properties from different entities in an anonymous type.
If I need to read the returned items of this query one by one until the end, am I under the risk of OutOfMemory exception as the collection is BIG or EF uses SqlDataReader implicitly?
( Or should I use EntityDateReader to ensure that I am reading the Db in a sequential order (But then I have to generate my query as a string I guess) )

As I see it there are two things you can do, firstly turn off tracking by using .AsNoTracking this will in most cases cut your memory set in half which may be enough.
If your set is still too big use skip and take to pull down the resultset in chunks. You should also use this in conjunction with AsNoTracking to ensure no memory is consumed with tracking
EDIT:
For example you could use something like the following the following to loop through all items in chunks of 1000. The below code should only hold 1000 items at a time in memory.
int numberOfItems = ctx.MySet.Count();
for(int i = 0; i < numberOfItems + 1000; i+=1000)
{
foreach(var item in ctx.MySet.AsNoTracking().Skip(i).Take(1000).AsEnumerable())
{
//do stuff with your entity
}
}

If the amount of the data is as big as you said i would recommend that you don't use EF for such a case. EF is great but you sometimes need to fallback and use standard SQL to get better performance.
Take a look at Dapper.NET https://github.com/SamSaffron/dapper-dot-net
If you really want to use EF in every case i would recommend that you use a Bounded Contexts (multiple DBContexts).
Splitting your model in multiple smaller contexts will improve the performance as you will use less resources when EF creates an in-memory model of the context.
The larger the context is the more resources are expended to generate
and maintain that in-memory model.
Also you can Detach and Attach when sharing instances of multiple context so you still don't load all the model.

Each row of a DataTable as a Class?

I have a "location" class. This class basically holds addresses, hence the term "location". I have a datatable that returns multiple records which I want to be "locations".
Right now, I have a "load" method in the "location" class, that gets a single "location" by ID. But what do I do when I want to have a collection of "location" objects from multiple datatable rows? Each row would be a "location".
I don't want to go to the database for each record, for obvious reasons. Do I simply create a new instance of a location class, assigning values to the properties, while looping through the rows in the datatable bypassing the "load" method? It seems logical to me, but I am not sure if this is the correct/most efficient method.

That is (your description) pretty much how a row (or a collection of rows) of data gets mapped to a C# biz object(s). But to save yourself a lot of work you should consider one of a number of existing ORM (object relational mapper) frameworks such as NHibernate, Entity Framework, Castle ActiveRecord etc.
Most ORMs will actually generate all the boilerplate code where rows and fields are mapped to your .NET object properties and vice-versa. (Yes, ORMs allow you to add, update and delete db data just as easily as retrieving and mapping it.) Do give the ORMs a look. The small amount of learning (there is some learning curve with each) will pay off very shortly. ORMs are also becoming quite standard and indeed expected in any application that touches an RDBMS.
Additionally these links may be of interest (ORM-related):
Wikipedia article on ORMs
SO Discussion on different ORMs
Many different .NET ORMs listed

You're on the right track, getting all the locations you need with one trip to the database would be best in terms of performance.
To make your code cleaner/shorter, make a constructor of your Location class that takes a DataRow, which will then set your properties accordingly. By doing this, you'll centralize your mapping from columns to properties in one place in your code base, which will be easy to maintain.
Then, it's totally acceptable to loop through the rows in your data table and call your constructor.
You could also use an object to relational mapper, like Entity Framework to do your database interaction.

Create a method that returns an IEnumerable . In this method do your database stuff and I often pass in the sqldatareader into the location constructor. So I would have something like this
public static IEnumerable<location> GetLocations()
{
List<location> retval = new List<location>();
using(sqlconnection conn = new sqlconn(connection string here);
{
sqlcommand command = new sqlcommand(conn, "spLoadData");
command.commandtype=stored proc
SqlDataReader reader = command.executereader();
while(reader.read())
{
retval.add(new location(reader));
}
}
return retval;
}
obviously that code won't work but it's just to give you an idea.
An ORM mapper could save you loads of time if you have lots to do however!

Optimize solution for categories tree search

I'm creating some kind of auction application, and I have to decide what is the most optimize way for this problem. I'm using BL Toolkit as my OR Mapper (It have nice Linq support) and ASP.NET MVC 2.
Background
I've got multiple Category objects that are created dynamically and that are saved in my database as a representation of this class:
class Category
{
public int Id { get; set; }
public int ParentId { get; set; }
public string Name { get; set; }
}
Now every Category object can have associated multiple of the InformatonClass objects that represents single information in that category, for example it's a price or colour. Those classes are also dynamicly created by administator and stored in database. There are specific for a group of categories. The class that represents it looks following:
class InformationClass
{
public int Id { get; set; }
public InformationDataType InformationDataType { get; set; }
public string Name { get; set; }
public string Label { get; set; }
}
Now I've got third table that represents the join between them like this:
class CategoryInformation
{
public int InformationClassId { get; set; }
public int AuctionCategoryId { get; set; }
}
Problem
Now the problem is that I need to inherit all category InformationClass in the child categories. For example every product will have a price so I need to add this InformationClass only to my root category. The frequency information can be added to base CPU category and it should be avaible in AMD and Intel categories that will derive from CPU category.
I have to know which InformationClass objects are related to specifed Category very often in my application.
So here is my question. What will be the most optimize solution for this problem? I've got some ideas but I cant decide.
Load all categories from database to Application table and take them from this place everytime - as far as the categories will not change too often it will reduce number of database requests but it will still require to tree search using Linq-to-Objects
Invent (I don't know if it's possible) some fancy Linq query that can tree search and get all information class id's without stressing the database too much.
Some other nice ideas?
I will be grateful for every answers and ideas. Thank you all in advice.

Sounds like a case for an idea I once had which I blogged about:
Tree structures and DAGs in SQL with efficient querying using transitive closures
The basic idea is this: In addition to the Category table, you also have a CategoryTC table which contains the transitive closure of the parent-child relationship. It allows you to quickly and efficiently retrieve a list of all ancestor or descendant categories of a particular category. The blog post explains how you can keep the transitive closure up-to-date every time a new category is created, deleted, or a parent-child relationship changed (it’s at most two queries each time).
The post uses SQL to express the idea, but I’m sure you can translate it to LINQ.
You didn’t specify in your question how the InformationClass table is linked to the Category table, so I have to assume that you have a CategoryInformation table that looks something like this:
class CategoryInformation
{
public int CategoryId { get; set; }
public int InformationClassId { get; set; }
}
Then you can get all the InformationClasses associated with a specific category by using something like this:
var categoryId = ...;
var infoClasses = db.CategoryInformation
.Where(cinf => db.CategoryTC.Where(tc => tc.Descendant == categoryId)
.Any(tc => tc.Ancestor == cinf.CategoryId))
.Select(cinf => db.InformationClass
.FirstOrDefault(ic => ic.Id == cinf.InformationClassId));
Does this make sense? Any questions, please ask.

In the past (pre SQLServer 2005 and pre LINQ) when dealing with this sort of structure (or the more general case of a directed acyclic graph, implemented with a junction table so that items can have more than one "parent"), I've either done this by loading the entire graph into memory, or by creating a tigger-updated lookup table in the database that cached in relationship of ancestor to descendant.
There are advantages to either and which wins out depends on update frequency, complexity of the objects outside of the matter of the parent-child relationship, and frequency of updating. In general, loading into memory allows for faster individual look-ups, but with a large graph it doesn't natively scale as well due to the amount of memory used in each webserver ("each" here, because webfarm situations are one where having items cached in memory brings extra issues), meaning that you will have to be very careful about how things are kept in synch to counter-act that effect.
A third option now available is to do ancestor lookup with a recursive CTE:
CREATE VIEW [dbo].[vwCategoryAncestry]
AS
WITH recurseCategoryParentage (ancestorID, descendantID)
AS
(
SELECT parentID, id
FROM Categories
WHERE parentID IS NOT NULL
UNION ALL
SELECT ancestorID, id
FROM recurseCategoryParentage
INNER JOIN Categories ON parentID = descendantID
)
SELECT DISTINCT ancestorID, descendantID
FROM recurseCategoryParentage
Assuming that root categories are indicated by having a null parentID.
(We use UNION ALL since we're going to SELECT DISTINCT afterwards anyway, and this way we have a single DISTINCT operation rather than repeating it).
This allows us to do the look-up table approach without the redundancy of that denormalised table. The efficiency trade-off is obviously different and generally poorer than with a table but not much (slight hit on select, slight gain on insert and delete, negliable space gain), but guarantee of correctness is greater.
I've ignored the question of where LINQ fits into this, as the trade-offs are much the same whatever way this is queried. LINQ can play nicer with "tables" that have individual primary keys, so we can change the select clause to SELECT DISTINCT (cast(ancestorID as bigint) * 0x100000000 + descendantID) as id, ancestorID, descendantID and defining that as the primary key in the [Column] attribute. Of course all columns should be indicated as DB-generated.
Edit. Some more on the trade-offs involved.
Comparing the CTE approach with look-up maintained in database:
Pro CTE:
The CTE code is simple, the above view is all the extra DB code you need, and the C# needed is identical.
The DB code is all in one place, rather than there being both a table and a trigger on a different table.
Inserts and deletes are faster; this doesn't affect them, while the trigger does.
While semantically recursive, it is so in a way the query planner understands and can deal with, so it's typically (for any depth) implemented in just two index scans (likely clustered) two light-weight spools, a concatenation and a distinct sort, rather than in the many many scans that you might imagine. So while certainly a heavier scan than a simple table lookup, it's nowhere near as bad as one might imagine at first. Indeed, even the nature of those two index scans (same table, different rows) makes it less expensive than you might think when reading that.
It is very very easy to replace this with the table look-up if later experience proves that to be the way to go.
A lookup table will, by its very nature, denormalise the database. Purity issues aside, the "bad smell" involved means that this will have to be explained and justified to any new dev, as until then it may simply "look wrong" and their instincts will send them on a wild-goose chase trying to remove it.
Pro Lookup-Table:
While the CTE is faster to select from than one might imagine, the lookup is still faster, especially when used as part of a more complicated query.
While CTEs (and the WITH keyword used to create them) are part of the SQL 99 standard, they are relatively new and some devs don't know them (though I think this particular CTE is so straightforward to read that it counts as a good learning example anyway, so maybe this is actually pro CTE!)
While CTEs are part of the SQL 99 standard, they aren't imlemented by some SQL databases, including older versions of SQLServer (which are still in live use), which may affect any porting efforts. (They are though supported by Oracle, and Postgres among others, so at this point this may not really be an issue).
It's reasonably easy to replace this with the CTE version later, if later experience suggests you should.
Comparing (both) the db-heavy options with in-memory caching.
Pro In-Memory:
Unless your implementation really sucks, it is going to be much faster than DB lookups.
It makes some secondary optimisations possible on the back of this change.
It is reasonably difficult to change from DB to in-memory if later profiling shows that in-memory is the way to go.
Pro Querying DB:
Start-up time can be very slow with in-memory.
Changes to the data are much much simpler. Most of the points are aspects of this. Really, if you go the in-memory route then the question of how to handle changes invalidating the cached information becomes a whole new ongoing concern for the lifetime of the project, and not a trivial one at all.
If you use in-memory, you are probably going to have to use this in-memory store even for operations where it is not relevant, which may complicate where it fits with the rest of your data-access code.
It is not necessary to track changes and cache freshness.
It is not necessary to ensure that every webserver in a web-farm and/or web-garden solution (a certain level of success will necessitate this) has precisely the same degree of freshness.
Similarly, the degree of scalability across machines (how close to 100% extra performance you get by doubling the number of webservers and DB slaves) is higher.
With in-memory, memory use can become very high, if either (a) the number of objects is high or (b) the size of the objects (fields, esp. strings, collections and objects which themselves have a sting or collection). Possibly "we need a bigger webserver" amounts of memory, and that goes for every machine in the farm.
7a. That heavy memory use is particularly like to continue to grow as the project evolves.
Unless changes cause an immediate refresh of the in-memory store, the in-memory solution will mean that the view used by the people in charge of administrating these categories will differ from what is seen by customers, until they are re-synchronised.
In-memory resynching can be very expensive. Unless you're very clever with it, it can cause random (to the user) massive performance spikes. If you are clever with it, it can exasperate the other issues (esp. in terms of keeping different machines at an equiv. level of freshness).
Unless you're clever with in-memory, those spikes can accumulate, putting the machine into a long-term hang. If you are clever with avoiding this, you may exasperate other issues.
It is very difficult to move from in-memory to hitting the db should that prove the way to go.
None of this leans with 100% certainty to one solution or the other, and I certainly aren't going to give a clear answer as doing so is premature optimsiation. What you can do a priori is make a reasonable decision about which is likely to be the optimal solution. Whichever you go for you should profile afterwards, esp. if the code does turn out to be a bottleneck and possibly change. You should also do so over the lifetime of the product as both changes to the code (fixes and new features) and changes to the dataset can certainly change which option is optimal (indeed, it can change from one to another and then change back to the previous one, over the course of the lifetime). This is why I included considerations of the ease of moving from one approach to another in the above list of pros and cons.

NHibernate and Modular Code

We're developing an application using Nhibernate as the data access layer.
One of the things I'm struggling with is finding a way to map 2 objects to the same table.
We have an object which is suited to data entry, and another which is used in more of a batch process.
The table contains all the columns for the data entry and some additional info for the batch processes.
When it's in a batch process I don't want to load all the data just a subset, but I want to be able to update the values in the table.
Does nhibernate support multiple objects pointed at the same table? and what is the thing that allows this?
I tried it a while ago and I remember that if you do a query for one of the objects it loads double the amount but i'm not so sure I didn't miss something.
e.g.
10 data entry objects
+
10 batch objects
So 20 object instead of 10.
Can anyone shed any light on this?
I should clarify that these objects are 2 different objects which in my mind should not be polymorphic in behaviour. However they do point at the same database record, it's more that the record has a dual purpose within the application and for sake of logical partitioning they should be kept separate. (A change to one domain object should not blow up numerous screens in other modules etc).
Thanks
Pete

An easy way to map multiple objects to the same table is by using a discriminator column. Add an extra column to the table and have it contain a value declaring it as type "Data Entry" or "Batch Process".
You'd create two objects - one for Data Entry and Batch Process. I'm not entirely sure how you enact that in regular NHibernate XML mapping - I use Castle ActiveRecord for annotating, so you'd mark up your objects like so:
[ActiveRecord("[Big Honking Table]",
DiscriminatorColumn = "Type",
DiscriminatorType = "String",
DiscriminatorValue = "Data Entry")]
public class Data Entry : ActiveRecordBase
{
//Your stuff here!
}
[ActiveRecord("[Big Honking Table]",
DiscriminatorColumn = "Type",
DiscriminatorType = "String",
DiscriminatorValue = "Batch Process")]
public class Batch Process : ActiveRecordBase
{
//Also your stuff!
}
Here's the way to do it with NHibernate + Castle ActiveRecord: http://www.castleproject.org/activerecord/documentation/trunk/usersguide/typehierarchy.html
Note that they use a parent object - I don't think that's necessary but I haven't implemented a discriminator column exactly the way you're describing, so it might be.
And here's the mapping in XML: https://www.hibernate.org/hib_docs/nhibernate/html/inheritance.html
You can also, through the mapping, let NHibernate know which columns to load / update - if you end up just making one big object.

I suppose you just might be overengineering it just a little bit:
If you worry about performance, that's premature optimization (besides, retrieving less columns is not much faster, as for saving you can enable dynamic updates to only update columns that changed).
If you trying to protect the programmer from himself by locking down his choices, you complicating your design for not so noble a cause.
In short, based on my 10 yrs+ of experience and my somewhat limited understanding of your problem I recommend you think again about doing what you wanna do.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.