How to populate objects with relationship from datatable? - c#

I am having trouble designing an approach for taking data from a CSV into business objects. I'm starting by parsing the CSV and getting each row into a DataTable and that is where my mental block starts.
I've got the following classes where APDistribution is considered a child of Voucher with a 1:Many relationship:
public class Voucher
{
public string GPVoucherNumber { get; set; }
public string VendorID { get; set; }
public string TransactionDescription { get; set; }
public string Title { get; set; }
public string DocNumber { get; set; }
public DateTime DocDate { get; set; }
public decimal PurchaseAmount { get; set; }
public IEnumerable<APDistribution> Distributions { get; set; }
}
public class APDistribution
{
public string AccountNumber { get; set; }
public decimal Debit { get; set; }
public decimal Credit { get; set; }
public string DistributionReference { get; set; }
}
My CSV looks like this. Several fields can repeat representing the Voucher transaction (Vendor, Title Invoice Number, Invoice Amount, etc), and some fields are the Distribution detail (Journal Account Code, Journal Amount).
I began by thinking I could use Linq to project onto my business objects but I'm failing to see how I can structure the query to do that in one pass. I find myself wondering if I can do one query to project into a Voucher collection, one to project into an APDistribution collection, and then some sort of code to properly associate them.
I started with the following where I am grouping by the fields that should uniquely define a Voucher, but that doesn't work because the projection is dealing with an anonymous type instead of the DataRow.
var vouchers =
from row in invoicesTable.AsEnumerable()
group row by new { vendor = row.Field<string>("Vendor Code"), invoice = row.Field<string>("Vendor Invoice Number") } into rowGroup
select new Voucher
{ VendorID = rowGroup.Field<string>("Vendor Code") };
Is this achievable without introducing complex Linq that a future developer (myself included) could have difficulty understanding/maintaining? Is there a simpler approach without Linq that I'm overlooking?

The general idea is:
invoicesTable
.AsEnumerable()
.GroupBy(x=> new { row.Field<string>("Vendor Code"), row.Field<string>("Vendor Invoice Number")})
.Select(grouping =>
new Voucher
{
VendorID = grouping.First().Field<string>("VendorId") /* and so on */
Distributions = grouping.Select(somerow => new redistribution {AccountNumber = somerow.Field<string>("AccountNumber") /* and so on */}
}
But this is not the most elegant way.

You are looking for a Linq join. See the documentation here for more greater depth.
Where it appears that you are running into trouble however, is that on your 2 objects you need something for the query to compare against, like maybe adding public string VendorID { get; set; } to the APDistribution class, if possible. I would assume that the CSV files would have something that ties an APDistribution back to a Voucher, so whatever it is, make sure it's in both classes so you can relate one to the other. The name doesn't need to be the same in both classes but it should be. More importantly is that you now have something that an equality comparer can use for the join operation.
Now personally, I don't like big gnarly queries if I can break them apart and make things easier. Too much to reason about all at once, and you've indicated that you agree. So my approach is to divide and conquer as follows.
First, run queries to project the CSV data into discrete objects, like so:
var voucherRows =
from row in invoicesTable.AsEnumerable()
Select New Voucher {
VendorID = row.Field<string>("Vendor Code")
// other properties to populate
};
and
var distributionRows =
from row in distributionsTable.AsEnumerable()
Select New APDistribution {
VendorID = row.Field<string>("Vendor Code"),
// other properties to populate
};
At this point you have 2 data sets that are related in domain terms but not yet associated in code. Now you can compose the queries together in the Join query and the join starts to look a lot easier, maybe something like:
var vouchers =
from row in voucherRows
join dist in distributionRows
on row.VendorId equals dist.VendorId
into distGroup
select new Voucher
{ VendorID = row.VendorID,
// other properties to populate
Distributions = distGroup.ToList()
};
You'll have to modify the queries to your needs, but this breaks them down into 3 distinct operations that are all designed to do 1 thing, thus easier to read, reason about, debug, and modify later. If you need to group the vouchers you can at this point, but this should get you moving. Also, if needed, you can add a validation step or other processing in between the initial CSV queries and the join and you don't have to rewrite your queries, with the exception of changing some input variable names on the join.
Also, disclaimer that I did NOT build these queries in an IDE before posting so you may have some typos or missed symbols to deal with, but I'm pretty sure I have it right. Sorry in advance if you find anything aggravating.

While Linq can be cool and add efficiencies, it doesn't add value if you can't be sure the code is correct today, and can't understand it tomorrow. Maybe using Linq in this case is Premature Optimization.
Start with a non-Linq solution that is verifiably accurate without being needlessly inefficient, and then optimize later if performance becomes a problem.
Here's how I might tackle this:
var vouchers = new Dictionary<string, Voucher>();
foreach (DataRow row in invoicesTable)
{
string vendor = row.Field<string>("Vendor Code");
string invoice = row.Field<string>("Vendor Invoice Number");
string voucherKey = vendor + "|" + invoice;
if (!vouchers.ContainsKey(voucherKey))
{
vouchers.Add(voucherKey, new Voucher { VendorID = vendor, DocNumber = invoice });
}
vouchers[voucherKey].Distributions.Add(new APDistribution { AccountNumber = row.Field<string>("Journal Account Code") });
}
If this will be processing a large number of rows, you can tune this a bit by preallocating the Dictionary to an estimate of the number of unique vendors:
var vouchers = new Dictionary<string, Voucher>(invoicesTable.Rows.Count * 0.8);

Related

How to make an SQL query with custom columns in Entity Framework Core 5

I have some table like this:
CREATE TABLE names (
score INTEGER NOT NULL PRIMARY KEY,
name TEXT NOT NULL
);
And I want to get some stat from it. In sqlite I can use LEAD, but not there. I now about linq2db, but I wouldn't want to use it, because of its algorithm. As I understand it, this package does not add LEAD template to SQL conversion in EF linq, and executes the LEAD algorithm on its own side (not on the database side, which is more efficient). If I'm wrong, correct me.
For example, I want to execute query:
var lst = db.table_names.FromSqlRaw("SELECT score, LEAD(cid, 1) OVER (ORDER BY score) as next, LEAD(score, 1) OVER (ORDER BY score) - score AS diff FROM names ORDER BY diff DESC LIMIT 1");
This SQL-expression returns the two scores with the largest gap between them. The query is executed and returns a single row (known from lst.Count() and debugger).
The result is there, but how do I get it? Perhaps there is some feature of EF that allows to legally get data from the custom SQL-formed data structure?
I would not like to put crutches with filling in the class structure with the data I need to transmit to code, but not correct from the point of the purpose of the class fields.
Maybe there are illegal, but still less crutchy ways than the one I gave above?
You have two ways to approach this issue.
Create a view on the database level with the query you have and use it in the entity framework, then you will be able to simply do the following
var lst = db.vw_name.OrderBy(d => d.diff).ToList();
Use LINQ Query Syntax instead, but you will need to write multiple queries and join them together, as well as creating a new class that the query can use to instantiate a list of objects of. Here is a simplified example that does not contain SQL functions
public class Scores {
public int Score { get; set; }
public int Next { get; set; }
public int Max { get; set; }
}
and
var lst = (from x in db.table_names
orderby x.diff
select new Scores {
Score = x.score,
Next = x.next,
Max = x.Max
}).ToList();
The former approach is much better for many reasons in my opinion.
Addition to answer from Bassel Ashi:
Create a view on the database level with the query you have and use it in the entity framework
Create a view on the database level:
db.Database.ExecuteSqlRaw(#"CREATE VIEW View_scoreDiff AS SELECT score, LEAD(cid, 1) OVER (ORDER BY score) as next, LEAD(score, 1) OVER (ORDER BY score) - score AS diff FROM names ORDER BY diff DESC LIMIT 1");
Then you need to create a class:
public class View_scoreDiffClass {
public int Score { get; private set; }
public int Next { get; private set; }
public int Diff { get; private set; }
}
Add next field to your context:
public DbSet<View_scoreDiffClass> View_scoreDiff { get; private set; }
Add the following line to OnModelCreating:
modelBuilder.Entity<View_scoreDiffClass>().ToView("View_scoreDiff").HasNoKey();
After all this, you can execute db.View_scoreDiff.FirstOrDefault() and get the desired columns.

Delete newer duplicate value based on two columns .NET Core

I have huge transactions table in azure database, where we import files with +1 million objects.
public class Transaction
{
[Key]
public int Id { get; set; }
public int TransactionId { get; set; }
public DateTime Date { get; set; }
public decimal Price { get; set; }
public int UserId { get; set; }
public string Product { get; set; }
public int ClientId { get; set; }
public int Uploaed { get; set; }
public string UniqueId { get; set; }
public string Custom1 { get; set; }
public string Custom2 { get; set; }
public string Custom3{ get; set; }
}
after importing all new data I take all new transaction ids, and take all transaction ids for that client from database.
// ids from import
string transactionsString = string.Join(",", transactionIdsCsv);
var result = await _transactionsDataRepository.GetByTransactionIdsAndClientId(transactionIdsCsv.ToArray(), clientId);
// ids from repository
string transactionsDBString = string.Join(",", result.ToList());
// remove rows in db where duplicate transactions ids and clientId=ClientId
but I am struggling to find the most effective way. I wanted to do something like
delete from transactions where transactionId IN (transactionsDBString) and clientId = ClientID but that would delete both values and I only want new value to be deleted (and old value to stay)
but would that be a good way? even fetching var result = await _transactionsDataRepository... can take a lot of time since there are millions of rows.
I only want new value to be deleted (and old value to stay)
Since you already know how to identify the transaction IDs you want to delete you could delete the necessary rows while keeping the latest like so (you didn't mention it but I'm assuming you're using Entity Framework - given your use of the [Key] attribute - correct me if I'm wrong):
var transToRemove = dbContext.Transactions
.Where(t => t.ClientId == clientId && transIds.Contains(t.TransactionId))
.GroupBy(t => t.TransactionId, t => t) // Group transactions with the same TransactionId
.SelectMany(
group => group.OrderBy(t => t.Date) // Order the oldest first
.Skip(1) // Skip the oldest (we want to keep it)
);
dbContext.Transactions.RemoveRange(transToRemove);
dbContext.SaveChanges();
Edit: Included an example that should work for Dapper...
var cn = // Create your DbConnection
// This query should select all transactions you want to delete excluding
// those with the oldest Date. This is just like 'transToRemove' above
var selectQuery = #"
SELECT t1.Id FROM Transactions t1
INNER JOIN (
SELECT
MIN(tInner.Date) AS FirstTransDate,
tInner.TransactionId,
tInner.ClientId
FROM Transactions tInner
WHERE tInner.ClientId = #clientId
AND tInner.TransactionId IN #transIds
GROUP BY tInner.TransactionId, tInner.ClientId
) t2 ON t2.ClientId = t1.ClientId AND t2.TransactionId = t1.TransactionId
WHERE t1.Date != t2.FirstTransDate
";
var idsToDelete = cn.Query<int>(
selectQuery,
new { clientId, transIds }).ToList();
// Delete the whole list in one go
cn.Execute("DELETE FROM Transactions WHERE Id in #idsToDelete", new {idsToDelete});
(inspiration from here and here)
I haven't tested this using Dapper but the list of idsToDelete should be correct according to this fiddle I made. A couple things to note:
Depending on how long your list of transIds is (I believe those ID's are in result in your own example) you might want to repeat this in smaller batches instead of trying to delete the whole list in one go.
The SQL query above doesn't take into account if two "duplicate" transactions have the same "oldest" Date. If that can happen in your table, then this query will only remove all "duplicate" rows apart from those two.
Improvements
There are a couple of things that seem a little out of place with your setup that I think you should consider:
even fetching var result = await _transactionsDataRepository... can take a lot of time since there are millions of rows
Millions of rows should not be an issue for any decent database server to handle. It sounds like you are missing some indexes on your table. With proper indexes your queries should be pretty swift as long as you can keep them simple.
but would that be a good way?
Not quite sure what you are referring to being good or bad here, but I'll interpret a little... Right now you are writing tons of rows to the table that seems to contain duplicate data. When I think of a transaction-based system, no two transactions should share the same ID. That means for two different ClientIds there should never be a case where t1.TransactionId == t2.TransactionId. Now you can avoid checking ClientId in my code-snippet above.
Since you want to only keep 1 transaction for each TransactionId will you ever need to have two transactions with the same TransactionId? If not, then you can go even further and make the TransactionId column unique and avoid inserting two rows with the same TransactionId. You can use the Entity Framework [Index(IsUnique=true)] attribute to also create an index to speed up queries on that column/property.

LINQ - AND, ANY and NOT Query

I am fairly rookie with LINQ. I can do some basic stuff with it but I am in need of an expert.
I am using Entity Framework and I have a table that has 3 columns.
public class Aspect
{
[Key, Column(Order = 0)]
public int AspectID { get; set; }
[Key, Column(Order = 1)]
public int AspectFieldID { get; set; }
public string Value { get; set; }
}
I have 3 lists of words from a user's input. One contains phrases or words that must be in the Value field (AND), another contains phrases or words that don't have to be in the Value field (ANY) and the last list contains phrases or words that can not be found in the Value field (NOT).
I need to get every record that has all of the ALL words, any of the ANY words and none of the NOT words.
Here are my objects.
public class SearchAllWord
{
public string Word { get; set; }
public bool includeSynonoyms { get; set; }
}
public class SearchAnyWord
{
public string Word { get; set; }
public bool includeSynonoyms { get; set; }
}
public class SearchNotWord
{
public string Word { get; set; }
}
What I have so far is this,
var aspectFields = getAspectFieldIDs().Where(fieldID => fieldID > 0).ToList();//retrieves a list of AspectFieldID's that match user input
var result = db.Aspects
.Where(p => aspectFields.Contains(p.AspectFieldID))
.ToList();
Any and all help is appreciated.
First let me say, if this is your requirement... your query will read every record in the database. This is going to be a slow operation.
IQueryable<Aspect> query = db.Aspects.AsQueryable();
//note, if AllWords is empty, query is not modified.
foreach(SearchAllWord x in AllWords)
{
//important, lambda should capture local variable instead of loop variable.
string word = x.Word;
query = query.Where(aspect => aspect.Value.Contains(word);
}
foreach(SearchNotWord x in NotWords)
{
string word = x.Word;
query = query.Where(aspect => !aspect.Value.Contains(word);
}
if (AnyWords.Any()) //haha!
{
List<string> words = AnyWords.Select(x => x.Value).ToList();
query =
from aspect in query
from word in words //does this work in EF?
where aspect.Value.Contains(word)
group aspect by aspect into g
select g.Key;
}
If you're sending this query into Sql Server, be aware of the ~2100 parameter limit. Each word is going to be sent as a parameter.
What you need are the set operators, specifically
Intersect
Any
Bundle up your "all" words into a string array (or some other enumerable) and then you can use intersect and count to check they are all present.
Here are two sets
var A = new string[] { "first", "second", "third" };
var B = new string[] { "second", "third" };
A is a superset of B?
var isSuperset = A.Intersect(B).Count() == B.Count();
A is disjoint with B?
var isDisjoint1 = !A.Intersect(B).Any();
var isDisjoint2 = !A.Any(a => B.Any(b => a == b)); //faster
Your objects are not strings so you will want the overload that allows you to supply a comparator function.
And now some soapboxing.
Much as I love Linq2sql it is not available in ASP.NET Core and the EF team wants to keep it that way, probably because jerks like me keep saying "gross inefficiency X of EF doesn't apply to Linq2Sql".
Core is the future. Small, fast and cross platform, it lets you serve a Web API from a Raspberry Pi running Windows IOT or Linux -- or get ridiculously high performance on big hardware.
EF is not and probably never will be a high performance proposition because it takes control away from you while insisting on being platform agnostic, which prevents it from exploiting the platform.
In the absence of Linq2sql, the solution seems to be libraries like Dapper, which handle parameters when sending the query and map results into object graphs when the result arrives, but otherwise don't do much. This makes them more or less platform agnostic but still lets you exploit the platform - apart from parameter substitution your SQL is passthrough.

Should I introduce redundancy into model design

I am trying to design a new system for tracking sales. A simplistic version of my data models are:
public class Sale
{
public int SaleId { get; set; }
public DateTime CompletedDateTime { get; set; }
public virtual List<SaleItem> SaleItems { get; set; }
public decimal Total
{
get
{
return SaleItems.Sum(i => i.Price);
}
}
}
public class SaleItem
{
public int SaleItemId { get; set; }
public decimal Price { get; set; }
public int SaleId { get; set; }
public virtual Sale Sale { get; set; }
}
I am now writing some reports which total the sales value for between a specified period. I have the following code to do that:
List<Sale> dailySales = db.Sales
.Where(x => DbFunctions.TruncateTime(x.CompletedDateTime) >= fromParam)
.Where(x => DbFunctions.TruncateTime(x.CompletedDateTime) <= toParam)
.ToList();
decimal total = dailySales.Sum(x => x.Total);
This is working ok and giving me the expected result. I feel like this might give me problems further down the line though once large datasets get involved. I assume having to load all the Sale's into a list would become resource intensive, plus my actual implementation has tax, costs etc. associated with each SaleItem so again becomes more complex.
The following would allow me to do all the processing on the database, however it is not possible to do this as the DB does not have a representation for Total, so EF throws an error:
Decimal total = db.Sales.Sum(x=>x.Total);
Which leads me to my question. I could set me model as the following and each time I add a SaleItem, make sure I update the Total:
public class Sale
{
...
public decimal Total { get; set; }
}
This would then allow me to query the database as required, and I assume will be less resource intensive. The flip side though is that I have reduced redundancy into the database. Is the latter way, the better method of dealing with this or is there an alternative method I haven't even considered that is better?
It depends on many factors. For instance, how often will you require the "Total" amount to be available? And how many SaleItems are there usually in a Sale?
If we're talking about, say, a supermarket kind of sale where you have... say... maximum of maximums 200 items. It's quite okay to just quickly calculate it on the fly. Then again, if this ever gets mapped to a RDBMS and if you have all the SaleItems in one single table, having an index on the foreign key (which links each individual SaleItem to its Sale) is a must, otherwise performance will take a huge hit once you start to have millions of transactions to sift through.
Answering the second half of your question, having redundancy is not always a bad thing... you just need to make sure that if each Sale ever needs to get its List modified, at the end of it the Total is recalculated. It's slightly dangerous (redundancy always has this attached burden) but you just need to ensure that whatever has the potential to change the Sale, does so in a way (maybe even with a trigger in the RDBMS) that the total will be automatically recalculated.
Hope it helps!
You're right that, it's much more efective to calculate totals on the DB side instead of loading the whole list and calculating it on the application.
I think you're missing that you can make a LINQ query that gets the SUM of related children entities.
using (var ctx = new MyDbContext())
{
var totalSales = ctx.Sales
.Select(s => s.SaleItems.Sum(si => si.Price)) // Total of each Sale
.Sum(tsi => tsi); // Sum of the total of each sale
}
You can of course shape the query to bring additional information, projecting the result in an anonymous class or in a class created ad-hoc for this purpose.
Of course, this EF query will be translated into a SQL query and executed on the server side.
When you start using LINQ to EF it's not very obvious how to get what you want, but in most occassions you can do it.

Linq to compare 2 different lists and select the outer join

I have 2 different classes that represent 2 types of data. The first is the unposted raw format. The second is the posted format.
public class SalesRecords
{
public long? RecordId { get; set; }
public DateTime RecordDate { get; set; }
public string RecordDesc { get; set; }
// Other non-related properties and methods
}
public class PostedSalesRecords
{
public long? CorrelationId { get; set; }
public DateTime RecordDate { get; set; }
public DateTime? PostedDate { get; set; }
public string RecordDesc { get; set; }
// Other non-related properties and methods
}
Our system has a list of sales records. These sales records are posted to a different system at a time determined by the users. I am creating a screen that will show all of the posted sales records along with the unposted sales records as a reconciliation. The datasource for my grid will be a list of PostedSalesRecords. What I need to do is find out which records out of the List<SalesRecords> that are not in List<PostedSalesRecords> and then map those unposted SalesRecords to a PostedSalesRecords. I am having trouble finding a way to quickly compare. Basically I tried this, and it was EXTREMELY slow:
private List<SalesRecords> GetUnpostedSalesRecords(
List<SalesRecords> allSalesRecords,
List<PostedSalesRecords> postedSalesRecords)
{
return allSalesRecords.Where(x => !(postedSalesRecords.Select(y => y.CorrelationId.Value).Contains(x.RecordId.Value))).ToList();
}
My biggest issue is that I am filtering through a lot of data. I am comparing ~55,000 total sales records to about 17,000 posted records. It takes about 2 minutes for me to process this. Any possible way to speed this up? Thanks!
You can try an outer join, please see if this helps with the performance:
var test = (from s in allSalesRecords
join p in postedSalesRecords on s.RecordId equals p.CorrelationId into joined
from j in joined.DefaultIfEmpty()
where j == null
select s).ToList();
Or in your implementation, you can create a dictionary of only Ids for postedSalesRecords and then use that collection in your query, it'll definitely help with performance because the lookup time will be O(1) instead of traversing through the whole collection for each record.
var postedIds = postedSalesRecords.Select(y => y.CorrelationId.Value)
.Distinct().ToDictionary(d=>d);
return allSalesRecords.Where(x => !(postedIds.ContainsKey(x.RecordId.Value))).ToList();
Using a left outer join as described on MSDN should work much more efficiently:
private List<SalesRecords> GetUnpostedSalesRecords(
List<SalesRecords> allSalesRecords,
List<PostedSalesRecords> postedSalesRecords)
{
return (from x in allSalesRecords
join y in postedSalesRecords on x.RecordId.Value
equals y.CorrelationId.Value into joined
from z in joined.DefaultIfEmpty()
where z == null
select x).ToList();
}
This will probably be implemented with a hash set. You could implement this yourself (arguably clearer that way): build a HashSet<long> of the ID values in one or both lists to ensure that you don't need repetitive O(N) lookups each time you go through the outer list.

Categories

Resources