Efficient query involving count in subquery

Efficient query involving count in subquery - c#

Say I have this hypothetical many-to-many relationship:
public class Paper
{
public int Id { get; set; }
public string Title { get; set; }
public virtual ICollection<Author> Authors { get; set; }
}
public class Author
{
public int Id { get; set; }
public string Name { get; set; }
public virtual ICollection<Paper> Papers { get; set; }
}
I want to use LINQ to build a query that will give me the "popularity" of each author compared to other authors, which is the number of papers the author contributed to divided by the total number of author contributions in general across all papers. I've come up with a couple queries to achieve this.
Option 1:
var query1 = from author in db.Authors
let sum = (double)db.Authors.Sum(a => a.Papers.Count)
select new
{
Author = author,
Popularity = author.Papers.Count / sum
};
Option 2:
var temp = db.Authors.Select(a => new
{
Auth = a,
Contribs = a.Papers.Count
});
var query2 = temp.Select(a => new
{
Author = a,
Popularity = a.Contribs / (double)temp.Sum(a2 => a2.Contribs)
});
Basically, my question is this: which of these is more efficient, and are there other single queries that are more efficient? How do any of those compare to two separate queries, like this:
double sum = db.Authors.Sum(a => a.Papers.Count);
var query3 = from author in db.Authors
select new
{
Author = author,
Popularity = author.Papers.Count / sum
};

Well, first of all, you can try them out yourself and see which one takes the longest for instance.
The first thing you should look for is that they translate perfectly into SQL or as close as possible so that the data doesn't get all get loaded in memory just to apply those computations.
But i feel that option 2 might be your best shot ,with one more optimization to cache the total sum of pages contributed. This way you only make one call to the db to get the authors which you anyway need, the rest will run in your code and there you can paralellize and do whatever you need to make it fast.
So something like this (sorry, I prefer the Fluent style of writing Linq):
//here you can even load only the needed info if you don't need the whole entity.
//I imagine you might only need the name and the Pages.Count which you can use below, this would be another optimization.
var allAuthors = db.Authors.All();
var totalPageCount = allAuthors.Sum(x => x.Pages.Count);
var theEndResult = allAuthors .Select(a => new
{
Author = a,
Popularity = a.Pages.Count/ (double)totalPageCount
});

Option 1 and 2 should generate the same SQL code. For readability I would go with option 1.
Option 3 will generate two SQL statements and be a little slower.

Related

How to populate objects with relationship from datatable?

I am having trouble designing an approach for taking data from a CSV into business objects. I'm starting by parsing the CSV and getting each row into a DataTable and that is where my mental block starts.
I've got the following classes where APDistribution is considered a child of Voucher with a 1:Many relationship:
public class Voucher
{
public string GPVoucherNumber { get; set; }
public string VendorID { get; set; }
public string TransactionDescription { get; set; }
public string Title { get; set; }
public string DocNumber { get; set; }
public DateTime DocDate { get; set; }
public decimal PurchaseAmount { get; set; }
public IEnumerable<APDistribution> Distributions { get; set; }
}
public class APDistribution
{
public string AccountNumber { get; set; }
public decimal Debit { get; set; }
public decimal Credit { get; set; }
public string DistributionReference { get; set; }
}
My CSV looks like this. Several fields can repeat representing the Voucher transaction (Vendor, Title Invoice Number, Invoice Amount, etc), and some fields are the Distribution detail (Journal Account Code, Journal Amount).
I began by thinking I could use Linq to project onto my business objects but I'm failing to see how I can structure the query to do that in one pass. I find myself wondering if I can do one query to project into a Voucher collection, one to project into an APDistribution collection, and then some sort of code to properly associate them.
I started with the following where I am grouping by the fields that should uniquely define a Voucher, but that doesn't work because the projection is dealing with an anonymous type instead of the DataRow.
var vouchers =
from row in invoicesTable.AsEnumerable()
group row by new { vendor = row.Field<string>("Vendor Code"), invoice = row.Field<string>("Vendor Invoice Number") } into rowGroup
select new Voucher
{ VendorID = rowGroup.Field<string>("Vendor Code") };
Is this achievable without introducing complex Linq that a future developer (myself included) could have difficulty understanding/maintaining? Is there a simpler approach without Linq that I'm overlooking?

The general idea is:
invoicesTable
.AsEnumerable()
.GroupBy(x=> new { row.Field<string>("Vendor Code"), row.Field<string>("Vendor Invoice Number")})
.Select(grouping =>
new Voucher
{
VendorID = grouping.First().Field<string>("VendorId") /* and so on */
Distributions = grouping.Select(somerow => new redistribution {AccountNumber = somerow.Field<string>("AccountNumber") /* and so on */}
}
But this is not the most elegant way.

You are looking for a Linq join. See the documentation here for more greater depth.
Where it appears that you are running into trouble however, is that on your 2 objects you need something for the query to compare against, like maybe adding public string VendorID { get; set; } to the APDistribution class, if possible. I would assume that the CSV files would have something that ties an APDistribution back to a Voucher, so whatever it is, make sure it's in both classes so you can relate one to the other. The name doesn't need to be the same in both classes but it should be. More importantly is that you now have something that an equality comparer can use for the join operation.
Now personally, I don't like big gnarly queries if I can break them apart and make things easier. Too much to reason about all at once, and you've indicated that you agree. So my approach is to divide and conquer as follows.
First, run queries to project the CSV data into discrete objects, like so:
var voucherRows =
from row in invoicesTable.AsEnumerable()
Select New Voucher {
VendorID = row.Field<string>("Vendor Code")
// other properties to populate
};
and
var distributionRows =
from row in distributionsTable.AsEnumerable()
Select New APDistribution {
VendorID = row.Field<string>("Vendor Code"),
// other properties to populate
};
At this point you have 2 data sets that are related in domain terms but not yet associated in code. Now you can compose the queries together in the Join query and the join starts to look a lot easier, maybe something like:
var vouchers =
from row in voucherRows
join dist in distributionRows
on row.VendorId equals dist.VendorId
into distGroup
select new Voucher
{ VendorID = row.VendorID,
// other properties to populate
Distributions = distGroup.ToList()
};
You'll have to modify the queries to your needs, but this breaks them down into 3 distinct operations that are all designed to do 1 thing, thus easier to read, reason about, debug, and modify later. If you need to group the vouchers you can at this point, but this should get you moving. Also, if needed, you can add a validation step or other processing in between the initial CSV queries and the join and you don't have to rewrite your queries, with the exception of changing some input variable names on the join.
Also, disclaimer that I did NOT build these queries in an IDE before posting so you may have some typos or missed symbols to deal with, but I'm pretty sure I have it right. Sorry in advance if you find anything aggravating.

While Linq can be cool and add efficiencies, it doesn't add value if you can't be sure the code is correct today, and can't understand it tomorrow. Maybe using Linq in this case is Premature Optimization.
Start with a non-Linq solution that is verifiably accurate without being needlessly inefficient, and then optimize later if performance becomes a problem.
Here's how I might tackle this:
var vouchers = new Dictionary<string, Voucher>();
foreach (DataRow row in invoicesTable)
{
string vendor = row.Field<string>("Vendor Code");
string invoice = row.Field<string>("Vendor Invoice Number");
string voucherKey = vendor + "|" + invoice;
if (!vouchers.ContainsKey(voucherKey))
{
vouchers.Add(voucherKey, new Voucher { VendorID = vendor, DocNumber = invoice });
}
vouchers[voucherKey].Distributions.Add(new APDistribution { AccountNumber = row.Field<string>("Journal Account Code") });
}
If this will be processing a large number of rows, you can tune this a bit by preallocating the Dictionary to an estimate of the number of unique vendors:
var vouchers = new Dictionary<string, Voucher>(invoicesTable.Rows.Count * 0.8);

LINQ - AND, ANY and NOT Query

I am fairly rookie with LINQ. I can do some basic stuff with it but I am in need of an expert.
I am using Entity Framework and I have a table that has 3 columns.
public class Aspect
{
[Key, Column(Order = 0)]
public int AspectID { get; set; }
[Key, Column(Order = 1)]
public int AspectFieldID { get; set; }
public string Value { get; set; }
}
I have 3 lists of words from a user's input. One contains phrases or words that must be in the Value field (AND), another contains phrases or words that don't have to be in the Value field (ANY) and the last list contains phrases or words that can not be found in the Value field (NOT).
I need to get every record that has all of the ALL words, any of the ANY words and none of the NOT words.
Here are my objects.
public class SearchAllWord
{
public string Word { get; set; }
public bool includeSynonoyms { get; set; }
}
public class SearchAnyWord
{
public string Word { get; set; }
public bool includeSynonoyms { get; set; }
}
public class SearchNotWord
{
public string Word { get; set; }
}
What I have so far is this,
var aspectFields = getAspectFieldIDs().Where(fieldID => fieldID > 0).ToList();//retrieves a list of AspectFieldID's that match user input
var result = db.Aspects
.Where(p => aspectFields.Contains(p.AspectFieldID))
.ToList();
Any and all help is appreciated.

First let me say, if this is your requirement... your query will read every record in the database. This is going to be a slow operation.
IQueryable<Aspect> query = db.Aspects.AsQueryable();
//note, if AllWords is empty, query is not modified.
foreach(SearchAllWord x in AllWords)
{
//important, lambda should capture local variable instead of loop variable.
string word = x.Word;
query = query.Where(aspect => aspect.Value.Contains(word);
}
foreach(SearchNotWord x in NotWords)
{
string word = x.Word;
query = query.Where(aspect => !aspect.Value.Contains(word);
}
if (AnyWords.Any()) //haha!
{
List<string> words = AnyWords.Select(x => x.Value).ToList();
query =
from aspect in query
from word in words //does this work in EF?
where aspect.Value.Contains(word)
group aspect by aspect into g
select g.Key;
}
If you're sending this query into Sql Server, be aware of the ~2100 parameter limit. Each word is going to be sent as a parameter.

What you need are the set operators, specifically
Intersect
Any
Bundle up your "all" words into a string array (or some other enumerable) and then you can use intersect and count to check they are all present.
Here are two sets
var A = new string[] { "first", "second", "third" };
var B = new string[] { "second", "third" };
A is a superset of B?
var isSuperset = A.Intersect(B).Count() == B.Count();
A is disjoint with B?
var isDisjoint1 = !A.Intersect(B).Any();
var isDisjoint2 = !A.Any(a => B.Any(b => a == b)); //faster
Your objects are not strings so you will want the overload that allows you to supply a comparator function.
And now some soapboxing.
Much as I love Linq2sql it is not available in ASP.NET Core and the EF team wants to keep it that way, probably because jerks like me keep saying "gross inefficiency X of EF doesn't apply to Linq2Sql".
Core is the future. Small, fast and cross platform, it lets you serve a Web API from a Raspberry Pi running Windows IOT or Linux -- or get ridiculously high performance on big hardware.
EF is not and probably never will be a high performance proposition because it takes control away from you while insisting on being platform agnostic, which prevents it from exploiting the platform.
In the absence of Linq2sql, the solution seems to be libraries like Dapper, which handle parameters when sending the query and map results into object graphs when the result arrives, but otherwise don't do much. This makes them more or less platform agnostic but still lets you exploit the platform - apart from parameter substitution your SQL is passthrough.

Updating a single element in a nested list with official C# MongoDB driver

I have a simple game with multiple rounds and I want to update the most recent round:
class Game
{
public ObjectId Id { get; set; }
public List<Round> Rounds { get; set; }
}
class Round
{
public int A { get; set; }
public int B { get; set; }
}
How can I do the equivalent of games.Rounds.Last().A = x using the official MongoDB C# driver?
Edit: Added Round.B. Note that in this case, both A and B may be updated concurrently so I cannot save back the entire document. I only want to update the A field.

If you're using the drivers with LINQ support, then I suppose you could do this:
var last = collection.AsQueryable<Game>().Last();
last.A = x;
collection.Save(last);
I imagine it wouldn't be as efficient as a hand-coded update statement, but this does functionally mirror your javascript version for the most part.
Edit: Without LINQ, and doing a subset update
var query = Query.EQ("_id", MongoDB.Bson.BsonValue.Create(games.Id);
var update = Update.Set("Rounds." + games.Rounds.Length - 1 + ".A", MongoDB.Bson.BsonValue.Create(x));
Collection.Update(query, update);
Not so pretty looking, but you can index into an array by the number in Mongo, so you'd miss out on the case where a new Round was added to the game before you update.

You could do something like this:
var query = Query<Entity>.EQ(e => e.Id, id);
var update = Update<Entity>.Set(e => e.Name, "Harry"); // update modifiers
collection.Update(query, update);
I hope you find this useful. Thanks.

Selecting related Article based on common Tags in Entity Framework

I Have these entities:
public class Article {
public int Id { get; set; }
public virtual IList<Tag> Tags { get; set; }
}
public class Tag {
public int Id { get; set; }
public virtual IList<Article> Articles { get; set; }
}
I load an Article by its Tags like this:
var articleByTags = context.Articles.Include(a => a.Tags).FirstOrDefault(a => a.Id == someId);
Now, how can I get a list of articles, that have must common tags with the selected article? Can you help me please?

Good question. Here is solution:
// you should have a list of primitive types to use in SQL IN keyword
var ids = articleByTags.Tags.Select(t => t.Id).ToList();
var query = (from article in context.Articles
// do you want same article again? NO! so remove the current article
where article.Id != articleByTags.Id
// this line would create a IN statement to SQL
// if you don't want to load common tags, you can merge this line
// by the next it, and just create a COUNT()
let commonTags = article.Tags.Where(tag => ids.Contains(tag.Id))
let commonCount = commonTags.Count()
// there as any?
where commonCount > 0
// ascending! not descending! you want most common
orderby commonCount ascending
// create your projection
select new {
Id = article.Id,
Title = article.Title,
Tags = article.Tags,
Commons = commonTags,
CommonCount = commonCount
// or any property you want...
})
// how many you want to take? for example 5
.Take(5)
.ToList();

I think you want something like this:
var listOfMustHaveTags = new List<Tag>(); // Should be filled with Tags
var articleByCommonTags = context.Articles.Where(n => n.Tags.All(listOfMustHaveTags.Contains));
If the requirement says that a least one tag must fit, then change .All() with .Any().

How do I construct a LINQ with multiple GroupBys?

I have an entity that looks like this:
public partial class MemberTank
{
public int Id { get; set; }
public int AccountId { get; set; }
public int Tier { get; set; }
public string Class { get; set; }
public string TankName { get; set; }
public int Battles { get; set; }
public int Victories { get; set; }
public System.DateTime LastUpdated { get; set; }
}
A tiny sample of the data:
Id AccountId Tier Class TankName Battles Victories
--- --------- ---- ----- --------- ------- ----------
1 432423 5 Heavy KV 105 58
2 432423 6 Heavy IS 70 39
3 544327 5 Heavy KV 200 102
4 325432 7 Medium KV-13 154 110
5 432423 7 Medium KV-13 191 101
Ultimately I am trying to get a result that is a list of tiers, within the tiers is a list of classes, and within the class is a distinct grouping of the TankName with the sums of Battles and Victories.
Is it possible to do all this in a single LINQ statement? Or is there another way to easily get the result? (I know I can easily loop through the DbSet several times to produce the list I want; I am hoping for a more efficient way of getting the same result with LINQ.)

This should do it:
var output = from mt in MemberTanks
group by {mt.Tier, mt.Class, mt.TankName} into g
select new { g.Key.Tier,
g.Key.Class,
g.Key.TankName,
Fights = g.Sum(mt => mt.Battles),
Wins = g.Sum(mt=> mt.Victories
};

You could also use Method syntax. This should give you the same as #TheEvilGreebo
var result = memberTanks.GroupBy(x => new {x.Tier, x.Class, x.TankName})
.Select(g => new { g.Key.Tier,
g.Key.Class,
g.Key.TankName,
Fights = g.Sum(mt => mt.Battles),
Wins = g.Sum(mt=> mt.Victories)
});
Which syntax you use comes down to preference.
Remove the .Select to return the IGrouping which will enable you to enumerate the groups
var result = memberTanks.GroupBy(x => new {x.Tier, x.Class, x.TankName})

I kept trying to get useful results our of the The Evil Greebo's answer. While the answer does yield results (after fixing the compilation issues mentioned in responses) it doesn't give me what I was really looking for (meaning I didn't explain myself well enough in the question).
Feanz left a comment in my question to check out the MS site with LINQ examples and, even though I thought I had looked there before, this time I found their example of nested group bys and I tried it their way. The following code gives me exactly what I was looking for:
var result = from mt in db.MemberTanks
group mt by mt.Tier into tg
select new
{
Tier = tg.Key,
Classes = from mt in tg
group mt by mt.Class into cg
select new
{
Class = cg.Key,
TankTypes = from mt in cg
group mt by mt.TankName into tng
select new
{
TankName = tng.Key,
Battles = tng.Sum(mt => mt.Battles),
Victories = tng.Sum(mt => mt.Victories),
Count = tng.Count()
}
}
};
I'll leave the answer by Mr. Greebo checked as most people will likely get the best results from that.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Efficient query involving count in subquery - c#

Option 1 and 2 should generate the same SQL code. For readability I would go with option 1. Option 3 will generate two SQL statements and be a little slower.

Related

How to populate objects with relationship from datatable?

LINQ - AND, ANY and NOT Query

Updating a single element in a nested list with official C# MongoDB driver

Selecting related Article based on common Tags in Entity Framework

How do I construct a LINQ with multiple GroupBys?

Categories

Resources