Removing duplicate rows in database with primary key using Distinct()

Removing duplicate rows in database with primary key using Distinct() - c#

I have some duplicate values in my database so I am using Linq to Entity to remove them with the code below. The problem is that there is an autonumber primary key in RosterSummaryData_Subject_Local, which invalidates the line var distinctRows = allRows.Distinct();
So, even if all the rows are the same, distinct won't work because the pk is different. Is there anyway to discredit the pk in the distinct? Or anyway to remove it from the query so it becomes a non issue. Just to note I want the query to return an IQueryable of my entity type so I can use the RemoveRange() method on the enttiy to remove the duplicates.
var allRows = (from subjLocal in customerContext.RosterSummaryData_Subject_Local
select subjLocal);
var distinctRows = allRows.Distinct();
if (allRows.Count() == distinctRows.Count())
{
return;
}
else
{
var rowsToDelete = allRows.Where(a => a != distinctRows);
customerContext.RosterSummaryData_Subject_Local.RemoveRange(rowsToDelete);
}
EDIT
I realized that to properly bring back distinct rows, all I have to do is select all the items except primary key:
var distinctRows = allRows
.Select(a => new {a.fkRosterSetID, a.fkTestInstanceID, a.fkTestTypeID,
a.fkSchoolYearID, a.fkRosterTypeID, a.fkDistrictID,
a.fkSchoolID, a.fkGradeID, a.fkDepartmentID,
a.fkCourseID, a.fkPeriodID, a.fkDemoCommonCodeID,
a.fkDemoCommonCategoryID, a.fkTest_SubjectID})
.Distinct();
The problem is that I cannot fetch the duplicate rows with the code below because the ! operator does not work with anonymous types(the variable distinctRows is an anonymous type because I didn't select all the columns):
var rowsToDelete = allRows.Where(a => a != distinctRows);
Any help?

you can try this:
var allRows = (from subjLocal in customerContext.RosterSummaryData_Subject_Local
select subjLocal).ToList();
var distinctRows = allRows.Distinct().ToList();
Since you will be dealing with list objects, then in your original else statement you can do this:
else
{
var rowsToDelete = allRows.Where(a => !distinctRows.Contains(a));
customerContext.RosterSummaryData_Subject_Local.RemoveRange(rowsToDelete);
}
To handle your issue with Distinct() and the autonumberID in the database, there are two solutions I can think of.
One is you can bring in the MoreLinq library, it's a Nuget package. then you can use the MoreLinq method DistinctBy():
allRows.DistinctBy(a => a.SomePropertyToUse);
Or the other route would be to use an IEqualityComparer with the regular .Distinct() Linq Method. You can check out this SO question for more info on using an IEqualityComparer in the .Distinct() method. using distinct with IEqualityComparer

maybe you need to check for each one of the fields in your customerContext.RosterSummaryData_Subject_Local to see which one is different

Related

How to Select subset of columns from SingleOrDefault LINQ query in EF

How can I select subset of all columns while using SingleOrDefault query? For example, following LINQ expression
var personid = ctx.persons.SingleOrDefault(p => p.login == currentLogin)?.personid;
will compile into SELECT TOP 1 * FROM ... type of query. I would like to Select() only the columns I am interested in, e.g. statement producing SELECT TOP 1 personid, myColumn FROM ... under-hood.
Please note, that the question cannot possibly be duplicate of linked question. I am interested in context of Single/SingleOrDefault not generic solution for LINQ. Chaining .SingleOrDefault() with .Select() is not possible for apparent reasons: Single<T>returns single object of type T (or throws) which clearly does not implement IEnumerable<T> and cannot be Select()ed upon.

var personid = ctx.persons
.Where(p => p.login == currentLogin)
.Select(p => new {Prop = p.Column, personid = p.id})
.SingleOrDefault()?.personid;
Would probably work.

Why is linq reversing order in group by

I have a linq query which seems to be reversing one column of several in some rows of an earlier query:
var dataSet = from fb in ds.Feedback_Answers
where fb.Feedback_Questions.Feedback_Questionnaires.QuestionnaireID == criteriaType
&& fb.UpdatedDate >= dateFeedbackFrom && fb.UpdatedDate <= dateFeedbackTo
select new
{
fb.Feedback_Questions.Feedback_Questionnaires.QuestionnaireID,
fb.QuestionID,
fb.Feedback_Questions.Text,
fb.Answer,
fb.UpdatedBy
};
Gets the first dataset and is confirmed working.
This is then grouped like this:
var groupedSet = from row in dataSet
group row by row.UpdatedBy
into grp
select new
{
Survey = grp.Key,
QuestionID = grp.Select(i => i.QuestionID),
Question = grp.Select(q => q.Text),
Answer = grp.Select(a => a.Answer)
};
While grouping, the resulting returnset (of type: string, list int, list string, list int) sometimes, but not always, turns the question order back to front, without inverting answer or questionID, which throws it off.
i.e. if the set is questionID 1,2,3 and question A,B,C it sometimes returns 1,2,3 and C,B,A
Can anyone advise why it may be doing this? Why only on the one column? Thanks!
edit: Got it thanks all! In case it helps anyone in future, here is the solution used:
var groupedSet = from row in dataSet
group row by row.UpdatedBy
into grp
select new
{
Survey = grp.Key,
QuestionID = grp.OrderBy(x=>x.QuestionID).Select(i => i.QuestionID),
Question = grp.OrderBy(x=>x.QuestionID).Select(q => q.Text),
Answer = grp.OrderBy(x=>x.QuestionID).Select(a => a.Answer)
};

Reversal of a grouped order is a coincidence: IQueryable<T>'s GroupBy returns groups in no particular order. Unlike in-memory GroupBy, which specifies the order of its groups, queries performed in RDBMS depend on implementation:
The query behavior that occurs as a result of executing an expression tree that represents calling GroupBy<TSource,TKey,TElement>(IQueryable<TSource>, Expression<Func<TSource,TKey>>, Expression<Func<TSource,TElement>>) depends on the implementation of the type of the source parameter.`
If you would like to have your rows in a specific order, you need to add OrderBy to your query to force it.
How I do it and maintain the relative list order, rather than apply an order to the resulting set?
One approach is to apply grouping to your data after bringing it into memory. Apply ToList() to dataSet at the end to bring data into memory. After that, the order of subsequent GrouBy query will be consistent with dataSet. A drawback is that the grouping is no longer done in RDBMS.

LINQ: Is there a way to combine these queries into one?

I have a database that contains 3 tables:
Phones
PhoneListings
PhoneConditions
PhoneListings has a FK from the Phones table(PhoneID), and a FK from the Phone Conditions table(conditionID)
I am working on a function that adds a Phone Listing to the user's cart, and returns all of the necessary information for the user. The phone make and model are contained in the PHONES table, and the details about the Condition are contained in the PhoneConditions table.
Currently I am using 3 queries to obtain all the neccesary information. Is there a way to combine all of this into one query?
public ActionResult phoneAdd(int listingID, int qty)
{
ShoppingBasket myBasket = new ShoppingBasket();
string BasketID = myBasket.GetBasketID(this.HttpContext);
var PhoneListingQuery = (from x in myDB.phoneListings
where x.phonelistingID == listingID
select x).Single();
var PhoneCondition = myDB.phoneConditions
.Where(x => x.conditionID == PhoneListingQuery.phonelistingID).Single();
var PhoneDataQuery = (from ph in myDB.Phones
where ph.PhoneID == PhoneListingQuery.phonePageID
select ph).SingleOrDefault();
}

You could project the result into an anonymous class, or a Tuple, or even a custom shaped entity in a single line, however the overall database performance might not be any better:
var phoneObjects = myDB.phoneListings
.Where(pl => pl.phonelistingID == listingID)
.Select(pl => new
{
PhoneListingQuery = pl,
PhoneCondition = myDB.phoneConditions
.Single(pc => pc.conditionID == pl.phonelistingID),
PhoneDataQuery = myDB.Phones
.SingleOrDefault(ph => ph.PhoneID == pl.phonePageID)
})
.Single();
// Access phoneObjects.PhoneListingQuery / PhoneCondition / PhoneDataQuery as needed
There are also slightly more compact overloads of the LINQ Single and SingleOrDefault extensions which take a predicate as a parameter, which will help reduce the code slightly.
Edit
As an alternative to multiple retrievals from the ORM DbContext, or doing explicit manual Joins, if you set up navigation relationships between entities in your model via the navigable join keys (usually the Foreign Keys in the underlying tables), you can specify the depth of fetch with an eager load, using Include:
var phoneListingWithAssociations = myDB.phoneListings
.Include(pl => pl.PhoneConditions)
.Include(pl => pl.Phones)
.Single(pl => pl.phonelistingID == listingID);
Which will return the entity graph in phoneListingWithAssociations
(Assuming foreign keys PhoneListing.phonePageID => Phones.phoneId and
PhoneCondition.conditionID => PhoneListing.phonelistingID)

You should be able to pull it all in one query with join, I think.
But as pointed out you might not achieve alot of speed from this, as you are just picking the first match and then moving on, not really doing any inner comparisons.
If you know there exist atleast one data point in each table then you might aswell pull all at the same time. if not then waiting with the "sub queries" is nice as done by StuartLC.
var Phone = (from a in myDB.phoneListings
join b in myDB.phoneConditions on a.phonelistingID equals b.conditionID
join c in ph in myDB.Phones on a.phonePageID equals c.PhoneID
where
a.phonelistingID == listingID
select new {
Listing = a,
Condition = b,
Data = c
}).FirstOrDefault();
FirstOrDefault because single throws error if there exists more than one element.

Join to an in-memory list efficiently

In EF, if I have a list of primatives (List), "joining" that against a table is easy:
var ids = int[]{1,4,6}; //some random values
var rows = context.SomeTable.Where(r => ids.Contains(r.id))
This gets much more complicated the instant you want to join on multiple columns:
var keys = something.Select(s => new { s.Field1, s.Field2 })
var rows = context.SomeTable.Where(r => keys.Contains(r => new { s.Field1, s.Field2 })); // this won't work
I've found two ways to join it, but neither is great:
Suck in the entire table, and filtering it based on the other data. (this gets slow if the table is really large)
For each key, query the table (this gets slow if you have a decent number of rows to pull in)
Sometimes, the compromise I've been able to make is a modified #1: pulling in subset of the table based on a fairly unique key
var keys = something.Select(s => s.Field1)
var rows = context.SomeTable.Where(r => keys.Contains(s.Field1)).ToList();
foreach (var sRow in something)
{
var joinResult = rows.Where(r => r.Field1 == sRow.Field1 && r.Field2 == sRow.Field2);
//do stuff
}
But even this could pull back too much data.
I know there are ways to coax table valued parameters into ADO.Net, and ways I can build a series of .Where() clauses that are OR'd together. Does anyone have any magic bullets?

Instead of a .Contains(), how about you use an inner join and "filter" that way:
from s in context.SomeTable
join k in keys on new {k.Field1, k.Field2} equals new {s.Field1, s.Field2}
There may be a typo in the above, but you get the idea...

I got exactly the same problem, and the solutions I came up with were:
Naive: do a separate query for each local record
Smarter: Create 2 lists of unique Filed1 values and unique Fiels2 values, query using 2 contains expressions and then you will have to double filter result as they might be not that accurate.
Looks like this:
var unique1 = something.Select(x => x.Field1).Distinct().ToList();
var unique2 = something.Select(x => x.Field2).Distinct().ToList();
var priceData = rows.Where(x => unique1.Contains(x.Field1) && unique2.Contains(x.Field2));
Next one is my own solution which I called BulkSelect, the idea behind it is like this:
Create temp table using direct SQL command
Upload data for SELECT command to that temp table
Intercept and modify SQL which was generated by EF.
I did it for Postgres, but this may be ported to MSSQL is needed. This nicely described here and the source code is here

You can try flattening your keys and then using the same Contains pattern. This will probably not perform great on large queries, although you could use function indexes to store the flattened key in the database...
I have table Test with columns K1 int, K2 int, Name varchar(50)
var l = new List<Tuple<int, int>>();
l.Add(new Tuple<int, int>(1, 1));
l.Add(new Tuple<int, int>(1, 2));
var s = l.Select(k => k.Item1.ToString() + "," + k.Item2.ToString());
var q = Tests.Where(t => s.Contains(t.K1.ToString() + "," + t.K2.ToString()));
foreach (var y in q) {
Console.WriteLine(y.Name);
}
I've tested this in LinqPad with Linq to SQL
First attempt that didn't work:
I think the way to write it as a single query is something like this
var keys = something.Select(s => new { s.Field1, s.Field2 })
var rows = context.SomeTable.Where(r => keys.Any(k => r.Field1 == k.Field1 && r.Field2 == k.Field2));
Unfortunately I don't have EF on this laptop and can't even test if this is syntactically correct.
I've also no idea how performant it is if it works at all...

var rows =
from key in keys
join thingy in context.SomeTable
on 1 = 1
where thingy.Field1 == key && thingy.Field2 == key
select thingy
should work, and generate reasonable SQL

wpf and ef select part of a table with include

I have a table with many fields and I want to get only a few individual fields, I work with EF and I add another table to the query as follows
var Test= ve.Folders.Include("Hosting")
.Where(a => a.Collateral!= true)
.AsEnumerable()
.Select(p => new
{
id = p.Folder_Id,
name = p.Full_Name,
add = p.Address,
date1 = p.Collateral_Date,
sName = p.Hosting._Name
})
.ToArray();
But with the field (sName= p.Hosting._Name) that is associated with the second table without any value query not working
Many attempts have been tried but without result (interesting when I ask without Select everything works well)
Thanks in advance for any help

One thing to note is that, in this case, there's little benefit to the Select after the call to AsEnumerable, since all the data in the table is still queried from the database (not just the fields you specifiy).
If you want to avoid that, and only query those five fields, you can remove the AsEnumerable call. That means the Select will execute as part of the SQL query. This also means the Include is unnecessary, since the Select will query all of the data you want.
var Test= ve.Folders
.Where(a => a.Collateral!= true)
.Select(p => new
{
id = p.Folder_Id,
name = p.Full_Name,
add = p.Address,
date1 = p.Collateral_Date,
sName = p.Hosting._Name
})
.ToArray();

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Removing duplicate rows in database with primary key using Distinct() - c#

maybe you need to check for each one of the fields in your customerContext.RosterSummaryData_Subject_Local to see which one is different

Related

How to Select subset of columns from SingleOrDefault LINQ query in EF

Why is linq reversing order in group by

LINQ: Is there a way to combine these queries into one?

Join to an in-memory list efficiently

wpf and ef select part of a table with include

Categories

Resources