Compare two datatables and draw those that differ into new DataTable efficiently - c#

In essence I have two datatables
DataTable 1
PirateShipID PirateShipPreference
123 1
122 2
121 3
And DataTable 2 (which has different named columns, but the data types are the same.
RGPirateShipID PirateShipPreferenceType
123 1
122 1
121 3
I want to grab all records where
PirateShipID == RGPirateShipID && PirateShipePreference != PirateShipPreferenceType
Ideally using Linq as I believe that would be my quickest way of accomplishing this
var idsNotinPirates = from r in DataTable1.AsEnumerable()
//Get all records that don't match on preference
where DataTable2.AsEnumerable().Any(r2 => r["PirateShiptID"] == r2["RGPirateShipID"] && r["PirateShipPreference"] != r2["PirateShipPreferenceType"])
select r;
However, DataTable 1 has about 10k pirateships and Datatable2 has 1 million.
It takes the application a long time to complete the above.
How can i make this more efficient?

I believe you should probably be doing something like this:
var query = from r in DataTable1.AsEnumerable()
join r2 in DataTable2.AsEnumerable() on r["PirateShipID"] equals r2["RGPirateShipID"] into joinedTable
where joinedTable["PirateShipPreference"] != joinedTable["PirateShipPreferenceType"]
select r

Related

Simplifying LINQ query in C#

Table1
Table1ID Name Graduation Version Hobbies
1 A Degree 1 B
2 A Degree 2 C
3 A Degree 3 D
Table2
Table2ID Table1ID Name Graduation Version Address Surname Date
1 1 A Degree 1 A A 08-10-2019
2 2 A Degree 2 A A 08-10-2019
3 3 A Degree 3 A A
//I want to check if any version greater than highest version exists in Table1 .where Date column is not null in Table2
Suppose for the combination of Name and Degree , the highest version is 2 in Table2 since Date is null for Table2, I want to check if any record greater than 2 exists in Table1, if yes add it to a new List
Here is what I am doing.
List<Table2> groupByTable2 = //Operations on Table2 and get highest Version record from db
List<Table1> check = new List<Table1>();
List<Table1> check2 = await _table1.GetAll().ToListAsync();
Foreach(var a in groupByTable2)
{
List<Table1> check4 = check2.Where(x => x.Name == a.Name && x.Graduation == a.Graduation).ToList();
If(check4.Any(x=>x.Version > a.Version))
{
check.Add(check2.Where(x=>x.Table1ID == a.Table1ID).First());
}
}
Now my check contains a record where ID is 3. But is there any simpler way to achieve this in simpler way with readability and performance?
I hope I understood what you are trying to achieve. You could try the following.
var result = table2.Where(x=>x.Date!=null)
.GroupBy(x=> new {x.Name, x.Graduation})
.SelectMany(x=> x.OrderByDescending(c=>c.Version).Take(1))
.Join(table1,t2=>t2.Table1ID,t1=>t1.Table1ID,(t2,t1)=>t1)
.ToList();
result.AddRange(table1.Where(x=> result.Any(c=>c.Name.Equals(x.Name)
&& c.Graduation.Equals(x.Graduation)
&& c.Version < x.Version)));
The idea is to first use GroupBy and Join to get the List of Items with highest Version number in Table1 that has a valid date in Table2. Then, use List.AddRange to add remaining higher versions from Table1.

LINQ join to return dynamic column list

I'm looking for a way to return a dynamic column list from a LINQ join of two datatables.
First, this is not a duplicate. I have already studied and discarded:
C# LINQ list select columns dynamically from a joined dataset
Creating a LINQ select from multiple tables
How to do a LINQ join that behaves exactly like a physical database inner join?
(and many others)
Here is my starting point:
public static DataTable JoinDataTables(DataTable dt1, DataTable dt2, string table1KeyField, string table2KeyField, string[] columns) {
DataTable result = ( from dataRows1 in dt1.AsEnumerable()
join dataRows2 in dt2.AsEnumerable()
on dataRows1.Field<string>(table1KeyField) equals dataRows2.Field<string>(table2KeyField)
[...I NEED HELP HERE with the SELECT....]).CopyToDataTable();
return result;
}
A few notes and requirements:
There is no database engine. The data sources are large CSV files (500K+ records) being read into c# DataTables.
Because the CSVs are large, looping through each record in the join is a bad solution for performance reasons. I've already tried record looping and it's just too slow. I get great performance on the join above, but I can't find a way to have it return just the columns I want (specified by the caller) without looping records.
If I need to loop over columns in the join, that is perfectly fine, I just don't want to loop rows.
I want to be able to pass in an array of column names and return just those columns in the resulting DataTable. If both datatables being passed in happen to have a column named the same, and if that column is in my array of column names, just pass back either column because the data will be the same between the 2 columns in that case.
If I need to pass in 2 arrays (1 for each datatable's desired columns) that's fine, but 1 array of column names would be ideal.
The column list cannot be static and hardcoded into the function. The reason is because my JoinDataTables() is called from many different places in my system in order to join a wide variety of CSVs-turned-datatables, and each CSV file has very different columns.
I don't want all columns returned in the resulting DataTable -- just the columns I specify in the columns array.
So suppose, before calling JoinDataTables(), I have the following 2 datatables:
Table: T1
T1A T1B T1C T1D
==================
10 AA H1 Foo1
11 AB H1 Foo2
12 AA H2 Foo1
13 AB H2 Foo2
Table: T2
T2A T2X T2Y T2Z
==================
12 N1 O1 Yeah1
17 N2 O2 Yeah2
18 N3 O1 Yeah1
19 N4 O2 Yeah2
Now suppose we join these 2 tables like so:
ON T1.T1A = T2.T2A
select * from [join]
and that yields this resultset:
T1A T1B T1C T1D T2A T2X T2Y T2Z
====================================
12 AA H2 Foo1 12 N1 O1 Yeah1
Notice that only 1 row is yielded by the join.
Now to the crux of my question. Suppose that for a given use case, I want to return only 4 columns from this join: T1A, T1D, T2A, and T2Y. So my resultset would then look like this:
T1A T1D T2A T2Y
==================
12 Foo1 12 O1
I'd like to be able to call my JoinDataTables function like so:
DataTable dt = JoinDataTables(dt1, dt2, "T1A", "T2A", new string[] {"T1A", "T1D", "T2A", "T2Y"});
Keeping in mind performance and the fact that I don't want to loop through records (because it's slow for large sets of data), how can this be accomplished? (The join is already working well, now I just need a correct select segment (whether via new{..} or whatever you think)).
I cannot accept a solution with a hardcoded column list inside the function. I have found examples of that approach all over SO.
Any ideas?
EDIT: I'd be ok getting ALL columns back every time, but every attempt I've made to include all columns has resulted in some kind of FULL OUTER JOIN or CROSS JOIN, returning orders of magnitude more records than it should. So, I'd be open to getting all columns back, as long as I don't get the cross join.
I'm not sure of the performance with 500k records, but here is an attempted solution.
Since you are combining two subsets of DataRows from different tables, there are no easy operations that will create the subset or create a new DataTable from the subsets (though I have an extension method for flattening an IEnumerable<anon> where anon = new { DataRow1, DataRow2, ... } from a join, it would probably be slow for you).
Instead, I pre-create an answer DataTable with the columns requested and then use LINQ to build the value arrays to be added as the rows.
public static DataTable JoinDataTables(DataTable dt1, DataTable dt2, string table1KeyField, string table2KeyField, string[] columns) {
var rtnCols1 = dt1.Columns.Cast<DataColumn>().Where(dc => columns.Contains(dc.ColumnName)).ToList();
var rc1 = rtnCols1.Select(dc => dc.ColumnName).ToList();
var rtnCols2 = dt2.Columns.Cast<DataColumn>().Where(dc => columns.Contains(dc.ColumnName) && !rc1.Contains(dc.ColumnName)).ToList();
var rc2 = rtnCols2.Select(dc => dc.ColumnName).ToList();
var work = from dataRows1 in dt1.AsEnumerable()
join dataRows2 in dt2.AsEnumerable()
on dataRows1.Field<string>(table1KeyField) equals dataRows2.Field<string>(table2KeyField)
select (from c1 in rc1 select dataRows1[c1]).Concat(from c2 in rc2 select dataRows2[c2]).ToArray();
var result = new DataTable();
foreach (var rc in rtnCols1)
result.Columns.Add(rc.ColumnName, rc.DataType);
foreach (var rc in rtnCols2)
result.Columns.Add(rc.ColumnName, rc.DataType);
foreach (var rowVals in work)
result.Rows.Add(rowVals);
return result;
}
Since you were using query syntax, I did as well, but normally I would probably do the select like so:
select rc1.Select(c1 => dataRows1[c1]).Concat(rc2.Select(c2 => dataRows2[c2])).ToArray();
Updated: It is probably worthwhile to use the column ordinals instead of the names to index into each DataRow by replacing the definitions of rc1 and rc2:
var rc1 = rtnCols1.Select(dc => dc.Ordinal).ToList();
var rc1Names = rtnCols1.Select(dc => dc.ColumnName).ToHashSet();
var rtnCols2 = dt2.Columns.Cast<DataColumn>().Where(dc => columns.Contains(dc.ColumnName) && !rc1Names.Contains(dc.ColumnName)).ToList();
var rc2 = rtnCols2.Select(dc => dc.Ordinal).ToList();

Split datatable into multiple arrays based on column value in C#

I am having a datatable with multiple records having different key values. For example, a key 34 has multiple rows and some 35 has multiple rows. I need to split this key into separate arrays based on the column value.
var rows34 = (from r in myDataTable.AsEnumerable()
where r.Field<int>("KeyColumn") == 34
select r).ToArray();
var KeyGroups = from r in myDataTable.AsEnumerable()
group r by r.Field<int>("KeyColumn") into g
select g;

How to concat with linq?

I have this tables at sql server:
id_service name
1 ejemplo 1
2 ejemplo 2
id_service id_quality
1 1
1 2
id_quality quality
1 simple
2 full
and Im trying to get this:
id_service qualities
1 Simple \n Full
I´m working with linq, is there any way to do it???
I have this so far, but it returns two rows instead of one as I need
var services = from service in dc.service
join s_quality in dc.service_quality
on service.id_service equals s_quality.id_service
join qualityObj in dc.quality
on s_quality.id_quality equals qualityObj.id_quality
select new {service.id_service, qualityObj.quality1};
GridView1.DataSource = services;
GridView1.DataBind();
The ending grid should look like this:
id_service name quality available
1 Company 1 1 - simple
2 - full
--------------------------------------------------------
2 Company 2 1 - simple
(this column should have grouped or concat results)
var servicios = from servicio in dc.servicio
join s_calidad in dc.servicio_calidad
on servicio.id_servicio equals s_calidad.id_servicio
join calidadObj in dc.calidad
on s_calidad.id_calidad equals calidadObj.id_calidad
group new {servicio.id_servicio, calidadObj.calidad1} by servicio.id_servicio into grouping
select new { groupning.Key, Concat = string.Join(",", grouping.Select(g => g.calidad1)) };
You can write your own concatenation logic for Concat property in the anonimous type

LINQ inner join condition

Suppose I have the following tables:
**Members**
Code Name
001 Sue
002 Peter
003 John
**Sales Info**
MemCode Date Type (A/B) Values
001 17/12/2013 A 100
001 17/11/2013 B 100
002 16/12/2013 A 100
I want to have the following result table
**Member Sales in 2013**
MemCode Jan(A) Jan(B) Feb(A) ... Nov(B) Dec(A) Dec(B)
001 0 0 0 100 100 0
002 0 0 0 0 100 0
I tried to exact some data (The Nov(A) and Nov(B)) first using the query,
var query = from tb in Members
join tb2 in SalesInfo on tb.MemCode equals tb2.MemCode
join tb3 in SalesInfo on tb.MemCode equals tb3.MemCode
where tb2.Type.Equals("A") &&
tb2.Date.Month.Equals(11)
tb3.Type.Equals("B") &&
tb3.Date.Month.Equals(11)
Select ...
However it returns no data as no A record found in November so the whole row is filtered. Is there any suggestion to solve the problem?
The problem is you are asking for (Type==A && Type==B) and it is impossible. You can select A first, and get B value of same date in a subquery.
Indeed, if you want type A and type B, you should select on type == A OR type == B. No row will ever satisfy both conditions :)
Something like this is simpler and probably more effective
var query = from tb in Members
join tb2 in SalesInfo on tb.MemCode equals tb2.MemCode
where (tb2.Type.Equals("A") ||
tb2.Type.Equals("B")) &&
tb2.Date.Month.Equals(11)
Select ...

Categories

Resources