I am having a datatable with multiple records having different key values. For example, a key 34 has multiple rows and some 35 has multiple rows. I need to split this key into separate arrays based on the column value.
var rows34 = (from r in myDataTable.AsEnumerable()
where r.Field<int>("KeyColumn") == 34
select r).ToArray();
var KeyGroups = from r in myDataTable.AsEnumerable()
group r by r.Field<int>("KeyColumn") into g
select g;
Related
I'm looking for a way to return a dynamic column list from a LINQ join of two datatables.
First, this is not a duplicate. I have already studied and discarded:
C# LINQ list select columns dynamically from a joined dataset
Creating a LINQ select from multiple tables
How to do a LINQ join that behaves exactly like a physical database inner join?
(and many others)
Here is my starting point:
public static DataTable JoinDataTables(DataTable dt1, DataTable dt2, string table1KeyField, string table2KeyField, string[] columns) {
DataTable result = ( from dataRows1 in dt1.AsEnumerable()
join dataRows2 in dt2.AsEnumerable()
on dataRows1.Field<string>(table1KeyField) equals dataRows2.Field<string>(table2KeyField)
[...I NEED HELP HERE with the SELECT....]).CopyToDataTable();
return result;
}
A few notes and requirements:
There is no database engine. The data sources are large CSV files (500K+ records) being read into c# DataTables.
Because the CSVs are large, looping through each record in the join is a bad solution for performance reasons. I've already tried record looping and it's just too slow. I get great performance on the join above, but I can't find a way to have it return just the columns I want (specified by the caller) without looping records.
If I need to loop over columns in the join, that is perfectly fine, I just don't want to loop rows.
I want to be able to pass in an array of column names and return just those columns in the resulting DataTable. If both datatables being passed in happen to have a column named the same, and if that column is in my array of column names, just pass back either column because the data will be the same between the 2 columns in that case.
If I need to pass in 2 arrays (1 for each datatable's desired columns) that's fine, but 1 array of column names would be ideal.
The column list cannot be static and hardcoded into the function. The reason is because my JoinDataTables() is called from many different places in my system in order to join a wide variety of CSVs-turned-datatables, and each CSV file has very different columns.
I don't want all columns returned in the resulting DataTable -- just the columns I specify in the columns array.
So suppose, before calling JoinDataTables(), I have the following 2 datatables:
Table: T1
T1A T1B T1C T1D
==================
10 AA H1 Foo1
11 AB H1 Foo2
12 AA H2 Foo1
13 AB H2 Foo2
Table: T2
T2A T2X T2Y T2Z
==================
12 N1 O1 Yeah1
17 N2 O2 Yeah2
18 N3 O1 Yeah1
19 N4 O2 Yeah2
Now suppose we join these 2 tables like so:
ON T1.T1A = T2.T2A
select * from [join]
and that yields this resultset:
T1A T1B T1C T1D T2A T2X T2Y T2Z
====================================
12 AA H2 Foo1 12 N1 O1 Yeah1
Notice that only 1 row is yielded by the join.
Now to the crux of my question. Suppose that for a given use case, I want to return only 4 columns from this join: T1A, T1D, T2A, and T2Y. So my resultset would then look like this:
T1A T1D T2A T2Y
==================
12 Foo1 12 O1
I'd like to be able to call my JoinDataTables function like so:
DataTable dt = JoinDataTables(dt1, dt2, "T1A", "T2A", new string[] {"T1A", "T1D", "T2A", "T2Y"});
Keeping in mind performance and the fact that I don't want to loop through records (because it's slow for large sets of data), how can this be accomplished? (The join is already working well, now I just need a correct select segment (whether via new{..} or whatever you think)).
I cannot accept a solution with a hardcoded column list inside the function. I have found examples of that approach all over SO.
Any ideas?
EDIT: I'd be ok getting ALL columns back every time, but every attempt I've made to include all columns has resulted in some kind of FULL OUTER JOIN or CROSS JOIN, returning orders of magnitude more records than it should. So, I'd be open to getting all columns back, as long as I don't get the cross join.
I'm not sure of the performance with 500k records, but here is an attempted solution.
Since you are combining two subsets of DataRows from different tables, there are no easy operations that will create the subset or create a new DataTable from the subsets (though I have an extension method for flattening an IEnumerable<anon> where anon = new { DataRow1, DataRow2, ... } from a join, it would probably be slow for you).
Instead, I pre-create an answer DataTable with the columns requested and then use LINQ to build the value arrays to be added as the rows.
public static DataTable JoinDataTables(DataTable dt1, DataTable dt2, string table1KeyField, string table2KeyField, string[] columns) {
var rtnCols1 = dt1.Columns.Cast<DataColumn>().Where(dc => columns.Contains(dc.ColumnName)).ToList();
var rc1 = rtnCols1.Select(dc => dc.ColumnName).ToList();
var rtnCols2 = dt2.Columns.Cast<DataColumn>().Where(dc => columns.Contains(dc.ColumnName) && !rc1.Contains(dc.ColumnName)).ToList();
var rc2 = rtnCols2.Select(dc => dc.ColumnName).ToList();
var work = from dataRows1 in dt1.AsEnumerable()
join dataRows2 in dt2.AsEnumerable()
on dataRows1.Field<string>(table1KeyField) equals dataRows2.Field<string>(table2KeyField)
select (from c1 in rc1 select dataRows1[c1]).Concat(from c2 in rc2 select dataRows2[c2]).ToArray();
var result = new DataTable();
foreach (var rc in rtnCols1)
result.Columns.Add(rc.ColumnName, rc.DataType);
foreach (var rc in rtnCols2)
result.Columns.Add(rc.ColumnName, rc.DataType);
foreach (var rowVals in work)
result.Rows.Add(rowVals);
return result;
}
Since you were using query syntax, I did as well, but normally I would probably do the select like so:
select rc1.Select(c1 => dataRows1[c1]).Concat(rc2.Select(c2 => dataRows2[c2])).ToArray();
Updated: It is probably worthwhile to use the column ordinals instead of the names to index into each DataRow by replacing the definitions of rc1 and rc2:
var rc1 = rtnCols1.Select(dc => dc.Ordinal).ToList();
var rc1Names = rtnCols1.Select(dc => dc.ColumnName).ToHashSet();
var rtnCols2 = dt2.Columns.Cast<DataColumn>().Where(dc => columns.Contains(dc.ColumnName) && !rc1Names.Contains(dc.ColumnName)).ToList();
var rc2 = rtnCols2.Select(dc => dc.Ordinal).ToList();
In essence I have two datatables
DataTable 1
PirateShipID PirateShipPreference
123 1
122 2
121 3
And DataTable 2 (which has different named columns, but the data types are the same.
RGPirateShipID PirateShipPreferenceType
123 1
122 1
121 3
I want to grab all records where
PirateShipID == RGPirateShipID && PirateShipePreference != PirateShipPreferenceType
Ideally using Linq as I believe that would be my quickest way of accomplishing this
var idsNotinPirates = from r in DataTable1.AsEnumerable()
//Get all records that don't match on preference
where DataTable2.AsEnumerable().Any(r2 => r["PirateShiptID"] == r2["RGPirateShipID"] && r["PirateShipPreference"] != r2["PirateShipPreferenceType"])
select r;
However, DataTable 1 has about 10k pirateships and Datatable2 has 1 million.
It takes the application a long time to complete the above.
How can i make this more efficient?
I believe you should probably be doing something like this:
var query = from r in DataTable1.AsEnumerable()
join r2 in DataTable2.AsEnumerable() on r["PirateShipID"] equals r2["RGPirateShipID"] into joinedTable
where joinedTable["PirateShipPreference"] != joinedTable["PirateShipPreferenceType"]
select r
I have the following scenario:
Table A that has 50 records and Table B that has 2 records.
I need to define a new table, say TableDiff which should contain 48 records from Table A that doesn't exist in Table B
My problem is that Table A and Table B are not identical but I have the field rowId which exists in both tables that I need to compare using it.
One way using Enumerable.Except and Enumerable.Join:
var aIDs = TableA.AsEnumerable().Select(r => r.Field<int>("RowID"));
var bIDs = TableB.AsEnumerable().Select(r => r.Field<int>("RowID"));
var diff = aIDs.Except(bIDs);
DataTable tblDiff = (from r in TableA.AsEnumerable()
join dId in diff on r.Field<int>("RowID") equals dId
select r).CopyToDataTable();
Here's the linq-to-objects "left-join"-approach:
DataTable tblDiff = (from rA in TableA.AsEnumerable()
join rB in TableB.AsEnumerable()
on rA.Field<int>("RowID") equals rB.Field<int>("RowID") into joinedRows
from ab in joinedRows.DefaultIfEmpty()
where ab == null
select rA).CopyToDataTable();
using System.Data.DataSetExtensions
var tableAIds = tableA.AsEnumerable().Select(row => (int)row["rowId"]);
var tableBIds = tableB.AsEnumerable().Select(row => (int)row["rowId"]);
var resultantIds = tableAIds.Except(tableBIds);
now for creating datatable again
DataTable diff = from myRow in tableA.AsEnumerable()
join rIDS resultantIds in myRow.Field<int>("rowId") equals rIDS
select myRow).CopyToDataTable()
I want filter the data in a data table using linq.
My scenario is I have an array of elements which contains dates created dynamically and in the data table we have columns as id,date,etc.
We have to retrieve the id's which contains all the dates in array
ex:
string[] arr={"10/10/2012","11/11/2012","9/9/2012"}
Table :
ID date
1 10/10/2012
2 11/11/2012
1 9/9/2012
6 9/9/2012
3 9/9/2012
6 11/11/2012
1 11/11/2012
Output would be 1 - because only id '1' has all the array elements.
To accomplish above functionality I am using the Linq query shown below. But I am literally failing.
Dim volunteers As DataTable =
(From leftTable In dtavailableVolunteers.AsEnumerable()
Join rightTable In dtavailableVolunteers.AsEnumerable()
On leftTable.VolunteerId Equals rightTable.VolunteerId
Where SelectedDatesArray.All(Function(i) rightTable.Field(Of String)("SelectedDate").Equals(i.ToString()))
Select rightTable).CopyToDataTable()
Lets say your datatable is dt
DataRow[] dr = dt.Select("date in (" + string.join("," , arr) + ")");
string[] st = dr.Select(ss => ss["id"].ToString()).ToArray();
OR
DataTable newdt = dr.CopyToDataTable();
Second line is of LINQ
You could group the rows by ID, and then find the groups where: there does not exist an arr element which the group's dates doesn't contain that element. I mean something like:
var result = from item in list
group item by item.ID into grouping
where !arr.Exists(date =>
!grouping.Select(x => x.Date).Contains(date))
select grouping.Key;
Here is another version:
from volunteer in dtavailableVolunteers
group volunteer by volunteer.Id into g
let volunteerDates = g.Select(groupedElement=>groupedElement.date)
where arr.All(date=>volunteerDates.Contains(date))
select g.Key
I have two tables.
[Table.Game]
Columns are "PK_id" "username" and "couponID"
[Table.Coupons]
Columns are "PK_id" "CouponID" and "Points"
The two columns "CouponID" are associated with eachother. Let say i have two rows with the user "harry" in [Table.Game] this person has two different couponID.
In [Table.Coupons] this user "harry" has "CouponID" 1 and 2. Column "Points" have 10 and 20.
To the question how do u sum this two different point values that have different "CouponIDs". This does work if i have only one "CouponId". But not when the user has 2 different CouponIDs. Values is 0
var points = (from p in linq.Coupons
join g in linq.games on p.couponID equals g.couponID
where g.username == username && g.couponID == p.couponID
select (int)p.win).Sum();
you already join the tables on the CouponID don't need it in the where:
var points = (
from p in linq.Coupons
join g in linq.games on p.couponID equals g.couponID
where g.username == username
select (int)p.win).Sum();
I solved the problem by using single to many relations instead of many to many in my database.