Compare two DataTables to determine rows in one but not the other - c#

I have two DataTables, A and B, produced from CSV files. I need to be able to check which rows exist in B that do not exist in A.
Is there a way to do some sort of query to show the different rows or would I have to iterate through each row on each DataTable to check if they are the same? The latter option seems to be very intensive if the tables become large.

Assuming you have an ID column which is of an appropriate type (i.e. gives a hashcode and implements equality) - string in this example, which is slightly pseudocode because I'm not that familiar with DataTables and don't have time to look it all up just now :)
IEnumerable<string> idsInA = tableA.AsEnumerable().Select(row => (string)row["ID"]);
IEnumerable<string> idsInB = tableB.AsEnumerable().Select(row => (string)row["ID"]);
IEnumerable<string> bNotA = idsInB.Except(idsInA);

would I have to iterate through each row on each DataTable to check if they are the same.
Seeing as you've loaded the data from a CSV file, you're not going to have any indexes or anything, so at some point, something is going to have to iterate through every row, whether it be your code, or a library, or whatever.
Anyway, this is an algorithms question, which is not my specialty, but my naive approach would be as follows:
1: Can you exploit any properties of the data? Are all the rows in each table unique, and can you sort them both by the same criteria? If so, you can do this:
Sort both tables by their ID (using some useful thing like a quicksort). If they're already sorted then you win big.
Step through both tables at once, skipping over any gaps in ID's in either table. Matched ID's mean duplicated records.
This allows you to do it in (sort time * 2 ) + one pass, so if my big-O-notation is correct, it'd be (whatever-sort-time) + O(m+n) which is pretty good.
(Revision: this is the approach that ΤΖΩΤΖΙΟΥ describes )
2: An alternative approach, which may be more or less efficient depending on how big your data is:
Run through table 1, and for each row, stick it's ID (or computed hashcode, or some other unique ID for that row) into a dictionary (or hashtable if you prefer to call it that).
Run through table 2, and for each row, see if the ID (or hashcode etc) is present in the dictionary. You're exploiting the fact that dictionaries have really fast - O(1) I think? lookup. This step will be really fast, but you'll have paid the price doing all those dictionary inserts.
I'd be really interested to see what people with better knowledge of algorithms than myself come up with for this one :-)

You can use the Merge and GetChanges methods on the DataTable to do this:
A.Merge(B); // this will add to A any records that are in B but not A
return A.GetChanges(); // returns records originally only in B

The answers so far assume that you're simply looking for duplicate primary keys. That's a pretty easy problem - you can use the Merge() method, for instance.
But I understand your question to mean that you're looking for duplicate DataRows. (From your description of the problem, with both tables being imported from CSV files, I'd even assume that the original rows didn't have primary key values, and that any primary keys are being assigned via AutoNumber during the import.)
The naive implementation (for each row in A, compare its ItemArray with that of each row in B) is indeed going to be computationally expensive.
A much less expensive way to do this is with a hashing algorithm. For each DataRow, concatenate the string values of its columns into a single string, and then call GetHashCode() on that string to get an int value. Create a Dictionary<int, DataRow> that contains an entry, keyed on the hash code, for each DataRow in DataTable B. Then, for each DataRow in DataTable A, calculate the hash code, and see if it's contained in the dictionary. If it's not, you know that the DataRow doesn't exist in DataTable B.
This approach has two weaknesses that both emerge from the fact that two strings can be unequal but produce the same hash code. If you find a row in A whose hash is in the dictionary, you then need to check the DataRow in the dictionary to verify that the two rows are really equal.
The second weakness is more serious: it's unlikely, but possible, that two different DataRows in B could hash to the same key value. For this reason, the dictionary should really be a Dictionary<int, List<DataRow>>, and you should perform the check described in the previous paragraph against each DataRow in the list.
It takes a fair amount of work to get this working, but it's an O(m+n) algorithm, which I think is going to be as good as it gets.

Just FYI:
Generally speaking about algorithms, comparing two sets of sortable (as ids typically are) is not an O(M*N/2) operation, but O(M+N) if the two sets are ordered. So you scan one table with a pointer to the start of the other, and:
other_item= A.first()
only_in_B= empty_list()
for item in B:
while other_item > item:
other_item= A.next()
if A.eof():
only_in_B.add( all the remaining B items)
return only_in_B
if item < other_item:
empty_list.append(item)
return only_in_B
The code above is obviously pseudocode, but should give you the general gist if you decide to code it yourself.

Thanks for all the feedback.
I do not have any index's unfortunately. I will give a little more information about my situation.
We have a reporting program (replaced Crystal reports) that is installed in 7 Servers across EU. These servers have many reports on them (not all the same for each country). They are invoked by a commandline application that uses XML files for their configuration. So One XML file can call multiple reports.
The commandline application is scheduled and controlled by our overnight process. So the XML file could be called from multiple places.
The goal of the CSV is to produce a list of all the reports that are being used and where they are being called from.
I am going through the XML files for all references, querying the scheduling program and producing a list of all the reports. (this is not too bad).
The problem I have is I have to keep a list of all the reports that might have been removed from production. So I need to compare the old CSV with the new data. For this I thought it best to put it into DataTables and compare the information, (this could be the wrong approach. I suppose I could create an object that holds it and compares the difference then create iterate through them).
The data I have about each report is as follows:
String - Task Name
String - Action Name
Int - ActionID (the Action ID can be in multiple records as a single action can call many reports, i.e. an XML file).
String - XML File called
String - Report Name
I will try the Merge idea given by MusiGenesis (thanks). (rereading some of the posts not sure if the Merge will work, but worth trying as I have not heard about it before so something new to learn).
The HashCode Idea sounds interesting as well.
Thanks for all the advice.

I found an easy way to solve this. Unlike previous "except method" answers, I use the except method twice. This not only tells you what rows were deleted but what rows were added. If you only use one except method - it will only tell you one difference and not both. This code is tested and works. See below
//Pass in your two datatables into your method
//build the queries based on id.
var qry1 = datatable1.AsEnumerable().Select(a => new { ID = a["ID"].ToString() });
var qry2 = datatable2.AsEnumerable().Select(b => new { ID = b["ID"].ToString() });
//detect row deletes - a row is in datatable1 except missing from datatable2
var exceptAB = qry1.Except(qry2);
//detect row inserts - a row is in datatable2 except missing from datatable1
var exceptAB2 = qry2.Except(qry1);
then execute your code against the results
if (exceptAB.Any())
{
foreach (var id in exceptAB)
{
//execute code here
}
}
if (exceptAB2.Any())
{
foreach (var id in exceptAB2)
{
//execute code here
}
}

Could you not simply compare the CSV files before loading them into DataTables?
string[] a = System.IO.File.ReadAllLines(#"cvs_a.txt");
string[] b = System.IO.File.ReadAllLines(#"csv_b.txt");
// get the lines from b that are not in a
IEnumerable<string> diff = b.Except(a);
//... parse b into DataTable ...

public DataTable compareDataTables(DataTable First, DataTable Second)
{
First.TableName = "FirstTable";
Second.TableName = "SecondTable";
//Create Empty Table
DataTable table = new DataTable("Difference");
DataTable table1 = new DataTable();
try
{
//Must use a Dataset to make use of a DataRelation object
using (DataSet ds4 = new DataSet())
{
//Add tables
ds4.Tables.AddRange(new DataTable[] { First.Copy(), Second.Copy() });
//Get Columns for DataRelation
DataColumn[] firstcolumns = new DataColumn[ds4.Tables[0].Columns.Count];
for (int i = 0; i < firstcolumns.Length; i++)
{
firstcolumns[i] = ds4.Tables[0].Columns[i];
}
DataColumn[] secondcolumns = new DataColumn[ds4.Tables[1].Columns.Count];
for (int i = 0; i < secondcolumns.Length; i++)
{
secondcolumns[i] = ds4.Tables[1].Columns[i];
}
//Create DataRelation
DataRelation r = new DataRelation(string.Empty, firstcolumns, secondcolumns, false);
ds4.Relations.Add(r);
//Create columns for return table
for (int i = 0; i < First.Columns.Count; i++)
{
table.Columns.Add(First.Columns[i].ColumnName, First.Columns[i].DataType);
}
//If First Row not in Second, Add to return table.
table.BeginLoadData();
foreach (DataRow parentrow in ds4.Tables[0].Rows)
{
DataRow[] childrows = parentrow.GetChildRows(r);
if (childrows == null || childrows.Length == 0)
table.LoadDataRow(parentrow.ItemArray, true);
table1.LoadDataRow(childrows, false);
}
table.EndLoadData();
}
}
catch (Exception ex)
{
Console.WriteLine(ex.Message);
}
return table;
}

try
{
if (ds.Tables[0].Columns.Count == ds1.Tables[0].Columns.Count)
{
for (int i = 0; i < ds.Tables[0].Rows.Count; i++)
{
for (int j = 0; j < ds.Tables[0].Columns.Count; j++)
{
if (ds.Tables[0].Rows[i][j].ToString() == ds1.Tables[0].Rows[i][j].ToString())
{
}
else
{
MessageBox.Show(i.ToString() + "," + j.ToString());
}
}
}
}
else
{
MessageBox.Show("Table has different columns ");
}
}
catch (Exception)
{
MessageBox.Show("Please select The Table");
}

I'm continuing tzot's idea ...
If you have two sortable sets, then you can just use:
List<string> diffList = new List<string>(sortedListA.Except(sortedListB));
If you need more complicated objects, you can define a comparator yourself and still use it.

The usual usage scenario considers a user that has a DataTable in hand and changes it by Adding, Deleting or Modifying some of the DataRows.
After the changes are performed, the DataTable is aware of the proper DataRowState for each row, and also keeps track of the Original DataRowVersion for any rows that were changed.
In this usual scenario, one can Merge the changes back into a source table (in which all rows are Unchanged). After merging, one can get a nice summary of only the changed rows with a call to GetChanges().
In a more unusual scenario, a user has two DataTables with the same schema (or perhaps only the same columns and lacking primary keys). These two DataTables consist of only Unchanged rows. The user may want to find out what changes does he need to apply to one of the two tables in order to get to the other one. That is, which rows need to be Added, Deleted, or Modified.
We define here a function called GetDelta() which does the job:
using System;
using System.Data;
using System.Xml;
using System.Linq;
using System.Collections.Generic;
using System.Data.DataSetExtensions;
public class Program
{
private static DataTable GetDelta(DataTable table1, DataTable table2)
{
// Modified2 : row1 keys match rowOther keys AND row1 does not match row2:
IEnumerable<DataRow> modified2 = (
from row1 in table1.AsEnumerable()
from row2 in table2.AsEnumerable()
where table1.PrimaryKey.Aggregate(true, (boolAggregate, keycol) => boolAggregate & row1[keycol].Equals(row2[keycol.Ordinal]))
&& !row1.ItemArray.SequenceEqual(row2.ItemArray)
select row2);
// Modified1 :
IEnumerable<DataRow> modified1 = (
from row1 in table1.AsEnumerable()
from row2 in table2.AsEnumerable()
where table1.PrimaryKey.Aggregate(true, (boolAggregate, keycol) => boolAggregate & row1[keycol].Equals(row2[keycol.Ordinal]))
&& !row1.ItemArray.SequenceEqual(row2.ItemArray)
select row1);
// Added : row2 not in table1 AND row2 not in modified2
IEnumerable<DataRow> added = table2.AsEnumerable().Except(modified2, DataRowComparer.Default).Except(table1.AsEnumerable(), DataRowComparer.Default);
// Deleted : row1 not in row2 AND row1 not in modified1
IEnumerable<DataRow> deleted = table1.AsEnumerable().Except(modified1, DataRowComparer.Default).Except(table2.AsEnumerable(), DataRowComparer.Default);
Console.WriteLine();
Console.WriteLine("modified count =" + modified1.Count());
Console.WriteLine("added count =" + added.Count());
Console.WriteLine("deleted count =" + deleted.Count());
DataTable deltas = table1.Clone();
foreach (DataRow row in modified2)
{
// Match the unmodified version of the row via the PrimaryKey
DataRow matchIn1 = modified1.Where(row1 => table1.PrimaryKey.Aggregate(true, (boolAggregate, keycol) => boolAggregate & row1[keycol].Equals(row[keycol.Ordinal]))).First();
DataRow newRow = deltas.NewRow();
// Set the row with the original values
foreach(DataColumn dc in deltas.Columns)
newRow[dc.ColumnName] = matchIn1[dc.ColumnName];
deltas.Rows.Add(newRow);
newRow.AcceptChanges();
// Set the modified values
foreach (DataColumn dc in deltas.Columns)
newRow[dc.ColumnName] = row[dc.ColumnName];
// At this point newRow.DataRowState should be : Modified
}
foreach (DataRow row in added)
{
DataRow newRow = deltas.NewRow();
foreach (DataColumn dc in deltas.Columns)
newRow[dc.ColumnName] = row[dc.ColumnName];
deltas.Rows.Add(newRow);
// At this point newRow.DataRowState should be : Added
}
foreach (DataRow row in deleted)
{
DataRow newRow = deltas.NewRow();
foreach (DataColumn dc in deltas.Columns)
newRow[dc.ColumnName] = row[dc.ColumnName];
deltas.Rows.Add(newRow);
newRow.AcceptChanges();
newRow.Delete();
// At this point newRow.DataRowState should be : Deleted
}
return deltas;
}
private static void DemonstrateGetDelta()
{
DataTable table1 = new DataTable("Items");
// Add columns
DataColumn column1 = new DataColumn("id1", typeof(System.Int32));
DataColumn column2 = new DataColumn("id2", typeof(System.Int32));
DataColumn column3 = new DataColumn("item", typeof(System.Int32));
table1.Columns.Add(column1);
table1.Columns.Add(column2);
table1.Columns.Add(column3);
// Set the primary key column.
table1.PrimaryKey = new DataColumn[] { column1, column2 };
// Add some rows.
DataRow row;
for (int i = 0; i <= 4; i++)
{
row = table1.NewRow();
row["id1"] = i;
row["id2"] = i*i;
row["item"] = i;
table1.Rows.Add(row);
}
// Accept changes.
table1.AcceptChanges();
PrintValues(table1, "table1:");
// Create a second DataTable identical to the first.
DataTable table2 = table1.Clone();
// Add a row that exists in table1:
row = table2.NewRow();
row["id1"] = 0;
row["id2"] = 0;
row["item"] = 0;
table2.Rows.Add(row);
// Modify the values of a row that exists in table1:
row = table2.NewRow();
row["id1"] = 1;
row["id2"] = 1;
row["item"] = 455;
table2.Rows.Add(row);
// Modify the values of a row that exists in table1:
row = table2.NewRow();
row["id1"] = 2;
row["id2"] = 4;
row["item"] = 555;
table2.Rows.Add(row);
// Add a row that does not exist in table1:
row = table2.NewRow();
row["id1"] = 13;
row["id2"] = 169;
row["item"] = 655;
table2.Rows.Add(row);
table2.AcceptChanges();
Console.WriteLine();
PrintValues(table2, "table2:");
DataTable delta = GetDelta(table1,table2);
Console.WriteLine();
PrintValues(delta,"delta:");
// Verify that the deltas DataTable contains the adequate Original DataRowVersions:
DataTable originals = table1.Clone();
foreach (DataRow drow in delta.Rows)
{
if (drow.RowState != DataRowState.Added)
{
DataRow originalRow = originals.NewRow();
foreach (DataColumn dc in originals.Columns)
originalRow[dc.ColumnName] = drow[dc.ColumnName, DataRowVersion.Original];
originals.Rows.Add(originalRow);
}
}
originals.AcceptChanges();
Console.WriteLine();
PrintValues(originals,"delta original values:");
}
private static void Row_Changed(object sender,
DataRowChangeEventArgs e)
{
Console.WriteLine("Row changed {0}\t{1}",
e.Action, e.Row.ItemArray[0]);
}
private static void PrintValues(DataTable table, string label)
{
// Display the values in the supplied DataTable:
Console.WriteLine(label);
foreach (DataRow row in table.Rows)
{
foreach (DataColumn col in table.Columns)
{
Console.Write("\t " + row[col, row.RowState == DataRowState.Deleted ? DataRowVersion.Original : DataRowVersion.Current].ToString());
}
Console.Write("\t DataRowState =" + row.RowState);
Console.WriteLine();
}
}
public static void Main()
{
DemonstrateGetDelta();
}
}
The code above can be tested in https://dotnetfiddle.net/. The resulting output is shown below:
table1:
0 0 0 DataRowState =Unchanged
1 1 1 DataRowState =Unchanged
2 4 2 DataRowState =Unchanged
3 9 3 DataRowState =Unchanged
4 16 4 DataRowState =Unchanged
table2:
0 0 0 DataRowState =Unchanged
1 1 455 DataRowState =Unchanged
2 4 555 DataRowState =Unchanged
13 169 655 DataRowState =Unchanged
modified count =2
added count =1
deleted count =2
delta:
1 1 455 DataRowState =Modified
2 4 555 DataRowState =Modified
13 169 655 DataRowState =Added
3 9 3 DataRowState =Deleted
4 16 4 DataRowState =Deleted
delta original values:
1 1 1 DataRowState =Unchanged
2 4 2 DataRowState =Unchanged
3 9 3 DataRowState =Unchanged
4 16 4 DataRowState =Unchanged
Note that if your tables don't have a PrimaryKey, the where clause in the LINQ queries gets simplified a little bit. I'll let you figure that out on your own.

Achieve it simply using linq.
private DataTable CompareDT(DataTable TableA, DataTable TableB)
{
DataTable TableC = new DataTable();
try
{
var idsNotInB = TableA.AsEnumerable().Select(r => r.Field<string>(Keyfield))
.Except(TableB.AsEnumerable().Select(r => r.Field<string>(Keyfield)));
TableC = (from row in TableA.AsEnumerable()
join id in idsNotInB
on row.Field<string>(ddlColumn.SelectedItem.ToString()) equals id
select row).CopyToDataTable();
}
catch (Exception ex)
{
lblresult.Text = ex.Message;
ex = null;
}
return TableC;
}

Related

Update :Result is not transfered to the orginal dataset. Sorting Rows in Dataset using C#

Update: even though I have got the required result but when the the second function access the data table the value is still the same
It a sequential program with two functions in different classes. First sort and second replace function. So it should sort the value and other function should be able to retrieve the sorted table but when it retrieve the datatable it gives the unsorted table.
I have used acceptchanges() but it also give the same result.
The program is trying to sort the table according to the required field and the result is stored in Sorted table variable. I am trying to copy this to the original i-e sourceTables but it is not working and is adding another row instead of updating [As shown in below dig]. I have tried to copy whole table but it does not work and by adding rows it is not giving the required result. I have used different methods but I am not getting the required result.
List<DataTable> sourceTables = context.GetDataByTable(sourceTable.StringValue);
List<DataTable> targetTables = context.GetDataByTable(targetTable.StringValue, sourceTables.Count);
string orderDesc= orderField.StringValue + " DESC";
for (int i = 0; i < sourceTables.Count; i++)
{
DataView dv = sourceTables[i].DefaultView;
if (orderDirection.StringValue == OrderDirectionAsc)
{
// for Sorting in Ascending Order
dv.Sort = orderField.StringValue;
}
else
{
// for Sorting in Descending Order
dv.Sort = orderDesc;
}
DataTable sortedTable = dv.ToTable();
DataTable dttableNew = sortedTable.Clone();
//sourceTables[i] = sortedTable.Copy();
//targetTables[i] = dv.ToTable();
//targetTables[i] = sortedTable.Copy();
// foreach (DataRow dr in sortedTable.Rows)
//// targetTables[i].Rows.Add(dr.ItemArray);
//}
for (int j = 0; j < sourceTables[i].Rows.Count; j++)
{
if (sourceTable.GetValue().ToString() == targetTable.GetValue().ToString())
{
foreach (DataRow dr in sortedTable.Rows)
{
targetTables[i].Rows.Add(dr.ItemArray);
}
else
{
foreach (DataRow dr in sortedTable.Rows)
{
targetTables[i].Rows.Add(dr.ItemArray);
}
// targetTables[i] = sortedTable.Copy(); does not work
//foreach (DataRow drtableOld in sortedTable.Rows)
//{
// targetTables[i].ImportRow(drtableOld);
//}
Instead of replacing the first values it is adding more rows
any help would be appreciated
If any one have problem with duplicate data or the changes are only local and is not effecting the original data table. Remember to always use .ImportRow(dr) function to add rows to the table and if you use Tables[i].Rows.Add(dr.ItemArray); the changes will affect only the local table and not the original one. Use .clear to remove the old rows from the orginal table. The action done directly on the original function will only effect the rows. If it is done on the clone copy changes will nor affect the original table.
Here is the complete code
DataTable sortTable = dv.ToTable();
if (sTable.GetValue().ToString() == tTable.GetValue().ToString())
{
sTables[i].Clear();
foreach (DataRow dr in sortTable.Rows)
{
sTables[i].ImportRow(dr);
}
sTables[i].AcceptChanges();
}

Change order of rows in DataTable in C#

Is it possible to change order of rows in DataTable so for example the one with current index of 5 moves to place with index of 3, etc.?
I have this legacy, messy code where dropdown menu get it's values from DataTable, which get it's values from database. It is impossible to make changes in database, since it has too many columns and entries. My original though was to add new column in db and order by it's values, but this is going to be hard.
So since this is only matter of presentation to user I was thinking to just switch order of rows in that DataTable. Does someone knows the best way to do this in C#?
This is my current code:
DataTable result = flokkurDao.GetMCCHABAKflokka("MSCODE");
foreach (DataRow row in result.Rows)
{
m_cboReasonCode.Properties.Items.Add(row["FLOKKUR"].ToString().Trim() + " - " + row["SKYRING"]);
}
For example I want to push row 2011 - Credit previously issued to the top of the DataTable.
SOLUTION:
For those who might have problems with ordering rows in DataTable and working with obsolete technology that doesn't supports Linq this might help:
DataRow firstSelectedRow = result.Rows[6];
DataRow firstNewRow = result.NewRow();
firstNewRow.ItemArray = firstSelectedRow.ItemArray; // copy data
result.Rows.Remove(firstSelectedRow);
result.Rows.InsertAt(firstNewRow, 0);
You have to clone row, remove it and insert it again with a new index. This code moves row with index 6 to first place in the DataTable.
If you really want randomness you could use Guid.NewGuid in LINQ's OrderBy:
DataTable result = flokkurDao.GetMCCHABAKflokka("MSCODE");
var randomOrder = result.AsEnumerable().OrderBy(r => Guid.NewGuid());
foreach (DataRow row in randomOrder)
{
// ...
}
If you actually don't want randomness but you want specific values at the top, you can use:
var orderFlokkur2011 = result.AsEnumerable()
.OrderBy(r => r.Field<int>("FLOKKUR") == 2011 ? 0 : 1);
You can use linq to order rows:
DataTable result = flokkurDao.GetMCCHABAKflokka("MSCODE");
foreach (DataRow row in result.Rows.OrderBy(x => x.ColumnName))
{
m_cboReasonCode.Properties.Items.Add(row["FLOKKUR"].ToString().Trim() + " - " + row["SKYRING"]);
}
To order by multiple columns:
result.Rows.OrderBy(x => x.ColumnName).ThenBy(x => x.OtherColumnName).ThenBy(x.YetAnotherOne)
To order by a specific value:
result.Rows.OrderBy(x => (x.ColumnName == 2001 or x.ColumnName == 2002) ? 0 : 1).ThenBy(x => x.ColumName)
You can use the above code to "pin" certain rows to the top, if you want more granular than that you can use a switch for example to sort specific values into sorted values of 1, 2, 3, 4 and use a higher number for the rest.
You can not change the order or delete a row in a foreach loop, you should create a new datatable and randomly add the rows to new datatable, you should also track the inserted rows not to duplicate
Use a DataView
DataTable result = flokkurDao.GetMCCHABAKflokka("MSCODE");
DateView view = new DateView(result);
view.Sort = "FLOKKUR";
view.Filter = "... you can even apply an in memory filter here ..."
foreach (DataRowView row in view.Rows)
{
....
Every data table comes with a view DefaultView which you can use, this way you can apply the default sorting / filtering in your datalayer.
public DataTable GetMCCHABAKflokka(string tableName, string sort, string filter)
{
var result = GetMCCHABAKflokka(tableName);
result.DefaultView.Sort = sort;
result.DefaultView.Filter = filter;
return result;
}
// use like this
foreach (DataRowView row in result.DefaultView)

DataTable throwing exception on RejectChanges

I found this bug while working with a DataTable.
I added a primary key column to a DataTable, than added one row to that table, removed that row, and added row with the same key to the table. This works. When I tried to call RejectChanges() on it, I got ConstraintException saying that value is already present.
Here is the example:
var dataTable = new DataTable();
var column = new DataColumn("ID", typeof(decimal));
dataTable.Columns.Add(column);
dataTable.PrimaryKey = new [] {column };
decimal id = 1;
var oldRow = dataTable.NewRow();
oldRow[column] = id;
dataTable.Rows.Add(oldRow);
dataTable.AcceptChanges();
oldRow.Delete();
var newRow = dataTable.NewRow();
newRow[column] = id;
dataTable.Rows.Add(newRow);
dataTable.RejectChanges(); // This is where it crashes
I think since the row is deleted, exception should not be thrown (constraint is not violated because row is in deleted state). Is there something I can do about this? Any help is appreciated.
I assume that this has the same cause than following bug issue since the first that will be rejected is your delete action:
DataTable.RejectChanges() should rollback rows in reverse order
Two possible workarounds:
Cycles through the DataRows rolling them back in reverse order. So the
new records are removed before the previous ones are brought back to
life.
DataRowCollection rows = dataTable.Rows;
for (int i = rows.Count - 1; i >= 0; i--)
{
rows[i].RejectChanges();
}
Disables constrains so the rollback can be done. Reenables constrains after that.
You could use LINQ-to-DataSet to define your own "rollback-order":
var rollbackPlan = (from r in dataTable.AsEnumerable()
where r.RowState != DataRowState.Unchanged
let firstOrder = r.RowState==DataRowState.Deleted? 1 : 0
let secondOrder = r.RowState==DataRowState.Added? 1 : 0
orderby firstOrder ascending, secondOrder ascending
select r).ToList();
foreach (DataRow r in rollbackPlan)
{
r.RejectChanges(); // Does not crash anymore
}
Here's the way you "disable" constraints on a DataTable temporarily:
var constraintBackup = dataTable.Constraints.Cast<System.Data.Constraint>().ToList();
dataTable.Constraints.Clear();
dataTable.RejectChanges(); // Does not crash anymore
foreach (System.Data.Constraint c in constraintBackup)
{
dataTable.Constraints.Add(c);
}
You can avoid this by using unique property of column tor true.
i.e. column.Unique = true;
As soon as this property is changed to true, a unique constraint will be created on this column to make sure that values are unique.

Delete row from datatable where first column does not contain (some string)

I have DataTable with the following columns:
ClientID date numberOfTransactions price
ClientID is of type string and I need to ensure that its contents include "A-" and "N6" for every value in the table.
I need to delete all rows from the DataTable where this first column (ClientID) does not contain both "A-" and "N6" (some totals and other unnecessary data). How can I select and delete these rows specifically from the DataTable?
I know this:
foreach (DataRow row in table.Rows) // Loop over the rows.
{
//Here should come part "if first column contains mentioned values
}
I also know this
If (string.Contains("A-") == true && string.Contains("N6") == true)
{
//Do something
}
I need help how to implement this for first column of each row.
Try this:
EDIT: Totally messed up that last line, so if you tried it, try it now that I made it not stupid. =)
List<int> IndicesToRemove = new List<int>();
DataTable table = new DataTable(); //Obviously, your table will already exist at this point
foreach (DataRow row in table.Rows)
{
if (!(row["ClientID"].ToString().Contains("A-") && row["ClientID"].ToString().Contains("N6")))
IndicesToRemove.Add(table.Rows.IndexOf(row));
}
IndicesToRemove.Sort();
for (int i = IndicesToRemove.Count - 1; i >= 0; i--) table.Rows.RemoveAt(IndicesToRemove[i]);
try using this,
assuming dt as your Datatabe object and ClientID as your first column (hence using ItemArray[0])
for(int i=0; i<dt.Rows.Count; i++)
{
temp = dt.Rows[i].ItemArray[0].ToString();
if (System.Text.RegularExpressions.Regex.IsMatch(temp, "A-", System.Text.RegularExpressions.RegexOptions.IgnoreCase) || System.Text.RegularExpressions.Regex.IsMatch(temp, "N6", System.Text.RegularExpressions.RegexOptions.IgnoreCase))
{
dt.Rows.RemoveAt(i);
i--;
}
}
Simple and straight forward solution... hope it helps
this should be more efficient, both in lines of Code and Time, try this :)
for(int x=0; x<table.Rows.Count;)
{
if (!table.Rows[x].ItemArray[0].contains("A-") && !table.Rows[x].ItemArray[0].contains("N6"))
table.Rows.RemoveAt(x);
else x++;
}
Happy Coding
Preface: C.Barlow's existing answer is awesome, this is just another route someone could take.
This is one way to do it where you never have to loop all the way through the original table (by taking advantage of the DataTable.Select() method):
DataTable table = new DataTable(); // This would be your existing DataTable
// Grab only the rows that meet your criteria using the .Select() method
DataRow[] newRows = table.Select("ClientID LIKE '%A-%' AND ClientID LIKE '%N6%'");
// Create a new table with the same schema as your existing one.
DataTable newTable = table.Clone();
foreach (DataRow r in newRows)
{
// Dump the selected rows into the table.
newTable.LoadDataRow(r.ItemArray, true);
}
And now you have a DataTable with only the rows you want. If necessary, at this point you could clear out the original table and replace it with the contents of the new one:
table.Clear();
table = newTable.Copy();
Edit: I thought of a memory optimization last night, you can just overwrite the existing table once you have the rows you need, which avoids the need for the temporary table.
DataTable table = new DataTable(); // This would be your existing DataTable
// Grab only the rows that meet your criteria using the .Select() method
DataRow[] newRows = table.Select("ClientID LIKE '%A-%' AND ClientID LIKE '%N6%'");
// Clear out the old table
table.Clear();
foreach (DataRow r in newRows)
{
// Dump the selected rows into the table.
table.LoadDataRow(r.ItemArray, true);
}

DataTable.Select and Performance Issue in C#

I'm importing the data from three Tab delimited files in the DataTables and after that I need to go thru every row of master table and find all the rows in two child tables. Against each DataRow[] array I found from the child tables, I have to again go thru individually each row and check the values based upon different paramenters and at the end I need to create a final record which will be merger of master and two child table columns.
Now I have done that and its working but the problem is its Performance. I'm using the DataTable.Select to find all child rows from child table which I believe making it very slow.
Please remember the None of the table has any Primary key as the duplicate rows are acceptable.
At the moment I have 1200 rows in master table and aroun 8000 rows in child table and the total time it takes to do that is 8 minutes.
Any idea how can I increase the Performance.
Thanks in advance
The code is below ***************
DataTable rawMasterdt = importMasterFile();
DataTable rawDespdt = importDescriptionFile();
dsHelper = new DataSetHelper();
DataTable distinctdt = new DataTable();
distinctdt = dsHelper.SelectDistinct("DistinctOffers", rawMasterdt, "C1");
if (distinctdt.Rows.Count > 0)
{
int count = 0;
foreach (DataRow offer in distinctdt.Rows)
{
string exp = "C1 = " + "'" + offer[0].ToString() + "'" + "";
DataRow masterRow = rawMasterdt.Select(exp)[0];
count++;
txtBlock1.Text = "Importing Offer " + count.ToString() + " of " + distinctdt.Rows.Count.ToString();
if (masterRow != null )
{
Product newProduct = new Product();
newProduct.Code = masterRow["C4"].ToString();
newProduct.Name = masterRow["C5"].ToString();
// -----
newProduct.Description = getProductDescription(offer[0].ToString(), rawDespdt);
newProduct.Weight = getProductWeight(offer[0].ToString(), rawDespdt);
newProduct.Price = getProductRetailPrice(offer[0].ToString(), rawDespdt);
newProduct.UnitPrice = getProductUnitPrice(offer[0].ToString(), rawDespdt);
// ------- more functions similar to above here
productList.Add(newProduct);
}
}
txtBlock1.Text = "Import Completed";
public string getProductDescription(string offercode, DataTable dsp)
{
string exp = "((C1 = " + "'" + offercode + "')" + " AND ( C6 = 'c' ))";
DataRow[] dRows = dsp.Select( exp);
string descrip = "";
if (dRows.Length > 0)
{
for (int i = 0; i < dRows.Length - 1; i++)
{
descrip = descrip + " " + dRows[i]["C12"];
}
}
return descrip;
}
.Net 4.5 and the issue is still there.
Here are the results of a simple benchmark where DataTable.Select and different dictionary implementations are compared for CPU time (results are in milliseconds)
#Rows Table.Select Hashtable[] SortedList[] Dictionary[]
1000 43,31 0,01 0,06 0,00
6000 291,73 0,07 0,13 0,01
11000 604,79 0,04 0,16 0,02
16000 914,04 0,05 0,19 0,02
21000 1279,67 0,05 0,19 0,02
26000 1501,90 0,05 0,17 0,02
31000 1738,31 0,07 0,20 0,03
Problem:
The DataTable.Select method creates a "System.Data.Select" class instance internally, and this "Select" class creates indexes based on the fields (columns) specified in the query. The Select class makes re-use of the indexes it had created but the DataTable implementation does not re-use the Select class instance hence the indexes are re-created every time DataTable.Select is invoked. (This behaviour can be observed by decompiling System.Data)
Solution:
Assume the following query
DataRow[] rows = data.Select("COL1 = 'VAL1' AND (COL2 = 'VAL2' OR COL2 IS NULL)");
Instead, create and fill a Dictionary with keys corresponding to the different value combinations of the values of the columns used as the filter. (This relatively expensive operation must be done only once and the dictionary instance must then be re-used)
Dictionary<string, List<DataRow>> di = new Dictionary<string, List<DataRow>>();
foreach (DataRow dr in data.Rows)
{
string key = (dr["COL1"] == DBNull.Value ? "<NULL>" : dr["COL1"]) + "//" + (dr["COL2"] == DBNull.Value ? "<NULL>" : dr["COL2"]);
if (di.ContainsKey(key))
{
di[key].Add(dr);
}
else
{
di.Add(key, new List<DataRow>());
di[key].Add(dr);
}
}
Query the Dictionary (multiple queries may be required) to filter the rows and combine the results into a List
string key1 = "VAL1//VAL2";
string key2 = "VAL1//<NULL>";
List<DataRow>() results = new List<DataRow>();
if (di.ContainsKey(key1))
{
results.AddRange(di[key1]);
}
if (di.ContainsKey(key2))
{
results.AddRange(di[key2]);
}
I know this is an old question, and code underpinning this issue may have changed, but I've recently encountered (and gain some insight into) this very issue.
For anyone coming along at a later date ... here's what I found.
Performance of the DataTable.Select(condition) is quite sensitive to the nature and structure of the 'condition' you provide. This looks like a bug to me (where would I report it to Microsoft?) but it may merely be a quirk.
I've written a set of tests to demonstrate the issue that are structured as follows:
Define a datatable with a few simple columns,like this:
var dataTable = new DataTable();
var idCol = dataTable.Columns.Add("Id", typeof(Int32));
dataTable.Columns.Add("Code", typeof(string));
dataTable.Columns.Add("Name", typeof(string));
dataTable.Columns.Add("FormationDate", typeof(DateTime));
dataTable.Columns.Add("Income", typeof(Decimal));
dataTable.Columns.Add("ChildCount", typeof(Int32));
dataTable.Columns.Add("Foreign", typeof(Boolean));
dataTable.PrimaryKey = new DataColumn[1] { idCol };
Populate the table with 40000 records, each with a unique 'Code' field.
Perform a batch of 'selects' (each with different parameters) against the datatable using two similar, but differently formatted, queries and record and compare the total time taken by each of the two formats.
You get remarkable results. Testing, for example, the below two conditions side-by-side:
Q1: [Code] = 'XX'
Q2: ([Code] = 'XX')
[ I do multiple Select calls using the above two queries, each iteration I replace the XX with a valid code that exists in the datatable ]
The result?
Time comparison for 320 lookups against 40000 records: 180 msec total search time with no brackets, 6871 msec total search time for search WITH brackets
Yes - 38 times slower if you just have the extra brackets surrounding the condition.
There are other scenarios which react differently.
For example,
[Code] = '{searchCode}' OR 1=0 vs ([Code] = '{searchCode}' OR 1=0) take similar (slow) times to execute, but:
[Code] = '{searchCode}' AND 1=1 vs ([Code] = '{searchCode}' AND 1=1) again shows the non-bracketed version to be close to 40 times faster.
I've not investigated all scenarios, but it seems that the introduction of brackets - either redundantly around a simple comparison check, or as required to specify sub-expression precedence - or the presence of an 'OR' slows the query down considerably.
I could speculate that the issue is caused by how the datatable parses the condition you use and how it creates and uses internal indexes ... but I won't.
You can speed it up a lot by using a dictionary. For example:
if (distinctdt.Rows.Count > 0)
{
// build index of C1 values to speed inner loop
Dictionary<string, DataRow> masterIndex = new Dictionary<string, DataRow>();
foreach (DataRow row in rawMasterdt.Rows)
masterIndex[row["C1"].ToString()] = row;
int count = 0;
foreach (DataRow offer in distinctdt.Rows)
{
Then in place of
string exp = "C1 = " + "'" + offer[0].ToString() + "'" + "";
DataRow masterRow = rawMasterdt.Select(exp)[0];
You would do this
DataRow masterRow;
if (masterIndex.ContainsKey(offer[0].ToString())
masterRow = masterIndex[offer[0].ToString()];
else
masterRow = null;
If you create a DataRelation between your parent and child DataTables, you can look up child rows by invoking DataRow.GetChildRows(DataRelation) on the parent row (resp. DataRow.GetChildRelName in case of typed DataSets). The search will apply a TreeMap lookup, and performance should be fine even with a lot of child rows.
In case you have to search for rows based on other criteria than on a DataRelation's foreign keys, I recommend to use DataView.Sort / DataView.FindRows() instead of DataTable.Select(), as soon as you have to query the data more than once. DataView.FindRows() will be based on TreeMap lookup (O(log(N)), where as DataTable.Select() has to scan all rows (O(N)). This article contains more details: http://arnosoftwaredev.blogspot.com/2011/02/when-datatableselect-is-slow-use.html
DataTables can be made to have Relationships with other DataTables in a DataSet. See http://msdn.microsoft.com/en-us/library/ay82azad%28VS.71%29.aspx for a bit of discussion and as a start point to browsing. I've not much experience of using them but as I understand it they will do what you want (assuming your tables are in a suitable format). I would assume that these have greater efficiency than a manual process of doing the same but I may be wrong. Might be worth seeing if they work for you and benchmarking to see if they are an improvement or not...
Have you ran it through a profiler? That should be the first step. Anyhow, this might help:
Read the master text file into memory line by line. Put the master record into a dictionary as the key. Add it to the dataset (1 pass through master).
Read child text file line by line, add this as a value for the appropriate master record in the dictionary created above
Now you have everything in the dictionary in memory, only doing 1 pass through each file.
Do a final pass through the dictionary/children and process each column and perform final calcs.

Categories

Resources