DataTable.Select and Performance Issue in C# - c#

I'm importing the data from three Tab delimited files in the DataTables and after that I need to go thru every row of master table and find all the rows in two child tables. Against each DataRow[] array I found from the child tables, I have to again go thru individually each row and check the values based upon different paramenters and at the end I need to create a final record which will be merger of master and two child table columns.
Now I have done that and its working but the problem is its Performance. I'm using the DataTable.Select to find all child rows from child table which I believe making it very slow.
Please remember the None of the table has any Primary key as the duplicate rows are acceptable.
At the moment I have 1200 rows in master table and aroun 8000 rows in child table and the total time it takes to do that is 8 minutes.
Any idea how can I increase the Performance.
Thanks in advance
The code is below ***************
DataTable rawMasterdt = importMasterFile();
DataTable rawDespdt = importDescriptionFile();
dsHelper = new DataSetHelper();
DataTable distinctdt = new DataTable();
distinctdt = dsHelper.SelectDistinct("DistinctOffers", rawMasterdt, "C1");
if (distinctdt.Rows.Count > 0)
{
int count = 0;
foreach (DataRow offer in distinctdt.Rows)
{
string exp = "C1 = " + "'" + offer[0].ToString() + "'" + "";
DataRow masterRow = rawMasterdt.Select(exp)[0];
count++;
txtBlock1.Text = "Importing Offer " + count.ToString() + " of " + distinctdt.Rows.Count.ToString();
if (masterRow != null )
{
Product newProduct = new Product();
newProduct.Code = masterRow["C4"].ToString();
newProduct.Name = masterRow["C5"].ToString();
// -----
newProduct.Description = getProductDescription(offer[0].ToString(), rawDespdt);
newProduct.Weight = getProductWeight(offer[0].ToString(), rawDespdt);
newProduct.Price = getProductRetailPrice(offer[0].ToString(), rawDespdt);
newProduct.UnitPrice = getProductUnitPrice(offer[0].ToString(), rawDespdt);
// ------- more functions similar to above here
productList.Add(newProduct);
}
}
txtBlock1.Text = "Import Completed";
public string getProductDescription(string offercode, DataTable dsp)
{
string exp = "((C1 = " + "'" + offercode + "')" + " AND ( C6 = 'c' ))";
DataRow[] dRows = dsp.Select( exp);
string descrip = "";
if (dRows.Length > 0)
{
for (int i = 0; i < dRows.Length - 1; i++)
{
descrip = descrip + " " + dRows[i]["C12"];
}
}
return descrip;
}

.Net 4.5 and the issue is still there.
Here are the results of a simple benchmark where DataTable.Select and different dictionary implementations are compared for CPU time (results are in milliseconds)
#Rows Table.Select Hashtable[] SortedList[] Dictionary[]
1000 43,31 0,01 0,06 0,00
6000 291,73 0,07 0,13 0,01
11000 604,79 0,04 0,16 0,02
16000 914,04 0,05 0,19 0,02
21000 1279,67 0,05 0,19 0,02
26000 1501,90 0,05 0,17 0,02
31000 1738,31 0,07 0,20 0,03
Problem:
The DataTable.Select method creates a "System.Data.Select" class instance internally, and this "Select" class creates indexes based on the fields (columns) specified in the query. The Select class makes re-use of the indexes it had created but the DataTable implementation does not re-use the Select class instance hence the indexes are re-created every time DataTable.Select is invoked. (This behaviour can be observed by decompiling System.Data)
Solution:
Assume the following query
DataRow[] rows = data.Select("COL1 = 'VAL1' AND (COL2 = 'VAL2' OR COL2 IS NULL)");
Instead, create and fill a Dictionary with keys corresponding to the different value combinations of the values of the columns used as the filter. (This relatively expensive operation must be done only once and the dictionary instance must then be re-used)
Dictionary<string, List<DataRow>> di = new Dictionary<string, List<DataRow>>();
foreach (DataRow dr in data.Rows)
{
string key = (dr["COL1"] == DBNull.Value ? "<NULL>" : dr["COL1"]) + "//" + (dr["COL2"] == DBNull.Value ? "<NULL>" : dr["COL2"]);
if (di.ContainsKey(key))
{
di[key].Add(dr);
}
else
{
di.Add(key, new List<DataRow>());
di[key].Add(dr);
}
}
Query the Dictionary (multiple queries may be required) to filter the rows and combine the results into a List
string key1 = "VAL1//VAL2";
string key2 = "VAL1//<NULL>";
List<DataRow>() results = new List<DataRow>();
if (di.ContainsKey(key1))
{
results.AddRange(di[key1]);
}
if (di.ContainsKey(key2))
{
results.AddRange(di[key2]);
}

I know this is an old question, and code underpinning this issue may have changed, but I've recently encountered (and gain some insight into) this very issue.
For anyone coming along at a later date ... here's what I found.
Performance of the DataTable.Select(condition) is quite sensitive to the nature and structure of the 'condition' you provide. This looks like a bug to me (where would I report it to Microsoft?) but it may merely be a quirk.
I've written a set of tests to demonstrate the issue that are structured as follows:
Define a datatable with a few simple columns,like this:
var dataTable = new DataTable();
var idCol = dataTable.Columns.Add("Id", typeof(Int32));
dataTable.Columns.Add("Code", typeof(string));
dataTable.Columns.Add("Name", typeof(string));
dataTable.Columns.Add("FormationDate", typeof(DateTime));
dataTable.Columns.Add("Income", typeof(Decimal));
dataTable.Columns.Add("ChildCount", typeof(Int32));
dataTable.Columns.Add("Foreign", typeof(Boolean));
dataTable.PrimaryKey = new DataColumn[1] { idCol };
Populate the table with 40000 records, each with a unique 'Code' field.
Perform a batch of 'selects' (each with different parameters) against the datatable using two similar, but differently formatted, queries and record and compare the total time taken by each of the two formats.
You get remarkable results. Testing, for example, the below two conditions side-by-side:
Q1: [Code] = 'XX'
Q2: ([Code] = 'XX')
[ I do multiple Select calls using the above two queries, each iteration I replace the XX with a valid code that exists in the datatable ]
The result?
Time comparison for 320 lookups against 40000 records: 180 msec total search time with no brackets, 6871 msec total search time for search WITH brackets
Yes - 38 times slower if you just have the extra brackets surrounding the condition.
There are other scenarios which react differently.
For example,
[Code] = '{searchCode}' OR 1=0 vs ([Code] = '{searchCode}' OR 1=0) take similar (slow) times to execute, but:
[Code] = '{searchCode}' AND 1=1 vs ([Code] = '{searchCode}' AND 1=1) again shows the non-bracketed version to be close to 40 times faster.
I've not investigated all scenarios, but it seems that the introduction of brackets - either redundantly around a simple comparison check, or as required to specify sub-expression precedence - or the presence of an 'OR' slows the query down considerably.
I could speculate that the issue is caused by how the datatable parses the condition you use and how it creates and uses internal indexes ... but I won't.

You can speed it up a lot by using a dictionary. For example:
if (distinctdt.Rows.Count > 0)
{
// build index of C1 values to speed inner loop
Dictionary<string, DataRow> masterIndex = new Dictionary<string, DataRow>();
foreach (DataRow row in rawMasterdt.Rows)
masterIndex[row["C1"].ToString()] = row;
int count = 0;
foreach (DataRow offer in distinctdt.Rows)
{
Then in place of
string exp = "C1 = " + "'" + offer[0].ToString() + "'" + "";
DataRow masterRow = rawMasterdt.Select(exp)[0];
You would do this
DataRow masterRow;
if (masterIndex.ContainsKey(offer[0].ToString())
masterRow = masterIndex[offer[0].ToString()];
else
masterRow = null;

If you create a DataRelation between your parent and child DataTables, you can look up child rows by invoking DataRow.GetChildRows(DataRelation) on the parent row (resp. DataRow.GetChildRelName in case of typed DataSets). The search will apply a TreeMap lookup, and performance should be fine even with a lot of child rows.
In case you have to search for rows based on other criteria than on a DataRelation's foreign keys, I recommend to use DataView.Sort / DataView.FindRows() instead of DataTable.Select(), as soon as you have to query the data more than once. DataView.FindRows() will be based on TreeMap lookup (O(log(N)), where as DataTable.Select() has to scan all rows (O(N)). This article contains more details: http://arnosoftwaredev.blogspot.com/2011/02/when-datatableselect-is-slow-use.html

DataTables can be made to have Relationships with other DataTables in a DataSet. See http://msdn.microsoft.com/en-us/library/ay82azad%28VS.71%29.aspx for a bit of discussion and as a start point to browsing. I've not much experience of using them but as I understand it they will do what you want (assuming your tables are in a suitable format). I would assume that these have greater efficiency than a manual process of doing the same but I may be wrong. Might be worth seeing if they work for you and benchmarking to see if they are an improvement or not...

Have you ran it through a profiler? That should be the first step. Anyhow, this might help:
Read the master text file into memory line by line. Put the master record into a dictionary as the key. Add it to the dataset (1 pass through master).
Read child text file line by line, add this as a value for the appropriate master record in the dictionary created above
Now you have everything in the dictionary in memory, only doing 1 pass through each file.
Do a final pass through the dictionary/children and process each column and perform final calcs.

Related

Checking a DataTable Column for Uniqueness

This method below of determining whether a column in DataTable seems very slow. Is there any suggestion of a quicker way of checking a particular column for uniqueness, perhaps without having to create a separate view?
var view = new DataView(LoadedDataTable);
var distinctValues = view.ToTable(true, ColID);
if (distinctValues.Rows.Count != LoadedDataTable.Rows.Count)
Messagebox.Show("Column " + ColID + " is not unique.");
update
One thing that may be useful to note is that I don't care about finding the duplicates, just whether the column is unique or not. I.e. if it finds a duplicate it could stop searching. If that makes any difference!
A quick time test for me had this solution as significantly faster (about 10 ms vs 1000 ms) than using a DataView and ToTable with about 15000 rows in the table.
var values = LoadedDataTable.Rows.Cast<DataRow>()
.Select(r => r[ColID])
.Distinct()
.ToList();
var unique = values.Count == LoadedDataTable.Rows.Count;

Change order of rows in DataTable in C#

Is it possible to change order of rows in DataTable so for example the one with current index of 5 moves to place with index of 3, etc.?
I have this legacy, messy code where dropdown menu get it's values from DataTable, which get it's values from database. It is impossible to make changes in database, since it has too many columns and entries. My original though was to add new column in db and order by it's values, but this is going to be hard.
So since this is only matter of presentation to user I was thinking to just switch order of rows in that DataTable. Does someone knows the best way to do this in C#?
This is my current code:
DataTable result = flokkurDao.GetMCCHABAKflokka("MSCODE");
foreach (DataRow row in result.Rows)
{
m_cboReasonCode.Properties.Items.Add(row["FLOKKUR"].ToString().Trim() + " - " + row["SKYRING"]);
}
For example I want to push row 2011 - Credit previously issued to the top of the DataTable.
SOLUTION:
For those who might have problems with ordering rows in DataTable and working with obsolete technology that doesn't supports Linq this might help:
DataRow firstSelectedRow = result.Rows[6];
DataRow firstNewRow = result.NewRow();
firstNewRow.ItemArray = firstSelectedRow.ItemArray; // copy data
result.Rows.Remove(firstSelectedRow);
result.Rows.InsertAt(firstNewRow, 0);
You have to clone row, remove it and insert it again with a new index. This code moves row with index 6 to first place in the DataTable.
If you really want randomness you could use Guid.NewGuid in LINQ's OrderBy:
DataTable result = flokkurDao.GetMCCHABAKflokka("MSCODE");
var randomOrder = result.AsEnumerable().OrderBy(r => Guid.NewGuid());
foreach (DataRow row in randomOrder)
{
// ...
}
If you actually don't want randomness but you want specific values at the top, you can use:
var orderFlokkur2011 = result.AsEnumerable()
.OrderBy(r => r.Field<int>("FLOKKUR") == 2011 ? 0 : 1);
You can use linq to order rows:
DataTable result = flokkurDao.GetMCCHABAKflokka("MSCODE");
foreach (DataRow row in result.Rows.OrderBy(x => x.ColumnName))
{
m_cboReasonCode.Properties.Items.Add(row["FLOKKUR"].ToString().Trim() + " - " + row["SKYRING"]);
}
To order by multiple columns:
result.Rows.OrderBy(x => x.ColumnName).ThenBy(x => x.OtherColumnName).ThenBy(x.YetAnotherOne)
To order by a specific value:
result.Rows.OrderBy(x => (x.ColumnName == 2001 or x.ColumnName == 2002) ? 0 : 1).ThenBy(x => x.ColumName)
You can use the above code to "pin" certain rows to the top, if you want more granular than that you can use a switch for example to sort specific values into sorted values of 1, 2, 3, 4 and use a higher number for the rest.
You can not change the order or delete a row in a foreach loop, you should create a new datatable and randomly add the rows to new datatable, you should also track the inserted rows not to duplicate
Use a DataView
DataTable result = flokkurDao.GetMCCHABAKflokka("MSCODE");
DateView view = new DateView(result);
view.Sort = "FLOKKUR";
view.Filter = "... you can even apply an in memory filter here ..."
foreach (DataRowView row in view.Rows)
{
....
Every data table comes with a view DefaultView which you can use, this way you can apply the default sorting / filtering in your datalayer.
public DataTable GetMCCHABAKflokka(string tableName, string sort, string filter)
{
var result = GetMCCHABAKflokka(tableName);
result.DefaultView.Sort = sort;
result.DefaultView.Filter = filter;
return result;
}
// use like this
foreach (DataRowView row in result.DefaultView)

Filling table takes a lot of time in MS Word

I made the following code to add external data table to another table in MS word document, its working fine but takes a lot of time in case that the number of rows is more than 100, and in case of adding table with rows count more that 500 it fills the ms word table really slow and can't complete the task.
I tried to hide the document and disable the screen update for the document but still no solution for the slow performance.
//Get the required external data to the DT data table
DataTable DT = XDt.GetData();
Word.Table TB;
int X = 1;
foreach (DataRow Rw in DT.Rows)
{
Word.Row Rn = TB.Rows.Add(TB.Rows[X + 1]);
for(int i=0;i<=DT.Columns.Count-1;i++)
{
Rn.Cells[i+1].Range.Text = Rw[i].ToString());
}
X++;
}
So is there a way to make this process go faster ?
The most efficient way to add a table to Word is to first concatenate the data in a delimited text string, where "/n" must be the symbol for end-of-row (record separator). The end-of-cell (field separator) can be any character you like that's not in the string content that makes up the table.
Assign this string to a Range object, then use the ConvertToTable() method to create the table.
You're retrieving the last row of the current table for the BeforeRow parameter of TB.Rows.Add. This is significantly slower than simply adding the row. You should replace this:
Word.Row Rn = TB.Rows.Add(TB.Rows[X + 1]);
With this:
Word.Row Rn = TB.Rows.Add();
Utilizing parallelization as suggested in the comments might help slightly, but I'm afraid it's not going to do much good seeing the table add code runs on the main thread as mentioned in this link.
EDIT:
If performance is still an issue, I'd look into creating the Word table independently of the Word object model by using OpenXML. It's orders of magnitude faster.
ConvertToTable method is orders of magnitude faster than adding Rows/Cells one at a time.
while (reader.Read())
{
values = new object[reader.FieldCount];
var cols = reader.GetValues(values);
var item = String.Join("\t", values);
items.Add(item);
};
data = String.Join("\n", items.ToArray());
var tempDocument = application.Documents.Add();
var range = tempDocument.Range();
range.Text = data;
var tempTable = range.ConvertToTable(Separator: Microsoft.Office.Interop.Word.WdTableFieldSeparator.wdSeparateByTabs,
NumColumns: reader.FieldCount,
NumRows: rows, DefaultTableBehavior: WdDefaultTableBehavior.wdWord9TableBehavior,
AutoFitBehavior: WdAutoFitBehavior.wdAutoFitWindow);

Copy DataTable.Rows to DataRow[]. Faster way then Select() with the same functionality?

I have now a problem with a very old system of ours. (!It is more then 7 years old and I have no budget and resources to make bigger change in the structure, so the decision to improve the old logic as many as we can.!)
We have an own written gridcontrol. Basically it is like a normal ASP.NET grid, you can add, change, delete elements.
The problem is that the grid has a BindGrid() method, where for further usage, the rows of the datasource table copied into a DataRow[]. I need to keep the DataRow[], but I would like to implement the best way to copy the source from the the table into the array.
The current solution:
DataRow[] rows = DataSource.Select("1=1", SortOrderString);
As I experienced so far, if I need to get a specified sort, that could be the best way (I'm also interested if it has a quicker way or not.)
BUT there are some simplified pages, where the SortOrder is not needed.
So I could make two method one for the sort order and one for without.
The real problem is the second one:
DataRow[] rows = DataSource.Select("1=1");
Because it is very slow. I made some test and it is kind of 15 times slower then the CopyTo() solution:
DataRow[] rows = new DataRow[DataSource.Rows.Count];
DataSource.Rows.CopyTo(rows,0);
I would like to use the faster way, BUT when I made the tests some old function simply crashed. It seems, there is an other difference, what I only noticed now:
The Select() gets the rows like the RowChanges are accepted.
So if I deleted a row, and I do not use the AcceptRowChanges() (I can't do that unfortunately), then with Select("1=1") the row is in the DataSource but not in the DataRow[].
With a simple .CopyTo() the row is there, and that is a bad news for me.
My questions are:
1) Is the Select("1=1") the best way to get the rows by the RowChanges? (I doubt a bit, because it is like 6 year old part)
2) And if 1) is not, is it possible to achieve a faster way with the same result than the .Select("1=1") ?
UPDATE:
Here is a very basic test app, what I used for speedtesting:
DataTable dt = new DataTable("Test");
dt.Columns.Add("Id", typeof (int));
dt.Columns.Add("Name", typeof(string));
for (int i = 0; i < 10000; i++)
{
DataRow row = dt.NewRow();
row["ID"] = i;
row["Name"] = "Name" + i;
dt.Rows.Add(row);
}
dt.AcceptChanges();
DateTime start = DateTime.Now;
DataRow[] rows = dt.Select();
/*DataRow[] rows = new DataRow[dt.Rows.Count];
dt.Rows.CopyTo(rows,0);*/
Console.WriteLine(DateTime.Now - start);
You can call Select without an argument: DataRow[] allRows = DataSource.Select(); That would be for sure more efficient than "1=1" since that applies a pointless RowFilter.
Another way is using Linq-To-DataSet to order and filter the DataTable. That isn't more efficient but more readable and maintainable.
I have yet no example or measurement, but it is obvious that a RowFilter with "1=1" is more expensive than none. Select is implemented in this way:
public Select(DataTable table, string filterExpression, string sort, DataViewRowState recordStates)
{
this.table = table;
this.IndexFields = table.ParseSortString(sort);
this.indexDesc = Select.ConvertIndexFieldtoIndexDesc(this.IndexFields);
// following would be omitted if you would use DataSource.Select() without "1=1"
if (filterExpression != null && filterExpression.Length > 0)
{
this.rowFilter = new DataExpression(this.table, filterExpression);
this.expression = this.rowFilter.ExpressionNode;
}
this.recordStates = recordStates;
}
If you want to be able to select also the rows that are currently not accepted, you can use the overload of Select:
DataRow[] allRows = DataSource.Select("", "", DataViewRowState.CurrentRows | DataViewRowState.Deleted);
This will select all rows inclusive the rows that are deleted even if AcceptChanges was not called yet.

Compare two DataTables to determine rows in one but not the other

I have two DataTables, A and B, produced from CSV files. I need to be able to check which rows exist in B that do not exist in A.
Is there a way to do some sort of query to show the different rows or would I have to iterate through each row on each DataTable to check if they are the same? The latter option seems to be very intensive if the tables become large.
Assuming you have an ID column which is of an appropriate type (i.e. gives a hashcode and implements equality) - string in this example, which is slightly pseudocode because I'm not that familiar with DataTables and don't have time to look it all up just now :)
IEnumerable<string> idsInA = tableA.AsEnumerable().Select(row => (string)row["ID"]);
IEnumerable<string> idsInB = tableB.AsEnumerable().Select(row => (string)row["ID"]);
IEnumerable<string> bNotA = idsInB.Except(idsInA);
would I have to iterate through each row on each DataTable to check if they are the same.
Seeing as you've loaded the data from a CSV file, you're not going to have any indexes or anything, so at some point, something is going to have to iterate through every row, whether it be your code, or a library, or whatever.
Anyway, this is an algorithms question, which is not my specialty, but my naive approach would be as follows:
1: Can you exploit any properties of the data? Are all the rows in each table unique, and can you sort them both by the same criteria? If so, you can do this:
Sort both tables by their ID (using some useful thing like a quicksort). If they're already sorted then you win big.
Step through both tables at once, skipping over any gaps in ID's in either table. Matched ID's mean duplicated records.
This allows you to do it in (sort time * 2 ) + one pass, so if my big-O-notation is correct, it'd be (whatever-sort-time) + O(m+n) which is pretty good.
(Revision: this is the approach that ΤΖΩΤΖΙΟΥ describes )
2: An alternative approach, which may be more or less efficient depending on how big your data is:
Run through table 1, and for each row, stick it's ID (or computed hashcode, or some other unique ID for that row) into a dictionary (or hashtable if you prefer to call it that).
Run through table 2, and for each row, see if the ID (or hashcode etc) is present in the dictionary. You're exploiting the fact that dictionaries have really fast - O(1) I think? lookup. This step will be really fast, but you'll have paid the price doing all those dictionary inserts.
I'd be really interested to see what people with better knowledge of algorithms than myself come up with for this one :-)
You can use the Merge and GetChanges methods on the DataTable to do this:
A.Merge(B); // this will add to A any records that are in B but not A
return A.GetChanges(); // returns records originally only in B
The answers so far assume that you're simply looking for duplicate primary keys. That's a pretty easy problem - you can use the Merge() method, for instance.
But I understand your question to mean that you're looking for duplicate DataRows. (From your description of the problem, with both tables being imported from CSV files, I'd even assume that the original rows didn't have primary key values, and that any primary keys are being assigned via AutoNumber during the import.)
The naive implementation (for each row in A, compare its ItemArray with that of each row in B) is indeed going to be computationally expensive.
A much less expensive way to do this is with a hashing algorithm. For each DataRow, concatenate the string values of its columns into a single string, and then call GetHashCode() on that string to get an int value. Create a Dictionary<int, DataRow> that contains an entry, keyed on the hash code, for each DataRow in DataTable B. Then, for each DataRow in DataTable A, calculate the hash code, and see if it's contained in the dictionary. If it's not, you know that the DataRow doesn't exist in DataTable B.
This approach has two weaknesses that both emerge from the fact that two strings can be unequal but produce the same hash code. If you find a row in A whose hash is in the dictionary, you then need to check the DataRow in the dictionary to verify that the two rows are really equal.
The second weakness is more serious: it's unlikely, but possible, that two different DataRows in B could hash to the same key value. For this reason, the dictionary should really be a Dictionary<int, List<DataRow>>, and you should perform the check described in the previous paragraph against each DataRow in the list.
It takes a fair amount of work to get this working, but it's an O(m+n) algorithm, which I think is going to be as good as it gets.
Just FYI:
Generally speaking about algorithms, comparing two sets of sortable (as ids typically are) is not an O(M*N/2) operation, but O(M+N) if the two sets are ordered. So you scan one table with a pointer to the start of the other, and:
other_item= A.first()
only_in_B= empty_list()
for item in B:
while other_item > item:
other_item= A.next()
if A.eof():
only_in_B.add( all the remaining B items)
return only_in_B
if item < other_item:
empty_list.append(item)
return only_in_B
The code above is obviously pseudocode, but should give you the general gist if you decide to code it yourself.
Thanks for all the feedback.
I do not have any index's unfortunately. I will give a little more information about my situation.
We have a reporting program (replaced Crystal reports) that is installed in 7 Servers across EU. These servers have many reports on them (not all the same for each country). They are invoked by a commandline application that uses XML files for their configuration. So One XML file can call multiple reports.
The commandline application is scheduled and controlled by our overnight process. So the XML file could be called from multiple places.
The goal of the CSV is to produce a list of all the reports that are being used and where they are being called from.
I am going through the XML files for all references, querying the scheduling program and producing a list of all the reports. (this is not too bad).
The problem I have is I have to keep a list of all the reports that might have been removed from production. So I need to compare the old CSV with the new data. For this I thought it best to put it into DataTables and compare the information, (this could be the wrong approach. I suppose I could create an object that holds it and compares the difference then create iterate through them).
The data I have about each report is as follows:
String - Task Name
String - Action Name
Int - ActionID (the Action ID can be in multiple records as a single action can call many reports, i.e. an XML file).
String - XML File called
String - Report Name
I will try the Merge idea given by MusiGenesis (thanks). (rereading some of the posts not sure if the Merge will work, but worth trying as I have not heard about it before so something new to learn).
The HashCode Idea sounds interesting as well.
Thanks for all the advice.
I found an easy way to solve this. Unlike previous "except method" answers, I use the except method twice. This not only tells you what rows were deleted but what rows were added. If you only use one except method - it will only tell you one difference and not both. This code is tested and works. See below
//Pass in your two datatables into your method
//build the queries based on id.
var qry1 = datatable1.AsEnumerable().Select(a => new { ID = a["ID"].ToString() });
var qry2 = datatable2.AsEnumerable().Select(b => new { ID = b["ID"].ToString() });
//detect row deletes - a row is in datatable1 except missing from datatable2
var exceptAB = qry1.Except(qry2);
//detect row inserts - a row is in datatable2 except missing from datatable1
var exceptAB2 = qry2.Except(qry1);
then execute your code against the results
if (exceptAB.Any())
{
foreach (var id in exceptAB)
{
//execute code here
}
}
if (exceptAB2.Any())
{
foreach (var id in exceptAB2)
{
//execute code here
}
}
Could you not simply compare the CSV files before loading them into DataTables?
string[] a = System.IO.File.ReadAllLines(#"cvs_a.txt");
string[] b = System.IO.File.ReadAllLines(#"csv_b.txt");
// get the lines from b that are not in a
IEnumerable<string> diff = b.Except(a);
//... parse b into DataTable ...
public DataTable compareDataTables(DataTable First, DataTable Second)
{
First.TableName = "FirstTable";
Second.TableName = "SecondTable";
//Create Empty Table
DataTable table = new DataTable("Difference");
DataTable table1 = new DataTable();
try
{
//Must use a Dataset to make use of a DataRelation object
using (DataSet ds4 = new DataSet())
{
//Add tables
ds4.Tables.AddRange(new DataTable[] { First.Copy(), Second.Copy() });
//Get Columns for DataRelation
DataColumn[] firstcolumns = new DataColumn[ds4.Tables[0].Columns.Count];
for (int i = 0; i < firstcolumns.Length; i++)
{
firstcolumns[i] = ds4.Tables[0].Columns[i];
}
DataColumn[] secondcolumns = new DataColumn[ds4.Tables[1].Columns.Count];
for (int i = 0; i < secondcolumns.Length; i++)
{
secondcolumns[i] = ds4.Tables[1].Columns[i];
}
//Create DataRelation
DataRelation r = new DataRelation(string.Empty, firstcolumns, secondcolumns, false);
ds4.Relations.Add(r);
//Create columns for return table
for (int i = 0; i < First.Columns.Count; i++)
{
table.Columns.Add(First.Columns[i].ColumnName, First.Columns[i].DataType);
}
//If First Row not in Second, Add to return table.
table.BeginLoadData();
foreach (DataRow parentrow in ds4.Tables[0].Rows)
{
DataRow[] childrows = parentrow.GetChildRows(r);
if (childrows == null || childrows.Length == 0)
table.LoadDataRow(parentrow.ItemArray, true);
table1.LoadDataRow(childrows, false);
}
table.EndLoadData();
}
}
catch (Exception ex)
{
Console.WriteLine(ex.Message);
}
return table;
}
try
{
if (ds.Tables[0].Columns.Count == ds1.Tables[0].Columns.Count)
{
for (int i = 0; i < ds.Tables[0].Rows.Count; i++)
{
for (int j = 0; j < ds.Tables[0].Columns.Count; j++)
{
if (ds.Tables[0].Rows[i][j].ToString() == ds1.Tables[0].Rows[i][j].ToString())
{
}
else
{
MessageBox.Show(i.ToString() + "," + j.ToString());
}
}
}
}
else
{
MessageBox.Show("Table has different columns ");
}
}
catch (Exception)
{
MessageBox.Show("Please select The Table");
}
I'm continuing tzot's idea ...
If you have two sortable sets, then you can just use:
List<string> diffList = new List<string>(sortedListA.Except(sortedListB));
If you need more complicated objects, you can define a comparator yourself and still use it.
The usual usage scenario considers a user that has a DataTable in hand and changes it by Adding, Deleting or Modifying some of the DataRows.
After the changes are performed, the DataTable is aware of the proper DataRowState for each row, and also keeps track of the Original DataRowVersion for any rows that were changed.
In this usual scenario, one can Merge the changes back into a source table (in which all rows are Unchanged). After merging, one can get a nice summary of only the changed rows with a call to GetChanges().
In a more unusual scenario, a user has two DataTables with the same schema (or perhaps only the same columns and lacking primary keys). These two DataTables consist of only Unchanged rows. The user may want to find out what changes does he need to apply to one of the two tables in order to get to the other one. That is, which rows need to be Added, Deleted, or Modified.
We define here a function called GetDelta() which does the job:
using System;
using System.Data;
using System.Xml;
using System.Linq;
using System.Collections.Generic;
using System.Data.DataSetExtensions;
public class Program
{
private static DataTable GetDelta(DataTable table1, DataTable table2)
{
// Modified2 : row1 keys match rowOther keys AND row1 does not match row2:
IEnumerable<DataRow> modified2 = (
from row1 in table1.AsEnumerable()
from row2 in table2.AsEnumerable()
where table1.PrimaryKey.Aggregate(true, (boolAggregate, keycol) => boolAggregate & row1[keycol].Equals(row2[keycol.Ordinal]))
&& !row1.ItemArray.SequenceEqual(row2.ItemArray)
select row2);
// Modified1 :
IEnumerable<DataRow> modified1 = (
from row1 in table1.AsEnumerable()
from row2 in table2.AsEnumerable()
where table1.PrimaryKey.Aggregate(true, (boolAggregate, keycol) => boolAggregate & row1[keycol].Equals(row2[keycol.Ordinal]))
&& !row1.ItemArray.SequenceEqual(row2.ItemArray)
select row1);
// Added : row2 not in table1 AND row2 not in modified2
IEnumerable<DataRow> added = table2.AsEnumerable().Except(modified2, DataRowComparer.Default).Except(table1.AsEnumerable(), DataRowComparer.Default);
// Deleted : row1 not in row2 AND row1 not in modified1
IEnumerable<DataRow> deleted = table1.AsEnumerable().Except(modified1, DataRowComparer.Default).Except(table2.AsEnumerable(), DataRowComparer.Default);
Console.WriteLine();
Console.WriteLine("modified count =" + modified1.Count());
Console.WriteLine("added count =" + added.Count());
Console.WriteLine("deleted count =" + deleted.Count());
DataTable deltas = table1.Clone();
foreach (DataRow row in modified2)
{
// Match the unmodified version of the row via the PrimaryKey
DataRow matchIn1 = modified1.Where(row1 => table1.PrimaryKey.Aggregate(true, (boolAggregate, keycol) => boolAggregate & row1[keycol].Equals(row[keycol.Ordinal]))).First();
DataRow newRow = deltas.NewRow();
// Set the row with the original values
foreach(DataColumn dc in deltas.Columns)
newRow[dc.ColumnName] = matchIn1[dc.ColumnName];
deltas.Rows.Add(newRow);
newRow.AcceptChanges();
// Set the modified values
foreach (DataColumn dc in deltas.Columns)
newRow[dc.ColumnName] = row[dc.ColumnName];
// At this point newRow.DataRowState should be : Modified
}
foreach (DataRow row in added)
{
DataRow newRow = deltas.NewRow();
foreach (DataColumn dc in deltas.Columns)
newRow[dc.ColumnName] = row[dc.ColumnName];
deltas.Rows.Add(newRow);
// At this point newRow.DataRowState should be : Added
}
foreach (DataRow row in deleted)
{
DataRow newRow = deltas.NewRow();
foreach (DataColumn dc in deltas.Columns)
newRow[dc.ColumnName] = row[dc.ColumnName];
deltas.Rows.Add(newRow);
newRow.AcceptChanges();
newRow.Delete();
// At this point newRow.DataRowState should be : Deleted
}
return deltas;
}
private static void DemonstrateGetDelta()
{
DataTable table1 = new DataTable("Items");
// Add columns
DataColumn column1 = new DataColumn("id1", typeof(System.Int32));
DataColumn column2 = new DataColumn("id2", typeof(System.Int32));
DataColumn column3 = new DataColumn("item", typeof(System.Int32));
table1.Columns.Add(column1);
table1.Columns.Add(column2);
table1.Columns.Add(column3);
// Set the primary key column.
table1.PrimaryKey = new DataColumn[] { column1, column2 };
// Add some rows.
DataRow row;
for (int i = 0; i <= 4; i++)
{
row = table1.NewRow();
row["id1"] = i;
row["id2"] = i*i;
row["item"] = i;
table1.Rows.Add(row);
}
// Accept changes.
table1.AcceptChanges();
PrintValues(table1, "table1:");
// Create a second DataTable identical to the first.
DataTable table2 = table1.Clone();
// Add a row that exists in table1:
row = table2.NewRow();
row["id1"] = 0;
row["id2"] = 0;
row["item"] = 0;
table2.Rows.Add(row);
// Modify the values of a row that exists in table1:
row = table2.NewRow();
row["id1"] = 1;
row["id2"] = 1;
row["item"] = 455;
table2.Rows.Add(row);
// Modify the values of a row that exists in table1:
row = table2.NewRow();
row["id1"] = 2;
row["id2"] = 4;
row["item"] = 555;
table2.Rows.Add(row);
// Add a row that does not exist in table1:
row = table2.NewRow();
row["id1"] = 13;
row["id2"] = 169;
row["item"] = 655;
table2.Rows.Add(row);
table2.AcceptChanges();
Console.WriteLine();
PrintValues(table2, "table2:");
DataTable delta = GetDelta(table1,table2);
Console.WriteLine();
PrintValues(delta,"delta:");
// Verify that the deltas DataTable contains the adequate Original DataRowVersions:
DataTable originals = table1.Clone();
foreach (DataRow drow in delta.Rows)
{
if (drow.RowState != DataRowState.Added)
{
DataRow originalRow = originals.NewRow();
foreach (DataColumn dc in originals.Columns)
originalRow[dc.ColumnName] = drow[dc.ColumnName, DataRowVersion.Original];
originals.Rows.Add(originalRow);
}
}
originals.AcceptChanges();
Console.WriteLine();
PrintValues(originals,"delta original values:");
}
private static void Row_Changed(object sender,
DataRowChangeEventArgs e)
{
Console.WriteLine("Row changed {0}\t{1}",
e.Action, e.Row.ItemArray[0]);
}
private static void PrintValues(DataTable table, string label)
{
// Display the values in the supplied DataTable:
Console.WriteLine(label);
foreach (DataRow row in table.Rows)
{
foreach (DataColumn col in table.Columns)
{
Console.Write("\t " + row[col, row.RowState == DataRowState.Deleted ? DataRowVersion.Original : DataRowVersion.Current].ToString());
}
Console.Write("\t DataRowState =" + row.RowState);
Console.WriteLine();
}
}
public static void Main()
{
DemonstrateGetDelta();
}
}
The code above can be tested in https://dotnetfiddle.net/. The resulting output is shown below:
table1:
0 0 0 DataRowState =Unchanged
1 1 1 DataRowState =Unchanged
2 4 2 DataRowState =Unchanged
3 9 3 DataRowState =Unchanged
4 16 4 DataRowState =Unchanged
table2:
0 0 0 DataRowState =Unchanged
1 1 455 DataRowState =Unchanged
2 4 555 DataRowState =Unchanged
13 169 655 DataRowState =Unchanged
modified count =2
added count =1
deleted count =2
delta:
1 1 455 DataRowState =Modified
2 4 555 DataRowState =Modified
13 169 655 DataRowState =Added
3 9 3 DataRowState =Deleted
4 16 4 DataRowState =Deleted
delta original values:
1 1 1 DataRowState =Unchanged
2 4 2 DataRowState =Unchanged
3 9 3 DataRowState =Unchanged
4 16 4 DataRowState =Unchanged
Note that if your tables don't have a PrimaryKey, the where clause in the LINQ queries gets simplified a little bit. I'll let you figure that out on your own.
Achieve it simply using linq.
private DataTable CompareDT(DataTable TableA, DataTable TableB)
{
DataTable TableC = new DataTable();
try
{
var idsNotInB = TableA.AsEnumerable().Select(r => r.Field<string>(Keyfield))
.Except(TableB.AsEnumerable().Select(r => r.Field<string>(Keyfield)));
TableC = (from row in TableA.AsEnumerable()
join id in idsNotInB
on row.Field<string>(ddlColumn.SelectedItem.ToString()) equals id
select row).CopyToDataTable();
}
catch (Exception ex)
{
lblresult.Text = ex.Message;
ex = null;
}
return TableC;
}

Categories

Resources