need help with comparison of datatables and finding out the difference

need help with comparison of datatables and finding out the difference - c#

DataTable 1 :-
Caption |ID
------------
Caption1|1
Caption2|2
Caption3|3
DataTable 2 :-
Name |ID
------------
Name1|1
Name2|2
I want to compare the above two data tables and fetch the value "Caption3" so I can display a message on screen
that "No name for "Caption3" exist!"
I have tried merging as follows but it's fetching DataTable 2 as it is in dtTemp !
datatable1.Merge( datatable2);
DataTable dtTemp = datatable2.GetChanges();
Also tried the logic as follows that removes rows with same IDs in both tables and updates datatable2's rows and only the ones that don't have duplicated IDs will remain...This didn't work either.
:(
if (datatable2.Rows.Count != datatable1.Rows.Count)
{
if (datatable2.Rows.Count != 0)
{
for (int k = 0; k < datatable2.Rows.Count; k++)
{
for (int j = 0; j < datatable1.Rows.Count; j++)
{
if (datatable1.Rows[j]["ID"].ToString() == datatable1.Rows[k]["ID"].ToString())
{
datatable1.Rows.Remove(datatable1.Rows[j]);
datatable1.AcceptChanges();
}
// string test = datatable1.Rows[0]["ID"].ToString();
}
}
}
How do I fetch those "CAPTIONS" whose corresponding "NAMES" do not exist?? Please help thanks.
Note:- Rows in both datatables will vary based on some logic. What I want is to fetch that CAPTION from datatable1 who's KCID doesn't exist in datatable2.
edit:-
How else do I loop through datatable1's rows and check which ID(from datatable1) doesn't exist in datatable2 and then print those captions on my page?
#CodeInChaos:: I have not worked with Linq-To-Objects at all so not able to understand your code :/
Is there any other way to loop through that datatable and fetch "caption " who's correcponding "Name" doesn't exist in the datatable2??
Someone please help me out with this. I am clueless how else how else to loop through the datatable rows if not like above.

Note that this code uses Linq-To-Objects and thus needs to whole tables in your application. There might be a better solution which works on the database server. But unlike your code it's at linear in the size of the tables.
HashSet<int> ids2=new HashSet<int>(Table2.Select(e=>e.ID));
var entriesOnlyInTable1=Table1.Where(e=>!ids2.Contains(e.ID));
IEnumerable<string> captionsOnlyInTable1=onlyInTable1.Select(e=>e.Caption);
Without LINQ:
HashSet<int> ids2=new HashSet<int>();
foreach(var e in Table2)
{
ids2.Add(e.ID);
}
List<string> captionsOnlyInTable1=new List<string>();
foreach(var e in Table1)
{
if(!ids2.Contains(e.ID))
captionsOnlyInTable1.Add(e.Caption);
}

Related

Find matching records in DataTable as fast as possible

I have C# DataTables with very large numbers of rows, and in my importer app I must query these hundreds of thousands of times in a given import. So I'm trying to find the fastest possible way to search. Thus far I am puzzling over very strange results. First, here are 2 different approaches I have been experimenting with:
APPROACH #1
public static bool DoesRecordExist(string keyColumn, string keyValue, DataTable dt)
{
if (dt != null && dt.Rows.Count > 0)
return dt.Select($"{keyColumn} = '{SafeTrim(keyValue)}'").Count() > 0;
else
return false;
}
APPROACH #2
public static bool DoesRecordExist(string keyColumn, string keyValue, DataTable dt)
{
if (dt != null && dt.Rows.Count > 0)
{
int counter = dt.AsEnumerable().Where(r => string.Equals(SafeTrim(r[keyColumn]), keyValue, StringComparison.CurrentCultureIgnoreCase)).Count();
return counter > 0;
}
else
return false;
}
In a mock test I run each method 15,000 times, handing in hardcoded data. This is apples-to-apples, a fair test. Approach #1 is dramatically faster. But in actual app execution, Approach #1 is dramatically slower.
Why the counterintuitive results? Is there some other faster way to query datatables that I haven't tried?
EDIT: The reason I use datatables as opposed to other types of
collections is because all my datasources are either MySQL tables or
CSV files. So datatables seemed like a logical choice. Some of these
tables contain 10+ columns, so different types of collections seemed
an awkward match.

If you want a faster access and still want to stick to the DataTables, use a dictionary to store the row numbers for given keys. Here I assume that each key is unique in the DataTable. If not, you would have to use a Dictionary<string, List<int>> or Dictionary<string, HashSet<int>> to store the indexes.
var indexes = new Dictionary<string, int>();
for (int i = 0; i < dt.Rows.Count; i++) {
indexes.Add((string)dt.Rows[i].Column(keyColumn), i);
}
Now you can access a row in a super fast way with
var row = dt.Rows[indexes[theKey]];

I have a very similar issue except that I need the actual First Occurrence of a matching row.
Using the .Select.FirstOrDefault (Approach 1) takes 38 minutes to run.
Using the .Where.FirstOrDefault (Approach 2) takes 6 minutes to run.
In a similar situation where I didn't need the FirstOrDefault, but just needed to find and work with the uniquely matching record, what I found to be the fastest by far is to use a HashTable where the Key is the Combined Values of any Columns you are trying to match, and the Value is the Data Row itself. Finding a Match is near instant.
The Function is
public Hashtable ConvertToLookup(DataTable myDataTable, params string[] pKeyFieldNames)
{
Hashtable myLookup = new Hashtable(StringComparer.InvariantCultureIgnoreCase); //Makes the Key Case Insensitive
foreach (DataRow myRecord in myDataTable.Rows)
{
string myHashKey = "";
foreach (string strKeyFieldName in pKeyFieldNames)
{
myHashKey += Convert.ToString(myRecord[strKeyFieldName]).Trim();
}
if (myLookup.ContainsKey(myHashKey) == false)
{
myLookup.Add(myHashKey, myRecord);
}
}
return myLookup;
}
The usage is...
//Build the Lookup Table
Hashtable myLookUp = ConvertToLookup(myDataTable, "Col1Name", "Col2Name");
//Use it
if (myLookUp.ContainsKey(mySearchForValue) == true)
{
DataRow myRecord = (DataRow)myLookUp[mySearchForValue]);
}

All. BINGO! Wanted to share as a different answer just because my previous might be suited for a bit of a different approach. In this scenario, I was able to go from 8 MINUTES, down to 6 SECONDS, not using either approaches...
Again, the key is a HashTable, or in my case a dictionary because I had multiple records. To recap, for me, I needed to delete 1 row from my DataTable for every matching record I found in another DataTable. With the goal that in the end, my First Datatable only contained the "Missing" records.
This uses a different function...
// -----------------------------------------------------------
// Creates a Dictionary with Grouping Counts from a DataTable
public Dictionary<string, Int32> GroupBy(DataTable myDataTable, params string[] pGroupByFieldNames)
{
Dictionary<string, Int32> myGroupBy = new Dictionary<string, Int32>(StringComparer.InvariantCultureIgnoreCase); //Makes the Key Case Insensitive
foreach (DataRow myRecord in myDataTable.Rows)
{
string myKey = "";
foreach (string strGroupFieldName in pGroupByFieldNames)
{
myKey += Convert.ToString(myRecord[strGroupFieldName]).Trim();
}
if (myGroupBy.ContainsKey(myKey) == false)
{
myGroupBy.Add(myKey, 1);
}
else
{
myGroupBy[myKey] += 1;
}
}
return myGroupBy;
}
Now.. say you have a Table of Records that you want to use as the "Match Values" based on Col1 and Col2
Dictionary<string, Int32> myQuickLookUpCount = GroupBy(myMatchTable, "Col1", "Col2");
And now the magic. We are looping through your Primary Table, and removing 1 instance of a record for each instance in the Matching Table. This is the part that took 8 minutes with Approach #2, or 38 minutes with Approach #1.. but now only takes seconds.
myDataTable.AcceptChanges(); //Trick that allows us to delete during a ForEach!
foreach (DataRow myDataRow in myDataTable.Rows)
{
//Grab the Key Values
string strKey1Value = Convert.ToString(myDataRow ["Col1"]);
string strKey2Value = Convert.ToString(myDataRow ["Col2"]);
if (myQuickLookUpCount.TryGetValue(strKey1Value + strKey2Value, out Int32 intTotalCount) == true && intTotalCount > 0)
{
myDataTable.Delete(); //Mark our Row to Delete
myQuickLookUpCount [strKey1Value + strKey2Value ] -= 1; //Decrement our Counter
}
}
myDataTable.AcceptChanges(); //Commits our changes and actually deletes the rows.

Excel List<List<string>> per each row

I have a small program where you can select some database tables and create a excel file with all values for each table and thats my solution to create the excel file.
foreach (var selectedDatabase in this.lstSourceDatabaseTables.SelectedItems)
{
//creates a new worksheet foreach selected table
foreach (TableRetrieverItem databaseTable in tableItems.FindAll(e => e.TableName.Equals(selectedDatabase)))
{
_xlWorksheet = (Excel.Worksheet) xlApp.Worksheets.Add();
_xlWorksheet.Name = databaseTable.TableName.Length > 31 ? databaseTable.TableName.Substring(0, 31): databaseTable.TableName;
_xlWorksheet.Cells[1, 1] = string.Format("{0}.{1}", databaseTable.TableOwner,databaseTable.TableName);
ColumnRetriever retrieveColumn = new ColumnRetriever(SourceConnectionString);
IEnumerable<ColumnRetrieverItem> dbColumns = retrieveColumn.RetrieveColumns(databaseTable.TableName);
var results = retrieveColumn.GetValues(databaseTable.TableName);
int i = 1;
(result is a result.Item3 is a List<List<string>> which contains all values from a table and for each row is a new list inserted)
for (int j = 0; j < results.Item3.Count(); j++)
{
int tmp = 1;
foreach (var value in results.Item3[j])
{
_xlWorksheet.Cells[j + 3, tmp] = value;
tmp++;
}
}
}
}
It works but when you have a table with 5.000 or more values it will take such a long time.
Does someone maybe know a better solution to add the List List string per row than my for foreach solution ?

I utilize the GetExcelColumnName function in my code sample to convert from column count to the excel column name.
The whole idea is, that it's very slow to write excel cells one by one. So instead precompute the whole table of values and then assign the result in a single operation. In order to assign values to a two dimensional range, use a two dimensional array of values:
var rows = results.Item3.Count;
var cols = results.Item3.Max(x => x.Count);
object[,] values = new object[rows, cols];
// TODO: initialize values from results content
// get the appropriate range
Range range = w.Range["A3", GetExcelColumnName(cols) + (rows + 2)];
// assign all values at once
range.Value = values;
Maybe you need to change some details about the used index ranges - can't test my code right now.

As I see, youd didn't do profiling. I recomend to do profiling first (for example dotTrace) and see what parts of your code actualy causing performance issues.
In my practice there is rare cases (almost no such cases) when code executes slower than database requests, even if code is realy awfull in algorithmic terms.
First, I recomend to fill up your excel not by columns, but by rows. If your table has many columns this will cause multiple round trips to database - it is great impact to performance.
Second, write to excel in batches - by rows. Think of excel files as mini-databases, with same 'batch is faster than one by one' principles.

Compare one value to several possible list matches

I'm back to haunt your dreams! I'm working on comparing some values in a complex loop. List 1 is a list of questions/answers, List 2 is also a list of questions/answers. I want to compare List 1 to List 2 and have duplicates removed from List 1 before merging it with List 2. My problem is in the current seed data I have the two items in List 1 match against List 2, but only one is removed instead of both.
I've been at this a couple days and my head is ready to explode, so I hope I can find some help!
Here's code for you:
//Fetching questions/answers which do not have an attempt
//Get questions, which automatically pull associated answers thanks to the model
List<QuizQuestions> notTriedQuestions = await db.QuizQuestions.Where(x=>x.QuizID == report.QuizHeader.QuizID).ToListAsync();
//Compare to existing attempt data and remove duplicate questions
int i = 0;
while(i < notTriedQuestions.Count)
{
var originalAnswersCount = notTriedQuestions.ElementAt(i).QuizAnswers.Count;
int j = 0;
while(j < originalAnswersCount)
{
var comparedID = notTriedQuestions.ElementAt(i).QuizAnswers.ElementAt(j).AnswerID;
if (report.QuizHeader.QuizQuestions.Any(item => item.QuizAnswers.Any(x => x.AnswerID == comparedID)))
{
notTriedQuestions.RemoveAt(i);
//Trip while value and cause break out of loop, otherwise you result in a catch
j = originalAnswersCount;
}
else
{
j++;
}
}
i++;
}
//Add filtered list to master list
foreach (var item in notTriedQuestions)
{
report.QuizQuestions.Add(item);
}

Try List.Union It is meant for exactly this sort of thing.

less expensive way to find duplicate rows in a datatable?

I want to find all rows in a DataTable where each of a group of columns is a duplicate. My current idea is to get a list of indexes of all rows that appear more than once as follows:
public List<int> findDuplicates_New()
{
string[] duplicateCheckFields = { "Name", "City" };
List<int> duplicates = new List<int>();
List<string> rowStrs = new List<string>();
string rowStr;
//convert each datarow to a delimited string and add it to list rowStrs
foreach (DataRow dr in submissionsList.Rows)
{
rowStr = string.Empty;
foreach (DataColumn dc in submissionsList.Columns)
{
//only use the duplicateCheckFields in the string
if (duplicateCheckFields.Contains(dc.ColumnName))
{
rowStr += dr[dc].ToString() + "|";
}
}
rowStrs.Add(rowStr);
}
//count how many of each row string are in the list
//add the string's index (which will match the row's index)
//to the duplicates list if more than 1
for (int c = 0; c < rowStrs.Count; c++)
{
if (rowStrs.Count(str => str == rowStrs[c]) > 1)
{
duplicates.Add(c);
}
}
return duplicates;
}
However, this isn't very efficient: it's O(n^2) to go through the list of strings and get the count of each string. I looked at this solution but couldn't figure out how to use it with more than 1 field. I'm looking for a less expensive way to handle this problem.

Try this:
How can I check for an exact match in a table where each row has 70+ columns?
The essence is to make a set where you store hashes for rows and only do comparisons between rows with colliding hashes, complexity will be O(n)
...
If you have a large number of rows and storing the hashes themselves is an issue (an unlikely case, but still...) you can use a Bloom filter. The core idea of a Bloom filter is to calculate several different hashes of each row and use them as an address in a bitmap. As you're scanning through the rows you can double-check the rows that already have all the bits in the bitmap previously set.

How to read tables from a particular place in a document?

When I use the below line It reads all tables of that particular document:
foreach (Microsoft.Office.Interop.Word.Table tableContent in document.Tables)
But I want to read tables of a particular content for example from one identifier to another identifier.
Identifier can be in the form of [SRS oraganisation_123] to another identifier [SRS Oraganisation_456]
I want to read the tables only in between the above mentioned identifiers.
Suppose 34th page contains my identifier so I want read all tables from that point to until I come across my second identifier. I don't want to read remaining tables.
Please ask me for any clarification in the question.

Say start and end Identifiers are stored in variables called myStartIdentifier and myEndIdentifier -
Range myRange = doc.Range();
int iTagStartIdx = 0;
int iTagEndIdx = 0;
if (myRange.Find.Execute(myStartIdentifier))
iTagStartIdx = myRange.Start;
myRange = doc.Range();
if (myRange.Find.Execute(myEndIdentifier))
iTagEndIdx = myRange.Start;
foreach (Table tbl in doc.Range(iTagStartIdx,iTagEndIdx).Tables)
{
// Your code goes here
}

Not sure how your program is structured... but if you can access the identifier in tableContent then you should be able to write a LINQ query.
var identifiers = new List<string>();
identifiers.Add("myIdentifier");
var tablesWithOnlyTheIdentifiersIWant = document.Tables.Select(tableContent => identifiers.Contains(tableContent.Identifier)
foreach(var tableContent in tablesWithOnlyTheIdentifiersIWant)
{
//Do something
}

Go through following code, if it helps you.
System.Data.DataTable dt = new System.Data.DataTable();
foreach (Microsoft.Office.Interop.Word.Cell c in r.Cells)
{
if(c.Range.Text=="Content you want to compare")
dt.Columns.Add(c.Range.Text);
}
foreach (Microsoft.Office.Interop.Word.Row row in newTable.Rows)
{
System.Data.DataRow dr = dt.NewRow();
int i = 0;
foreach (Cell cell in row.Cells)
{
if (!string.IsNullOrEmpty(cell.Range.Text)&&(cell.Range.Text=="Text you want to compare with"))
{
dr[i] = cell.Range.Text;
}
}
dt.Rows.Add(dr);
i++;
}
Go through following linked 3rd number answer.
Replace bookmark text in Word file using Open XML SDK

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

need help with comparison of datatables and finding out the difference - c#

Related

Find matching records in DataTable as fast as possible

Excel List<List<string>> per each row

Compare one value to several possible list matches

less expensive way to find duplicate rows in a datatable?

How to read tables from a particular place in a document?

Categories

Resources