Find matching records in DataTable as fast as possible

Find matching records in DataTable as fast as possible - c#

I have C# DataTables with very large numbers of rows, and in my importer app I must query these hundreds of thousands of times in a given import. So I'm trying to find the fastest possible way to search. Thus far I am puzzling over very strange results. First, here are 2 different approaches I have been experimenting with:
APPROACH #1
public static bool DoesRecordExist(string keyColumn, string keyValue, DataTable dt)
{
if (dt != null && dt.Rows.Count > 0)
return dt.Select($"{keyColumn} = '{SafeTrim(keyValue)}'").Count() > 0;
else
return false;
}
APPROACH #2
public static bool DoesRecordExist(string keyColumn, string keyValue, DataTable dt)
{
if (dt != null && dt.Rows.Count > 0)
{
int counter = dt.AsEnumerable().Where(r => string.Equals(SafeTrim(r[keyColumn]), keyValue, StringComparison.CurrentCultureIgnoreCase)).Count();
return counter > 0;
}
else
return false;
}
In a mock test I run each method 15,000 times, handing in hardcoded data. This is apples-to-apples, a fair test. Approach #1 is dramatically faster. But in actual app execution, Approach #1 is dramatically slower.
Why the counterintuitive results? Is there some other faster way to query datatables that I haven't tried?
EDIT: The reason I use datatables as opposed to other types of
collections is because all my datasources are either MySQL tables or
CSV files. So datatables seemed like a logical choice. Some of these
tables contain 10+ columns, so different types of collections seemed
an awkward match.

If you want a faster access and still want to stick to the DataTables, use a dictionary to store the row numbers for given keys. Here I assume that each key is unique in the DataTable. If not, you would have to use a Dictionary<string, List<int>> or Dictionary<string, HashSet<int>> to store the indexes.
var indexes = new Dictionary<string, int>();
for (int i = 0; i < dt.Rows.Count; i++) {
indexes.Add((string)dt.Rows[i].Column(keyColumn), i);
}
Now you can access a row in a super fast way with
var row = dt.Rows[indexes[theKey]];

I have a very similar issue except that I need the actual First Occurrence of a matching row.
Using the .Select.FirstOrDefault (Approach 1) takes 38 minutes to run.
Using the .Where.FirstOrDefault (Approach 2) takes 6 minutes to run.
In a similar situation where I didn't need the FirstOrDefault, but just needed to find and work with the uniquely matching record, what I found to be the fastest by far is to use a HashTable where the Key is the Combined Values of any Columns you are trying to match, and the Value is the Data Row itself. Finding a Match is near instant.
The Function is
public Hashtable ConvertToLookup(DataTable myDataTable, params string[] pKeyFieldNames)
{
Hashtable myLookup = new Hashtable(StringComparer.InvariantCultureIgnoreCase); //Makes the Key Case Insensitive
foreach (DataRow myRecord in myDataTable.Rows)
{
string myHashKey = "";
foreach (string strKeyFieldName in pKeyFieldNames)
{
myHashKey += Convert.ToString(myRecord[strKeyFieldName]).Trim();
}
if (myLookup.ContainsKey(myHashKey) == false)
{
myLookup.Add(myHashKey, myRecord);
}
}
return myLookup;
}
The usage is...
//Build the Lookup Table
Hashtable myLookUp = ConvertToLookup(myDataTable, "Col1Name", "Col2Name");
//Use it
if (myLookUp.ContainsKey(mySearchForValue) == true)
{
DataRow myRecord = (DataRow)myLookUp[mySearchForValue]);
}

All. BINGO! Wanted to share as a different answer just because my previous might be suited for a bit of a different approach. In this scenario, I was able to go from 8 MINUTES, down to 6 SECONDS, not using either approaches...
Again, the key is a HashTable, or in my case a dictionary because I had multiple records. To recap, for me, I needed to delete 1 row from my DataTable for every matching record I found in another DataTable. With the goal that in the end, my First Datatable only contained the "Missing" records.
This uses a different function...
// -----------------------------------------------------------
// Creates a Dictionary with Grouping Counts from a DataTable
public Dictionary<string, Int32> GroupBy(DataTable myDataTable, params string[] pGroupByFieldNames)
{
Dictionary<string, Int32> myGroupBy = new Dictionary<string, Int32>(StringComparer.InvariantCultureIgnoreCase); //Makes the Key Case Insensitive
foreach (DataRow myRecord in myDataTable.Rows)
{
string myKey = "";
foreach (string strGroupFieldName in pGroupByFieldNames)
{
myKey += Convert.ToString(myRecord[strGroupFieldName]).Trim();
}
if (myGroupBy.ContainsKey(myKey) == false)
{
myGroupBy.Add(myKey, 1);
}
else
{
myGroupBy[myKey] += 1;
}
}
return myGroupBy;
}
Now.. say you have a Table of Records that you want to use as the "Match Values" based on Col1 and Col2
Dictionary<string, Int32> myQuickLookUpCount = GroupBy(myMatchTable, "Col1", "Col2");
And now the magic. We are looping through your Primary Table, and removing 1 instance of a record for each instance in the Matching Table. This is the part that took 8 minutes with Approach #2, or 38 minutes with Approach #1.. but now only takes seconds.
myDataTable.AcceptChanges(); //Trick that allows us to delete during a ForEach!
foreach (DataRow myDataRow in myDataTable.Rows)
{
//Grab the Key Values
string strKey1Value = Convert.ToString(myDataRow ["Col1"]);
string strKey2Value = Convert.ToString(myDataRow ["Col2"]);
if (myQuickLookUpCount.TryGetValue(strKey1Value + strKey2Value, out Int32 intTotalCount) == true && intTotalCount > 0)
{
myDataTable.Delete(); //Mark our Row to Delete
myQuickLookUpCount [strKey1Value + strKey2Value ] -= 1; //Decrement our Counter
}
}
myDataTable.AcceptChanges(); //Commits our changes and actually deletes the rows.

Related

Copying Datatable Columns

I have the following method that returns a trimmed down copy of a datatable based on the user selecting which 6 columns to keep. My problem is the datatable can be quite large and is taking up quite a bit of memory. By creating the initial copy, this is causing the system to have to start writing to the page file and slowing the application down considerably.
I'm wondering if it is possible to create a datatable copy but of only the specified columns (can be identified through name or index, doesn't matter) rather than creating the copy then removing the unnecessary columns?
This question appears to be asking the same thing but in VB.net.
private DataTable CreateCleanData()
{
var cleanedDataTable = _loadedDataData.Copy();
var columnsToKeep = new List<string>();
columnsToKeep.Add(1.SelectedValue.ToString());
columnsToKeep.Add(2.SelectedValue.ToString());
columnsToKeep.Add(3.SelectedValue.ToString());
columnsToKeep.Add(4.SelectedValue.ToString());
columnsToKeep.Add(5.SelectedValue.ToString());
columnsToKeep.Add(6.SelectedValue.ToString());
for (var i = cleanedDataTable.Columns.Count - 1; i >= 0; i--)
if (!columnsToKeep.Contains(cleanedDataTable.Columns[i].ColumnName))
cleanedDataTable.Columns.Remove(cleanedDataTable.Columns[i]);
cleanedDatTable.AcceptChanges();
GC.Collect();
return cleanedDataTable;
}

You could use this method, basically just use Clone instead of Copy:
public static DataTable CreateCleanData(DataTable source, params int[] keepColumns)
{
var cleanedDataTable = source.Clone(); // empty table but same columns
for (int i = cleanedDataTable.Columns.Count - 1; i >= 0; i--)
{
if (!keepColumns.Contains(i))
cleanedDataTable.Columns.RemoveAt(i);
}
cleanedDataTable.BeginLoadData();
foreach (DataRow sourceRow in source.Rows)
{
DataRow newRow = cleanedDataTable.Rows.Add();
foreach (DataColumn c in cleanedDataTable.Columns)
{
newRow.SetField(c, sourceRow[c.ColumnName]);
}
}
cleanedDataTable.EndLoadData();
return cleanedDataTable;
}

Excel List<List<string>> per each row

I have a small program where you can select some database tables and create a excel file with all values for each table and thats my solution to create the excel file.
foreach (var selectedDatabase in this.lstSourceDatabaseTables.SelectedItems)
{
//creates a new worksheet foreach selected table
foreach (TableRetrieverItem databaseTable in tableItems.FindAll(e => e.TableName.Equals(selectedDatabase)))
{
_xlWorksheet = (Excel.Worksheet) xlApp.Worksheets.Add();
_xlWorksheet.Name = databaseTable.TableName.Length > 31 ? databaseTable.TableName.Substring(0, 31): databaseTable.TableName;
_xlWorksheet.Cells[1, 1] = string.Format("{0}.{1}", databaseTable.TableOwner,databaseTable.TableName);
ColumnRetriever retrieveColumn = new ColumnRetriever(SourceConnectionString);
IEnumerable<ColumnRetrieverItem> dbColumns = retrieveColumn.RetrieveColumns(databaseTable.TableName);
var results = retrieveColumn.GetValues(databaseTable.TableName);
int i = 1;
(result is a result.Item3 is a List<List<string>> which contains all values from a table and for each row is a new list inserted)
for (int j = 0; j < results.Item3.Count(); j++)
{
int tmp = 1;
foreach (var value in results.Item3[j])
{
_xlWorksheet.Cells[j + 3, tmp] = value;
tmp++;
}
}
}
}
It works but when you have a table with 5.000 or more values it will take such a long time.
Does someone maybe know a better solution to add the List List string per row than my for foreach solution ?

I utilize the GetExcelColumnName function in my code sample to convert from column count to the excel column name.
The whole idea is, that it's very slow to write excel cells one by one. So instead precompute the whole table of values and then assign the result in a single operation. In order to assign values to a two dimensional range, use a two dimensional array of values:
var rows = results.Item3.Count;
var cols = results.Item3.Max(x => x.Count);
object[,] values = new object[rows, cols];
// TODO: initialize values from results content
// get the appropriate range
Range range = w.Range["A3", GetExcelColumnName(cols) + (rows + 2)];
// assign all values at once
range.Value = values;
Maybe you need to change some details about the used index ranges - can't test my code right now.

As I see, youd didn't do profiling. I recomend to do profiling first (for example dotTrace) and see what parts of your code actualy causing performance issues.
In my practice there is rare cases (almost no such cases) when code executes slower than database requests, even if code is realy awfull in algorithmic terms.
First, I recomend to fill up your excel not by columns, but by rows. If your table has many columns this will cause multiple round trips to database - it is great impact to performance.
Second, write to excel in batches - by rows. Think of excel files as mini-databases, with same 'batch is faster than one by one' principles.

less expensive way to find duplicate rows in a datatable?

I want to find all rows in a DataTable where each of a group of columns is a duplicate. My current idea is to get a list of indexes of all rows that appear more than once as follows:
public List<int> findDuplicates_New()
{
string[] duplicateCheckFields = { "Name", "City" };
List<int> duplicates = new List<int>();
List<string> rowStrs = new List<string>();
string rowStr;
//convert each datarow to a delimited string and add it to list rowStrs
foreach (DataRow dr in submissionsList.Rows)
{
rowStr = string.Empty;
foreach (DataColumn dc in submissionsList.Columns)
{
//only use the duplicateCheckFields in the string
if (duplicateCheckFields.Contains(dc.ColumnName))
{
rowStr += dr[dc].ToString() + "|";
}
}
rowStrs.Add(rowStr);
}
//count how many of each row string are in the list
//add the string's index (which will match the row's index)
//to the duplicates list if more than 1
for (int c = 0; c < rowStrs.Count; c++)
{
if (rowStrs.Count(str => str == rowStrs[c]) > 1)
{
duplicates.Add(c);
}
}
return duplicates;
}
However, this isn't very efficient: it's O(n^2) to go through the list of strings and get the count of each string. I looked at this solution but couldn't figure out how to use it with more than 1 field. I'm looking for a less expensive way to handle this problem.

Try this:
How can I check for an exact match in a table where each row has 70+ columns?
The essence is to make a set where you store hashes for rows and only do comparisons between rows with colliding hashes, complexity will be O(n)
...
If you have a large number of rows and storing the hashes themselves is an issue (an unlikely case, but still...) you can use a Bloom filter. The core idea of a Bloom filter is to calculate several different hashes of each row and use them as an address in a bitmap. As you're scanning through the rows you can double-check the rows that already have all the bits in the bitmap previously set.

Flat file normalization with a dynamic number of columns

I have a flat file with an unfortunately dynamic column structure. There is a value that is in a hierarchy of values, and each tier in the hierarchy gets its own column. For example, my flat file might resemble this:
StatisticID|FileId|Tier0ObjectId|Tier1ObjectId|Tier2ObjectId|Tier3ObjectId|Status
1234|7890|abcd|efgh|ijkl|mnop|Pending
...
The same feed the next day may resemble this:
StatisticID|FileId|Tier0ObjectId|Tier1ObjectId|Tier2ObjectId|Status
1234|7890|abcd|efgh|ijkl|Complete
...
The thing is, I don't care much about all the tiers; I only care about the id of the last (bottom) tier, and all the other row data that is not a part of the tier columns. I need normalize the feed to something resembling this to inject into a relational database:
StatisticID|FileId|ObjectId|Status
1234|7890|ijkl|Complete
...
What would be an efficient, easy-to-read mechanism for determining the last tier object id, and organizing the data as described? Every attempt I've made feels kludgy to me.
Some things I've done:
I have tried to examine the column names for regular expression patterns, identify the columns that are tiered, order them by name descending, and select the first record... but I lose the ordinal column number this way, so that didn't look good.
I have placed the columns I want into an IDictionary<string, int> object to reference, but again reliably collecting the ordinal of the dynamic columns is an issue, and it seems this would be rather non-performant.

I ran into a simular problem a few years ago. I used a Dictionary to map the columns, it was not pretty, but it worked.
First make a Dictionary:
private Dictionary<int, int> GetColumnDictionary(string headerLine)
{
Dictionary<int, int> columnDictionary = new Dictionary<int, int>();
List<string> columnNames = headerLine.Split('|').ToList();
string maxTierObjectColumnName = GetMaxTierObjectColumnName(columnNames);
for (int index = 0; index < columnNames.Count; index++)
{
if (columnNames[index] == "StatisticID")
{
columnDictionary.Add(0, index);
}
if (columnNames[index] == "FileId")
{
columnDictionary.Add(1, index);
}
if (columnNames[index] == maxTierObjectColumnName)
{
columnDictionary.Add(2, index);
}
if (columnNames[index] == "Status")
{
columnDictionary.Add(3, index);
}
}
return columnDictionary;
}
private string GetMaxTierObjectColumnName(List<string> columnNames)
{
// Edit this function if Tier ObjectId is greater then 9
var maxTierObjectColumnName = columnNames.Where(c => c.Contains("Tier") && c.Contains("Object")).OrderBy(c => c).Last();
return maxTierObjectColumnName;
}
And after that it's simply running thru the file:
private List<DataObject> ParseFile(string fileName)
{
StreamReader streamReader = new StreamReader(fileName);
string headerLine = streamReader.ReadLine();
Dictionary<int, int> columnDictionary = this.GetColumnDictionary(headerLine);
string line;
List<DataObject> dataObjects = new List<DataObject>();
while ((line = streamReader.ReadLine()) != null)
{
var lineValues = line.Split('|');
string statId = lineValues[columnDictionary[0]];
dataObjects.Add(
new DataObject()
{
StatisticId = lineValues[columnDictionary[0]],
FileId = lineValues[columnDictionary[1]],
ObjectId = lineValues[columnDictionary[2]],
Status = lineValues[columnDictionary[3]]
}
);
}
return dataObjects;
}
I hope this helps (even a little bit).

Personally I would not try to reformat your file. I think the easiest approach would be to parse each row from the front and the back. For example:
itemArray = getMyItems();
statisticId = itemArray[0];
fileId = itemArray[1];
//and so on for the rest of your pre-tier columns
//Then get the second to last column which will be the last tier
lastTierId = itemArray[itemArray.length -1];
Since you know the last tier will always be second from the end you can just start at the end and work your way forwards. This seems like it would be much easier than trying to reformat the datafile.
If you really want to create a new file, you could use this approach to get the data you want to write out.

I don't know C# syntax, but something along these lines:
split line in parts with | as separator
get parts [0], [1], [length - 2] and [length - 1]
pass the parts to the database handling code

How to speed these code,which is used to find matching record from two grid boths have 190000+ records

say i have two grid each contain 190000+ records ,named grid_A and grid_B,
for every record in grid_A ,i want to find whether there is a same record in grid_B.
grid_A and grid_B have same columns ,in my case,their columns is
col1 col2 col3 col4
and their datatype may be
string datatime double
by now,what i do is:
for each row in grid_A,loop through all rows in grid_B,and compare the Four cols
one by one.
the code is shown blow:
//loop grid_A
foreach (UltraGridRow row in ultraGrid1.Rows)
{
List<object> lo = new List<object>();
for (int i=0;i<4;i++) //add col's value to ListA
{
lo.Add(row.Cells[i].Value);
}
//loop grid_B
foreach (UltraGridRow rowDist in ultraGrid2.Rows)
{
List<object> loDist = new List<object>();
for (int ii=0;ii<4;ii++) //add col's value to ListB
{
loDist.Add(rowDist.Cells[ii].Value);
}
if (CompareList(lo, loDist) == true) //compare two List
{
break;
}
}
}
// the function compare two List
private bool CompareList(List<object> a, List<object> b)
{
//Assert a.count==b.count
for (int i=0;i<a.Count;i++)
{
if (!CompareObject(a[i], b[i]))
return false;
}
return true;
}
//
private bool CompareObject(object oa, object ob)
{
// object is string
if (oa.GetType() == typeof(System.String))
{
try
{
string strOb = Convert.ToString(ob);
if (oa.ToString() == strOb)
return true;
else
return false;
}
catch
{
return false;
}
}
// object is datetime
if (oa.GetType() == typeof(System.DateTime))
{
try
{
DateTime dtOb = Convert.ToDateTime(ob);
if ((DateTime)oa == dtOb)
return true;
else
return false;
}
catch
{
return false;
}
}
//object is double
if (oa.GetType() == typeof(System.Double))
{
try
{
double ddOb = Convert.ToDouble(ob);
if ((double)oa == ddOb)
return true;
else
return false;
}
catch
{
return false;
}
}
return true;
}
i know my compare way is too stupid,in fact,each loop circel cost 2.4 seconds,
that is :190000 circels cost 130 hours,it looks so terrible,
i heard it can use hash table to speed search performance,but i do not know how to use it.
anyway,for each record in grid_A ,search all record in grid_B is unacceptable,
so whatever help is appreciate.
my grid's data is imported from excel,so it does not have sql database or table.

So you want to find out whether in grid_B there is a record the same as in grid_A. Instead of doing nested foreach (resulting in O(n^2) complexity, which is huge) you could change your algorithm.
If you would first iterate over grid_B and calculate hashes of each row (e.g. by combining the data into string form, and then taking hash value, but there might be better way of generating the hash). If you would put those hashes into dictionary as a key, and value could be reference to row in grid_B or some other value needed.
Then you could iterate over grid_A, calculate hash value of row and check in dictionary whether the key is present or not. If it is present - you have the same row in grid_B (and the value stored in dictionary may lead you to that row for example). If it is not present - there is no row with the same values.
Such approach will give you complexity of O(2n), which is usually simplified to just O(n) - we will always have two iterations over the data. Thisis significant improvement, but at the cost of higher memory consumption.
Whether such approach is faster may depend on number of records in grids, but given you have 190k records - that should be faster (though I cannot test that now, will try to do that in the evening).

If you can sort both of the grids beforehand, you can greatly reduce the number of comparisons. You would maintain a current row index for each grid (row_A for grid_A, row_B for grid_B) and walk your way through the grids. For each comparison (pseudocode):
if (grid_A[row_A] < grid_B[row_B])
row_A++;
if (grid_A[row_A] == grid_B[row_B])
{
// matching record
row_A++;
row_B++;
}
if (grid_A[row_A] > grid_B[row_B])
row_B++;
If you are unable to sort the grids then I'm not aware of a way to avoid your current algorithm. You'll just have to optimize the individual row comparisons as much as possible.

Hand on the raw data and not through the datagrid, this way you will not have the overhead of the ultraGrid and it will speed up the things a bit.
An other idea would be to create a string out of each and every row and compare these two strings. e.g. if the row is 1,'text', true, 1/1/2001 then you must create and ordered list containing the string '1-text-true-1/1/2001' this way you will have a much faster indexing.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Find matching records in DataTable as fast as possible - c#

Related

Copying Datatable Columns

Excel List<List<string>> per each row

less expensive way to find duplicate rows in a datatable?

Flat file normalization with a dynamic number of columns

How to speed these code,which is used to find matching record from two grid boths have 190000+ records

Categories

Resources