less expensive way to find duplicate rows in a datatable? - c#

I want to find all rows in a DataTable where each of a group of columns is a duplicate. My current idea is to get a list of indexes of all rows that appear more than once as follows:
public List<int> findDuplicates_New()
{
string[] duplicateCheckFields = { "Name", "City" };
List<int> duplicates = new List<int>();
List<string> rowStrs = new List<string>();
string rowStr;
//convert each datarow to a delimited string and add it to list rowStrs
foreach (DataRow dr in submissionsList.Rows)
{
rowStr = string.Empty;
foreach (DataColumn dc in submissionsList.Columns)
{
//only use the duplicateCheckFields in the string
if (duplicateCheckFields.Contains(dc.ColumnName))
{
rowStr += dr[dc].ToString() + "|";
}
}
rowStrs.Add(rowStr);
}
//count how many of each row string are in the list
//add the string's index (which will match the row's index)
//to the duplicates list if more than 1
for (int c = 0; c < rowStrs.Count; c++)
{
if (rowStrs.Count(str => str == rowStrs[c]) > 1)
{
duplicates.Add(c);
}
}
return duplicates;
}
However, this isn't very efficient: it's O(n^2) to go through the list of strings and get the count of each string. I looked at this solution but couldn't figure out how to use it with more than 1 field. I'm looking for a less expensive way to handle this problem.

Try this:
How can I check for an exact match in a table where each row has 70+ columns?
The essence is to make a set where you store hashes for rows and only do comparisons between rows with colliding hashes, complexity will be O(n)
...
If you have a large number of rows and storing the hashes themselves is an issue (an unlikely case, but still...) you can use a Bloom filter. The core idea of a Bloom filter is to calculate several different hashes of each row and use them as an address in a bitmap. As you're scanning through the rows you can double-check the rows that already have all the bits in the bitmap previously set.

Related

Find matching records in DataTable as fast as possible

I have C# DataTables with very large numbers of rows, and in my importer app I must query these hundreds of thousands of times in a given import. So I'm trying to find the fastest possible way to search. Thus far I am puzzling over very strange results. First, here are 2 different approaches I have been experimenting with:
APPROACH #1
public static bool DoesRecordExist(string keyColumn, string keyValue, DataTable dt)
{
if (dt != null && dt.Rows.Count > 0)
return dt.Select($"{keyColumn} = '{SafeTrim(keyValue)}'").Count() > 0;
else
return false;
}
APPROACH #2
public static bool DoesRecordExist(string keyColumn, string keyValue, DataTable dt)
{
if (dt != null && dt.Rows.Count > 0)
{
int counter = dt.AsEnumerable().Where(r => string.Equals(SafeTrim(r[keyColumn]), keyValue, StringComparison.CurrentCultureIgnoreCase)).Count();
return counter > 0;
}
else
return false;
}
In a mock test I run each method 15,000 times, handing in hardcoded data. This is apples-to-apples, a fair test. Approach #1 is dramatically faster. But in actual app execution, Approach #1 is dramatically slower.
Why the counterintuitive results? Is there some other faster way to query datatables that I haven't tried?
EDIT: The reason I use datatables as opposed to other types of
collections is because all my datasources are either MySQL tables or
CSV files. So datatables seemed like a logical choice. Some of these
tables contain 10+ columns, so different types of collections seemed
an awkward match.
If you want a faster access and still want to stick to the DataTables, use a dictionary to store the row numbers for given keys. Here I assume that each key is unique in the DataTable. If not, you would have to use a Dictionary<string, List<int>> or Dictionary<string, HashSet<int>> to store the indexes.
var indexes = new Dictionary<string, int>();
for (int i = 0; i < dt.Rows.Count; i++) {
indexes.Add((string)dt.Rows[i].Column(keyColumn), i);
}
Now you can access a row in a super fast way with
var row = dt.Rows[indexes[theKey]];
I have a very similar issue except that I need the actual First Occurrence of a matching row.
Using the .Select.FirstOrDefault (Approach 1) takes 38 minutes to run.
Using the .Where.FirstOrDefault (Approach 2) takes 6 minutes to run.
In a similar situation where I didn't need the FirstOrDefault, but just needed to find and work with the uniquely matching record, what I found to be the fastest by far is to use a HashTable where the Key is the Combined Values of any Columns you are trying to match, and the Value is the Data Row itself. Finding a Match is near instant.
The Function is
public Hashtable ConvertToLookup(DataTable myDataTable, params string[] pKeyFieldNames)
{
Hashtable myLookup = new Hashtable(StringComparer.InvariantCultureIgnoreCase); //Makes the Key Case Insensitive
foreach (DataRow myRecord in myDataTable.Rows)
{
string myHashKey = "";
foreach (string strKeyFieldName in pKeyFieldNames)
{
myHashKey += Convert.ToString(myRecord[strKeyFieldName]).Trim();
}
if (myLookup.ContainsKey(myHashKey) == false)
{
myLookup.Add(myHashKey, myRecord);
}
}
return myLookup;
}
The usage is...
//Build the Lookup Table
Hashtable myLookUp = ConvertToLookup(myDataTable, "Col1Name", "Col2Name");
//Use it
if (myLookUp.ContainsKey(mySearchForValue) == true)
{
DataRow myRecord = (DataRow)myLookUp[mySearchForValue]);
}
All. BINGO! Wanted to share as a different answer just because my previous might be suited for a bit of a different approach. In this scenario, I was able to go from 8 MINUTES, down to 6 SECONDS, not using either approaches...
Again, the key is a HashTable, or in my case a dictionary because I had multiple records. To recap, for me, I needed to delete 1 row from my DataTable for every matching record I found in another DataTable. With the goal that in the end, my First Datatable only contained the "Missing" records.
This uses a different function...
// -----------------------------------------------------------
// Creates a Dictionary with Grouping Counts from a DataTable
public Dictionary<string, Int32> GroupBy(DataTable myDataTable, params string[] pGroupByFieldNames)
{
Dictionary<string, Int32> myGroupBy = new Dictionary<string, Int32>(StringComparer.InvariantCultureIgnoreCase); //Makes the Key Case Insensitive
foreach (DataRow myRecord in myDataTable.Rows)
{
string myKey = "";
foreach (string strGroupFieldName in pGroupByFieldNames)
{
myKey += Convert.ToString(myRecord[strGroupFieldName]).Trim();
}
if (myGroupBy.ContainsKey(myKey) == false)
{
myGroupBy.Add(myKey, 1);
}
else
{
myGroupBy[myKey] += 1;
}
}
return myGroupBy;
}
Now.. say you have a Table of Records that you want to use as the "Match Values" based on Col1 and Col2
Dictionary<string, Int32> myQuickLookUpCount = GroupBy(myMatchTable, "Col1", "Col2");
And now the magic. We are looping through your Primary Table, and removing 1 instance of a record for each instance in the Matching Table. This is the part that took 8 minutes with Approach #2, or 38 minutes with Approach #1.. but now only takes seconds.
myDataTable.AcceptChanges(); //Trick that allows us to delete during a ForEach!
foreach (DataRow myDataRow in myDataTable.Rows)
{
//Grab the Key Values
string strKey1Value = Convert.ToString(myDataRow ["Col1"]);
string strKey2Value = Convert.ToString(myDataRow ["Col2"]);
if (myQuickLookUpCount.TryGetValue(strKey1Value + strKey2Value, out Int32 intTotalCount) == true && intTotalCount > 0)
{
myDataTable.Delete(); //Mark our Row to Delete
myQuickLookUpCount [strKey1Value + strKey2Value ] -= 1; //Decrement our Counter
}
}
myDataTable.AcceptChanges(); //Commits our changes and actually deletes the rows.

Excel List<List<string>> per each row

I have a small program where you can select some database tables and create a excel file with all values for each table and thats my solution to create the excel file.
foreach (var selectedDatabase in this.lstSourceDatabaseTables.SelectedItems)
{
//creates a new worksheet foreach selected table
foreach (TableRetrieverItem databaseTable in tableItems.FindAll(e => e.TableName.Equals(selectedDatabase)))
{
_xlWorksheet = (Excel.Worksheet) xlApp.Worksheets.Add();
_xlWorksheet.Name = databaseTable.TableName.Length > 31 ? databaseTable.TableName.Substring(0, 31): databaseTable.TableName;
_xlWorksheet.Cells[1, 1] = string.Format("{0}.{1}", databaseTable.TableOwner,databaseTable.TableName);
ColumnRetriever retrieveColumn = new ColumnRetriever(SourceConnectionString);
IEnumerable<ColumnRetrieverItem> dbColumns = retrieveColumn.RetrieveColumns(databaseTable.TableName);
var results = retrieveColumn.GetValues(databaseTable.TableName);
int i = 1;
(result is a result.Item3 is a List<List<string>> which contains all values from a table and for each row is a new list inserted)
for (int j = 0; j < results.Item3.Count(); j++)
{
int tmp = 1;
foreach (var value in results.Item3[j])
{
_xlWorksheet.Cells[j + 3, tmp] = value;
tmp++;
}
}
}
}
It works but when you have a table with 5.000 or more values it will take such a long time.
Does someone maybe know a better solution to add the List List string per row than my for foreach solution ?
I utilize the GetExcelColumnName function in my code sample to convert from column count to the excel column name.
The whole idea is, that it's very slow to write excel cells one by one. So instead precompute the whole table of values and then assign the result in a single operation. In order to assign values to a two dimensional range, use a two dimensional array of values:
var rows = results.Item3.Count;
var cols = results.Item3.Max(x => x.Count);
object[,] values = new object[rows, cols];
// TODO: initialize values from results content
// get the appropriate range
Range range = w.Range["A3", GetExcelColumnName(cols) + (rows + 2)];
// assign all values at once
range.Value = values;
Maybe you need to change some details about the used index ranges - can't test my code right now.
As I see, youd didn't do profiling. I recomend to do profiling first (for example dotTrace) and see what parts of your code actualy causing performance issues.
In my practice there is rare cases (almost no such cases) when code executes slower than database requests, even if code is realy awfull in algorithmic terms.
First, I recomend to fill up your excel not by columns, but by rows. If your table has many columns this will cause multiple round trips to database - it is great impact to performance.
Second, write to excel in batches - by rows. Think of excel files as mini-databases, with same 'batch is faster than one by one' principles.

How do I delete a datarow from a datarow array?

I am looping through a array of datarows and when a particular random item is not valid I want to remove that item and get the new total to get another random item.
But when I delete a datarow the datarow does not go away... And yes there is probably a much better way to do this but I am not smart enough to do it..
Instead of removing the row I see this inside
ItemArray = podLps[1].ItemArray threw an exception of type System.Data.RowNotInTableException
//PHASE 1: Get all LPs in the pod and add to collection
List<DataRow> allLps = dtLp.AsEnumerable().ToList();
DataRow[] podLps = allLps.Where(x => x.ItemArray[0].ToString() == loPod).ToArray();
//PHASE 2: Pick a random LP from collection that has valid WAVE1
for (int i = podLps.Count(); i > 0; i--)
{
//Recount items in collection since one may have been removed
int randomIndex = random.Next(podLps.Count());
var randomLpUserId = podLps[randomIndex].ItemArray[1].ToString();
var randomLpWave1 = int.Parse(podLps[randomIndex].ItemArray[2].ToString());
//Get WAVE1 # for selected LP
lpNumberOfLoans = GetNumberOfLoans(session, randomLpUserId);
//check if LP has valid WAVE1 then use this person
if (randomLpWave1 > lpNumberOfLoans)
{
return randomLpUserId;
}
else
{
podLps[randomIndex].Delete();
}
}
look at this example and it should point you in the right direction for removing rows I just tested it and it works
for (int i = myDataTable.Rows.Count - 1; i >= 0; i--)
{
DataRow row = myDataTable.Rows[i]; //Remove
if (myDataTable.Rows[i][0].ToString() == string.Empty)
{
myDataTable.Rows.Remove(row);
}
}
I would suggest to use a List for podLps instead of an array.
Then you can use .RemoveAt as Jaco mentioned (dosn't work for arrays).
DataRow.Delete() just flags the Row to be deleted in the next update of your DataTable.
The easiest method is to convert your array of DataRow[] to a List, call RemoveAt and then convert the list back to an array:
var dest = new List<>(podLps);
dest.RemoveAt(randomIndex);
podLps = dest.ToArray();

Is there a way to dynamically create an object at run time in .NET 3.5?

I'm working on an importer that takes tab delimited text files. The first line of each file contains 'columns' like ItemCode, Language, ImportMode etc and there can be varying numbers of columns.
I'm able to get the names of each column, whether there's one or 10 and so on. I use a method to achieve this that returns List<string>:
private List<string> GetColumnNames(string saveLocation, int numColumns)
{
var data = (File.ReadAllLines(saveLocation));
var columnNames = new List<string>();
for (int i = 0; i < numColumns; i++)
{
var cols = from lines in data
.Take(1)
.Where(l => !string.IsNullOrEmpty(l))
.Select(l => l.Split(delimiter.ToCharArray(), StringSplitOptions.None))
.Select(value => string.Join(" ", value))
let split = lines.Split(' ')
select new
{
Temp = split[i].Trim()
};
foreach (var x in cols)
{
columnNames.Add(x.Temp);
}
}
return columnNames;
}
If I always knew what columns to be expecting, I could just create a new object, but since I don't, I'm wondering is there a way I can dynamically create an object with properties that correspond to whatever GetColumnNames() returns?
Any suggestions?
For what it's worth, here's how I used DataTables to achieve what I wanted.
// saveLocation is file location
// numColumns comes from another method that gets number of columns in file
var columnNames = GetColumnNames(saveLocation, numColumns);
var table = new DataTable();
foreach (var header in columnNames)
{
table.Columns.Add(header);
}
// itemAttributeData is the file split into lines
foreach (var row in itemAttributeData)
{
table.Rows.Add(row);
}
Although there was a bit more work involved to be able to manipulate the data in the way I wanted, Karthik's suggestion got me on the right track.
You could create a dictionary of strings where the first string references the "properties" name and the second string its characteristic.

How to speed these code,which is used to find matching record from two grid boths have 190000+ records

say i have two grid each contain 190000+ records ,named grid_A and grid_B,
for every record in grid_A ,i want to find whether there is a same record in grid_B.
grid_A and grid_B have same columns ,in my case,their columns is
col1 col2 col3 col4
and their datatype may be
string datatime double
by now,what i do is:
for each row in grid_A,loop through all rows in grid_B,and compare the Four cols
one by one.
the code is shown blow:
//loop grid_A
foreach (UltraGridRow row in ultraGrid1.Rows)
{
List<object> lo = new List<object>();
for (int i=0;i<4;i++) //add col's value to ListA
{
lo.Add(row.Cells[i].Value);
}
//loop grid_B
foreach (UltraGridRow rowDist in ultraGrid2.Rows)
{
List<object> loDist = new List<object>();
for (int ii=0;ii<4;ii++) //add col's value to ListB
{
loDist.Add(rowDist.Cells[ii].Value);
}
if (CompareList(lo, loDist) == true) //compare two List
{
break;
}
}
}
// the function compare two List
private bool CompareList(List<object> a, List<object> b)
{
//Assert a.count==b.count
for (int i=0;i<a.Count;i++)
{
if (!CompareObject(a[i], b[i]))
return false;
}
return true;
}
//
private bool CompareObject(object oa, object ob)
{
// object is string
if (oa.GetType() == typeof(System.String))
{
try
{
string strOb = Convert.ToString(ob);
if (oa.ToString() == strOb)
return true;
else
return false;
}
catch
{
return false;
}
}
// object is datetime
if (oa.GetType() == typeof(System.DateTime))
{
try
{
DateTime dtOb = Convert.ToDateTime(ob);
if ((DateTime)oa == dtOb)
return true;
else
return false;
}
catch
{
return false;
}
}
//object is double
if (oa.GetType() == typeof(System.Double))
{
try
{
double ddOb = Convert.ToDouble(ob);
if ((double)oa == ddOb)
return true;
else
return false;
}
catch
{
return false;
}
}
return true;
}
i know my compare way is too stupid,in fact,each loop circel cost 2.4 seconds,
that is :190000 circels cost 130 hours,it looks so terrible,
i heard it can use hash table to speed search performance,but i do not know how to use it.
anyway,for each record in grid_A ,search all record in grid_B is unacceptable,
so whatever help is appreciate.
my grid's data is imported from excel,so it does not have sql database or table.
So you want to find out whether in grid_B there is a record the same as in grid_A. Instead of doing nested foreach (resulting in O(n^2) complexity, which is huge) you could change your algorithm.
If you would first iterate over grid_B and calculate hashes of each row (e.g. by combining the data into string form, and then taking hash value, but there might be better way of generating the hash). If you would put those hashes into dictionary as a key, and value could be reference to row in grid_B or some other value needed.
Then you could iterate over grid_A, calculate hash value of row and check in dictionary whether the key is present or not. If it is present - you have the same row in grid_B (and the value stored in dictionary may lead you to that row for example). If it is not present - there is no row with the same values.
Such approach will give you complexity of O(2n), which is usually simplified to just O(n) - we will always have two iterations over the data. Thisis significant improvement, but at the cost of higher memory consumption.
Whether such approach is faster may depend on number of records in grids, but given you have 190k records - that should be faster (though I cannot test that now, will try to do that in the evening).
If you can sort both of the grids beforehand, you can greatly reduce the number of comparisons. You would maintain a current row index for each grid (row_A for grid_A, row_B for grid_B) and walk your way through the grids. For each comparison (pseudocode):
if (grid_A[row_A] < grid_B[row_B])
row_A++;
if (grid_A[row_A] == grid_B[row_B])
{
// matching record
row_A++;
row_B++;
}
if (grid_A[row_A] > grid_B[row_B])
row_B++;
If you are unable to sort the grids then I'm not aware of a way to avoid your current algorithm. You'll just have to optimize the individual row comparisons as much as possible.
Hand on the raw data and not through the datagrid, this way you will not have the overhead of the ultraGrid and it will speed up the things a bit.
An other idea would be to create a string out of each and every row and compare these two strings. e.g. if the row is 1,'text', true, 1/1/2001 then you must create and ordered list containing the string '1-text-true-1/1/2001' this way you will have a much faster indexing.

Categories

Resources