Best way to remove duplicates from DataTable depending on column values

Best way to remove duplicates from DataTable depending on column values - c#

I have a DataSet which contains just one Table, so you could say I'm working with a DataTable here.
The code you see below works, but I want to have the best and most efficient way to perform the task because I work with some data here.
Basically, the data from the Table should later be in a Database, where the primary key - of course - must be unique.
The primary key of the data I work with is in a column called Computer Name. For each entry we also have a date in another column date.
I wrote a function which searches for duplicates in the Computer Name column, and then compare the dates of these duplicates to delete all but the newest.
The Function I wrote looks like this:
private void mergeduplicate(DataSet importedData)
{
Dictionary<String, List<DataRow>> systems = new Dictionary<String, List<DataRow>>();
DataSet importedDataCopy = importedData.Copy();
importedData.Tables[0].Clear();
foreach (DataRow dr in importedDataCopy.Tables[0].Rows)
{
String systemName = dr["Computer Name"].ToString();
if (!systems.ContainsKey(systemName))
{
systems.Add(systemName, new List<DataRow>());
}
systems[systemName].Add(dr);
}
foreach (KeyValuePair<String,List<DataRow>> entry in systems) {
if (entry.Value.Count > 1) {
int firstDataRowIndex = 0;
int secondDataRowIndex = 1;
while (entry.Value.Count > 1) {
DateTime time1 = Validation.ConvertStringIntoDateTime(entry.Value[firstDataRowIndex]["date"].ToString());
DateTime time2 = Validation.ConvertStringIntoDateTime(entry.Value[secondDataRowIndex]["date"].ToString());
//delete older entry
if (DateTime.Compare(time1,time2) >= 0) {
entry.Value.RemoveAt(firstDataRowIndex);
} else {
entry.Value.RemoveAt(secondDataRowIndex);
}
}
}
importedData.Tables[0].ImportRow(entry.Value[0]);
}
}
My Question is, since this code works - what is the best and fastest/most efficient way to perform the task?
I appreciate any answers!

I think this can be done more efficiently. You copy the DataSet once with DataSet importedDataCopy = importedData.Copy(); and then you copy it again into a dictionary and then you delete the unnecessary data from the dictionary. I would rather just remove the unnecessary information in one pass. What about something like this:
private void mergeduplicate(DataSet importedData)
{
Dictionary<String, DataRow> systems = new Dictionary<String, DataRow>();
int i = 0;
while (i < importedData.Tables[0].Rows.Count)
{
DataRow dr = importedData.Tables[0].Rows[i];
String systemName = dr["Computer Name"].ToString();
if (!systems.ContainsKey(systemName))
{
systems.Add(systemName, dr);
}
else
{
// Existing date is the date in the dictionary.
DateTime existing = Validation.ConvertStringIntoDateTime(systems[systemName]["date"].ToString());
// Candidate date is the date of the current DataRow.
DateTime candidate = Validation.ConvertStringIntoDateTime(dr["date"].ToString());
// If the candidate date is greater than the existing date then replace the existing DataRow
// with the candidate DataRow and delete the existing DataRow from the table.
if (DateTime.Compare(existing, candidate) < 0)
{
importedData.Tables[0].Rows.Remove(systems[systemName]);
systems[systemName] = dr;
}
else
{
importedData.Tables[0].Rows.Remove(dr);
}
}
i++;
}
}

maybe not the most efficient way but you said you appreciate any answers
List<DataRow> toDelete = dt.Rows.Cast<DataRow>()
.GroupBy(s => s["Computer Name"])
.SelectMany(grp => grp.OrderBy(x => x["date"])
.Skip(1)).ToList();
toDelete.ForEach(x => dt.Rows.Remove(x));

You could try to use CopyToDataTable, like this:
importedData.Tables[0] = importedData.Tables[0].AsEnumerable()
.GroupBy(r => new {CN = r["Computer Name"], Date = r["date"]})
.Select(g => g.OrderBy(r => r["Date"]).(First())
.CopyToDataTable();

Related

How to get data as IEnumerable<MODEL> from DataTable by using LINQ?

I have a DataTable with some attributes data get from a database, each data in that DataTable belong to a product, and each product could have more than one attribute.
So i have another DataTable which has all the products in a foreach by looping through each row i'm adding each product to it's List<Plu> like this:
var productAttr = new List<Plu.Attributi>();
foreach (DataRow rowPlu in dt.Rows)
{
try
{
int id = (int)rowPlu["ID_PLUREP"];
plu.Add(new Plu(
id,
(string)rowPlu["CODICE_PRP"],
(string)rowPlu[ESTESA],
(string)rowPlu[DESCR], (float)rowPlu["PRE_PRP"],
rowPlu.IsNull("IMG_IMG") ? null : (string)rowPlu["IMG_IMG"],
productAttr,
null,
(int)rowPlu["ID_MENU_PRP"]
));
}
catch
{
return plu;
}
}
For now the productAttr is empty but now i need to add to each product it's attributes, so with the following function i get a DataTable filled with data from database with all product attributes:
var attributi = Attributi(connection, idNegozio);
and then i was trying to do something like this inside the foreach
foreach (DataRow rowPlu in dt.Rows)
{
try
{
int id = (int)rowPlu["ID_PLUREP"];
plu.Add(new Plu(
id,
(string)rowPlu["CODICE_PRP"],
(string)rowPlu[ESTESA],
(string)rowPlu[DESCR], (float)rowPlu["PRE_PRP"],
rowPlu.IsNull("IMG_IMG") ? null : (string)rowPlu["IMG_IMG"],
from row in attributi.AsEnumerable() where row.Field<int>("ID_PLUREP_VAT") == id select row,
null,
(int)rowPlu["ID_MENU_PRP"]
));
}
catch
{
return plu;
}
}
But the LINQ returns a EnumerableRowCollection while i need a IEnumerable<Plu.Attribute>, so i was wondering if there is a lazy way to cast the .AsEnumerable to IEnumerable<Plu.Attrbute>...

The problem is, that the DataTable only knows which values are in the cells. It does not know what these values stand for. It does not know that the number in column 0 is in fact an Id. It doesn't know that the string in column 1 is the Name of a Customer, and the DateTime in column 2 is the Birthday of the Customer.
If you will be using the contents of this Datatable (or similar DataTables) for other queries in the future, you need some translation from DataRow to the items that they stand for.
Once you've got the translation from DataRow to Plu, you can convert your DataTable to an IEnumerable<Plu>, and do other LINQ processing on it.
Usage will be like:
DataTable table = ...
var mySelectedData = table.AsEnumerable().ToPlus()
.Where(plu => ...)
.Select(plu => new {...})
.ToList();
You need two extension methods: one that converts a DataRow to a Plu and one that converts a sequence of DataRows to a sequence of Plus. See extension methods demystified
public static Plu ToPlu(this DataRow row)
{
// TODO implement
}
public static IEnumerable<Plu> ToPlus(this IEnumerable<DataRow> dataRows)
{
// TODO: exception if null dataRows
return dataRows.Select(row => row.ToPlu());
}
If desired, create an extension method from DataTable to extract the Plus:
public static IEnumerable<Plu> ExtractPlus(this DataTable table)
{
// TODO: exception if table null
return table.AsEnumerable().ToPlus();
}
Usage:
DataTable table = ...
IEnumerable<Plu> plus = table.ExtractPlus();
I haven't got the faintest idea what a Plu is, and you forgot to mention the relevant properties of the Plu, so I'll give you an example of a table that contains Customers:
class Customer
{
public int Id {get; set;} // Id will be in column 0
public string Name {get; set;} // Name will be in column 1
...
}
public static Customer ToCustomer(this DataRow row)
{
return new Customer
{
Id = (int)row[0],
Name = (string)row[1],
};
}
If desired, instead of columnIndex you can use the name of the column.
So by only creating a ToPlu, and a one-liner method to convert sequences of DataRows to a sequence of Plus, you've extended LINQ with your methods to read your tables.
To be on the safe side, consider creating an extension method that converts a sequence of Plus to a DataTable. This way, the layout of the table is in one location: ToPlu(DataRow) and ToDataRow(Plu). Future changes in the table layout will be easier to manage, users of your DataTable will only think in sequences of Plus.

You can do something like below. If you want IEnumerable<Plu> you can remove the .ToList() from the end.
dt.AsEnumerable().Select(x => new Plu {
Id = x.Field<int>("ID_PLUREP"),
CodicePrep = x.Field<string>("CODICE_PRP"),
....
Attributes = attributi.AsEnumerable()
.Where(y => y.Field<int>("ID_PLUREP_VAT") == x.Field<int>("ID_PLUREP"))
.Select(z => new Attributi
{
....
}).ToList(),
....
}).ToList();

Find matching records in DataTable as fast as possible

I have C# DataTables with very large numbers of rows, and in my importer app I must query these hundreds of thousands of times in a given import. So I'm trying to find the fastest possible way to search. Thus far I am puzzling over very strange results. First, here are 2 different approaches I have been experimenting with:
APPROACH #1
public static bool DoesRecordExist(string keyColumn, string keyValue, DataTable dt)
{
if (dt != null && dt.Rows.Count > 0)
return dt.Select($"{keyColumn} = '{SafeTrim(keyValue)}'").Count() > 0;
else
return false;
}
APPROACH #2
public static bool DoesRecordExist(string keyColumn, string keyValue, DataTable dt)
{
if (dt != null && dt.Rows.Count > 0)
{
int counter = dt.AsEnumerable().Where(r => string.Equals(SafeTrim(r[keyColumn]), keyValue, StringComparison.CurrentCultureIgnoreCase)).Count();
return counter > 0;
}
else
return false;
}
In a mock test I run each method 15,000 times, handing in hardcoded data. This is apples-to-apples, a fair test. Approach #1 is dramatically faster. But in actual app execution, Approach #1 is dramatically slower.
Why the counterintuitive results? Is there some other faster way to query datatables that I haven't tried?
EDIT: The reason I use datatables as opposed to other types of
collections is because all my datasources are either MySQL tables or
CSV files. So datatables seemed like a logical choice. Some of these
tables contain 10+ columns, so different types of collections seemed
an awkward match.

If you want a faster access and still want to stick to the DataTables, use a dictionary to store the row numbers for given keys. Here I assume that each key is unique in the DataTable. If not, you would have to use a Dictionary<string, List<int>> or Dictionary<string, HashSet<int>> to store the indexes.
var indexes = new Dictionary<string, int>();
for (int i = 0; i < dt.Rows.Count; i++) {
indexes.Add((string)dt.Rows[i].Column(keyColumn), i);
}
Now you can access a row in a super fast way with
var row = dt.Rows[indexes[theKey]];

I have a very similar issue except that I need the actual First Occurrence of a matching row.
Using the .Select.FirstOrDefault (Approach 1) takes 38 minutes to run.
Using the .Where.FirstOrDefault (Approach 2) takes 6 minutes to run.
In a similar situation where I didn't need the FirstOrDefault, but just needed to find and work with the uniquely matching record, what I found to be the fastest by far is to use a HashTable where the Key is the Combined Values of any Columns you are trying to match, and the Value is the Data Row itself. Finding a Match is near instant.
The Function is
public Hashtable ConvertToLookup(DataTable myDataTable, params string[] pKeyFieldNames)
{
Hashtable myLookup = new Hashtable(StringComparer.InvariantCultureIgnoreCase); //Makes the Key Case Insensitive
foreach (DataRow myRecord in myDataTable.Rows)
{
string myHashKey = "";
foreach (string strKeyFieldName in pKeyFieldNames)
{
myHashKey += Convert.ToString(myRecord[strKeyFieldName]).Trim();
}
if (myLookup.ContainsKey(myHashKey) == false)
{
myLookup.Add(myHashKey, myRecord);
}
}
return myLookup;
}
The usage is...
//Build the Lookup Table
Hashtable myLookUp = ConvertToLookup(myDataTable, "Col1Name", "Col2Name");
//Use it
if (myLookUp.ContainsKey(mySearchForValue) == true)
{
DataRow myRecord = (DataRow)myLookUp[mySearchForValue]);
}

All. BINGO! Wanted to share as a different answer just because my previous might be suited for a bit of a different approach. In this scenario, I was able to go from 8 MINUTES, down to 6 SECONDS, not using either approaches...
Again, the key is a HashTable, or in my case a dictionary because I had multiple records. To recap, for me, I needed to delete 1 row from my DataTable for every matching record I found in another DataTable. With the goal that in the end, my First Datatable only contained the "Missing" records.
This uses a different function...
// -----------------------------------------------------------
// Creates a Dictionary with Grouping Counts from a DataTable
public Dictionary<string, Int32> GroupBy(DataTable myDataTable, params string[] pGroupByFieldNames)
{
Dictionary<string, Int32> myGroupBy = new Dictionary<string, Int32>(StringComparer.InvariantCultureIgnoreCase); //Makes the Key Case Insensitive
foreach (DataRow myRecord in myDataTable.Rows)
{
string myKey = "";
foreach (string strGroupFieldName in pGroupByFieldNames)
{
myKey += Convert.ToString(myRecord[strGroupFieldName]).Trim();
}
if (myGroupBy.ContainsKey(myKey) == false)
{
myGroupBy.Add(myKey, 1);
}
else
{
myGroupBy[myKey] += 1;
}
}
return myGroupBy;
}
Now.. say you have a Table of Records that you want to use as the "Match Values" based on Col1 and Col2
Dictionary<string, Int32> myQuickLookUpCount = GroupBy(myMatchTable, "Col1", "Col2");
And now the magic. We are looping through your Primary Table, and removing 1 instance of a record for each instance in the Matching Table. This is the part that took 8 minutes with Approach #2, or 38 minutes with Approach #1.. but now only takes seconds.
myDataTable.AcceptChanges(); //Trick that allows us to delete during a ForEach!
foreach (DataRow myDataRow in myDataTable.Rows)
{
//Grab the Key Values
string strKey1Value = Convert.ToString(myDataRow ["Col1"]);
string strKey2Value = Convert.ToString(myDataRow ["Col2"]);
if (myQuickLookUpCount.TryGetValue(strKey1Value + strKey2Value, out Int32 intTotalCount) == true && intTotalCount > 0)
{
myDataTable.Delete(); //Mark our Row to Delete
myQuickLookUpCount [strKey1Value + strKey2Value ] -= 1; //Decrement our Counter
}
}
myDataTable.AcceptChanges(); //Commits our changes and actually deletes the rows.

dynamic datatable sorting in ascending or descending

I have created dynamic table
DataTable date = new DataTable();
date.Columns.Add("date1");
and made fill the column name "date1" with date as
date1(Column name)
05-07-2013
10-07-2013
09-07-2013
02-07-2013
and made fill my dynamic table
Now i want this dynamic table data to be sort as ascending or descending order
For eg:
date1(Column name)
02-07-2013
05-07-2013
09-07-2013
10-07-2013

This cannot be done with the original data table. However you can create a new, sorted one:
DataView view = date.DefaultView;
view.Sort = "date1 ASC";
DataTable sortedDate = view.ToTable();

You can use DataTable.Select(filterExpression, sortExpression) method.
Gets an array of all DataRow objects that match the filter criteria,
in the specified sort order.
date.Select("", "YourColumn ASC");
or
date.Select("", "YourColumn DESC");
As an alternative, you can use DataView like;
DataView view = date.DefaultView;
view.Sort = "YourColumn ASC";
DataTable dt = view.ToTable();

Thought I would give in my two cents here. Instead of using a sorting algorithm which takes time and computational performance, why not instead reverse the way in which you are adding data to your data object.
This won't work for everyone's scenario - but for my own it worked out perfectly.
I had a database which listed items in an ascending order, but or ease of use I needed to reverse the way in which people could see the data (DESC) so that the newest input shows at the top, rather then the bottom of the list.
So, I just changed my for loop so instead of working from 0 -> upwards, it started from the length of the datatable (-1 to stop an overflow) and then stops when it is >= to 0;
private Dictionary<string, string> GetComboData(string table, int column, bool id, int idField = 0)
{
SqlClass sql = new SqlClass(database);
Dictionary<string, string> comboBoxData = new Dictionary<string, string>();
if (sql.connectedToServer)
{
sql.SelectResults(SQLCommands.Commands.SelectAll(table));
for (int i = sql.table.Rows.Count-1; i >= 0; i--)
{
string tool = sql.table.Rows[i].ItemArray.Select(x => x.ToString()).ToArray()[column];
string ID = sql.table.Rows[i].ItemArray.Select(x => x.ToString()).ToArray()[idField];
comboBoxData.Add(ID, tool);
}
}
return comboBoxData;
}

using OrderByDescending()
#foreach (var rca in Model.OrderByDescending(x=>x.Id))
{
<tr class="heading">
<td>#rca.PBINo</td>
<td>#rca.Title</td>
<td>#rca.Introduction</td>
<td>#rca.CustomerImpact</td>
<td>#rca.RootCauseAnalysis</td>
</tr>
}

Is there a way to dynamically create an object at run time in .NET 3.5?

I'm working on an importer that takes tab delimited text files. The first line of each file contains 'columns' like ItemCode, Language, ImportMode etc and there can be varying numbers of columns.
I'm able to get the names of each column, whether there's one or 10 and so on. I use a method to achieve this that returns List<string>:
private List<string> GetColumnNames(string saveLocation, int numColumns)
{
var data = (File.ReadAllLines(saveLocation));
var columnNames = new List<string>();
for (int i = 0; i < numColumns; i++)
{
var cols = from lines in data
.Take(1)
.Where(l => !string.IsNullOrEmpty(l))
.Select(l => l.Split(delimiter.ToCharArray(), StringSplitOptions.None))
.Select(value => string.Join(" ", value))
let split = lines.Split(' ')
select new
{
Temp = split[i].Trim()
};
foreach (var x in cols)
{
columnNames.Add(x.Temp);
}
}
return columnNames;
}
If I always knew what columns to be expecting, I could just create a new object, but since I don't, I'm wondering is there a way I can dynamically create an object with properties that correspond to whatever GetColumnNames() returns?
Any suggestions?

For what it's worth, here's how I used DataTables to achieve what I wanted.
// saveLocation is file location
// numColumns comes from another method that gets number of columns in file
var columnNames = GetColumnNames(saveLocation, numColumns);
var table = new DataTable();
foreach (var header in columnNames)
{
table.Columns.Add(header);
}
// itemAttributeData is the file split into lines
foreach (var row in itemAttributeData)
{
table.Rows.Add(row);
}
Although there was a bit more work involved to be able to manipulate the data in the way I wanted, Karthik's suggestion got me on the right track.

You could create a dictionary of strings where the first string references the "properties" name and the second string its characteristic.

Ensure a datatable is ordered by a value

I have a complex algorithm which I am not going to explain here. The code pasted below is doing some processing for each row, but I need to ensure that the table is ordered by a field different than the Primary Key.
I need to do this in this code, not in SQL, or in stored procedures; it needs to be done in .net just before the foreach.
NO LINQ IS ALLOWED, ITS .NET 2.0
THX
Your help is appreciated.
List<int> distinctREFMDossierIds = GetREFMDossierIdsFromBookings();
foreach (int refmDossierId in distinctREFMDossierIds)
{
bool errorsFoundInDetails = false;
bool errorsFoundInHeaders = false;
wingsBookingInterfaceIdswithErrors.Clear();
dicRows.Clear();
sbWingsBookingInterfaceIds= new StringBuilder();
YBooking booking = new YBooking();
foreach (UC090_WingsIntegrationDataSet.WingsBookingInterfaceRow row in _uc090_WingsIntegrationDataSet.WingsBookingInterface.Rows)
{
//code
}

You can use LINQ:
foreach(var row in _uc090_WingsIntegrationDataSet.WingsBookingInterface
.OrderBy(r => r. Something))

You can sort a DataTable like this:
DataTable dt = new DataTable();
dt.DefaultView.Sort = <Sort expression>;
dt = dt.DefaultView.ToTable();

WingsBookingInterface.Rows.OrderBy(item => item.columnName);

You can sort a collection ( for example a List<> ) with the OrderBy extension method.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Best way to remove duplicates from DataTable depending on column values - c#

maybe not the most efficient way but you said you appreciate any answers List<DataRow> toDelete = dt.Rows.Cast<DataRow>() .GroupBy(s => s["Computer Name"]) .SelectMany(grp => grp.OrderBy(x => x["date"]) .Skip(1)).ToList(); toDelete.ForEach(x => dt.Rows.Remove(x));

You could try to use CopyToDataTable, like this: importedData.Tables[0] = importedData.Tables[0].AsEnumerable() .GroupBy(r => new {CN = r["Computer Name"], Date = r["date"]}) .Select(g => g.OrderBy(r => r["Date"]).(First()) .CopyToDataTable();

Related

How to get data as IEnumerable<MODEL> from DataTable by using LINQ?

Find matching records in DataTable as fast as possible

dynamic datatable sorting in ascending or descending

Is there a way to dynamically create an object at run time in .NET 3.5?

Ensure a datatable is ordered by a value

Categories

Resources