Compare two DataTable Rows remove identical rows besides one. - c#

I want to check the rows of two data tables. If there is an exact match I want to remove all but one.
I've figured out how to compare two rows of data. I'm not sure the best way to return the cleaned up version without duplicates.
the tables within my program are pulling tables from a database, so i simplified them for the example.
Here's what I've worked out so far.
var table1 = new list<string>();
var table2 = new list<string>();
foreach (DataRow row1 in table.Rows)
foreach (DataRow row2 in table2.Rows)
{
var array1 = row1.ItemArray;
var array2 = row2.ItemArray;
if (array1.SequenceEqual(array2))
{
// store the unique elements within a new list?
// remove duplicates and return the remainder?
}
}
I thought using the Intersect() method might be an option as well.

Cast to a hashset of your desired type. They will automatically remove duplicates, as by definition hashset cannot have duplicate entries.
More info:
https://www.dotnetperls.com/hashset

Related

Populate List<string> from DataTable column

I have a datatable dt with a single column and a list of strings. If I want to change this to a list (for example, to make it easier to compare with another list) I would default to something like:
var ls = new List<string>();
foreach (DataRow dr in dt.Rows)
{
ls.Add(dr["COLUMN_NAME"].ToString());
}
I don't think it's possible to get more efficient than that, but is there shorter way to write this, maybe using LINQ?
As mentioned in the comment, the LINQ way might be shorter in code, but not more efficient.
Anyway, here is the one liner LINQ version.
var list = new List<string>(dt.Rows.Cast<DataRow>().Select(r => r["COLUMN_NAME"].ToString()));

Save row # to list when conditions are met

I have a datatable imported from a csv. What I'm trying to do is compare all of the rows to each other to find duplicates. In the case of duplicates I am going to add the row # to a list, then write the list to an array and deal with the duplicates after that.
//find duplicate rows and merge them.
foreach (DataRow dr in dt.Rows)
{
//loop again to compare rows
foreach (DataRow dx in dt.Rows)
{
if (dx[0]==dr[0] && dx[1]==dr[1] && dx[2] == dr[2] && dx[3] == dr[3] && dx[4] == dr[4] && dx[5] == dr[5] && dx[7] == dr[7])
{
dupeRows.Add(dx.ToString());
}
}
}
for testing I have added:
listBox1.Items.AddRange(dupeRows.ToArray());
which simply outputs System.Data.DataRow.
How do I store the duplicate row index ids?
The basic problem is that you saved a string describing the type of the row (what DataRow.ToString() returns by default) at the time you decided the row was a duplicate
Assuming you've read your CSV straight in with some library/driver rather than line by line (which would have been a good time to dedupe) let's use a dictionary to dedupe:
Dictionary<string, DataRow> d = new Dictionary<string, DataRow>();
foreach(var ro in dataTable.Rows){
//form a key for the dictionary
string key = string.Format("{0}\t{1}\t{2}\t{3}\t{4}\t{5}\t{7}", ro.ItemArray);
d[key] = ro;
}
That's it; at the end of this operation the d.Values will be a deduped collection of DataRow. 1000 rows will require 1000 operations so this will likely be orders of magnitude faster than comparing every row to every other row, which would need a million operations for a thousand rows
I've used tabs to separate the values when I formed the key - assuming your data contains no tabs. Best reliability will be achieved if you use a character that does not appear in the data
If you've read your CSV line by line and done a manual string split on comma (i.e. a primitive way of reading a CSV) you could do this operation then instead; after you split you have an array that can be used in place of ro.ItemArray. Process the entire file, creating rows (and adding to the dictionary) only if d.ContainsKey returns false. If the dictionary already contains that row, skip on rather than creating a row
The output (System.Data.DataRow) that you are seeing is expected since there is no custom implementation of DataRow.ToString() found in your project, framework is calling base class's (which is System.Object) ToString() for which the default implementation returns data type of object which invokes that method.
I see three solutions here:
If possible, try to read the DataTable into custom objects (like
MyDataTable, MyDataRow) so, you can create your own ToString() like
below:
public class MyDataRow
{
public override string ToString()
{
return "This is my custom data row formatted string";
}
}
in the for loop, when you found duplicated row, either just add
index/id (sort of primary key) of dx to array and then have another
for loop to retrieve dupes.
Third is same as mentioned by Caius Jard.

Cannot copy data from query to datatable

I have a small problem, which I just cannot find how to fix it.
For my data table, Dictionar.DtDomenii, I need to copy the unique data from my other data table, Dictionar.Dt.
I wrote a query, but when using query.CopyToDataTable() to copy the data into my DtDomenii table, the "CopyToDataTable" function does not show...
Am I doing something wrong? Is there an easier way to copy distinct data (categories from my example) from one data table to another?
PS: I've already read the information from MSDN https://msdn.microsoft.com/en-us/library/bb386921%28v=vs.110%29.aspx
void LoadCategories()
{
var query = (from cat in Dictionar.dt.AsEnumerable()
select new
{
categorie = categorii.Field<string>("Categoria")
}).Distinct();
// This statement does not work:
Dictionar.dtDomenii = query.CopyToDataTable();
}
Only collections of DataRows can use the CopyToDataTable method. For example:
DataTable table = new DataTable();
table.AsEnumerable().CopyToDataTable(); // this works
List<DataRow> dataRows = new List<DataRow>();
dataRows.CopyToDataTable(); // this also works
List<string> strings = new List<string>();
strings.CopyToDataTable(); // this does not work
The select new... part of your query is converting the DataRows into objects. You need to convert the objects back into DataRows before you can use the CopyToDataTable method.
You might have better luck doing something like this:
DataTable copy = Dictionar.dt
.AsEnumerable() // now an enumerable of DataRow
.Distinct() // remove duplicates, still an enumerable of DataRow
.CopyToDataTable(); // done!
You can also make a complete copy of the table with Dictionar.dt.Copy(), then remove the duplicate rows manually.

replace List.foreach to LINQ

I'm new to LINQ and doing some experiments with it.
Sorry if it is a duplicate but I cant seem to find proper guide (for me) to it
I want to replace this code :
DataTable table
List<string> header = new List<string>();
table.Columns.Cast<DataColumn>().ToList().ForEach(col => header.Add(col.ColumnName));
with something LINQ like:
var LINQheader = from mycol in table.Columns select mycol.ColumnName;
LINQheader.tolist();
but it doesn't even compile.
what I want Is not a one line solution but would like some logic to understand how construct it with more complicated environments (Like choosing many node in XML with some logic)
here is the original code
table.Columns.Cast<DataColumn>().ToList().ForEach(col => header.Add(col.ColumnName));
Why Cast used?
because it allows you to treat DataColumnCollection items as a DataColumn not an object.
Why ToList used?
becuase it converts your IEnumerable to List and allows you to call ForEach because this function is special method that exists in List class.
Why ForEach used?
because it allows you to do what you want for each element on the list (in your case it adds column name of each column to another list(header)).
Simplified version:
now assume you want to add column names to header where they starts with "Student"
you can write something like this
DataTable table = new DataTable();
List<string> header = new List<string>();
foreach (DataColumn col in table.Columns)
{
if (col.ColumnName.StartsWith("Id")) // you can remove this line if you want to add all of them
header.Add(col.ColumnName);
}
you can also use this
table.Columns.Cast<DataColumn>()
.ToList()
.ForEach(col =>
{
if (col.ColumnName.StartsWith("Id"))
header.Add(col.ColumnName)
});
or
var headers = table.Columns.Cast<DataColumn>()
.Where(col => col.ColumnName.StartsWith("Id"))
.Select(col => col.ColumnName);
header.AddRange(headers);
You can use Enumerable.Aggregate() for this:
var header = table.Columns.Cast<DataColumn>().Aggregate(new List<string>(), (list, col) => { list.Add(col.ColumnName); return list; });
In general, Linq allows for retrieval and transformation of sequences of data from data sources. What you want to do in this question is to iterate over a sequence and return an immediate result. That isn't the primary focus of Linq, but there are methods that perform tasks like this, including Aggregate(), Average(), Count(), Max() and so on.
var LINQheader = from mycol in table.column select mycol.ColumnName;
LINQheader.tolist();
This will not compile as there is no such property in DataTable as column, there is only Columns, and you have to use .Cast() method as they are not implementing right interface (see #Uriil's answer).
Try this:
var LINQheader = from mycol in table.Columns.Cast<DataColumn>()
select mycol.ColumnName;
LINQheader.tolist();
If you want to use wrap it in an extension method, you can do it like this:
public static IEnumerable<string> GetHeaderColumns (this DataTable dataTable)
{
if (dataTable == null || !dataTable.Columns.Any())
{
yield break;
}
foreach (var col in dataTable.Columns.Cast<DataColumn>())
{
yield return col.ColumnName;
}
}
static void Main(string[] args)
{
DataTable tbl = new DataTable();
tbl.Columns.Add("A");
tbl.Columns.Add("B");
var p = from DataColumn col in tbl.Columns select col.ColumnName;
foreach(string a in p)
{
Console.WriteLine(a);
}
}
Here little code example. If you want to be List<string>, use ToList().
EDIT:
Like #Grundy says you missing to specify type of the col, which is DataColumn.
List<string> columnList = (from DataColumn mycol in table.Columns select mycol.ColumnName).ToList();
Here this will be your one line.
Why not simple select like
DataTable table;
IEnumerable<string> header = from mycol in table.Columns.Cast<DataColumn>()
select mycol.ColumnName;
You have some problems, which requires workaround:
ForEach is List specifict method, so can can not translate it into LINQ
LINQ is for data selection, aggregation, but not for data
modification
table.Columns, returns DataColumnCollection, which does not
implement IEnumerable<T>, so you will have to cast it anyway:
var LINQheader = from mycol in table.Columns.Cast<DataColumn>()
select name.ColumnName;

Order DataTable rows with Dictionary

I'm trying to pass an ArrayList into a DataRow object, the idea being to import data into a database from a CSV.
Previously in the file, a Dictionary<string,int> has been created, with the column name as the Key, and the position index as the corresponding value.
I was planning on using this to create a temporary DataTable for each record to aid importing into the DB. My original idea was something along the lines of:
private DataRow ArrayListToDataRow(ArrayList data, Dictionary<string,int> columnPositions)
{
DataTable dt = new DataTable();
DataColumn dc = new DataColumn();
for (i=0;i<=data.Count;i++)
{
dc.ColumnName = columnPositions.Keys[i];
dt.Columns.Add(dc);
dt.Columns[columnPositions.Keys[i]].SetOrdinal(columnPositions(columnPositions.Keys[i]);
}
//TODO Add data to row
}
But of course, the keys aren't indexable.
Does anybody have an idea on how this could be achieved?
Since the size of data should be the same as the size of your columnPositions, you could try using a foreach over your dictionary instead of a for loop.
If you want to access your dictionary values based on a sortable index, you would need to change it to
Dictionary<int, string>
Which seems to make more sense, as you seem to want to read them in that order.
If you cannot change the dictionary, you can do something like this
var orderedPositions = columnPositions.OrderBy(x => x.Value);
foreach(var position in orderedPositions)
{
// do your stuff using position.Key and position.Value
}
.OrderBy comes from Linq, so yuo will need to add
using System.Linq;
to your class.
By ordering the columnPositions on their value (the columnIndex) instead of the default (the order in which items were added), you can loop trough them in the order you presumably want (seeing as you were going with a for loop and every time trying to get the next columnPosition).

Categories

Resources