Common and New rows in Data Table - c#

I have two dataTables as follows
1) dtExistingZipCodeInDB (data from database)
2) dtCSVExcelSource (data from csv source which is to be processed)
I have two requirements
1) List all the retired zip codes (zip codes that are present in dtExistingZipCodeInDB but not in dtCSVExcelSource)
2) UnChanged zip codes (zip codes that are present both in dtExistingZipCodeInDB and dtCSVExcelSource)
I can use the Merge to get the retired zip codes. How do I get the unchanged zip codes?
Framework: .Net 3.0
//Note: dtExistingZipCodeInDB and dtCSVExcelSource has the same columns
dtCSVExcelSource.Merge(dtExistingZipCodeInDB);
DataTable dtRetiredZipDataTable = dtCSVExcelSource.GetChanges();
string retiredZipCodes = GetStringFromDataTable(dtRetiredZipDataTable, "ZipCode");
Thanks
Lijo

With the .NET 3.0 requirement, the Intersect LINQ extension method is not available, but you can provide your own extension method.
All you need is the MatchingRows extension method (see below in the demo code) and then do:
IEnumerable<DataRow> unchangedZipCodes = dtExistingZipCodeInDB.MatchingRows(dtCSVExcelSource, "ZipCode");
Then you can loop over unchangedZipCodes, which will contain only those rows with ZipCodes in common between dtExistingZipCodeInDB and dtCSVExcelSource.
Below is demo code I wrote using LINQPad. I love LINQPad -- it's great for proof of concept or scratchpadding/sandboxing some code quickly. But it is not required for the solution to this question.
void Main()
{
string colname = "ZipCode";
var dt = new DataTable();
dt.Columns.Add(colname, typeof(string));
dt.Rows.Add(new [] { "12345" } );
dt.Rows.Add(new [] { "67890" } );
dt.Rows.Add(new [] { "40291" } );
var dt2 = new DataTable();
dt2.Columns.Add(colname, typeof(string));
dt2.Rows.Add(new [] { "12345" } );
dt2.Rows.Add(new [] { "83791" } );
dt2.Rows.Add(new [] { "24520" } );
dt2.Rows.Add(new [] { "48023" } );
dt2.Rows.Add(new [] { "67890" } );
/// With .NET 3.5 LINQ extensions, it can be done inline.
// var results = dt.AsEnumerable()
// .Select(r => r.Field<string>(colname))
// .Intersect(dt2.AsEnumerable()
// .Select(r => r.Field<string>(colname)));
// Console.Write(String.Join(", ", results.ToArray()));
var results = dt.MatchingRows(dt2, colname);
foreach (DataRow r in results)
Console.WriteLine(r[colname]);
}
public static class Extensions
{
/// With .NET 3.0 and no LINQ, create an extension method using yield.
public static IEnumerable<DataRow> MatchingRows(this DataTable dt, DataTable dtCompare, string colName)
{
foreach (DataRow r in dt.Rows)
{
if (dtCompare.Select(String.Format("{0} = {1}", colName, r[(colName)])).Length > 0)
yield return r;
}
}
}
Outputs:
12345
67890

Related

How to melt a DataTable in C# .NET (wide to long format)?

How to melt a DataTable in C# (wide to long format) as Python Pandas.melt does? https://pandas.pydata.org/docs/reference/api/pandas.melt.html
Is there any method already implemented? If not, how the code for melting a DataTable would
For example:
I have one DataTable which is in wide format, that is that has one row per id and has as many columns as variables. I would like to transform this DataTable to long format that has as many rows as combinations of id with each variable column. You can see this example in top image.
Please, if there is not clear enough visit Pandas documentation, there is more clear. (https://pandas.pydata.org/docs/reference/api/pandas.melt.html)
Note: I would like a solution that is DataTable independent, that is, that the solution is able to take parameters as id_vars, value_vars, etc... like Pandas.melt does
Any help is appreciated.
I don't know that Melt method but according to docs it seems to be an unpivot method:
public static DataTable MeltTable(DataTable inputTable, string outputColumn, params string[] unpivotColumns)
{
DataTable resultTable = new DataTable();
DataColumn col = new DataColumn(outputColumn, inputTable.Columns[outputColumn].DataType);
resultTable.Columns.Add(col);
resultTable.Columns.Add("Variable");
resultTable.Columns.Add("Value");
foreach(string unpivotColumn in unpivotColumns)
{
foreach (DataRow row in inputTable.Rows)
{
resultTable.Rows.Add(row[outputColumn], unpivotColumn, row[unpivotColumn]);
}
}
return resultTable;
}
You use it in this way:
DataTable table = new DataTable();
table.Columns.Add("Name");
table.Columns.Add("Course");
table.Columns.Add("Age", typeof(int));
table.Rows.Add("Tim", "Masters", 47);
table.Rows.Add("Bob", "Graduate", 19);
table.Rows.Add("Sheila", "Graduate", 20);
DataTable resultTable = MeltTable(table, "Name", "Course", "Age");
Result:
Name Variable Value
Tim Course Masters
Bob Course Graduate
Sheila Course Graduate
Tim Age 47
Bob Age 19
Sheila Age 20
#TimSchmelter gave me the answer but I modified a little bit to be a more general solution. Here's the code:
public static List<string> GetDifferenceColumns(DataTable dt, List<string> diffCols)
{
string[] columns = GetColumnsList(dt).ToArray();
IEnumerable<string> differenceColumns = from column in columns.Except(diffCols.ToArray()) select column;
return differenceColumns.ToList();
}
public static DataTable Melt(DataTable dt, List<string> idCols = null, List<string> varCols = null)
{
string errorPrefixString = "Error in DataProcessing Melt Method:\n";
bool varsColsIsNull = (varCols == null || varCols.Count == 0);
bool idColsIsNull = (idCols == null || idCols.Count == 0);
string varsName = "Variable";
string valueName = "Value";
if (dt.Rows.Count == 0)
{
throw new Exception(errorPrefixString + "DataTable is empty");
}
if (varsColsIsNull && varsColsIsNull)
{
throw new Exception(errorPrefixString+"You should past at least varCols or idCols");
}
if (varsColsIsNull)
{
varCols = GetDifferenceColumns(dt, idCols);
}
if (idColsIsNull)
{
idCols = GetDifferenceColumns(dt, varCols);
}
DataTable resultTable = new DataTable();
// Creating final columns of resultTable
foreach (string id in idCols)
{
resultTable.Columns.Add(id);
}
resultTable.Columns.Add(varsName);
resultTable.Columns.Add(valueName);
// Populating resultTable with the new rows
// generated by unpivoting varCols
foreach (string varCol in varCols)
{
foreach (DataRow row in dt.Rows)
{
DataRow resultRow = resultTable.NewRow();
foreach(string id in idCols)
{
resultRow[id] = row[id]; // create id cols
}
resultRow[varsName] = varCol;
resultRow[valueName] = row[varCol];
resultTable.Rows.Add(resultRow);
}
}
return resultTable;
}
How to use it:
DataTable dt = new DataTable();
dt.Columns.Add("Name");
dt.Columns.Add("Course");
dt.Columns.Add("Age");
dt.Rows.Add("Tim", "Masters", 47);
dt.Rows.Add("Bob", "Graduate", 19);
dt.Rows.Add("Sheila", "Graduate", 20);
List<string> varCols = new List<string> { "Course", "Age" };
DataTable finalDataTable = Melt(dt, varCols: varCols);

How do I find and list duplicate rows based on columns in a CSV file using C#. Matching/Grouping Rows.

I converted an excel file into a CSV file. The file contains over 100k records. I'm wanting to search and return duplicate rows by searching the full name column. If the full name's match up I want the program to return the entire rows of the duplicates. I started with a code that returns a list of full names but that's about it.
I've listed the code that I have now below:
public static void readCells()
{
var dictionary = new Dictionary<string, int>();
Console.WriteLine("started");
var counter = 1;
var readText = File.ReadAllLines(path);
var duplicatedValues = dictionary.GroupBy(fullName => fullName.Value).Where(fullName => fullName.Count() > 1);
foreach (var s in readText)
{
var values = s.Split(new Char[] { ',' });
var fullName = values[3];
if (!dictionary.ContainsKey(fullName))
{
dictionary.Add(fullName, 1);
}
else
{
dictionary[fullName] += 1;
}
Console.WriteLine("Full Name Is: " + values[3]);
counter++;
}
}
}
I changed dictionary to use fullname as key :
public static void readCells()
{
var dictionary = new Dictionary<string, List<List<string>>>();
Console.WriteLine("started");
var counter = 1;
var readText = File.ReadAllLines(path);
var duplicatedValues = dictionary.GroupBy(fullName => fullName.Value).Where(fullName => fullName.Count() > 1);
foreach (var s in readText)
{
List<string> values = s.Split(new Char[] { ',' }).ToList();
string fullName = values[3];
if (!dictionary.ContainsKey(fullName))
{
List<List<string>> newList = new List<List<string>>();
newList.Add(values);
dictionary.Add(fullName, newList);
}
else
{
dictionary[fullName].Add(values);
}
Console.WriteLine("Full Name Is: " + values[3]);
counter++;
}
}
I've found that using Microsoft's built-in TextFieldParser (which you can use in c# despite being in the Microsoft.VisualBasic.FileIO namespace) can simplify reading and parsing of CSV files.
Using this type, your method ReadCells() can be modified into the following extension method:
using Microsoft.VisualBasic.FileIO;
public static class TextFieldParserExtensions
{
public static List<IGrouping<string, string[]>> ReadCellsWithDuplicatedCellValues(string path, int keyCellIndex, int nRowsToSkip /* = 0 */)
{
using (var stream = File.OpenRead(path))
using (var parser = new TextFieldParser(stream))
{
parser.SetDelimiters(new string[] { "," });
var values = parser.ReadAllFields()
// If your CSV file contains header row(s) you can skip them by passing a value for nRowsToSkip
.Skip(nRowsToSkip)
.GroupBy(row => row.ElementAtOrDefault(keyCellIndex))
.Where(g => g.Count() > 1)
.ToList();
return values;
}
}
public static IEnumerable<string[]> ReadAllFields(this TextFieldParser parser)
{
if (parser == null)
throw new ArgumentNullException();
while (!parser.EndOfData)
yield return parser.ReadFields();
}
}
Which you would call like:
var groups = TextFieldParserExtensions.ReadCellsWithDuplicatedCellValues(path, 3);
Notes:
TextFieldParser correctly handles cells with escaped, embedded commas which s.Split(new Char[] { ',' }) will not.
Since your CSV file has over 100k records I adopted a streaming strategy to avoid the intermediate string[] readText memory allocation.
You can try out Cinchoo ETL - an open source library to parse CSV file and identify the duplicates with few lines of code.
Sample CSV file (EmpDuplicates.csv) below
Id,Name
1,Tom
2,Mark
3,Lou
3,Lou
4,Austin
4,Austin
4,Austin
Here is how you can parse and identify the duplicate records
using (var parser = new ChoCSVReader("EmpDuplicates.csv").WithFirstLineHeader())
{
foreach (dynamic c in parser.GroupBy(r => r.Id).Where(g => g.Count() > 1).Select(g => g.FirstOrDefault()))
Console.WriteLine(c.DumpAsJson());
}
Output:
{
"Id": 3,
"Name": "Lou"
}
{
"Id": 4,
"Name": "Austin"
}
Hope this helps.
For more detailed usage of this library, visit CodeProject article at https://www.codeproject.com/Articles/1145337/Cinchoo-ETL-CSV-Reader

Read CSV file in DataGridView

I want to read a csv-file into a Datagridview. I would like to have a class and a function which reads the csv like this one:
class Import
{
public DataTable readCSV(string filePath)
{
DataTable dt = new DataTable();
using (StreamReader sr = new StreamReader(filePath))
{
string strLine = sr.ReadLine();
string[] strArray = strLine.Split(';');
foreach (string value in strArray)
{
dt.Columns.Add(value.Trim());
}
DataRow dr = dt.NewRow();
while (sr.Peek() >= 0)
{
strLine = sr.ReadLine();
strArray = strLine.Split(';');
dt.Rows.Add(strArray);
}
}
return dt;
}
}
and call it:
Import imp = new Import();
DataTable table = imp.readCSV(filePath);
foreach(DataRow row in table.Rows)
{
dataGridView.Rows.Add(row);
}
Result of this is-> rows are created but there is no data in the cells!!
First solution using a litle bit of linq
public DataTable readCSV(string filePath)
{
var dt = new DataTable();
// Creating the columns
File.ReadLines(filePath).Take(1)
.SelectMany(x => x.Split(new[] { ';' }, StringSplitOptions.RemoveEmptyEntries))
.ToList()
.ForEach(x => dt.Columns.Add(x.Trim()));
// Adding the rows
File.ReadLines(filePath).Skip(1)
.Select(x => x.Split(';'))
.ToList()
.ForEach(line => dt.Rows.Add(line));
return dt;
}
Below another version using foreach loop
public DataTable readCSV(string filePath)
{
var dt = new DataTable();
// Creating the columns
foreach(var headerLine in File.ReadLines(filePath).Take(1))
{
foreach(var headerItem in headerLine.Split(new[] { ';' }, StringSplitOptions.RemoveEmptyEntries))
{
dt.Columns.Add(headerItem.Trim());
}
}
// Adding the rows
foreach(var line in File.ReadLines(filePath).Skip(1))
{
dt.Rows.Add(x.Split(';'));
}
return dt;
}
First we use the File.ReadLines, that returns an IEnumerable that is a colletion of lines. We use Take(1), to get just the first row, that should be the header, and then we use SelectMany that will transform the array of string returned from the Split method in a single list, so we call ToList and we can now use ForEach method to add Columns in DataTable.
To add the rows, we still use File.ReadLines, but now we Skip(1), this skip the header line, now we are going to use Select, to create a Collection<Collection<string>>, then again call ToList, and finally call ForEach to add the row in DataTable. File.ReadLines is available in .NET 4.0.
Obs.: File.ReadLines doesn't read all lines, it returns a IEnumerable, and lines are lazy evaluated, so just the first line will be loaded two times.
See the MSDN remarks
The ReadLines and ReadAllLines methods differ as follows: When you use ReadLines, you can start enumerating the collection of strings before the whole collection is returned; when you use ReadAllLines, you must wait for the whole array of strings be returned before you can access the array. Therefore, when you are working with very large files, ReadLines can be more efficient.
You can use the ReadLines method to do the following:
Perform LINQ to Objects queries on a file to obtain a filtered set of its lines.
Write the returned collection of lines to a file with the File.WriteAllLines(String, IEnumerable) method, or append them to an existing file with the File.AppendAllLines(String, IEnumerable) method.
Create an immediately populated instance of a collection that takes an IEnumerable collection of strings for its constructor, such as a IList or a Queue.
This method uses UTF8 for the encoding value.
If you still have any doubt look this answer: What is the difference between File.ReadLines() and File.ReadAllLines()?
Second solution using CsvHelper package
First, install this nuget package
PM> Install-Package CsvHelper
For a given CSV, we should create a class to represent it
CSV File
Name;Age;Birthdate;Working
Alberto Monteiro;25;01/01/1990;true
Other Person;5;01/01/2010;false
The class model is
public class Person
{
public string Name { get; set; }
public int Age { get; set; }
public DateTime Birthdate { get; set; }
public bool Working { get; set; }
}
Now lets use CsvReader to build the DataTable
public DataTable readCSV(string filePath)
{
var dt = new DataTable();
var csv = new CsvReader(new StreamReader(filePath));
// Creating the columns
typeof(Person).GetProperties().Select(p => p.Name).ToList().ForEach(x => dt.Columns.Add(x));
// Adding the rows
csv.GetRecords<Person>().ToList.ForEach(line => dt.Rows.Add(line.Name, line.Age, line.Birthdate, line.Working));
return dt;
}
To create columns in DataTable e use a bit of reflection, and then use the method GetRecords to add rows in DataTabble
using Microsoft.VisualBasic.FileIO;
I would suggest the following. It should have the advantage at least that ';' in a field will be correctly handled, and it is not constrained to a particular csv format.
public class CsvImport
{
public static DataTable NewDataTable(string fileName, string delimiters, bool firstRowContainsFieldNames = true)
{
DataTable result = new DataTable();
using (TextFieldParser tfp = new TextFieldParser(fileName))
{
tfp.SetDelimiters(delimiters);
// Get Some Column Names
if (!tfp.EndOfData)
{
string[] fields = tfp.ReadFields();
for (int i = 0; i < fields.Count(); i++)
{
if (firstRowContainsFieldNames)
result.Columns.Add(fields[i]);
else
result.Columns.Add("Col" + i);
}
// If first line is data then add it
if (!firstRowContainsFieldNames)
result.Rows.Add(fields);
}
// Get Remaining Rows
while (!tfp.EndOfData)
result.Rows.Add(tfp.ReadFields());
}
return result;
}
}
CsvHelper's Author build functionality in library.
Code became simply:
using (var reader = new StreamReader("path\\to\\file.csv"))
using (var csv = new CsvReader(reader, CultureInfo.CurrentCulture))
{
// Do any configuration to `CsvReader` before creating CsvDataReader.
using (var dr = new CsvDataReader(csv))
{
var dt = new DataTable();
dt.Load(dr);
}
}
CultureInfo.CurrentCulture is used to determine the default delimiter and needs if you want to read csv saved by Excel.
I had the same problem but I found a way to use #Alberto Monteiro's Answer in my own way...
My CSV file does not have a "First-Line-Column-Header", I personally didn't put them there for some reasons, So this is the file sample
1,john doe,j.doe,john.doe#company.net
2,jane doe,j.doe,jane.doe#company.net
So you got the idea right ?
Now in I am going to add the Columns manually to the DataTable. And also I am going to use Tasks to do it asynchronously. and just simply using a foreach loop adding the values into the DataTable.Rows using the following function:
public Task<DataTable> ImportFromCSVFileAsync(string filePath)
{
return Task.Run(() =>
{
DataTable dt = new DataTable();
dt.Columns.Add("Index");
dt.Columns.Add("Full Name");
dt.Columns.Add("User Name");
dt.Columns.Add("Email Address");
// splitting the values using Split() command
foreach(var srLine in File.ReadAllLines(filePath))
{
dt.Rows.Add(srLine.Split(','));
}
return dt;
});
}
Now to call the function I simply ButtonClick to do the job
private async void ImportToGrid_STRBTN_Click(object sender, EventArgs e)
{
// Handling UI objects
// Best idea for me was to put everything a Panel and Disable it while waiting
// and after the job is done Enabling it
// and using a toolstrip docked to bottom outside of the panel to show progress using a
// progressBar and setting its style to Marquee
panel1.Enabled = false;
progressbar1.Visible = true;
try
{
DataTable dt = await ImportFromCSVFileAsync(#"c:\myfile.txt");
if (dt.Rows.Count > 0)
{
Datagridview1.DataSource = null; // To clear the previous data before adding the new ones
Datagridview1.DataSource = dt;
}
}
catch (Exception ex)
{
MessagBox.Show(ex.Message, "Error");
}
progressbar1.Visible = false;
panel1.Enabled = true;
}

Find a row in a DataTable

I've a table in a DataSet and I want to search for a row in this Table using a unique key.
My question is : Is there any method that allows me to find this row without using loops ?
This is the code I wrote using the forech loop :
foreach (var myRow in myClass.ds.Tables["Editeur"].AsEnumerable())
{
if (newKeyWordAEditeurName == myRow[1] as String)
id_Editeur_Editeur = (int)myRow[0];
}
Sure. You have the Select method off of a DataTable. GEt the table from your DataSet, and use Select to snag it.
void Demo(DataSet ds)
{
DataTable dt = ds.Tables[0]; // refer to your table of interest within the DataSet
dt.Select("Field = 1"); // replace with your criteria as appropriate
}
To find a particular row, you might want to search by key which can uniquely identify each row.
But if you want to find a group of rows, then you want to use filter.
Key can contain different types of objects simultaneously. So can filter!
Following is a concrete example which covers searching with a key method or a filter method as locateOneRow() and locateRows() respectively.
using System.Data;
namespace namespace_A {
public static class cData {
public static DataTable srchTBL = new DataTable(tableName: "AnyTable");
public static DataColumn[] theColumns = {
new DataColumn("IDnum", typeof(int))
, new DataColumn("IDString", typeof(string))
, new DataColumn("DataString", typeof(string))
};
public static void DataInit(){
if (srchTBL.Columns.Count == 0) {
srchTBL.Columns.AddRange(theColumns);
srchTBL.PrimaryKey = new DataColumn[2] { srchTBL.Columns["IDnum"], srchTBL.Columns["IDstring"] };
//Data
srchTBL.Rows.Add(0, "me", "Homemaker");
srchTBL.Rows.Add(1, "me2", "Breadwinner2");
srchTBL.Rows.Add(1, "you", "Breadwinner1");
srchTBL.Rows.Add(2, "kid", "Learner");
}
}//DataInit
public static DataRow locateOneRow(){
DataInit();
object[] keyVals = new object[] {0, "me" };
return srchTBL.Rows.Find(keyVals);
}//locateOneRow - Result: the "Homemaker" row
public static DataRow[] locateRows(){ //no Primary key needed
DataInit();
return srchTBL.Select("IDnum = 1 OR IDstring = 'me'");
}//locateRows - Result: the row with "Homermaker" & the row with "Breadwinner2"
}//class
class Program {
static void Main(string[] args) {
try
{
DataRow row1 =cData.locateOneRow();
DataRow[] rows = cData.locateRows();
}catch(Exception ex){
}
} // Main
} // Program
}

Convert datarow(only single column) to a string list

Please look at what is wrong? I want to convert datarow to a string list.
public List<string> GetEmailList()
{
// get DataTable dt from somewhere.
List<DataRow> drlist = dt.AsEnumerable().ToList();
List<string> sEmail = new List<string>();
foreach (object str in drlist)
{
sEmail.Add(str.ToString()); // exception
}
return sEmail; // Ultimately to get a string list.
}
Thanks for help.
There's several problems here, but the biggest one is that you're trying to turn an entire row into a string, when really you should be trying to turn just a single cell into a string. You need to reference the first column of that DataRow, which you can do with brackets (like an array).
Try something like this instead:
public List<string> GetEmailList()
{
// get DataTable dt from somewhere.
List<string> sEmail = new List<string>();
foreach (DataRow row in dt.Rows)
{
sEmail.Add(row[0].ToString());
}
return sEmail; // Ultimately to get a string list.
}
Here's a one liner I got from ReSharper on how to do this, not sure of the performance implications, just thought I'd share it.
List<string> companyProxies = Enumerable.Select(
vendors.Tables[0].AsEnumerable(), vendor => vendor["CompanyName"].ToString()).ToList();
Here is how it would be with all additional syntax noise stripped: one-liner.
public List<string> GetEmailList()
{
return dt.AsEnumerable().Select(r => r[0].ToString()).ToList();
}
The Linq way ...
private static void Main(string[] args)
{
var dt = getdt();
var output = dt
.Rows
.Cast<DataRow>()
.ToList();
// or a CSV line
var csv = dt
.Rows
.Cast<DataRow>()
.Aggregate(new StringBuilder(), (sb, dr) => sb.AppendFormat(",{0}", dr[0]))
.Remove(0, 1);
Console.WriteLine(csv);
Console.ReadLine();
}
private static DataTable getdt()
{
var dc = new DataColumn("column1");
var dt = new DataTable("table1");
dt.Columns.Add(dc);
Enumerable.Range(0, 10)
.AsParallel()
.Select(i => string.Format("row {0}", i))
.ToList()
.ForEach(s =>
{
var dr = dt.NewRow();
dr[dc] = s;
dt.Rows.Add(dr);
});
return dt;
}

Categories

Resources