I have a datatable imported from a csv. What I'm trying to do is compare all of the rows to each other to find duplicates. In the case of duplicates I am going to add the row # to a list, then write the list to an array and deal with the duplicates after that.
//find duplicate rows and merge them.
foreach (DataRow dr in dt.Rows)
{
//loop again to compare rows
foreach (DataRow dx in dt.Rows)
{
if (dx[0]==dr[0] && dx[1]==dr[1] && dx[2] == dr[2] && dx[3] == dr[3] && dx[4] == dr[4] && dx[5] == dr[5] && dx[7] == dr[7])
{
dupeRows.Add(dx.ToString());
}
}
}
for testing I have added:
listBox1.Items.AddRange(dupeRows.ToArray());
which simply outputs System.Data.DataRow.
How do I store the duplicate row index ids?
The basic problem is that you saved a string describing the type of the row (what DataRow.ToString() returns by default) at the time you decided the row was a duplicate
Assuming you've read your CSV straight in with some library/driver rather than line by line (which would have been a good time to dedupe) let's use a dictionary to dedupe:
Dictionary<string, DataRow> d = new Dictionary<string, DataRow>();
foreach(var ro in dataTable.Rows){
//form a key for the dictionary
string key = string.Format("{0}\t{1}\t{2}\t{3}\t{4}\t{5}\t{7}", ro.ItemArray);
d[key] = ro;
}
That's it; at the end of this operation the d.Values will be a deduped collection of DataRow. 1000 rows will require 1000 operations so this will likely be orders of magnitude faster than comparing every row to every other row, which would need a million operations for a thousand rows
I've used tabs to separate the values when I formed the key - assuming your data contains no tabs. Best reliability will be achieved if you use a character that does not appear in the data
If you've read your CSV line by line and done a manual string split on comma (i.e. a primitive way of reading a CSV) you could do this operation then instead; after you split you have an array that can be used in place of ro.ItemArray. Process the entire file, creating rows (and adding to the dictionary) only if d.ContainsKey returns false. If the dictionary already contains that row, skip on rather than creating a row
The output (System.Data.DataRow) that you are seeing is expected since there is no custom implementation of DataRow.ToString() found in your project, framework is calling base class's (which is System.Object) ToString() for which the default implementation returns data type of object which invokes that method.
I see three solutions here:
If possible, try to read the DataTable into custom objects (like
MyDataTable, MyDataRow) so, you can create your own ToString() like
below:
public class MyDataRow
{
public override string ToString()
{
return "This is my custom data row formatted string";
}
}
in the for loop, when you found duplicated row, either just add
index/id (sort of primary key) of dx to array and then have another
for loop to retrieve dupes.
Third is same as mentioned by Caius Jard.
Related
I have some problem with my code. I want to replace the ForEach loop with the help of LINQ here, is there any way or solution to solve my problem? My code is given bellow.
static public string table2Json(DataSet ds, int table_no)
{
try
{
object[][] tb = new object[ds.Tables[table_no].Rows.Count][];
int r = 0;
foreach (DataRow dr in ds.Tables[table_no].Rows)
{
tb[r] = new object[ds.Tables[table_no].Columns.Count];
int col = 0;
foreach (DataColumn column in ds.Tables[table_no].Columns)
{
tb[r][col] = dr[col];
if ((tb[r][col]).Equals(System.DBNull.Value))
{
tb[r][col] = "";
}
col++;
}
r++;
}
string table = JsonConvert.SerializeObject(tb, Formatting.Indented);
return table;
}
catch (Exception ex)
{
tools.log(ex.Message);
throw ex;
}
}
This question really asks 3 different things:
how to serialize a DataTable
how to change the DataTable serialization format and finally
how to replace nulls with empty strings, even though an empty string isn't a NULL.
JSON.NET already handles DataSet and DataTable instance serialization with a DataTableConverter whose source can be found here. You could just write :
var str = JsonConvert.SerializeObject(data);
Given this DataTable :
var dataTable=new DataTable();
dataTable.Columns.Add("Name",typeof(string));
dataTable.Columns.Add("SurName",typeof(string));
dataTable.Rows.Add("Moo",null);
dataTable.Rows.Add("AAA","BBB");
You get :
[{"Name":"Moo","SurName":null},{"Name":"AAA","SurName":"BBB"}]
DataTables aren't 2D arrays and the column names and types matter. Generating a separate row object with named fields is far better than generating an object[] array. It also allows makes it far easier for clients to handle the JSON string without knowing its schema in advance. With an object[] for each row, the clients will have to know what's stored in each location in advance.
If you want to use a different serialization format, you could customize the DataTableConverter. Another option though, is to use DataRow.ItemArray to get the values as an object[] and LINQ to get the rows, eg :
object[][] values=dataTable.Rows.Cast<DataRow>()
.Select(row=>row.ItemArray)
.ToArray();
Serializing this produces :
[["Moo",null],["AAA","BBB"]]
And there's no way to tell which item is the name and which is the surname any more.
Replacing DBNulls with strings in this last form needs an extra Select() to replace DBNull.Value with "" :
object[][] values=dataTable.Rows.Cast<DataRow>()
.Select(row=>row.ItemArray
.Select(x=>x==DBNull.Value?"":x)
.ToArray())
.ToArray();
Serializing this produces :
[["Moo",""],["AAA","BBB"]]
That's what was asked, but now we have no way to tell whether the Surname is an empty string, or just doesn't exist.
This may sound strange, but Arabic names may be one long name without surname. Makes things interesting for airlines or travel agents that try to issue tickets (ask me how I know).
We can get rid of ToArray() if we use var :
var values=dataTable.Rows.Cast<DataRow>()
.Select(row=>row.ItemArray
.Select(x=>x==DBNull.Value?"":x));
JSON serialization will work the same.
LINQ is not a nice fit for this sort of thing because you are using explicit indexes r and col into multiple "array structures" (and there is no easy/tidy way to achieve multiple, parallel enumeration).
Other issues
tb is repeatedly newed, filled with data and then replaced in the next iteration, so you end up capturing only the last row of input to the JSON string - that's a logical bug and won't work as I think you intend.
The inner foreach loop declares but does not use the iteration variable column - that's not going to break anything but it is redundant.
You will get more mileage out of using JSON.Net properly (or coding the foreach loops as for loops instead if you want to navigate the structures yourself).
In my program I have varying number of columns, so I've created universal input window for that which returns an array of strings
Now I want to add inputted data to DataGrid, but don't know how
Default DataGrid Add method supports only adding an object, so if I adding an array it just add spaces.
InputWindow iw = new InputWindow(inputs.ToArray());
if (iw.ShowDialog() == true)
{
try
{
var strings = iw.GetInputs();
ActiveDataGrid.Items.Add(strings);
}
catch (ArgumentException ex)
{
Debug.WriteLine($"{ex.Message} from InputWindow");
}
}
Strings from InputWindow returns correctly
How can I add these values corresponding to my varying number of columns?
I assume you mean you want to add 1 value from your array to each column for each item in the array. The simplest way to do this would be to create a DataRow for each input array formatting it according to how ever many items are in the array.
DataRow row = dataGrid.NewRow();
foreach (var item in array)
{
dataGrid.Columns.Add(item);
row[item] = item;
}
dataGrid.Rows.Add(row);
dataGrid.Import.Row(row);
This would work if your plan is to process the array items one at a time and you would clear the DataGrid after each import but if that is not the case you are going to need to create some amount of generic DataColumns then enumerate through the array and columns as much as necessary to place one item in each column.
You might consider creating a generic class that only contains string fields. Then feed your array to the class, and then feed the class object to the data grid.
I want to check the rows of two data tables. If there is an exact match I want to remove all but one.
I've figured out how to compare two rows of data. I'm not sure the best way to return the cleaned up version without duplicates.
the tables within my program are pulling tables from a database, so i simplified them for the example.
Here's what I've worked out so far.
var table1 = new list<string>();
var table2 = new list<string>();
foreach (DataRow row1 in table.Rows)
foreach (DataRow row2 in table2.Rows)
{
var array1 = row1.ItemArray;
var array2 = row2.ItemArray;
if (array1.SequenceEqual(array2))
{
// store the unique elements within a new list?
// remove duplicates and return the remainder?
}
}
I thought using the Intersect() method might be an option as well.
Cast to a hashset of your desired type. They will automatically remove duplicates, as by definition hashset cannot have duplicate entries.
More info:
https://www.dotnetperls.com/hashset
'cannot implicitly convert type string to data row[]'.
Is it possible to store the string type to data row[]? I need to store the value of the particular column in that particular data row array. Suggest me an answer please.
DataRow[] drprocess = objds.Tables[0].Rows[i]["ProcessName"].ToString();
You have declared a variable of type DataRow[] called drProcess but have not yet created an array of DataRows in which to put any values. Instead you've tried to tell the compiler that the string you're retrieving is actually a DataRow, which it isn't.
It's possible that what you want to do is to create your array of DataRows, then create a DataRow object and assign it into the array. However, I'm suspicious that this isn't actually what you're trying to achieve. Note that objds.Tables[0].Rows is already a collection of DataRows. You can actually edit or use this collection yourself if you need.
Or if you're wanting to create a new collection of process names you might be better creating a var processes = new List<string>() then calling process.Add(objds.Tables[0].Rows[i]["ProcessName"].ToString()).
It all depends what you want to do with this collection of process names afterwards.
First, a DataRow always belongs to a DataTable. To which table should these new DataRow belong? I will presume objds.Tables[0].
I also assume that you have a string-column and you want to split every field in it to a DataRow[], then we need to know the delimiter.
Presuming it is a comma:
DataRow[] drprocess = objds.Tables[0].Rows[i].Field<string>("ProcessName").Split(',')
.Select(name => {
DataRow row = objds.Tables[0].NewRow();
row.SetField("ProcessName", name);
return row;
})
.ToArray();
I'm trying to pass an ArrayList into a DataRow object, the idea being to import data into a database from a CSV.
Previously in the file, a Dictionary<string,int> has been created, with the column name as the Key, and the position index as the corresponding value.
I was planning on using this to create a temporary DataTable for each record to aid importing into the DB. My original idea was something along the lines of:
private DataRow ArrayListToDataRow(ArrayList data, Dictionary<string,int> columnPositions)
{
DataTable dt = new DataTable();
DataColumn dc = new DataColumn();
for (i=0;i<=data.Count;i++)
{
dc.ColumnName = columnPositions.Keys[i];
dt.Columns.Add(dc);
dt.Columns[columnPositions.Keys[i]].SetOrdinal(columnPositions(columnPositions.Keys[i]);
}
//TODO Add data to row
}
But of course, the keys aren't indexable.
Does anybody have an idea on how this could be achieved?
Since the size of data should be the same as the size of your columnPositions, you could try using a foreach over your dictionary instead of a for loop.
If you want to access your dictionary values based on a sortable index, you would need to change it to
Dictionary<int, string>
Which seems to make more sense, as you seem to want to read them in that order.
If you cannot change the dictionary, you can do something like this
var orderedPositions = columnPositions.OrderBy(x => x.Value);
foreach(var position in orderedPositions)
{
// do your stuff using position.Key and position.Value
}
.OrderBy comes from Linq, so yuo will need to add
using System.Linq;
to your class.
By ordering the columnPositions on their value (the columnIndex) instead of the default (the order in which items were added), you can loop trough them in the order you presumably want (seeing as you were going with a for loop and every time trying to get the next columnPosition).