I have a DataTable with 36000 columns minimum, 20000 rows min. I have a hardcoded list of column names which I need to find in the datatable and perform some calculation on the values for each column.
e.g Sum of all rows for a particular column
I was thinking of using a simple foreach to find the column name and datatable.compute method to perform my calculation.
Is there a better way of achieving this? Any help is greatly appreciated.
Thank you.
I am using .Net 4.6 and VS2015
You can compare the performance difference between the Datatable.Compute and Foreach.
// Datatable.Compute sample
DataTable table;
table = dataSet.Tables["Orders"];
// Declare an object variable.
object sumObject;
sumObject = table.Compute("Sum(Total)", "EmpID = 5");
Below is a sample of foreach
// Foreach through DataTable
double sum = 0;
dt = ds.Tables["Orders"];
foreach (DataRow dr in dt.Rows)
{
sum += System.Convert.ToDouble(dr["Total"]);
}
For more performance its good to use parallel methods.
This link has a sample for you to use parallel and it is very Good to use it every where is possible.
At this example you see how much will improve your speed.
Related
I have a DataSet in C# with DataTables and PostgreSQL database with the same tables. I fill DataTable in my code and want to INSERT DataTable to Postgresql DataBase. I tried to insert it with simple SQL queries (INSERT INTO...), but it's very slowly if I have hundred tables of thousands rows. I guess, using DataAdapter will improve performance, but I cant understand, how does it work. Can you explain me at two cases example?
case1:
Inserting DataSet's tables to Postgresql with DataAdapter
case2:
Inserting only uniq values from DataSet to PostgreSQL (if table in database have rows with uniq keys and DataTable contain the same)
Or maybe you can suggest what to read to learn DataAdapters... Anyway, thanks.
With the exception of trivially small datasets, you're going to have a hard time beating the performance of NpgSql's implementation of copy, which can be accomplished via the BeginTextImport method of your NpgSqlConnection object.
So, regardless of how your data exists in your application, if you dump the output via the text import (copy), it should be very zippy. Here is an example of how you would do that with a datatable. Bear in mind the columns in the datatable and the columns in the table would have to line up -- if not, you need to manage that one way or the other.
This presupposes NpgSql 3.1.9 or higher.
object[] outRow = new object[dt.Columns.Count];
using (var writer = conn.BeginTextImport("copy <table> from STDIN WITH NULL AS '' CSV"))
{
foreach (DataRow rw in dt.Rows)
{
for (int col = 0; col < dt.Columns.Count; col++)
outRow[col] = rw[col];
writer.WriteLine(string.Join(",", outRow));
}
}
As far as duplicates... wow, that really depends. Define "duplicates." If it's just a "select distinct," then it also depends on how many duplicates you expect. If it's a small amount, a List.Exists<> would probably be adequate, but if you have a large number of dupes a Dictionary object would make each lookup a lot more efficient. A typical List lookup is O(n), while a Dictionary lookup would be O(1).
Here's a pretty brute-force example of a dictionary distinct insert for the above example:
object[] outRow = new object[dt.Columns.Count];
Dictionary<string, bool> already = new Dictionary<string, bool>();
bool test;
using (var writer = conn.BeginTextImport("copy <table> from STDIN WITH NULL AS '' CSV"))
{
foreach (DataRow rw in dt.Rows)
{
for (int col = 0; col < dt.Columns.Count; col++)
outRow[col] = rw[col];
string output = string.Join(",", outRow);
if (!already.TryGetValue(output, out test))
{
writer.WriteLine(output);
already.Add(output, true);
}
}
}
Disclaimer: This is a memory pig. If you can manage dupes any other way, or guarantee the ordering of the data, there are numerous other options.
If you can't (or won't) use a bulk copy insert, something that would help performance would be to wrap your inserts into a transaction (NpgSqlTransaction), but for hundreds of thousands of rows, I can't see why you would.
I have now a problem with a very old system of ours. (!It is more then 7 years old and I have no budget and resources to make bigger change in the structure, so the decision to improve the old logic as many as we can.!)
We have an own written gridcontrol. Basically it is like a normal ASP.NET grid, you can add, change, delete elements.
The problem is that the grid has a BindGrid() method, where for further usage, the rows of the datasource table copied into a DataRow[]. I need to keep the DataRow[], but I would like to implement the best way to copy the source from the the table into the array.
The current solution:
DataRow[] rows = DataSource.Select("1=1", SortOrderString);
As I experienced so far, if I need to get a specified sort, that could be the best way (I'm also interested if it has a quicker way or not.)
BUT there are some simplified pages, where the SortOrder is not needed.
So I could make two method one for the sort order and one for without.
The real problem is the second one:
DataRow[] rows = DataSource.Select("1=1");
Because it is very slow. I made some test and it is kind of 15 times slower then the CopyTo() solution:
DataRow[] rows = new DataRow[DataSource.Rows.Count];
DataSource.Rows.CopyTo(rows,0);
I would like to use the faster way, BUT when I made the tests some old function simply crashed. It seems, there is an other difference, what I only noticed now:
The Select() gets the rows like the RowChanges are accepted.
So if I deleted a row, and I do not use the AcceptRowChanges() (I can't do that unfortunately), then with Select("1=1") the row is in the DataSource but not in the DataRow[].
With a simple .CopyTo() the row is there, and that is a bad news for me.
My questions are:
1) Is the Select("1=1") the best way to get the rows by the RowChanges? (I doubt a bit, because it is like 6 year old part)
2) And if 1) is not, is it possible to achieve a faster way with the same result than the .Select("1=1") ?
UPDATE:
Here is a very basic test app, what I used for speedtesting:
DataTable dt = new DataTable("Test");
dt.Columns.Add("Id", typeof (int));
dt.Columns.Add("Name", typeof(string));
for (int i = 0; i < 10000; i++)
{
DataRow row = dt.NewRow();
row["ID"] = i;
row["Name"] = "Name" + i;
dt.Rows.Add(row);
}
dt.AcceptChanges();
DateTime start = DateTime.Now;
DataRow[] rows = dt.Select();
/*DataRow[] rows = new DataRow[dt.Rows.Count];
dt.Rows.CopyTo(rows,0);*/
Console.WriteLine(DateTime.Now - start);
You can call Select without an argument: DataRow[] allRows = DataSource.Select(); That would be for sure more efficient than "1=1" since that applies a pointless RowFilter.
Another way is using Linq-To-DataSet to order and filter the DataTable. That isn't more efficient but more readable and maintainable.
I have yet no example or measurement, but it is obvious that a RowFilter with "1=1" is more expensive than none. Select is implemented in this way:
public Select(DataTable table, string filterExpression, string sort, DataViewRowState recordStates)
{
this.table = table;
this.IndexFields = table.ParseSortString(sort);
this.indexDesc = Select.ConvertIndexFieldtoIndexDesc(this.IndexFields);
// following would be omitted if you would use DataSource.Select() without "1=1"
if (filterExpression != null && filterExpression.Length > 0)
{
this.rowFilter = new DataExpression(this.table, filterExpression);
this.expression = this.rowFilter.ExpressionNode;
}
this.recordStates = recordStates;
}
If you want to be able to select also the rows that are currently not accepted, you can use the overload of Select:
DataRow[] allRows = DataSource.Select("", "", DataViewRowState.CurrentRows | DataViewRowState.Deleted);
This will select all rows inclusive the rows that are deleted even if AcceptChanges was not called yet.
I have a table in a dataadapter. I want to get the count and sum of a specific column of it. How is that possible?
This is the code for reach to the column, what after that?
DataColumn buy_count = myDataSet.Tables["all_saled"].Columns["how_much_buy"]
I know that we have sum, count,... in SQL, but how can I do it in C#?
You can use LINQ to DataSets
var sales = myDataSet.Tables["all_saled"].AsEnumerable();
var buy_total = sales.Sum(datarow => datarow.Field<int>("how_much_buy"));
Check the LINQ to DataSets 101 Samples
P.S. might need the System.Data.DataSetExtensions assembly referenced.
Use the DataTable.Compute method:
int total = (int)myDataSet.Tables["all_saled"].Compute("SUM(how_much_buy)", null);
I have the following (simplified) code which I'd like to optimise for speed:
long inputLen = 50000000; // 50 million
DataTable dataTable = new DataTable();
DataRow dataRow;
object[] objectRow;
while (inputLen--)
{
objectRow[0] = ...
objectRow[1] = ...
objectRow[2] = ...
// Generate output for this input
output = ...
for (int i = 0; i < outputLen; i++) // outputLen can range from 1 to 20,000
{
objectRow[3] = output[i];
dataRow = dataTable.NewRow();
dataRow.ItemArray = objectRow;
dataTable.Rows.Add(dataRow);
}
}
// Bulk copy
SqlBulkCopy bulkTask = new SqlBulkCopy(connection, SqlBulkCopyOptions.TableLock, null);
bulkTask.DestinationTableName = "newTable";
bulkTask.BatchSize = dataTable.Rows.Count;
bulkTask.WriteToServer(dataTable);
bulkTask.Close();
I'm already using SQLBulkCopy in an attempt to speed things up, but it appears assigning values to the DataTable itself proves to be slow.
I don't know how DataTables work so I'm wondering if I'm creating unnecessary overhead by first creating a reusable array, then assigning it to a DataRow, then adding the DataRow to the DataTable? Or is using DataTable not optimal in the first place? The input comes from a database.
I don't care much about LOC, just speed. Can anyone give some advice on this?
For such a big table, you should instead use the
public void WriteToServer(IDataReader reader)
method.
It may mean you'll have to implement yourself a "fake" IDataReader interface with your code (if you' don't get the data from an existing IDataReader), but this way, you'll get "streaming" from end to end, and will avoid a 200 million loop.
Instead of holding a huge data table in memory, I would suggest implementing a IDataReader which serves up the data as the bulk copy goes. This will reduce the need to keep everything in memory upfront, and should thus serve to improve performance.
You should not construct entire datatable in memory. Use this overload of WrtieToServer, that takes array of DataRow. Just split in chunks your data.
Hi is there any way to select top 5 rows from a data table without iteration?
I think, You can use LINQ:
datatable.AsEnumerable().Take(5);
Using 2 of the above posts, the following works for me:
foreach (DataRow _dr in DataSet.Tables[<tblname>].Select("", "Timestamp DESC").AsEnumerable().OfType<DataRow>().Take(5))
So now you can normally filter if you want, order if you want and then get only the amount of records that you want and then iterate through them whether it is 1 or 100.
Hope that helps someone.
This is what worked for me:
datatable.Rows.Cast<System.Data.DataRow>().Take(5);
This works for my needs.
public static DataTable TopRows(this DataTable dTable, int rowCount)
{
DataTable dtNew = dTable.Clone();
dtNew.BeginLoadData();
if (rowCount > dTable.Rows.Count) { rowCount = dTable.Rows.Count; }
for (int i = 0; i < rowCount;i++)
{
DataRow drNew = dtNew.NewRow();
drNew.ItemArray = dTable.Rows[i].ItemArray;
dtNew.Rows.Add(drNew);
}
dtNew.EndLoadData();
return dtNew;
}
to use it you then do this:
dataTable.TopRows(5);
If you use a LINQ statement, you could use the Take() method.
This post may be of some assistance as well.
EDIT
As you are using VS2005, use the SELECT() method in the datatable like so:
DataRow[] rows = datatable.Select('TOP 5');