Datatable.compute slow with like condition

Datatable.compute slow with like condition - c#

I have a problem with my asp.net program. I am doing a Datatable.Compute on a very large database with a like condition in it. The result takes something like 4 minutes to show or does a Request timeout. If I do the same request with a = and a fix text, it takes nearly 1 minute to show, which for my use is acceptable.
Here is the line that is so slow:
float test = (int)Datatbl.Tables["test"].Compute("COUNT(pn)", "pn like '9800%' and mois=" + i + " and annee=" + j);
I have been searching for a solution for 2 days.
Please help me.

Are you retrieving the data in your Datatable from a database? Do you have access to the database?
If so, one option is to research methods of moving this lookup and aggregation into the database instead of doing it in your C# code. Once it is in the database, if required, you could add indexes for the 'mois' and 'annee' columns which may speed up the look up considerably. If '9800' is a hardcoded value, then you could even add a denormalisation consisting of a boolean column indicating whether the 'pn' column begins with '9800' and put an index on this column. This may make the lookup very fast indeed.
There are lots of options available.

I found it.
I use the Dataview and send the result to a DataTable. This sped up the process 10 times
Here is an exemple:
DataView dv = new DataView(Datatbl.Tables["test"], filter, "pn", DataViewRowState.CurrentRows);
DataTable test = dv.ToTable();
and then you can use the "test" DataTable.

Related

Sorting inside the database or Sorting in code behind? Which is best?

I have a dropdown list in my aspx page. Dropdown list's datasource is a datatable. Backend is MySQL and records get to the datatable by using a stored procedure.
I want to display records in the dropdown menu in ascending order.
I can achieve this by two ways.
1) dt is datatable and I am using dataview to filter records.
dt = objTest_BLL.Get_Names();
dataView = dt.DefaultView;
dataView.Sort = "name ASC";
dt = dataView.ToTable();
ddown.DataSource = dt;
ddown.DataTextField = dt.Columns[1].ToString();
ddown.DataValueField = dt.Columns[0].ToString();
ddown.DataBind();
2) Or in the select query I can simply say that
SELECT
`id`,
`name`
FROM `test`.`type_names`
ORDER BY `name` ASC ;
If I use 2nd method I can simply eliminate the dataview part. Assume this type_names table has 50 records. And my page is view by 100,000 users at a minute. What is the best method by considering efficiency,Memory handling? Get unsorted records to datatable and filter in code behind or sort them inside the datatabse?

Note - Only real performance tests can tell you real numbers.. Theoretical options are below (which is why I use word guess a lot in this answer).
You have at least 3 (instead of 2) options -
Sort in database - If the column being sorted on is indexed.. Then this may make most sense, because overhead of sorting on your database server may be negligible. SQL servers own data caches may make this super fast operation.. but 100k queries per minute.. measure if SQL gives noticeably faster results without sort.
Sort in code behind / middle layer - Likely you won't have your own equivalent of index.. you'd be sorting list of 50 records, 100k times per minutes.. would be slower than SQL, I would guess.
Big benefit would apply, only if data is relatively static, or very slow changing, and sorted values can be cached in memory for few seconds to minutes or hours..
The option not in your list - send the data unsorted all the way to the client, and sort it on client side using javascript. This solution may scale the most... sorting 50 records in Browser, should not be a noticeable impact on your UX.

The SQL purists will no doubt tell you that it’s better to let SQL do the sorting rather than C#. That said, unless you are dealing with massive record sets or doing many queries per second it’s unlikely you’d notice any real difference.
For my own projects, these days I tend to do the sorting on C# unless I’m running some sort of aggregate on the statement. The reason is that it’s quick, and if you are running any sort of stored proc or function on the SQL server it means you don’t need to find ways of passing order by’s into the stored proc.

Improving data on row

I have N rows (which could be nothing less than 1000) on an excel spreadsheet. And in this sheet our project has 150 columns like this:
Now, our application needs data to be copied (using normal Ctrl+C) and pasted (using Ctrl+V) from the excel file sheet on our GUI sheet. I have increased by using Divide and Conquer or some other mechanism. Currently i am not really sure how to go about this. Here is what part of my code looks like:
The above code gets called row-wise like this:
Please know my question needs a more algorithmic solution than code optimization, however any answers containing code related optimizations will be appreciated as well. (Tagged Linq because although not seen i have been using linq in some parts of my code).

1. IIRC dRow["Condition"] is much slower than dRow[index] as it has to do a lookup every time. Find out which indexes the columns have before the call.
public virtual void ValidateAndFormatOnCopyPaste(DataTable DtCopied, int CurRow, int conditionIndex, int valueIndex)
{
foreach (DataRow dRow in dtValidateAndFormatConditions.Rows)
{
string Condition = dRow[conditionIndex];
string FormatValue = Value = dRow[valueIndex];
GetValidatedFormattedData(DtCopied,ref Condition, ref FormatValue ,iRowIndex);
Condition = Parse(Condition);
dRow[conditionIndex] = Condition;
FormatValue = Parse(FormatValue );
dRow[valueIndex] = FormatValue;
}
}
2. If you are updating a excel document live, you should also lock the sheet updates during the process, so the document isn't redrawn on every cell change.
3. virtual methods also have a performance penalty.

The general answer to a problem like this is you want to move as much as possible of the heavier processing out of the row loop, so it only needs to execute once instead of for every row.
It's difficult to provide more detail without knowing exactly how your validation/formatting system works, but I can offer some "pointers":
Is it possible to build some kind of cached data structure from this condition table of yours? This would eliminate all the heavy DataTable operations from your inner loop
The most efficient solution to mass validation/formatting I can think of, is to construct a C# script based upon your set of conditions, compile it into a delegate and then just evaluate for each row. I don't know if this is possible for your problem ...

There are two proposed improvements in the algorithm : -
a.U can use multithreading if possible to speed the process by constant factor(need
testing to get the actual value). U can use multithreading to evaluate the rows in
parallel.
b. If it is possible to stop processing row if even one column is invalid then u can stop
processing that row . Further u can analyse the input data for large no of data and
arrange the columns in decreasing porbabilty of them being invalid and then check for
columns in this calculated order. Further more u can also arrange the predicates of
validation condition in the way u did for columns for checking validations
Proposed algorithm which might improve performance: -
for i in totalconds :
probability(i) = 0
for record in largeDataSet :
for col in record :
for cond in conditions :
if invalid(cond,col) :
probability(cond)++
sort(probability(cond),condorder,order=decreasing)
check for condition in order of condorder
This is learning algorithm which can be used to calculate the order of evaluation of predicates for efficient short-circuit evaluation of conditions but would take same time for valid inputs. You can evaluate this order offline on large dataset of sample inputs and just store in array during live usage.
Edit: Another Improvement that i missed is that use of hash table for columns which have a small range of valid values , so instead of evaluating the conditions on that column we just check if it is in hash table. Similarly if invalid values range is small then we check for them in hash table. The hash table can be filled up at before the start of evaluation using a file.

Operations like 'string Condition = dRow["Condition"]' are rather heavy, thus I would recommend to move row enumeration (for-cycle) from the ValidateAndFormat method to the ValidateAndFormatOnCopyPaste method, immediately around the call of GetValidatedFormattedData.

Pre steps:
create a task class that accepts 1 dataRow and processes it
create a task queue
create a thread pool with Y workers
Algorithm
when you get your N rows, add all of them as tasks to your task queue
get the workers to start taking tasks from the task queue
as the tasks responses get returned, update your data table (probably initially a clone)
once all the tasks are done return your new data table
Possible Improvements
as Vikram said, you could probably short circuit your conditions, if at value 10, you know that its already an error, don't bother checking the rest of the 140 conditions, but thats only if that fits your requirements, if your requirements require a checked condition for all 150, then you can't escape that one
change the task classes to take in a list of row data instead of 1, which could improve due to context switching between threads, if they finish pretty quickly
Other ideas that I haven't really thought through
sort the data first, maybe there is a speed benefit with short circuiting certain known conditions
checksum the whole row, store it into a db with its result, essentially cache the parameters/results so that the next time something with the exact same values/fields runs you can pull it from cache
the other checksum of the whole row buys you is change, lets say that you have some sort of key and you are looking for changed data, the checksum of everything but the key will tell you if some value changed and if its even worth looking at the conditions of all the other columns

You can use the datatable.
try this
datatable.Select(string.Format("[Col1] = '{0}'", id)).ToList().ForEach(r => r["Col1"] = Data);
see the link
https://msinternal1.engageexpress.com/sf/MTY1NDNfMTY4NTM4

I know this is kind of brute force, but you can combine it with what others have suggested:
Parallel.ForEach(dtValidateAndFormatConditions.Rows, dRow =>
{
string Condition = dRow[conditionIndex];
string FormatValue = Value = dRow[valueIndex];
GetValidatedFormattedData(DtCopied,ref Condition, ref FormatValue ,iRowIndex);
Condition = Parse(Condition);
FormatValue = Parse(FormatValue);
lock (dRow)
{
dRow[conditionIndex] = Condition;
dRow[valueIndex] = FormatValue;
}
});

Search item from datatable(cache object) vs database

Current I have a large data that stored inside database.
This set of data (3000 records) will be retrieve by user frequently.
Currently the method that I'm using right now is:
Retrieve this set of records from database
Convert into datatable
Store into a cache object
Search result from this cache object base on query
CachePostData.Select(string.Format("Name LIKE '%{0}%'", txtItemName.Text));
Bind the result to a repeater (with paging that display 40 records per page)
But I notice that the performance is not good (about 4 seconds on every request). So, I am wondering is there any better way to do this? Or I should straight away retrieve the result from the database for every query?

DataTable.Select is probably not the most efficient way to search an in-memory cache, but it certainly shouldn't take 4 seconds or anything like to search 3000 rows.
First step is to find out where your performance bottlenecks are. I'm betting it's nothing to do with searching the cache, but you can easily find out, e.g. with code something like:
var stopwatch = Stopwatch.StartNew();
var result = CachePostData.Select(string.Format("Name LIKE '%{0}%'", txtItemName.Text));
WriteToLog("Read from cache took {0} ms", stopwatch.Elapsed.TotalMilliseconds);
where WriteToLog traces somewhere (e.g. System.Diagnostics.Trace, System.Diagnostics.Debug, or a logging framework such as log4net).
If you are looking for alternatives for caching, you could simply cache a generic list of entity objects, and use Linq to search the list:
var result = CachePostData.Select(x => x.Name.Contains(txtItemName.Text));
This is probably slightly more efficient (for example, it doesn't need to parse the "NAME LIKE ..." filter expression), but again, I don't expect this is your bottleneck.

I think using datatable would be more efficient as by doing that you will reduce the hits on your database server. You can store datatable in cache and then to reuse it. Something like this:-
public DataTable myDatatable()
{
DataTable dt = HttpContext.Current.Cache["key"] as DataTable;
if(dt == null)
{
dt = myDatatable();
HttpContext.Current.Cache["key"] = dt;
}
return dt;
}
Also check SqlCacheDependency
You may also clear the cache on some particular time interval like this:-
HttpContext.Current.Cache.Insert("key", dt, null, DateTime.Now.AddHours(2),
System.Web.Caching.Cache.NoSlidingExpiration);
Also check DataTable caching performance

It is a bit hard to declare a correct solution for your problem without knowing how many hits on the database you actually expect. For most cases, I would not cache the data in the ASP.NET cache for filtering, because searching by using DataTable.Select basically performs a table scan and cannot take advantage of database indexing. Unless you run into really heavy load, most database server should be capable of performing this task with less delay than filtering the DataTable in .NET.
If your database supports fulltext search (ie. MSSQL or MySQL), you could create a fulltext index on the name column and query that. Fulltext search should give you even faster response types for these types of LIKE queries.
Generally, caching data for faster access is good, but in this case, the DataTable is most likely inferior to the database server in terms of searching for data. You could still use the cache to display unfiltered data faster and without hitting the database.

reference value in DataColumn Expressions

given an expression like so
DataTable1.Columns.Add("value", typeof(double), "rate * loan_amt");
in a datatable with 10000 rows where rate is same for all rows and loan_amt varies
When the rate changes, it changes for all
currently that means iterating through all rows like so
foreach(DataRow dr in DataTable1.Rows) dr["rate"] = new_rate;
wondering if there,s a better way using a ReferenceTable (with only 1 row ) in the same DataSet and linking it somehow like so
DataTable1.Columns.Add("value", typeof(double), "RefTable.Row0.rate * loan_amt");
so changing the rate would be as simple as
RefTable.Rows[0]["rate"] = new_rate;
Or any other way ?

That is a good idea, but you would have to rewrite any time that data was accessed in legacy code. It would certainly make updates to the rate more efficient, but you may run into issues with reverse compatibility.
If their isn't much code accessing that table then it isn't such a big deal, but if this is a production system with multiple processes calling that data you might end up with a runaway train of null value exceptions when trying to access the "rate" column of the original table, or inconsistencies for your "value" depending on which code accessed which table to retrieve the rate.
If this is not the case, then it's no big deal. Go for it.

found the answer, adding it for others who might land here
Key is to add a DataRelation between the two tables/columns and the expression would be
Parent.rate * loan_amt

Join multiple DataRows into a single DataRow

I am writing this in C# using .NET 3.5. I have a System.Data.DataSet object with a single DataTable that uses the following schema:
Id : uint
AddressA: string
AddressB: string
Bytes : uint
When I run my application, let's say the DataTable gets filled with the following:
1 192.168.0.1 192.168.0.10 300
2 192.168.0.1 192.168.0.20 400
3 192.168.0.1 192.168.0.30 300
4 10.152.0.13 167.10.2.187 80
I'd like to be able to query this DataTable where AddressA is unique and the Bytes column is summed together (I'm not sure I'm saying that correctly). In essence, I'd like to get the following result:
1 192.168.0.1 1000
2 10.152.0.13 80
I ultimately want this result in a DataTable that can be bound to a DataGrid, and I need to update/regenerate this result every 5 seconds or so.
How do I do this? DataTable.Select() method? If so, what does the query look like? Is there an alternate/better way to achieve my goal?
EDIT: I do not have a database. I'm simply using an in-memory DataSet to store the data, so a pure SQL solution won't work here. I'm trying to figure out how to do it within the DataSet itself.

For readability (and because I love it) I would try to use LINQ:
var aggregatedAddresses = from DataRow row in dt.Rows
group row by row["AddressA"] into g
select new {
Address = g.Key,
Byte = g.Sum(row => (uint)row["Bytes"])
};
int i = 1;
foreach(var row in aggregatedAddresses)
{
result.Rows.Add(i++, row.Address, row.Byte);
}
If a performace issue is discovered with the LINQ solution I would go with a manual solution summing up the rows in a loop over the original table and inserting them into the result table.
You can also bind the aggregatedAddresses directly to the grid instead of putting it into a DataTable.

most efficient solution would be to do the sum in SQL directly
select AddressA, SUM(bytes) from ... group by AddressA

I agree with Steven as well that doing this on the server side is the best option. If you are using .NET 3.5 though, you don't have to go through what Rune suggests. Rather, use the extension methods for datasets to help query and sum the values.
Then, you can map it easily to an anonymous type which you can set as the data source for your grid (assuming you don't allow edits to this, which I don't see how you can, since you are aggregating the data).

I agree with Steven that the best way to do this is to do it in the database. But if that isn't an option you can try the following:
Make a new datatable and add the columns you need manually using DataTable.Columns.Add(name, datatype)
Step through the first datatables Rows collection and for each row create a new row in your new datatable using DataTable.NewRow()
Copy the values of the columns found in the first table into the new row
Find the matching row in the other data table using Select() and copy out the final value into the new data row
Add the row to your new data table using DataTable.Rows.Add(newRow)
This will give you a new data table containing the combined data from the two tables. It won't be very fast, but unless you have huge amounts of data it will probably be fast enough. But try to avoid doing a LIKE-query in the Select, for that one is slow.
One possible optimization would be possible if both tables contains rows with identical primary keys. You could then sort both tables and step through them fetching both data rows using their array index. This would rid you of the Select call.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.