Improving data on row

Improving data on row - c#

I have N rows (which could be nothing less than 1000) on an excel spreadsheet. And in this sheet our project has 150 columns like this:
Now, our application needs data to be copied (using normal Ctrl+C) and pasted (using Ctrl+V) from the excel file sheet on our GUI sheet. I have increased by using Divide and Conquer or some other mechanism. Currently i am not really sure how to go about this. Here is what part of my code looks like:
The above code gets called row-wise like this:
Please know my question needs a more algorithmic solution than code optimization, however any answers containing code related optimizations will be appreciated as well. (Tagged Linq because although not seen i have been using linq in some parts of my code).

1. IIRC dRow["Condition"] is much slower than dRow[index] as it has to do a lookup every time. Find out which indexes the columns have before the call.
public virtual void ValidateAndFormatOnCopyPaste(DataTable DtCopied, int CurRow, int conditionIndex, int valueIndex)
{
foreach (DataRow dRow in dtValidateAndFormatConditions.Rows)
{
string Condition = dRow[conditionIndex];
string FormatValue = Value = dRow[valueIndex];
GetValidatedFormattedData(DtCopied,ref Condition, ref FormatValue ,iRowIndex);
Condition = Parse(Condition);
dRow[conditionIndex] = Condition;
FormatValue = Parse(FormatValue );
dRow[valueIndex] = FormatValue;
}
}
2. If you are updating a excel document live, you should also lock the sheet updates during the process, so the document isn't redrawn on every cell change.
3. virtual methods also have a performance penalty.

The general answer to a problem like this is you want to move as much as possible of the heavier processing out of the row loop, so it only needs to execute once instead of for every row.
It's difficult to provide more detail without knowing exactly how your validation/formatting system works, but I can offer some "pointers":
Is it possible to build some kind of cached data structure from this condition table of yours? This would eliminate all the heavy DataTable operations from your inner loop
The most efficient solution to mass validation/formatting I can think of, is to construct a C# script based upon your set of conditions, compile it into a delegate and then just evaluate for each row. I don't know if this is possible for your problem ...

There are two proposed improvements in the algorithm : -
a.U can use multithreading if possible to speed the process by constant factor(need
testing to get the actual value). U can use multithreading to evaluate the rows in
parallel.
b. If it is possible to stop processing row if even one column is invalid then u can stop
processing that row . Further u can analyse the input data for large no of data and
arrange the columns in decreasing porbabilty of them being invalid and then check for
columns in this calculated order. Further more u can also arrange the predicates of
validation condition in the way u did for columns for checking validations
Proposed algorithm which might improve performance: -
for i in totalconds :
probability(i) = 0
for record in largeDataSet :
for col in record :
for cond in conditions :
if invalid(cond,col) :
probability(cond)++
sort(probability(cond),condorder,order=decreasing)
check for condition in order of condorder
This is learning algorithm which can be used to calculate the order of evaluation of predicates for efficient short-circuit evaluation of conditions but would take same time for valid inputs. You can evaluate this order offline on large dataset of sample inputs and just store in array during live usage.
Edit: Another Improvement that i missed is that use of hash table for columns which have a small range of valid values , so instead of evaluating the conditions on that column we just check if it is in hash table. Similarly if invalid values range is small then we check for them in hash table. The hash table can be filled up at before the start of evaluation using a file.

Operations like 'string Condition = dRow["Condition"]' are rather heavy, thus I would recommend to move row enumeration (for-cycle) from the ValidateAndFormat method to the ValidateAndFormatOnCopyPaste method, immediately around the call of GetValidatedFormattedData.

Pre steps:
create a task class that accepts 1 dataRow and processes it
create a task queue
create a thread pool with Y workers
Algorithm
when you get your N rows, add all of them as tasks to your task queue
get the workers to start taking tasks from the task queue
as the tasks responses get returned, update your data table (probably initially a clone)
once all the tasks are done return your new data table
Possible Improvements
as Vikram said, you could probably short circuit your conditions, if at value 10, you know that its already an error, don't bother checking the rest of the 140 conditions, but thats only if that fits your requirements, if your requirements require a checked condition for all 150, then you can't escape that one
change the task classes to take in a list of row data instead of 1, which could improve due to context switching between threads, if they finish pretty quickly
Other ideas that I haven't really thought through
sort the data first, maybe there is a speed benefit with short circuiting certain known conditions
checksum the whole row, store it into a db with its result, essentially cache the parameters/results so that the next time something with the exact same values/fields runs you can pull it from cache
the other checksum of the whole row buys you is change, lets say that you have some sort of key and you are looking for changed data, the checksum of everything but the key will tell you if some value changed and if its even worth looking at the conditions of all the other columns

You can use the datatable.
try this
datatable.Select(string.Format("[Col1] = '{0}'", id)).ToList().ForEach(r => r["Col1"] = Data);
see the link
https://msinternal1.engageexpress.com/sf/MTY1NDNfMTY4NTM4

I know this is kind of brute force, but you can combine it with what others have suggested:
Parallel.ForEach(dtValidateAndFormatConditions.Rows, dRow =>
{
string Condition = dRow[conditionIndex];
string FormatValue = Value = dRow[valueIndex];
GetValidatedFormattedData(DtCopied,ref Condition, ref FormatValue ,iRowIndex);
Condition = Parse(Condition);
FormatValue = Parse(FormatValue);
lock (dRow)
{
dRow[conditionIndex] = Condition;
dRow[valueIndex] = FormatValue;
}
});

Related

SQL : SELECT * FROM method

Just out of curiosity, how exactly does SELECT * FROM table WHERE column = "something" works?
Is the underlying principle same as that of a for/foreach loop with an if condition like:
for (iterator)
{
if(condition)
//print results
}
If am dealing with , say 100 records, will there be any considerable performance difference between the 2 approaches in getting the desired data I want ?

SQL is a 4th generation language, which makes it very different from programming languages. Instead of telling the computer how to do something (loop through rows, compare columns), you tell the computer what to do (get the rows matching a condition).
The DBMS may or may not use a loop. It could as well use hashes and buckets, pre-sort a data set, whatever. It is free to choose.
On the technical side, you can provide an index in the datebase, so the DBMS can look up the keys to quickly to access the rows (like quickly finding names in a telephone book). This gives the DBMS an option how to acces the data, but it is still free to use a completely different approach, e.g. read the whole table sequentially.

Searching over a large List of objects quickly

First I build a list (by reading existing files) of approximately 12,000 objects that look like this:
public class Operator
{
string identifier; //i.e "7/1/2017 MN01 Day"
string name1;
string name2;
string id1;
string id2;
}
The identifier will be unique within the list.
Next I run a large query (currently about 4 million rows but it could be as large as 10 million, and about 20 columns). Then I write all of this to a CSV line by line using a write stream. For each line I loop over the Operator list to find a match and add those columns.
The problem I am having is with performance. I expect this report to take a long time to run but I've determined that the file writing step is taking especially long (about 4 hours). I suspect that it has to do with looping over the Operator list 4 million times.
Is there any way I can improve the speed of this? Perhaps by doing something when I build the list initially (indexing or sorting, maybe) that will allow searching to be done much faster.

You should be able to greatly speed up your code by building a Dictionary(HashTable):
var items = list.ToDictionary(i => i.identifier, i => i);
You can then index in on this dictionary:
var item = items["7/1/2017 MN01 Day"];
Building the dictionary is an O(n) operation, and doing a lookup into the dictionary is an O(1) operation. This means that your time complexity becomes linear rather than exponential.

... but also, "couldn't you somehow put those operators into a database table, so that you could use some kind of JOIN operation in your SQL?"
Another possibility that comes to mind is ... "twenty different queries, one for each symbol." Or, a UNION query with twenty branches. If there is any way for the SQL engine to use indexes, on its side, to speed up that process, you would still come out ahead.
Right now, vast amounts of time might be being wasted, packaging up every one of those millions of lines, squirting them through the network wires to your machine, only to have to discarding most of them, say, because they don't match any symbol.
If you control the database and can afford the space, and if, say, most of the rows don't match any symbol, consider a symbols table and a symbols_matched table, the second being a many-to-many join table that pre-identifies which rows match which symbol(s). It might well be worth the space, to save the time. (The process of populating this table could be put to a stored procedure which is TRIGGERed by appropriate insert, update, and delete events ...)

It is difficult to tell you how to speed up your file write without seeing any code.
But in general it could be worth considering writing using multiple threads. This SO post has some helpful info, and you could of course Google for more.

How do I find out if a DynamoDB table is empty?

How can I find out if a DynamoDB table contains any items using the .NET SDK?
One option is to do a Scan operation, and check the returned Count. But Scans can be costly for large tables and should be avoided.

The describe table count does not return real time value. The item count is updated every 6 hours.
The best way is to scan only once without any filter expression and check the count. This may not be costly as you are scanning the table only once and it would not scan the entire table as you don't need to scan recursively to find whether the table has any item.
A single scan returns only 1 MB of data.
If the use case requires real time value, this is the best and only option available.

Edit: While the below appears to work fine with small tables on localhost, the docs state
DynamoDB updates this value approximately every six hours. Recent changes might not be reflected in this value.
so only use DescribeTable if you don't need an accurate, up to date figure.
Original:
It looks like the best way to do this is to use the DescribeTable method on AmazonDynamoDBClient:
AmazonDynamoDBClient client = ...
if (client.DescribeTable("FooTable").Table.ItemCount == 0)
// do stuff

Parallel.For maintain input list order on output list

I'd like some input on keeping the order of a list during heavy-duty operations that I decided to try to do in a parallel manner to see if it boosts performance. (It did!)
I came up with a solution, but since this was my first attempt at anything parallel, I'd need someone to slap my hands if I did something very stupid.
There's a query that returns a list of card owners, sorted by name, then by date of birth. This needs to be rendered in a table on a web page (ASP.Net WebForms). The original coder decided he would construct the table cell-by-cell (TableCell), add them to rows (TableRow), then each row to the table. So no GridView, allegedly its performance is bad, but the performance was very poor regardless :).
The database query returns in no time, the most time is spent on looping through the results and adding table cells etc.
I made the following method to maintain the original order of the list:
private TableRow[] ComposeRows(List<CardHolder> queryResult)
{
int queryElementsCount = queryResult.Count();
// array with the query's size
var rowArray = new TableRow[queryElementsCount];
Parallel.For(0, queryElementsCount, i =>
{
var row = new TableRow();
var cell = new TableCell();
// various operations, including simple ones such as:
cell.Text = queryResult[i].Name;
row.Cells.Add(cell);
// here I'm adding the current item to it's original index
// to maintain order in the output list
rowArray[i] = row;
});
return rowArray;
}
So as you can see, because I'm returning a very different type of data (List<CardHolder> -> TableRow[]), I can't just simply omit the ordering from the original query to do it after the operations.
Also, I also thought it would be a good idea to Dispose() the objects at the end of each loop, because the query can return a huge list and letting cell and row objects pile up in the heap could impact performance.(?)
How badly did I do? Does anyone have a better solution in case mine is flawed?
After testing, with a |2000| array, StopWatch says parallel composition is done in 63 ms (165082 ticks), and serial composition is done in 267 ms (698222). Both including adding rows to- (table.Rows.AddRange()), then rendering the Table. Without adding and rendering, times are 33ms/87541t for parallel and 178ms/467068t for serial.

A little code-review:
the 2 calls to Dispose() should be removed. At best they are harmless.
take a critical look at how you obtain originalIndex. It should always turn out to be equal to i so you don't need it.
But aside from this bit of streamlining there is not much to improve. It is doubtful that doing this in parallel will help much, or that this is the real bottleneck in your code. I suspect most time will be spent processing the results in TableRow[] .

reference value in DataColumn Expressions

given an expression like so
DataTable1.Columns.Add("value", typeof(double), "rate * loan_amt");
in a datatable with 10000 rows where rate is same for all rows and loan_amt varies
When the rate changes, it changes for all
currently that means iterating through all rows like so
foreach(DataRow dr in DataTable1.Rows) dr["rate"] = new_rate;
wondering if there,s a better way using a ReferenceTable (with only 1 row ) in the same DataSet and linking it somehow like so
DataTable1.Columns.Add("value", typeof(double), "RefTable.Row0.rate * loan_amt");
so changing the rate would be as simple as
RefTable.Rows[0]["rate"] = new_rate;
Or any other way ?

That is a good idea, but you would have to rewrite any time that data was accessed in legacy code. It would certainly make updates to the rate more efficient, but you may run into issues with reverse compatibility.
If their isn't much code accessing that table then it isn't such a big deal, but if this is a production system with multiple processes calling that data you might end up with a runaway train of null value exceptions when trying to access the "rate" column of the original table, or inconsistencies for your "value" depending on which code accessed which table to retrieve the rate.
If this is not the case, then it's no big deal. Go for it.

found the answer, adding it for others who might land here
Key is to add a DataRelation between the two tables/columns and the expression would be
Parent.rate * loan_amt

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.