SSIS Script Task - Count Number of Unique Rows in Output Object

SSIS Script Task - Count Number of Unique Rows in Output Object - c#

My SSIS script task generates 5 rows of records into an Output script component.
if (conditionMeets)
{
WOProductBuffer.AddRow();
WOProductBuffer.WorkOrderId = workOrderId;
WOProductBuffer.WorkOrderProductId = workOrderProductId;
//other fields
}
Objective: Count number of rows, group by WorkOrderId and WorkOrderProductId, and set this count value to WopCount
I realized that PostExecute() method is unable to read the Output object WOProduct, so it is likely not possible.
Based on all the rows, is there a way to implement this within the same script task?
Or the only way is to create a new script task, loop all the records in PreExecute() method to generate the count value?
What I have tried:
Adding WOProductBuffer to a list, and loop in PostExecute() method. This didn't work as each row doesn't seem to contain any value
Currently trying:
How to loop through Input rows in a new script task

From an earlier question, you have a class something like this
public class WorkOrderProduct
{
public Guid workOrderId;
public Guid workOrderProductId;
}
In a script task, you'll want to update an SSIS Variable with the final count so we know that can only take place in the PostExecute method.
Similar to how we moved the declaration of wopList to the class level on How to Access Parameters of User Variables in SSIS Script Task we would create a similar list, just with different a type.
There are two ways of doing this. You can either implement the distinct logic in your code and only add to the list unique items. The other option is to use a tiny bit of LINQ and let it do the logic for you.
The decision points become:
What do I understand and want to maintain?
What's the expected cardinality relative to the size of the list - aka how many total rows would we expect versus how many uniques? If it's under millions, eh, I can't imagine it making a difference which you choose. Above a million, I'll start pulling out my consulting "It Depends" card but I suspect you'll be fine. Billions? yeah, I bet things will start to get interesting. If nothing else, you'll probably need LongCount instead of the Count method.
Create a class member - pick one (or both and try them out)
List<KeyValuePair<Guid, Guid>> option1;
List<KeyValuePair<Guid, Guid>> option2;
In your PreExecute method, instantiate the List(s)
this.option1 = new List<KeyValuePair<Guid, Guid>>();
this.option2 = new List<KeyValuePair<Guid, Guid>>();
In your existing logic, as a final step, implement option 1 or 2. We will create a KeyValuePair of our two Guids.
We will then ask the existing List if it has one of those already? If it does not, we'll add it to our option1 list.
Finally, we'll just add it to option2 list as we'll figure out uniques later.
if (conditionMeets)
{
// Doing our business process here thing
KeyValuePair<Guid, Guid> newItem = new KeyValuePair<Guid, Guid>(workOrderId, workOrderProductId);
if (!option1.Contains(newItem))
{
option1.Add(newItem);
}
// Just add it and we'll figure it out later
option2.Add(newItem);
}
In your PostExecute method, you can use the Count property on the option1 List as you've already done the heavy lifting to only add distinct values.
For option2, we'll invoke the Distinct Method and then chain a call to Count() method. Do note the difference in when we use parentheses here otherwise you'll have "Cannot convert from 'method group' to int" syntax error in your code.
Finally, Console.WriteLine doesn't do you any good. Instead you'll assign values back to your Variable or as I showed in
bool pbFireAgain = false;
int uniqueCount = 0;
// Option 1 approach
uniqueCount = option1.Count;
// Pop the value into the run log so we can trace what was generated
this.ComponentMetaData.FireInformation(0, "SCR PostExecute Counts", string.Format("option1 count is {0}", option1.Count), "", 0, ref pbFireAgain);
// Option 2 logic
uniqueCount = option2.Distinct().Count();
// Push the option 2 values into the output log
this.ComponentMetaData.FireInformation(0, "SCR PostExecute Counts", string.Format("option2 Distinct Count is {0}, total Count is {1}", option2.Distinct().Count(), option2.Count()), "", 0, ref pbFireAgain);
this.Variables.MySSISVariable = uniqueCount;

Related

Child List from Parent List parameter without duplicates C# VS2013

I am working on a small expense tracking program. The idea is to have a list that holds Expense objects that can be manipulated and used to perform calculations.
I was able to create the List without issue and populate it with several dummy expenses. My expenses are grouped by category, Expense.expenseType, to allow me to do calculations for analysis so I am trying to make another List that will store category names and relevant calculations values. The list of category names is meant to remove duplicates but so far I've been unsuccessful at populating it.
My approach for creating the List has been to define a Category class that holds only a string parameter for categoryName and a float for categoryTotal although the latter is initialized to 0.00. I then have a For loop that copies the names into the List and a second For loop that removes indexes based on the name once they've been alphabetized. I've tried different variations of this but ultimately I get either an index that is out of bounds or a reduced but still duplicates list of categoryName.
Really hoping to get some advice so I could move forward with the code. I didn't add the actual code since I'm new to C#/VS and figure I may be approaching the problem all wrong.
Edit 1: Based on the feedback I got, the function I am using is below:
public void getCategories(List<Category> passedCategories)
{
passedCategories = passedCategories.GroupBy(Category =>Category.strName)
.Select(gr => new Category
{
strName = gr.Key,
fltTotal = gr.Sum(ex => ex.Value)
});
}
This function is not working, I have a few points I wanted to clarify and I am sure there are others I missed.
Passed categories is a List of Categories that have three parameters - strName, fltTotal and fltPercent. The latter two are currently set to zero when the whole list is populated via a temp Category. The strName is being copied from an Expense List with many more parameters. Since the Category name will repeat in the Expense List, I am trying to remove all duplicates so I can have just the different categories. I took out var since I am passing the List in, should I not have done this? What am I missing?
Thanks again for the help,
Yusif Nurizade

That you need is something like the following. I say something, because I don't see your code and I have to imagine it. For instance, I don't know the name of the property for the amount of expense. I assumed that this is called Value.
// This would be the list of expenses. You have to populate it with data.
var expenses = new List<Expense>();
// Using LINQ you can achieve that you want in a few lines.
// First you group by your data by their categories.
// Then you calculate the total expense for each category.
var statistics = expenses.GroupBy(expense=>expsense.Type)
.Select(gr=> new Category
{
Name = gr.Key,
Total = gr.Sum(ex=>ex.Value)
});

How can you determine the current items position whilst looping through the collection?

How can you determine the current items position whilst looping through the collection?
I'm working through decision data, grouped by each client, but I have some business logic which depends on the "position" in the set, i.e. 1st, 2nd, 3rd, etc. in conjunction with other properties of the record, e.g. if it's the 3rd decision about a client and their rating in the instance is A then ...
var multiples = from d in context.Decision_Data
group d by d.Client_No
into c
where c.Count() > 1
select c;
foreach (var grouping in multiples)
{
foreach (var item in grouping)
{
// business logic here for processing each decision for a Client_No
// BUT depends on item position ... 1st, 2nd, etc.
}
UPDATE: I appreciate I could put a counter in and manually increment, but it feels wrong and I'd of thought there was something in .NET to handle this ??

Something like this:
foreach (var grouping in multiples)
{
foreach (var x in grouping.Select(index,item) => new {index, item})
{
// x.index is the position of the item in this group
// x.item is the item itself
}
}
Side note: you can make the implementation of your LINQ query a bit more efficient. Count() > 1 will completely enumerate each group fully, which you are likely to do in the foreach anyway. Instead you can use Skip(1).Any(), which will stop iterating the group as soon as it finds two items. Obviously this will only make a real difference for (very) large input lists.
var multiples = from d in context.Decision_Data
group d by d.Client_No
into c
where c.Skip(1).Any()
select c;

There isn't anything offered by the standard foreach. Simply maintain an external count.
There is an overload on the Enumerable.Select extension method that provides the index of the current item:
http://msdn.microsoft.com/en-us/library/bb534869
But without knowing what your code is trying to do in the foreach I cannot really offer an example of using it. In theory you could project an anonymous type that has the index stored and use that later on with the foreach. It appears that jeroenh's answer went down this route.

As Adam stated, you could either go with Adams solution, or do a ToList() on the query to be able to do
multiples.IndexOf(grouping)

I fail to see how you can have any certainty about your decisions order.
I'm guessing your data come from a long-term data source (e.g. data base or such) and that you doesn't have any control on the order in which the decision are fetched from the data source, especially after applying a "group by".
I would add an "order" field (or column) to the Decision entity to track the order in which the decision were made that would be set while adding the Decision to the data source.
That way, you could directly use this field in your business logic.
There must be many ways to achieve the tracking of decision order but without, you can't even be sure in what order they have been made.

Algorithm for ordering a list of Objects

Say you have a List of objects. The User uses mostly all objects when he is working.
How can you order the list of objects, so that the list adapts to the order, the users uses mostly? What algorithm can you use for that?
EDIT: Many answers suggested counting the number of times an object was used. This does not work, because all objects are used the same amount, just in different orders.

Inside your object, keep a usedCount. Whenever the object is used, increase this count.
Then you can simply do this:
objects.OrderByDescending(o => o.UsedCount);

I would keep a running count of how many times the object was used, and in what order it was used.
So if object X was used 3rd, average it with the running count and use the result as it's position in the list.
For example:
Item Uses Order of Use
---------------------------------------
Object X 10 1,2,3,1,2,1,3,1,2,2 (18)
Object Y 10 3,1,2,3,3,3,1,3,3,1 (23)
Object Z 10 2,3,1,2,1,2,2,2,2,3 (20)
Uses would be how many times the user used the object, the order of use would be a list (or sum) of where the item is used in the order.
Using a list of the each order individually could have some performance issues, so you may just want to keep a sum of the positions. If you keep a sum, just add the order to that sum every time the object is used.
To calculate the position, you would then just use the sum of the positions, divided by the number of uses and you'd have your average. All you would have to do at that point is order the list by the average.
In the example above, you'd get the following averages (and order):
Object X 1.8
Object Z 2.0
Object Y 2.3

Add a list of datetimes of when a user accesses an object. Each time a user uses an object, add a datetime.
Now just count the number of datetime entries in your list that are w (now - x days) and sort by that. You can delete the datetimes that are > (now - x days).
It's possible that a user uses different items in a month, this will reflect those changes.

You can add a number_of_views field to your object class, ++ it every time the object's used and sort list by that field. And you should make this field=0 to all objects when number_of_views at all objects is the same but isn't 0.

I would also use a counter for each object to monitor its use, but instead of reordering the whole list after each use, I would recommend to just sort the list "locally".
Like in a bubble sort, I would just compare the object whose counter was just increased with the upper object, and swap them if needed. If swapped, I would then compare the object and its new upper object and so on.
However, it is not very different from the previous methods if the sort is properly implemented.

If your User class looks like so:
class User
{
Collection<Algo> algosUsed = new List<Algo>(); //Won't compile, used for explanation
...
}
And your Algo class looks like so:
class Algo
{
int usedCount;
...
}
You should be able to bind specific instances of the Algo object to the User object that allow for the recording of how often it is used. At the most basic level you would serialize the information to a file or a stream. Most likely you want a database to keep track of what is being used. Then when you grab your User and invoke a sort function you order the algos param of User by the usedCount param of Algo

Sounds like you want a cache. I spose you could look at the algorithms a cache uses and then take out the whole business about context switching...there is an algorithm called "clock sweep"... but meh that might all be too complex for what you are looking for. To go the lazy way I'd say just make a hash of "used thing":num_of_uses or, in your class, have a var you ++ each time the object is used.
Every once and a while sort the hash by num_of_uses or the objects by the value of their ++'d variable.

From https://stackoverflow.com/a/2619065/1429439 :
maybe use OrderedMultiDictionary with the usedCount as the keys and the object as the value.

EDIT: ADDED A Order Preferrence!!! look in CODE
I dont like the Last used method as Carra said because it inflict many sort changes which is confusing.
the count_accessed field is much better, though i think it should be levelled to
how many times the user accessed this item in the last XX minutes/hours/days Etc...
the best Datastructure for that is surely
static TimeSpan TIME_TO_LIVE;
static int userOrderFactor = 0;
LinkedList<KeyValuePair<DateTime, int>> myAccessList = new LinkedList<KeyValuePair<DateTime, int>>();
private void Access_Detected()
{
userOrderFactor++;
myAccessList.AddLast(new KeyValuePair<DateTime, int>(DateTime.Now, userOrderFactor));
myPriority += userOrderFactor; // take total count differential, so we dont waste time summing the list
}
private int myPriority = 0;
public int MyPriority
{
get
{
DateTime expiry = DateTime.Now.Subtract(TIME_TO_LIVE);
while (myAccessList.First.Value.Key < expiry)
{
myPriority += myAccessList.First.Value.Value; // take care of the Total Count
myAccessList.RemoveFirst();
}
return myPriority;
}
}
Hope this helps...
it is almost always O(1) BTW...
reminds me somewhat of the Sleep mechanism of Operating Systems

When a user interacts with an object, save the ID of the previous object acted upon on that second object so that you always have a pointer to the object used before any given object.
Additionally, store the ID of the most frequently first used object so you know where to start.
When you are building your list of objects to display, you start with the one you've stored as the most frequently first-used object then search for the object that has the first-used object's ID stored on it to display next.

List goes from count of 1 to 0 between two operations, with no code to do so. (asp.net)

I've got the following list
List<int> deletedRecords = new List<int>();
When I hit the delete button in my gridview, I add the Id for that record to this List.
When the user clicks the save button, all records that exist in the List are deleted from the database before I proceed.
However, when I get to this point, the List is always empty.
The List is only referenced in three places, those being its declaration, its .Add method, and a foreach to cycle through all values it contains.
When I do a debug, I can see the List.Count go to 1, but then when I hit the Save button, my Debug shows the List has gone to a count of 0. I'm really confused by this.
Can anyone help?

The list variable / field only exists for the duration of a single request; any button-click (such as Save) is a separate request, with an entirely different set of objects. In many cases it won't even be served by the same server.
If you need state between requests, you need to manage that state, perhaps via session-state.

Without seeing any code, I'd guess the following:
You've forgotten web apps have no state.

As Marc and Esteban said, you have to persist your items.
So instead of writing
List<int> deletedRecords = new List<int>();
you could write
private List<int> deletedRecords
{
get
{
var result = ViewState["deletedRecords"] as List<int>;
if ( result == null )
{
result = new List<int>();
ViewState["deletedRecords"] = result;
}
return result;
}
}
and use this property of your page class instead.

Which is faster in .NET, .Contains() or .Count()?

I want to compare an array of modified records against a list of records pulled from the database, and delete those records from the database that do not exist in the incoming array. The modified array comes from a client app that maintains the database, and this code runs in a WCF service app, so if the client deletes a record from the array, that record should be deleted from the database. Here's the sample code snippet:
public void UpdateRecords(Record[] recs)
{
// look for deleted records
foreach (Record rec in UnitOfWork.Records.ToList())
{
var copy = rec;
if (!recs.Contains(rec)) // use this one?
if (0 == recs.Count(p => p.Id == copy.Id)) // or this one?
{
// if not in the new collection, remove from database
Record deleted = UnitOfWork.Records.Single(p => p.Id == copy.Id);
UnitOfWork.Remove(deleted);
}
}
// rest of method code deleted
}
My question: is there a speed advantage (or other advantage) to using the Count method over the Contains method? the Id property is guaranteed to be unique and to identify that particular record, so you don't need to do a bitwise compare, as I assume Contains might do.
Anyone?
Thanks, Dave

This would be faster:
if (!recs.Any(p => p.Id == copy.Id))
This has the same advantages as using Count() - but it also stops after it finds the first match unlike Count()

You should not even consider Count since you are only checking for the existence of a record. You should use Any instead.
Using Count forces to iterate the entire enumerable to get the correct count, Any stops enumerating as soon as you found the first element.
As for the use of Contains you need to take in consideration if for the specified type reference equality is equivalent to the Id comparison you are performing. Which by default it is not.

Assuming Record implements both GetHashCode and Equals properly, I'd use a different approach altogether:
// I'm assuming it's appropriate to pull down all the records from the database
// to start with, as you're already doing it.
foreach (Record recordToDelete in UnitOfWork.Records.ToList().Except(recs))
{
UnitOfWork.Remove(recordToDelete);
}
Basically there's no need to have an N * M lookup time - the above code will end up building a set of records from recs based on their hash code, and find non-matches rather more efficiently than the original code.
If you've actually got more to do, you could use:
HashSet<Record> recordSet = new HashSet<Record>(recs);
foreach (Record recordFromDb in UnitOfWork.Records.ToList())
{
if (!recordSet.Contains(recordFromDb))
{
UnitOfWork.Remove(recordFromDb);
}
else
{
// Do other stuff
}
}
(I'm not quite sure why your original code is refetching the record from the database using Single when you've already got it as rec...)

Contains() is going to use Equals() against your objects. If you have not overridden this method, it's even possible Contains() is returning incorrect results. If you have overridden it to use the object's Id to determine identity, then in that case Count() and Contains() are almost doing the exact same thing. Except Contains() will short circuit as soon as it hits a match, where as Count() will keep on counting. Any() might be a better choice than both of them.
Do you know for certain this is a bottleneck in your app? It feels like premature optimization to me. Which is the root of all evil, you know :)

Since you're guarenteed that there will be 1 and only 1, Any might be faster. Because as soon as it finds a record that matches it will return true.
Count will traverse the entire list counting each occurrence. So if the item is #1 in the list of 1000 items, it's going to check each of the 1000.
EDIT
Also, this might be a time to mention not doing a premature optimization.
Wire up both your methods, put a stopwatch before and after each one.
Create a sufficiently large list (1000 items or more, depending on your domain.) And see which one is faster.
My guess is that we're talking on the order of ms here.
I'm all for writing efficient code, just make sure you're not taking hours to save 5 ms on a method that gets called twice a day.

It would be so:
UnitOfWork.Records.RemoveAll(r => !recs.Any(rec => rec.Id == r.Id));

May I suggest an alternative approach that should be faster I believe since count would continue even after the first match.
public void UpdateRecords(Record[] recs)
{
// look for deleted records
foreach (Record rec in UnitOfWork.Records.ToList())
{
var copy = rec;
if (!recs.Any(x => x.Id == copy.Id)
{
// if not in the new collection, remove from database
Record deleted = UnitOfWork.Records.Single(p => p.Id == copy.Id);
UnitOfWork.Remove(deleted);
}
}
// rest of method code deleted
}
That way you are sure to break on the first match instead of continue to count.

If you need to know the actual number of elements, use Count(); it's the only way. If you are checking for the existence of a matching record, use Any() or Contains(). Both are MUCH faster than Count(), and both will perform about the same, but Contains will do an equality check on the entire object while Any() will evaluate a lambda predicate based on the object.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

SSIS Script Task - Count Number of Unique Rows in Output Object - c#

Related

Child List from Parent List parameter without duplicates C# VS2013

How can you determine the current items position whilst looping through the collection?

Algorithm for ordering a list of Objects

List goes from count of 1 to 0 between two operations, with no code to do so. (asp.net)

Which is faster in .NET, .Contains() or .Count()?

Categories

Resources