Let's say I work at the Dept. of Health. I've processed food poisoning complaints and stored the complaints data into a multi-dimensional array like so:
ID - 5 digit ID number for the restaurant victim ate at
Date - Date of Food Poisoning
Name - Name of Victim
Age - Age of Victim
Phone - Victim's Phone Number
Array[0] contains the first complaint's data. Array[0].ID contains the restaurant ID of the first complaint and so forth.
Within my array how do I extract a list of unique 5 digit IDs?
Some restaurants might have 50 complaints and some might have just 1. I want to create a list of all of the unique restaurant IDs that show up in my complaints data.
var Unique = array.ID.Distinct();
does not work. What am I doing wrong?
Select() first...
var ids = array.Select(o => o.ID).Distinct();
Edit:
Hi, can you please explain why.
First, let's talk about what you did wrong:
var ids = array.ID.Distinct();
You tried to refer to ID, a non-existent member of the array. What you're looking for is the ID of an item within the array.
You tried to call Distinct() on that non-existent member rather than the collection.
Now let's look at what the new code does:
var ids = array.Select(o => o.ID).Distinct();
That Select() generates a new enumerable yielding only the ID values. The Distinct() generates another enumerable, yielding only the unique values from the Select().
Use a HashSet if you plan to do lookups going forward:
var hashSet = new HashSet<int>(array.Select(i => i.ID));
This will automatically remove duplicates and also allow near O(1) lookups.
Related
This one is all about performance. I have two major lists of objects (here, I'll use PEOPLE/PERSON as the stand-in). First, I need to filter one list by the First_Name property - then I need to create two filtered lists from each master list based on shared date - one list with only one name, the other list with every name, but with both lists only containing matching date entries (no date in one list that doesn't exist in the other). I've written a pseudo-code to simplify the issue to the core question below. Please understand while reading that BIRTHDAY wasn't the best choice, as there are multiple date entries per person. So please pretend that each person has about 5,000 "birthdays" when reading the code below:
public class Person
{
public string first_Name;
public string last_Name;
public DateTime birthday;
}
public class filter_People
{
List<Person> Group_1 = new List<Person>();// filled from DB Table "1982 Graduates" Group_1 contains all names and all dates
List<Person> Group_2 = new List<Person>();// filled from DB Table "1983 Graduates" Group_2 contains all names and all dates
public void filter(List<Person> group_One, List<Person> group_Two)
{
Group_1 = group_One;
Group_2 = group_Two;
//create a list of distinct first names from Group_1
List<string> distinct_Group_1_Name = Group_1.Select(p => p.first_Name).Distinct().ToList();
//Compare each first name in Group_1 to EVERY first name in Group 2, using only records with matching birthdays
Parallel.For(0, distinct_Group_1_Name.Count, dI => {
//Step 1 - create a list of person out of group_1 that match the first name being iterated
List<Person> first_Name_List_1 = Group_1.Where(m => m.first_Name == distinct_Group_1_Name[dI]).ToList();
//first_Name_List_1 now contains a list of everyone named X (Tom). We need to find people from group 2 who match Tom's birthday - regardless of name
//step 2 - find matching birthdays by JOINing the filtered name list against Group_2
DateTime[] merged_Dates = first_Name_List_1.Join(Group_2, d => d.birthday, b => b.birthday, (d, b) => b.birthday).ToArray();
//Step 3 - create filtered lists where Filtered_Group_1 contains ONLY people named Tom, and Filtered_Group_2 contains people with ANY name sharing Tom's birthday. No duplicates, no missing dates.
List<Person> Filtered_Group_1 = first_Name_List_1.Where(p => p.birthday.In(merged_Dates)).ToList();
List<Person> Filtered_Group_2 = Group_2.Where(p => p.birthday.In(merged_Dates)).ToList();
//Step 4 -- move on adn process the two filtered lists (outside scope of question)
//each name in Group_1 will then be compared to EVERY name in Group_2 sharing the same birthday
//compare_Groups(Filtered_Group_1,Filtered_Group_2)
});
}
}
public static class Extension
{
public static bool In<T>(this T source, params T[] list)
{
return list.Contains(source);
}
}
Here, the idea is to take two different master name lists from the DB and create sub-lists where dates match (one with only one name, and the other with all names) allowing for a one-to-many comparison based on datasets of the same length with matching date indices. Originally, the idea was to simply load the lists from the DB, but the lists are long and loading all name data and using SELECT/WHERE/JOIN is much faster. I say "much faster" but that's relative.
I've tried converting Group_1 and Group_2 to Dictionaries and matching dates by using keys. Not much improvement. Group_1 has about 12Million records (about 4800 distinct names with multiple dates each), and Group_2 has about the same, so the input here is 12Million records and the output is a bazillion records. Even though I'm running this method as a separate Task and queuing the results for another thread to process, it's taking forever to split these lists and keep up.
Also, I realize this code doesn't make much sense using class Person, but it's only a representative of the problem essentially using pseudocode. In reality, this method sorts multiple datasets on date and compares one to many for correlation.
Any help on how to accomplish filtering this one to many comparison in a more productive way would be greatly appreciated.
Thanks!
Code in the current format, I see way too many issues for it to become performance oriented with the kind of data you have mentioned. Parallelism is no magic pill for poor algorithm and data structure choice.
Currently for every comparison it goes for linear search O(N), thus making it M*O(N) for M operations, even if we make these operations O(logN), even better O(1), there would be a drastic improvement in the execution time.
Instead of taking Distinct and then searching in the Parallel loop using Where clause, use GroupBy to aggregate / group the records, and create a Dictionary in the same operation, which would ensure the easy search of records with a given name
var nameGroupList = Group_1.GroupBy(p => p.first_Name).ToDictionary(p => p.Key, p => p);
This will help you get rid of following two operations in the original code (one of them in Parallel is a repetitive operation, which hurts the performance big time)
List<string> distinct_Group_1_Name = Group_1.Select(p => p.first_Name).Distinct().ToList();
List<Person> first_Name_List_1 = Group_1.Where(m => m.first_Name == distinct_Group_1_Name[dI]).ToList();
The Dictionary will be of type Dictionary<string,IEnumerable<Person>>, thus you get the List of Person by name in O(1) time and there's no repetitive search. Another issue of the code that this would handle is recreation of list and as it searches through the original list / data.
Next part that needs to be handled, which is hurting the performance is the code like this
p.birthday.In(merged_Dates)
since in the extension methods you run the list.Contains, as an O(N) operation every time, which kills the performance. Following are the possible options:
Take the following operation too out of the Parallel loop:
DateTime[] merged_Dates = first_Name_List_1.Join(Group_2, d => d.birthday, b => b.birthday, (d, b) => b.birthday).ToArray();
Instead create another Dictionary of type Dictionary<string, Hashset<DateTime>>, by intersecting the data from Dictionary<string,IEnumerable<Person>> created earlier, using a Data from Group2, you can use the appropriate IEqualityComparer for DateTime and thus a ready reckoner for Date list / array would be available and needn't be created everytime:
personDictionary["PersonCode"].Intersect(Group2,IEqualityComparer(using Date))
For the final result please notice, you shall store the result as HashSet instead of List. The benefit would be the Contains would be O(log(N)) operation instead of O(N), thus making it much faster. In fact it is also fine to have the structure like Dictionary<string, Dictionary<DateTime,DateTime>>, which will make it O(1) operation.
Try these points and suggest if there's any improvement in the working of the code.
I am working on a small expense tracking program. The idea is to have a list that holds Expense objects that can be manipulated and used to perform calculations.
I was able to create the List without issue and populate it with several dummy expenses. My expenses are grouped by category, Expense.expenseType, to allow me to do calculations for analysis so I am trying to make another List that will store category names and relevant calculations values. The list of category names is meant to remove duplicates but so far I've been unsuccessful at populating it.
My approach for creating the List has been to define a Category class that holds only a string parameter for categoryName and a float for categoryTotal although the latter is initialized to 0.00. I then have a For loop that copies the names into the List and a second For loop that removes indexes based on the name once they've been alphabetized. I've tried different variations of this but ultimately I get either an index that is out of bounds or a reduced but still duplicates list of categoryName.
Really hoping to get some advice so I could move forward with the code. I didn't add the actual code since I'm new to C#/VS and figure I may be approaching the problem all wrong.
Edit 1: Based on the feedback I got, the function I am using is below:
public void getCategories(List<Category> passedCategories)
{
passedCategories = passedCategories.GroupBy(Category =>Category.strName)
.Select(gr => new Category
{
strName = gr.Key,
fltTotal = gr.Sum(ex => ex.Value)
});
}
This function is not working, I have a few points I wanted to clarify and I am sure there are others I missed.
Passed categories is a List of Categories that have three parameters - strName, fltTotal and fltPercent. The latter two are currently set to zero when the whole list is populated via a temp Category. The strName is being copied from an Expense List with many more parameters. Since the Category name will repeat in the Expense List, I am trying to remove all duplicates so I can have just the different categories. I took out var since I am passing the List in, should I not have done this? What am I missing?
Thanks again for the help,
Yusif Nurizade
That you need is something like the following. I say something, because I don't see your code and I have to imagine it. For instance, I don't know the name of the property for the amount of expense. I assumed that this is called Value.
// This would be the list of expenses. You have to populate it with data.
var expenses = new List<Expense>();
// Using LINQ you can achieve that you want in a few lines.
// First you group by your data by their categories.
// Then you calculate the total expense for each category.
var statistics = expenses.GroupBy(expense=>expsense.Type)
.Select(gr=> new Category
{
Name = gr.Key,
Total = gr.Sum(ex=>ex.Value)
});
Background: In my program I have a list of nodes (a class I have defined). They each have a unique id number and a non-unique "region" number. I want to randomly select a node, record its id number, then remove all nodes of the same region from the list.
Problem: Someone pointed out to me that using a hashset instead of a list would be much faster, as a hashset's "order" is effectively random for my purposes and removing elements from it would be much faster. How would I do this (i.e. how do I access a random element in a hashset? I only know how to check to see if a hashset contains an element I already have)?
Also, I'm not quite sure how to remove all the nodes of a certain region. Do I have to override/define a comparison function to compare node regions? Again, I know how to remove a known element from a hashset, but here I don't know how to remove all nodes of a certain region.
I can post specifics about my code if that would help.
To be clear, the order items in a HashSet isn't random, it's just not easily determinable. Meaning if you iterate a hash set multiple times the items will be in the same order each time, but you have no control over what order they're in.
That said, HastSet<T> implements IEnumerable<T> so you could just pick a random number n and remove the nth item:
// assuming a Random object is defined somewhere (do not declare it here)
n = rand.Next(hashSet.Count);
var item = hashSet.ElementAt(n);
hashSet.Remove(item);
Also, I'm not quite sure how to remove all the nodes of a certain region. Do I have to override/define a comparison function to compare node regions?
Not necessarily - you'll need to scan the hashSet to find matching items (easily done with Linq) and remove each one individually. Whether you do that by just comparing properties or defining an equality comparer is up to you.
foreach (var dupe in hashSet.Where(x => x.Region == item.Region).ToList())
hashSet.Remove(dupe);
Note the ToList which is necessary since you can't modify a collection while iterating over it, so the items to remove need to be stored in a different collection.
Note that you can't override Equals in the Node class for this purpose or you won't be able to put multiple nodes from one region in the hash set.
If you haven't noticed, both of these requirements defeat the purpose of using a HashSet - A HashSet is faster only when looking for a known item; iterating or looking for items based on properties is no faster than a regular collection. It would be like looking through the phone book to find all people whose phone number start with 5.
If you always want the items organized by region, then perhaps a Dictionary<int, List<Node>> is a better structure.
There's another alternative approach that you could take that could end up being faster than removals from hash sets, and that's creating a structure that does your job for you in one go.
First up, to give me some sample data I'm running this code:
var rnd = new Random();
var nodes =
Enumerable
.Range(0, 10)
.Select(n => new Node() { id = n, region = rnd.Next(0, 3) })
.ToList();
That gives me this kind of data:
Now I build up my structure like this:
var pickable =
nodes
.OrderBy(n => rnd.Next())
.ToLookup(n => n.region, n => n.id);
Which gives me this:
Notice how the regions and individual ids are randomized in the lookup. Now it's possible to iterate over the lookup and take just the first element of each group to get both a random region and random node id without the need to remove any items from a hash set.
I wouldn't expect performance to be too much of an issue as I just tried this with 1,000,000 nodes with 1,000 regions and got a result back in just over 600ms.
On a hashset you can use ElementAt
notreallrandomObj nrrbase = HS.ElementAt(0);
int region = nrrbase.region;
List<notreallrandomObj> removeItems = new List<notreallrandomObj>();
foreach (notreallrandomObj nrr in HS.Where(x => x.region == region))
removeItems.Add(nrr);
foreach (notreallrandomObj nrr in removeItems)
HS.Remove(nrr);
Not sure if you can remove in the loop.
You may need to build up the remove list.
Yes remove O(1) on a HashSet but that does not mean it will be faster than a List. You don't even have a solution and are optimizing. That is premature optimization.
With a List you can just use RemoveAll
ll.RemoveAll(x => x.region == region);
Say I have an IQueryable that will return a datatype with an ID property (column).
I want to further filter my query (I don't want to evaluate the query) as follows:
For each unique ID from the main query, I want to Take(n), where n is some arbitrary number.
That is, I want to only keep the first n rows for each unique ID.
I can get the distinct ID's...
var ids = query.Select(q => q.ID).Distinct();
and I can Take(n) with the rest of them, but I'm stumped on connecting the two:
query = query.<FOR EACH DISTINCT ID>.Take(n);
The accepted answer works, but is slow for a large table. I wrote this question as a follow-up.
You can do it like this:
query = query.GroupBy(q => q.ID).SelectMany(g => g.Take(n));
The GroupBy brings together the records with identical IDs, letting you process them as a group; SelectMany takes each group, limits the number of its members to n, and puts the results back into one flat list.
I have a table that represents a matrix:
CustType DiscountGroup1 DiscountGroup2 DiscountGroup3
Wholesale 32 10 15
Retail 10 15 0
All my stock items have a corresponding discount group code 1, 2 or 3.
At the time of invoicing I want to lookup the discount the customer type gets on the item(s) being invoiced.
The table needs to be able to grow to include new customer types and new discount groups so nothing can be hardcoded.
I figured I would pull the data into an array so I could select the column by index but I am getting stumped by my entities being too intelligent...
var disc = (from d in context.CustDiscountGroups
where d.CustType == Wholesale
select d).ToArray();
I can only access the columns by name ie: disc[0].DiscountGroup1
if I try disc[0,1] I get an error saying wrong number of indices inside.
What am I missing? I feel like it is something ridiculously fundamental. My only other thought was naming the columns as 1, 2, 3 etc and building a sql select string where I can use a variable to denote a column name.
The database is in design stages as well so the table(s) can be remade in any way needed, I'm struggling to get my head wrapped round which way to approach the problem.
your entity CustDiscountGroups having properties CustType, DiscountGroup1, DiscountGroup2, DiscountGroup3 and your query return array of CustDiscountGroups so you cant access like [0,1] there is no 2D array
if you need to access first item you can get it as disc[0] then you can get any of properties of discount group by name of the property. like
disc[0].CustType, disc[0].DiscountGroup1, disc[0].DiscountGroup2, disc[0].DiscountGroup3
If you want to get array of array then get the property value using reflection as below
var disc = context.CustDiscountGroups.Where(c=>c.CustType == Wholesale)
.Select(v=>typeof(CustDiscountGroups)
.GetFields(System.Reflection.BindingFlags.Public)
.Select(f=>f.GetValue(v)).ToArray())
.ToArray();
var disc = context.CustDiscountGroups.Where(c=>c.CustType == Wholesale)
.Select(v=>typeof(CustDiscountGroups)
.GetProperties()
.Select(f=>f.GetValue(v,null)).ToArray()).ToArray();
now you can access values like disc[0][1]
Please note: I haven't compiled and tested above code, please get the idea and change as you want