I wanted to generate a unique identifier for the results of a Linq query i did on some date.
Initially i thought of using Guid for that but stumbling upon this problem i had to improvise.
However I'd like to see if anyone could have a solution using Guid so here we go.
Imagine we have:
class Query
{
public class Entry
{
public string Id { get; set; }
public int Value { get; set; }
}
public static IEnumerable<Entry> GetEntries( IEnumerable<int> list)
{
var result =
from i in list
select new Entry
{
Id = System.Guid.NewGuid().ToString("N"),
Value = i
};
return result;
}
}
Now we want Id to be unique for each entry, but we need this value to be the same for each traversal of the IEnumerable we get from GetEntries. This means that we want calling the following code:
List<int> list = new List<int> { 1, 2, 3, 4, 5 };
IEnumerable<Query.Entry> entries = Query.GetEntries(list);
Console.WriteLine("first pass");
foreach (var e in entries) { Console.WriteLine("{0} {1}", e.Value, e.Id); }
Console.WriteLine("second pass");
foreach (var e in entries) { Console.WriteLine("{0} {1}", e.Value, e.Id); }
to give us something like:
first pass
1 47f4a21a037c4ac98a336903ca9df15b
2 f339409bde22487e921e9063e016b717
3 8f41e0da06d84a58a61226a05e12e519
4 013cddf287da46cc919bab224eae9ee0
5 6df157da4e404b3a8309a55de8a95740
second pass
1 47f4a21a037c4ac98a336903ca9df15b
2 f339409bde22487e921e9063e016b717
3 8f41e0da06d84a58a61226a05e12e519
4 013cddf287da46cc919bab224eae9ee0
5 6df157da4e404b3a8309a55de8a95740
However we get:
first pass
1 47f4a21a037c4ac98a336903ca9df15b
2 f339409bde22487e921e9063e016b717
3 8f41e0da06d84a58a61226a05e12e519
4 013cddf287da46cc919bab224eae9ee0
5 6df157da4e404b3a8309a55de8a95740
second pass
1 a9433568e75f4f209c688962ee4da577
2 2d643f4b58b946ba9d02b7ba81064274
3 2ffbcca569fb450b9a8a38872a9fce5f
4 04000e5dfad340c1887ede0119faa16b
5 73a11e06e087408fbe1909f509f08d03
Now taking a second look at my code above I realized where my error was:
The assignment of Id to Guid.NewGuid().ToString("N") gets called every time we traverse the collection and thus is different everytime.
So what should i do then?
Is there a way i can reassure that i will get with only one copy of the collection everytime?
Is there a way that i'm sure that i won't be getting the new instances of the result of the query?
Thank you for your time in advance :)
This is a inherent to all LINQ queries. Being repeatable is coincidental, not guaranteed.
You can solve it with a .ToList() , like:
IEnumerable<Query.Entry> entries = Query.GetEntries(list).ToList();
Or better, move the .ToList() inside GetEntries()
Perhaps you need to produce the list of entries once, and return the same list each time in GetEntries.
Edit:
Ah no, you get each time the different list! Well, then it depends on what you want to get. If you want to get the same Id for each specific Value, maybe in different lists, you need to cache Ids: you should have a Dictionary<int, Guid> where you'll store the already allocated GUIDs. If you want your GUIDs be unique for each source list, you would perhaps need to cache the input the return IEnumerables, and always check if this input list was already returned or not.
Edit:
If you don't want to share the same GUIDs for different runs of GetEntries, you should just "materialize" the query (replacing return result; with return result.ToList();, for example), as it was suggested in the comment to your question.
Otherwise the query will run each time you traverse your list. This is what is called lazy evaluation. The lazy evaluation is usually not a problem, but in your case it leads to recalculating the GUID each query run (i.e., each loop over the result sequence).
Any reason you have to use LINQ? The following seems to work for me:
public static IEnumerable<Entry> GetEntries(IEnumerable<int> list)
{
List<Entry> results = new List<Entry>();
foreach (int i in list)
{
results.Add(new Entry() { Id = Guid.NewGuid().ToString("N"), Value = i });
}
return results;
}
That's because of the way linq works. When you return just the linq query, it is executed every time you enumerate over it. Therefore, for each list item Guid.NewGuid will be executed as many times as you enumerate over the query.
Try adding an item to the list after you iterated once over the query and you will see, that when iterating a second time, the just added list item will be also in the result set. That's because the linq query holds an instance of your list and not an independent copy.
To get always the same result, return an array or list instead of the linq query, so change the return line of the GetEntries method to something like that:
return result.ToArray();
This forces immediate execution, which also happens only once.
Best Regards,
Oliver Hanappi
You might think not using Guid, at least not with "new".
Using GetHashCode() returns unique values that don't change when you traverse the list multiple times.
The problem is that your list is IEnumerable<int>, so the hash code of each item coincides with its value.
You should re-evaluate your approach and use a different strategy. One thing that comes into my mind is to use a pseudo-random number generator initialized with the hash code of the collection. It will return you always the same numbers as soon as it's initialized with the same value. But, again, forget Guid
One suggestion: (Don't know if that's your case or not though)
If you want to save the entries in database, Try to assign your entry's primary key a Guid at the database level. This way, each entry will have a unique and persisted Guid as its primary key. Checkout this link for more info.
Related
In the foreach loop, I want to add the Products to a List, but I want this List to not contain duplicate Products, currently I have two ideas solved.
1/ In the loop, before adding the Product to the List, I will check whether the Product already exists in the List, otherwise I will add it to the List.
foreach (var product in products)
{
// code logic
if(!listProduct.Any(x => x.Id == product.Id))
{
listProduct.Add(product);
}
}
2/. In the loop, I will add all the Products to the List even if there are duplicate products. Then outside of the loop, I would use Distinct to remove duplicate records.
foreach (var product in products)
{
// code logic
listProduct.Add(product);
}
listProduct = listProduct.Distinct().ToList();
I wonder in these two ways is the most effective way. Or have any other ideas to be able to add records to the List to avoid duplication ??
I'd go for a third approach: the HashSet. It has a constructor overload that accepts an IEnumerable. This constructor removes duplicates:
If the input collection contains duplicates, the set will contain one
of each unique element. No exception will be thrown.
Source: HashSet<T> Constructor
usage:
List<Product> myProducts = ...;
var setOfProducts = new HashSet<Product>(myProducts);
After removing duplicates there is no proper meaning of setOfProducts[4].
Therefore a HashSet is not a IList<Product>, but an ICollection<Product>, you can Count / Add / Remove, etc, everything you can do with a List. The only thing you can't do is fetch by index
You first take which elements are not already in the collection:
var newProducts = products.Where(x => !listProduct.Any(y => x.Id == y.Id));
And then just add them using AddRang
listProduct.AddRagne(newItems)
Or you can use foreach loop too
foreach (var product in newProducts)
{
listProduct.Add(product);
}
1 more easy solution could be there no need to use Distint
var newProductList = products.Union(listProduct).ToList();
But Union has not good performance.
From what you have included, you are storing everything in memory. If this is the case, or you are persisting only after you have it ready you can consider using BinarySearch:
https://msdn.microsoft.com/en-us/library/w4e7fxsh(v=vs.110).aspx and you also get an ordered list at the end. If ordering is not important, you can use HashSet, which is very fast, and meant specially for this purpose.
Check also: https://www.dotnetperls.com/hashset
This should be pretty fast and take care of any ordering:
// build a HashSet of your primary keys type (I'm assuming integers here) containing all your list elements' keys
var hashSet = new HashSet<int>(listProduct.Select(p => p.Id));
// add all items from the products list whose Id can be added to the hashSet (so it's not a duplicate)
listProduct.AddRange(products.Where(p => hashSet.Add(p.Id)));
What you might want to consider doing instead, though, is implementing IEquatable<Product> and overriding GetHashCode() on your Product type which would make the above code a little easier and put the equality checks where they should be (inside the respective type):
var hashSet = new HashSet<int>(listProduct);
listProduct.AddRange(products.Where(hashSet.Add));
I have about 100 items (allRights) in the database and about 10 id-s to be searched (inputRightsIds). Which one is better - first to get all rights and then search the items (Variant 1) or to make 10 checking requests requests to the database
Here is some example code:
DbContext db = new DbContext();
int[] inputRightsIds = new int[10]{...};
Variant 1
var allRights = db.Rights.ToLIst();
foreach( var right in allRights)
{
for(int i>0; i<inputRightsIds.Lenght; i++)
{
if(inputRightsIds[i] == right.Id)
{
// Do something
}
}
}
Variant 2
for(int i>0; i<inputRightsIds.Lenght; i++)
{
if(db.Rights.Any(r => r.Id == inputRightsIds[i]);)
{
// Do something
}
}
Thanks in advance!
As other's have already stated you should do the following.
var matchingIds = from r in db.Rights
where inputRightIds.Contains(r.Id)
select r.Id;
foreach(var id in matchingIds)
{
// Do something
}
But this is different from both of your approaches. In your first approach you are making one SQL call to the DB that is returning more results than you are interested in. The second is making multiple SQL calls returning part of the information you want with each call. The query above will make one SQL call to the DB and return only the data you are interested in. This is the best approach as it reduces the two bottle necks of making multiple calls to the DB and having too much data returned.
You can use following :
db.Rights.Where(right => inputRightsIds.Contains(right.Id));
They should be very similar speeds since both must enumerate the arrays the same number of times. There might be subtle differences in speed between the two depending on the input data but in general I would go with Variant 2. I think you should almost always prefer LINQ over manual enumeration when possible. Also consider using the following LINQ statement to simplify the whole search to a single line.
var matches = db.Rights.Where(r=> inputRightIds.Contains(r.Id));
...//Do stuff with matches
Not forget get all your items into memory to process list further
var itemsFromDatabase = db.Rights.Where(r => inputRightsIds.Contains(r.Id)).ToList();
Or you could even enumerate through collection and do some stuff on each item
db.Rights.Where(r => inputRightsIds.Contains(r.Id)).ToList().Foreach(item => {
//your code here
});
So I have a couple of different lists that I'm trying to process and merge into 1 list.
Below is a snipet of code that I want to see if there was a better way of doing.
The reason why I'm asking is that some of these lists are rather large. I want to see if there is a more efficient way of doing this.
As you can see I'm looping through a list, and the first thing I'm doing is to check to see if the CompanyId exists in the list. If it does, then I find item in the list that I'm going to process.
pList is my processign list. I'm adding the values from my different lists into this list.
I'm wondering if there is a "better way" of accomplishing the Exist and Find.
boolean tstFind = false;
foreach (parseAC item in pACList)
{
tstFind = pList.Exists(x => (x.CompanyId == item.key.ToString()));
if (tstFind == true)
{
pItem = pList.Find(x => (x.CompanyId == item.key.ToString()));
//Processing done here. pItem gets updated here
...
}
Just as a side note, I'm going to be researching a way to use joins to see if that is faster. But I haven't gotten there yet. The above code is my first cut at solving this issue and it appears to work. However, since I have the time I want to see if there is a better way still.
Any input is greatly appreciated.
Time Findings:
My current Find and Exists code takes about 84 minutes to loop through the 5.5M items in the pACList.
Using pList.firstOrDefault(x=> x.CompanyId == item.key.ToString()); takes 54 minutes to loop through 5.5M items in the pACList
You can retrieve item with FirstOrDefault instead of searching for item two times (first time to define if item exists, and second time to get existing item):
var tstFind = pList.FirstOrDefault(x => x.CompanyId == item.key.ToString());
if (tstFind != null)
{
//Processing done here. pItem gets updated here
}
Yes, use a hashtable so that your algorithm is O(n) instead of O(n*m) which it is right now.
var pListByCompanyId = pList.ToDictionary(x => x.CompanyId);
foreach (parseAC item in pACList)
{
if (pListByCompanyId.ContainsKey(item.key.ToString()))
{
pItem = pListByCompanyId[item.key.ToString()];
//Processing done here. pItem gets updated here
...
}
You can iterate though filtered list using linq
foreach (parseAC item in pACList.Where(i=>pList.Any(x => (x.CompanyId == i.key.ToString()))))
{
pItem = pList.Find(x => (x.CompanyId == item.key.ToString()));
//Processing done here. pItem gets updated here
...
}
Using lists for this type of operation is O(MxN) (M is the count of pACList, N is the count of pList). Additionally, you are searching pACList twice. To avoid that issue, use pList.FirstOrDefault as recommended by #lazyberezovsky.
However, if possible I would avoid using lists. A Dictionary indexed by the key you're searching on would greatly improve the lookup time.
Doing a linear search on the list for each item in another list is not efficient for large data sets. What is preferable is to put the keys into a Table or Dictionary that can be much more efficiently searched to allow you to join the two tables. You don't even need to code this yourself, what you want is a Join operation. You want to get all of the pairs of items from each sequence that each map to the same key.
Either pull out the implementation of the method below, or change Foo and Bar to the appropriate types and use it as a method.
public static IEnumerable<Tuple<Bar, Foo>> Merge(IEnumerable<Bar> pACList
, IEnumerable<Foo> pList)
{
return pACList.Join(pList, item => item.Key.ToString()
, item => item.CompanyID.ToString()
, (a, b) => Tuple.Create(a, b));
}
You can use the results of this call to merge the two items together, as they will have the same key.
Internally the method will create a lookup table that allows for efficient searching before actually doing the searching.
Convert pList to HashSet then query pHashSet.Contains(). Complexity O(N) + O(n)
Sort pList on CompanyId and do Array.BinarySearch() = O(N Log N) + O(n * Log N )
If Max company id is not prohibitively large, simply create and array of them where item with company id i exists at i-th position. Nothing can be more fast.
where N is size of pList and n is size of pACList
Given the following LINQ Statement(s), which will be more efficient?
ONE:
public List<Log> GetLatestLogEntries()
{
var logEntries = from entry in db.Logs
select entry;
return logEntries.ToList().Take(10);
}
TWO:
public List<Log> GetLatestLogEntries()
{
var logEntries = from entry in db.Logs
select entry;
return logEntries.Take(10).ToList();
}
I am aware that .ToList() executes the query immediately.
The first version wouldn't even compile - because the return value of Take is an IEnumerable<T>, not a List<T>. So you'd need it to be:
public List<Log> GetLatestLogEntries()
{
var logEntries = from entry in db.Logs
select entry;
return logEntries.ToList().Take(10).ToList();
}
That would fetch all the data from the database and convert it to a list, then take the first 10 entries, then convert it to a list again.
Getting the Take(10) to occur in the database (i.e. the second form) certainly looks a heck of a lot cheaper to me...
Note that there's no Queryable.ToList() method - you'll end up calling Enumerable.ToList() which will fetch all the entries. In other words, the call to ToList doesn't participate in SQL translation, whereas Take does.
Also note that using a query expression here doesn't make much sense either. I'd write it as:
public List<Log> GetLatestLogEntries()
{
return db.Log.Take(10).ToList();
}
Mind you, you may want an OrderBy call - otherwise it'll just take the first 10 entries it finds, which may not be the latest ones...
Your first option won't work, because .Take(10) converts it to IEnumerable<Log>. Your return type is List<Log>, so you would have to do return logEntries.ToList().Take(10).ToList(), which is more inefficient.
By doing .ToList().Take(10), you are forcing the .Take(10) to be LINQ to objects, while the other way the filter could be passed on to the database or other underlying data source. In other words, if you first do .ToList(), ALL the objects have to be transferred from the database and allocated in memory. THEN you filter to the first 10. If you're talking about millions of database rows (and objects) you can imagine how this is VERY inefficient and not scalable.
The second one will also run immediately because you have .ToList(), so no difference there.
The second version will be more efficient (in both time and memory usage). For example, imagine that you have a sequence containing 1,000,000 items:
The first version iterates through all 1,000,000 items, adding them to a list as it goes. Then, finally, it will take the first 10 items from that large list.
The second version only needs to iterate the first 10 items, adding them to a list as it goes. (The remaining 999,990 items don't even need to be considered.)
How about this ?
I have 5000 records in "items"
version 1:
IQueryable<T> items = Items; // my items
items = ApplyFilteringCriteria(items, filter); // my filter BL
items = ApplySortingCriteria(items, sortBy, sortDir); // my sorting BL
items = items.Skip(0);
items = items.Take(25);
return items.ToList();
this took : 20 sec on server
version 2:
IQueryable<T> items = Items; // my items
items = ApplyFilteringCriteria(items, filter); // my filter BL
items = ApplySortingCriteria(items, sortBy, sortDir); // my sorting BL
List<T> x = items.ToList();
items = x.Skip(0).ToList();
items = x.Take(25).ToList();
return x;
this took : 1 sec on server
What do you think now ? Any idea why ?
The second option.
The first will evaluate the entire enumerable, slurping it into a List(); then you set up the iterator that will iterate through the first ten objects and then exit.
The second sets up the Take() iterator first, so whatever happens later than that, only 10 objects will be evaluated and sent to the "downstream" processing (in this case the ToList() which will take those ten elements and return them as the concrete List).
I have a class that has multiple List<> contained within it. Its basically a table stored with each column as a List<>. Each column does not contain the same type. Each list is also the same length (has the same number of elements).
For example:
I have 3 List<> objects; one List, two List, and three List.
//Not syntactically correct
List<DateTime> one = new List...{4/12/2010, 4/9/2006, 4/13/2008};
List<double> two = new List...{24.5, 56.2, 47.4};
List<string> three = new List...{"B", "K", "Z"};
I want to be able to sort list one from oldest to newest:
one = {4/9/2006, 4/13/2008, 4/12/2010};
So to do this I moved element 0 to the end.
I then want to sort list two and three the same way; moving the first to the last.
So when I sort one list, I want the data in the corresponding index in the other lists to also change in accordance with how the one list is sorted.
I'm guessing I have to overload IComparer somehow, but I feel like there's a shortcut I haven't realized.
I've handled this design in the past by keeping or creating a separate index list. You first sort the index list, and then use it to sort (or just access) the other lists. You can do this by creating a custom IComparer for the index list. What you do inside that IComparer is to compare based on indexes into the key list. In other words, you are sorting the index list indirectly. Something like:
// This is the compare function for the separate *index* list.
int Compare (object x, object y)
{
KeyList[(int) x].CompareTo(KeyList[(int) y])
}
So you are sorting the index list based on the values in the key list. Then you can use that sorted key list to re-order the other lists. If this is unclear, I'll try to add a more complete example when I get in a situation to post one.
Here's a way to do it using LINQ and projections. The first query generates an array with the original indexes reordered by the datetime values; in your example, the newOrdering array would have members:
{ 4/9/2006, 1 }, { 4/13/2008, 2 }, { 4/12/2010, 0 }
The second set of statements generate new lists by picking items using the reordered indexes (in other words, items 1, 2, and 0, in that order).
var newOrdering = one
.Select((dateTime, index) => new { dateTime, index })
.OrderBy(item => item.dateTime)
.ToArray();
// now, order each list
one = newOrdering.Select(item => one[item.index]).ToList();
two = newOrdering.Select(item => two[item.index]).ToList();
three = newOrdering.Select(item => three[item.index]).ToList();
I am sorry to say, but this feels like a bad design. Especially because List<T> does not guarantee element order before you have called one of the sorting operations (so you have a problem when inserting):
From MSDN:
The List is not guaranteed to be
sorted. You must sort the List
before performing operations (such as
BinarySearch) that require the List
to be sorted.
In many cases you won't run into trouble based on this, but you might, and if you do, it could be a very hard bug to track down. For example, I think the current framework implementation of List<T> maintains insert order until sort is called, but it could change in the future.
I would seriously consider refactoring to use another data structure. If you still want to implement sorting based on this data structure, I would create a temporary object (maybe using an anonymous type), sort this, and re-create the lists (see this excellent answer for an explanation of how).
First you should create a Data object to hold everything.
private class Data
{
public DateTime DateTime { get; set; }
public int Int32 { get; set; }
public string String { get; set; }
}
Then you can sort like this.
var l = new List<Data>();
l.Sort(
(a, b) =>
{
var r = a.DateTime.CompareTo(b);
if (r == 0)
{
r = a.Int32.CompareTo(b);
if (r == 0)
{
r = a.String.CompareTo(b);
}
}
return r;
}
);
I wrote a sort algorithm that does this for Nito.LINQ (not yet released). It uses a simple-minded QuickSort to sort the lists, and keeps any number of related lists in sync. Source code starts here, in the IList<T>.Sort extension method.
Alternatively, if copying the data isn't a huge concern, you could project it into a LINQ query using the Zip operator (requires .NET 4.0 or Rx), order it, and then pull each result out:
List<DateTime> one = ...;
List<double> two = ...;
List<string> three = ...;
var combined = one.Zip(two, (first, second) => new { first, second })
.Zip(three, (pair, third) => new { pair.first, pair.second, third });
var ordered = combined.OrderBy(x => x.first);
var orderedOne = ordered.Select(x => x.first);
var orderedTwo = ordered.Select(x => x.second);
var orderedThree = ordered.Select(x => x.third);
Naturally, the best solution is to not separate related data in the first place.
Using generic arrays, this can get a bit cumbersome.
One alternative is using the Array.Sort() method that takes an array of keys and an array of values to sort. It first sorts the key array into ascending order and makes sure the array of values is reorganized to match this sort order.
If you're willing to incur the cost of converting your List<T>s to arrays (and then back), you could take advantage of this method.
Alternatively, you could use LINQ to combine the values from multiple arrays into a single anonymous type using Zip(), sort the list of anonymous types using the key field, and then split that apart into separate arrays.
If you want to do this in-place, you would have to write a custom comparer and create a separate index array to maintain the new ordering of items.
I hope this could help :
one = one.Sort(delegate(DateTime d1, DateTime d2)
{
return Convert.ToDateTime(d2).CompareTo(Convert.ToDateTime(d1));
});