I was sitting this cloudy Saturday morning thinking to myself:
IEnumerable<SomeType>
someThings = ...,
conjunctedThings = someThings.Where(thing => thing.Big && thing.Tall),
multiWhereThings = someThings
.Where(thing => thing.Big).Where(thing => thing.Tall);
Intuitively, I'd say that conjunctedThings will be computed no slower than multiWhereThings but is there really a difference in a general case?
I can imagine that depending on the share if big things and tall tings, the computations might elapse differently but I'd like to disregard that aspect.
Are there any other properties I need to take into consideration? E.g. the type of the enumerable or anything else?
In general the MultiWhere will be slower. It needs to process more items and call more lambdas.
If someThings contains n items, and m of which are Big then the lambda for conjucated-things is called n times while the lambdas for multi-where are called n+m times. Of couse this is true if the user of the two sequences intends to iterate all the contents. Since the Where method performs yield return internally the number of iterations might be less depending on the user of the collections. In other words the numbers above are the worst-case estimate.
Related
Let's say, I have 3 lists (no more than 10 in my case).
List 1 has m elements
List 2 has n elements
List 3 has p elements
It's possible to have duplicates. I need to find the 10 first distinct elements that match a request (I know how to do it that's not the question).
Is it faster to concatenate the 3 lists and then filter?
Or is it faster to filter the 3 lists (3x10 elements) and then concatenate. And then filter again to have the last 10 elements I wanted.
I would go for the second options but I am not 100% because I don't know the cost of a concatenation and the cost of filtering.
Thanks for any inputs.
Edit:
I can have up to 10 lists of 100-1000 elements => between 1000 elements to 10000 elements in the merged list.
In my case, this request can be made 3 to 5 times per second per user (but just once in a while). The lists contains contacts and sometimes, the user searches a contact. I have an ajax request that sends each characters and refreshes a table.
Editted answer: I was previously having a thinko, because for some reason I was thinking of "concatenate" as actually creating a full new list. (Actually, I know part of what the reason is, in that the costs of concatenating strings came to mind, but why that was the case I don't know).
Of course, concatenating in Linq does no such thing, so the choice is between:
list1.Concat(list2).Concat(list3) // ...and so on
.Where(yourFilter)
.Distinct()
.Take(10)
And:
list1.Where(yourFilter)
.Concat(list2.Where(yourFilter))
.Concat(list3.Where(yourFilter))
.Distinct()
.Take(10)
And the difference between them is quite interesting.
From just looking at the code here, we wouldn't expect there to be much difference. We'd expect the latter to have a disadvantage in that it involves slightly more calls, but the former to have the disadvantage of more interface steps being involved in the Where implementation that is more complicated than the Concat implementation and so these two balance out. The latter comes out as being slightly faster though how much depends on whether the second and/or third Where are ever used (they might not be if the Take is satisfied before hitting them).
With lists as the sources though, the latter comes out as quite a bit faster, because Where is optimised for the case of the source being a List<T> and only the latter benefits from that optimisation behind the scenes.
Because I do not have 50 reputations yet I cant use the comment. Sorry dudes.
But, for the question.
In your first case, you will allocate a List as big as your 3 lists.
If you have memory constraint, this might be a bad idea.
So you concat 3 Lists, then filter through this big list. 2 operations.
In the second case, you just have to detect distinct elements in your 3 Lists, access does no cost that much.
I mean, what is the difference between searching in 3 Lists or 1 List?
My question is raised based on this question, I had posted an answer on that question..here
This is the code.
var lines = System.IO.File.ReadLines(#"C:\test.txt");
var Minimum = lines[0];//Default length set
var Maximum = "";
foreach (string line in lines)
{
if (Maximum.Length < line.Length)
{
Maximum = line;
}
if (Minimum.Length > line.Length)
{
Minimum = line;
}
}
and alternative for this code using LINQ (My approach)
var lines = System.IO.File.ReadLines(#"C:\test.txt");
var Maximum = lines.OrderByDescending(a => a.Length).First().ToString();
var Minimum = lines.OrderBy(a => a.Length).First().ToString();
LINQ is easy to read and implement..
I want to know which one is good for performance.
And how Linq work internally for OrderByDescending and OrderBy for ordering by length?
You can read the source code for OrderBy.
Stop doing micro-optimizing or premature-optimization on your code. Try to write code that performs correctly, then if you face a performance problem later then profile your application and see where is the problem. If you have a piece of code which have performance problem due to finding the shortest and longest string then start to optimize this part.
We should forget about small efficiencies, say about 97% of the time:
premature optimization is the root of all evil. Yet we should not pass
up our opportunities in that critical 3% - Donald Knuth
File.ReadLines is returning an IEnumerable<string>, It means that if you do a foreach over it it will return data to you one by one. I think the best performance improvement you can do here is to improve the reading of file from the disk. If it is small enough to load the whole file into memory use File.ReadAllLines, if it is not try reading the file in big chunks that fits in memory. Reading a file line by line will cause performance degradation due to I/O operation from disk. So the problem here is not how LINQ or loop perform, The problem is in number of disk reads.
With the second method, you are not only sorting the lines twice... You are reading the file twice. This because File.ReadLines returns a IEnumerable<string>. This clearly shows why you shouldn't ever ever enumerate a IEnumerable<> twice unless you know how it was built. If you really want to do it, add a .ToList() or a .ToArray() that will materialize the IEnumerable<> to a collection... And while the first method has a memory footprint of a single line of text (because it reads the file one line at a time), the second method will load the whole file in memory to sort it, so will have a much bigger memory footprint, and if the file is some hundred mb, the difference is big (note that technically you could have a file with a single line of text long 1gb, so this rule isn't absolute... It is for reasonable files that have lines long up to some hundred characters :-) )
Now... Someone will tell you that premature optimization is evil, but I'll tell you that ignorance is twice evil.
If you know the difference between the two blocks of code then you can do an informed choice between the two... Otherwise you are simply randomly throwing rocks until it seems to work. Where seems to work is the keyword here.
In my opinion, you need to understand some points for deciding what is the best way.
First, let's think that we want to solve the problem with LINQ. Then, to write the most optimized code, you must understand Deferred Execution. Most Linq methods, such as Select, Where, OrderBy, Skip, Take and some others uses DE. So, what is Deferred Execution? It means that, these methods will not be executed unless the user doesn't need them. These methods will just create iterator. And this iterator is ready to be executed, when we need them. So, how can user make them execute? The answer is, with the help of foreach which will call GetEnumerator or other Linq methods. Such as, ToList(), First(), FirstOrDefault(), Max() and some others.
These process will help us to gain some performance.
Now, let's come back to your problem. File.ReadLines will return IEnumerable<string>, which it means that, it will not read the lines, unless we need them. In your example, you have twice called sorting method for this object, which it means that it will sort this collection over again twice. Instead of that, you can sort the collection once, then call ToList() which will execute the OrderedEnumerable iterator and then get first and last element of the collection which physically inside our hands.
var orderedList = lines
.OrderBy(a => a.Length) // This method uses deferred execution, so it is not executed yet
.ToList(); // But, `ToList()` makes it to execute.
var Maximum = orderedList.Last();
var Minimum = orderedList.First();
BTW, you can find OrderBy source code, here.
It returns OrderedEnumerable instance and the sorting algorithm is here:
public IEnumerator<TElement> GetEnumerator()
{
Buffer<TElement> buffer = new Buffer<TElement>(source);
if (buffer.count > 0)
{
EnumerableSorter<TElement> sorter = GetEnumerableSorter(null);
int[] map = sorter.Sort(buffer.items, buffer.count);
sorter = null;
for (int i = 0; i < buffer.count; i++) yield return buffer.items[map[i]];
}
}
And now, let's come back to another aspect which effects the performance. If you see, Linq uses another element to store sorted collection. Of course, it will take some memory, which tells us it is not the most efficent way.
I just tried to explain you how does Linq work. But, I am very agree with #Dotctor as a result to your overall answer. Just, don't forget that, you can use File.ReadAllLines which will not return IEnumerable<stirng>, but string[].
What does it mean? As I tried to explain in the beginning, difference is that, if it is IEnumerable, then .net will read line one by one when enuemrator enumerates over iterator. But, if it is string[], then all lines in our application memory.
The most efficient approach is to avoid LINQ here, the approach using foreach needs only one enumeration.
If you want to put the whole file into a collection anyway you could use this:
List<string> orderedLines = System.IO.File.ReadLines(#"C:\test.txt")
.OrderBy(l => l.Length)
.ToList();
string shortest = orderedLines.First();
string longest = orderedLines.Last();
Apart from that you should read about LINQ's deferred execution.
Also note that your LINQ approach does not only order all lines twice to get the longest and the shortest, it also needs to read the whole file twice since File.ReadLines is using a StreamReader(as opposed to ReadAllLines which reads all lines into an array first).
MSDN:
When you use ReadLines, you can start enumerating the collection of
strings before the whole collection is returned; when you use
ReadAllLines, you must wait for the whole array of strings be returned
before you can access the array
In general that can help to make your LINQ queries more efficient, f.e. if you filter out lines with Where, but in this case it's making things worse.
As Jeppe Stig Nielsen has mentioned in a comment, since OrderBy needs to create another buffer-collection internally(with ToList the second), there is another approach that might be more efficient:
string[] allLines = System.IO.File.ReadAllLines(#"C:\test.txt");
Array.Sort(allLines, (x, y) => x.Length.CompareTo(y.Length));
string shortest = allLines.First();
string longest = allLines.Last();
The only drawback of Array.Sort is that it performs an unstable sort as opposed to OrderBy. So if two lines have the same length the order might not be maintained.
I am attempting to do a comparison for each element X in a list, ListA, if two properties of X, X.Code and X.Rate, have match the Code and Rate of any element Y in ListB. The current solution uses LINQ and AsParallel to execute these comparisons (time is a factor and each list can contain anywhere from 0 elements to a couple hundred elements each).
So far the AsParallel method seems much faster, however I am not sure that these operations are thread-safe. My understanding is that because this comparison will only be reading values and not modifying them that this should be safe but I am not 100% confident. How can I determine if this operation is thread-safe before unleashing it on my production environment?
Here is the code I am working with:
var s1 = System.Diagnostics.Stopwatch.StartNew();
ListA.AsParallel().ForAll(x => x.IsMatching = ListB.AsParallel().Any(y => x.Code== y.Code && x.Rate== y.Rate));
s1.Stop();
var s2 = System.Diagnostics.Stopwatch.StartNew();
ListA.ForEach(x => x.IsMatching = ListB.Any(y => x.Code == y.Code && x.Rate== y.Rate));
s2.Stop();
Currently each method returns the same result, however the AsParallel() executes in ~1/3 the time as the plain ForEach, so I hope to benefit from that if there is a way to perform this operation safely.
The code you have is thread-safe. The lists are being accessed as read-only, and the implicit synchronization required to implement the parallelized version is sufficient to ensure any writes have been committed. You do modify the elements within the list, but again, the synchronization implicit in the parallel operation, with which the current thread necessarily has to wait on, will ensure any writes to the element objects are visible in the current thread.
That said, the thread safety is irrelevant, because you are doing the whole thing wrong. You are applying a brute force, O(N^2) algorithm to a need that can be addressed using a more elegant and efficient solution, the LINQ join:
var join = from x in list1
join y in list2 on new { x.Code, x.Rate } equals new { y.Code, y.Rate }
select x;
foreach (A a in join)
{
a.IsMatching = true;
}
Your code example didn't include any initialization of sample data. So I can't reproduce your results with any reliability. Indeed, in my test set, where I initialized list1 and list2 identically, with each having the same 1000 elements (I simply set Code and Rate to the element's index in the list, i.e. 0 through 999), I found the AsParallel() version slower than the serial version, by a little more than 25% (i.e. 250 iterations of the parallel version took around 2.7 seconds, while 250 iterations of the serial version took about 1.9 seconds).
But neither came close to the join version, which completed 250 iterations of that particular test data in about 60 milliseconds, almost 20 times faster than the faster of the other two implementations.
I'm reasonably confident that in spite of my lack of a comparable data set relative to your scenario, that the basic result will still stand, and that you will find the use of the join approach far superior to either of the options you've tried so far.
With regards to this solution.
Is there a way to limit the number of keywords to be taken into consideration? For example, I'd like only first 1000 words of text to be calculated. There's a "Take" method in Linq, but it serves a different purpose - all words will be calculated, and N records will be returned. What's the right alternative to make this correctly?
Simply apply Take earlier - straight after the call to Split:
var results = src.Split()
.Take(1000)
.GroupBy(...) // etc
Well, strictly speaking LINQ is not necessarily going to read everything; Take will stop as soon as it can. The problem is that in the related question you look at Count, and it is hard to get a Count without consuming all the data. Likewise, string.Split will look at everything.
But if you wrote a lazy non-buffering Split function (using yield return) and you wanted the first 1000 unique words, then
var words = LazySplit(text).Distinct().Take(1000);
would work
Enumerable.Take does in fact stream results out; it doesn't buffer up its source entirely and then return only the first N. Looking at your original solution though, the problem is that the input to where you would want to do a Take is String.Split. Unfortunately, this method doesn't use any sort of deferred execution; it eagerly creates an array of all the 'splits' and then returns it.
Consequently, the technique to get a streaming sequence of words from some text would be something like:
var words = src.StreamingSplit() // you'll have to implement that
.Take(1000);
However, I do note that the rest of your query is:
...
.GroupBy(str => str) // group words by the value
.Select(g => new
{
str = g.Key, // the value
count = g.Count() // the count of that value
});
Do note that GroupBy is a buffering operation - you can expect that all of the 1,000 words from its source will end up getting stored somewhere in the process of the groups being piped out.
As I see it, the options are:
If you don't mind going through all of the text for splitting purposes, then src.Split().Take(1000) is fine. The downside is wasted time (to continue splitting after it is no longer necesary) and wasted space (to store all of the words in an array even though only the first 1,000) will be needed. However, the rest of the query will not operate on any more words than necessary.
If you can't afford to do (1) because of time / memory constraints, go with src.StreamingSplit().Take(1000) or equivalent. In this case, none of the original text will be processed after 1,000 words have been found.
Do note that those 1,000 words themselves will end up getting buffered by the GroupBy clause in both cases.
I am currently doing some Project Euler problems and the earlier ones often involve things like Fibonacci numbers or primes. Iterating over them seems to be a natural fit for LINQ, at least in readability and perceived "elegance" of the code (I'm trying to use language-specific features where possible and applicable to get a feel for the languages).
My problem is now, if I only need a set of numbers up to a certain limit, how should I best express this? Currently I have hard-coded the respective limit in the iterator but I'd really like the enumerator to return the list until something outside decides not to query it anymore, since it's over a certain limit. So basically that I have a potentially infinite iterator but I only take a finite set of numbers from it. I know such things are trivial in functional languages, but I wonder whether C# allows for that, too. The only other idea I had would be to have an iterator Primes(long) that returns primes up to a certain limit, likewise for other sequences.
Any ideas?
Most of the LINQ methods (Enumerable class) are lazy. So for instance, there's nothing wrong with:
var squares = Enumerable.Range(0, Int32.MaxValue).Select(x=>x*x);
You can use the Take method to limit the results:
var 10squares = squares.Take(10);
var smallSquares = squares.TakeWhile(x => x < 10000);
Edit: The things you need to avoid are functions that return "lazily" but have to consume the entire enumerable to produce a result. For example, grouping or sorting:
var oddsAndEvens = Enumerable.Range(0, Int32.MaxValue)
.GroupBy(x => x % 2 == 0);
foreach (var item in oddsAndEvens) {
Console.WriteLine(item.Key);
}
(That'll probably give you an OutOfMemoryExeption on 32-bit.)