LINQ evaluates clauses from right to left? That's why seems so many articles which explains "Lazy evaluation" using a Take operation in the end?
The following example, Code Snippet 2 a lot faster than Code Snippet 1 because it didn't do "ToList"
Code Snippet 1 (Takes about 13000 msec)
var lotsOfNums = Enumerable.Range(0, 10000000).ToList();
Stopwatch sw = new Stopwatch();
sw.Start();
// Get all the even numbers
var a = lotsOfNums.Where(num => num % 2 == 0).ToList();
// Multiply each even number by 100.
var b = a.Select(num => num * 100).ToList();
var c = b.Select(num => new Random(num).NextDouble()).ToList();
// Get the top 10
var d = c.Take(10);
// a, b, c and d have executed on each step.
foreach (var num in d)
{
Console.WriteLine(num);
}
sw.Stop();
Console.WriteLine("Elapsed milliseconds: " + sw.ElapsedMilliseconds);
Code Snippet 2 (3 msec)
sw.Reset();
sw.Start();
var e = lotsOfNums.Where(num => num % 2 == 0).Select(num => num * 100).Select(num => new Random(num).NextDouble()).Take(10);
foreach (var num in e)
{
Console.WriteLine(num);
}
sw.Stop();
Console.WriteLine("Elapsed milliseconds: " + sw.ElapsedMilliseconds);
Console.Read();
However, for Code Snippet 2, I find the relative position of "Take" is not relevant?
To be specific, I changed from:
var e = lotsOfNums.Where(num => num % 2 == 0).Select(num => num * 100).Select(num => new Random(num).NextDouble()).Take(10);
To:
var e = lotsOfNums.Take(10).Where(num => num % 2 == 0).Select(num => num * 100).Select(num => new Random(num).NextDouble());
There's no difference in performance?
Also worth noting, if you move the NextDouble to far right, since LINQ evaluates left to right, your result list will be empty and also Select(NextDouble) forces all subsequent clauses in left to loop thru the whole list, it will take much longer time to evaluate.
var e = lotsOfNums.Select(num => new Random(num).NextDouble()).Where(num => num % 2 == 0).Select(num => num * 100).Take(10);
LINQ evaluates clauses from right to left?
No, clauses are evaluated left to right. Everything is evaluated left to right in C#.
That's why seems so many articles which explains "Lazy evaluation" using a Take operation in the end?
I don't understand the question.
UPDATE: I understand the question. The original poster believes incorrectly that Take has the semantics of ToList; that it executes the query, and therefore goes at the end. This belief is incorrect. A Take clause just appends a Take operation to the query; it does not execute the query.
You must put the Take operation where it needs to be. Remember, x.Take(y).Where(z) and x.Where(z).Take(y) are very different queries. You can't just move a Take around without changing the meaning of the query, so put it in the right place: as early as possible, but not so early that it changes the meaning of the query.
Position of "NextDouble" select clause matters?
Matters to who? Again, I don't understand the question. Can you clarify it?
Why codesnippet 1 and codesnippet 2 has same performance stats?
Since you have not given us your measurements, we have no basis upon which to make a comparison. But your two code samples do completely different things; one executes a query, one just builds a query. Building a query that is never executed is faster than executing it!
I thought "ToList" force early evaluation thus make things slower?
That's correct.
There's no difference in performance? (between my two query constructions)
You've constructed two queries; you have not executed them. Construction of queries is fast, and not typically worth measuring. Measure the performance of the execution of the query, not the construction of the query, if you want to know how fast the query executes!
I think you seem to have the impression that .Take() forces evaluation, which it does not. You're seeing similar performance regardless of the position of Take() because your query isn't actually being evaluated at all. You have to add a .ToList() at the end (or maybe iterate over the result) to test the performance of the query you've built.
Related
While writing a solution for a coding problem I discovered an interesting behavior of my LINQ statements. I had two scenarios:
First:
arr.Select(x => x + 5).OrderBy(x => x)
Second:
arr.OrderBy(x => x).Select(x => x + 5)
After a little bit of testing with System.Diagnostics.Stopwatch I got the following results for an integer array of length 100_000.
For the first approach:
00:00:00.0000152
For the second:
00:00:00.0073650
Now I'm interested in why it takes more time if I do the ordering first. I wasn't able to find something on google so I thought about it by myself.
I ended up with 2 Ideas:
1. The second scenario has to convert to IOrderedEnumerable and then back to IEnumerable while the first scenario only has to convert to IOrderedEnumerable and not back.
2. You end up having 2 loops. The first for sorting and the second for the selecting while approach 1 does everything in 1 loop.
So my question is why does it take much more time to do the ordering before select?
Let's have a look at the sequences:
private static void UnderTestOrderBySelect(int[] arr) {
var query = arr.OrderBy(x => x).Select(x => x + 5);
foreach (var item in query)
;
}
private static void UnderTestSelectOrderBy(int[] arr) {
var query = arr.Select(x => x + 5).OrderBy(x => x);
foreach (var item in query)
;
}
// See Marc Gravell's comment; let's compare Linq and inplace Array.Sort
private static void UnderTestInPlaceSort(int[] arr) {
var tmp = arr;
var x = new int[tmp.Length];
for (int i = 0; i < tmp.Length; i++)
x[i] = tmp[i] + 5;
Array.Sort(x);
}
In order to perform benchmark, let's run 10 times and average 6 middle results:
private static string Benchmark(Action<int[]> methodUnderTest) {
List<long> results = new List<long>();
int n = 10;
for (int i = 0; i < n; ++i) {
Random random = new Random(1);
int[] arr = Enumerable
.Range(0, 10000000)
.Select(x => random.Next(1000000000))
.ToArray();
Stopwatch sw = new Stopwatch();
sw.Start();
methodUnderTest(arr);
sw.Stop();
results.Add(sw.ElapsedMilliseconds);
}
var valid = results
.OrderBy(x => x)
.Skip(2) // get rid of top 2 runs
.Take(results.Count - 4) // get rid of bottom 2 runs
.ToArray();
return $"{string.Join(", ", valid)} average : {(long) (valid.Average() + 0.5)}";
}
Time to run and have a look at the results:
string report = string.Join(Environment.NewLine,
$"OrderBy + Select: {Benchmark(UnderTestOrderBySelect)}",
$"Select + OrderBy: {Benchmark(UnderSelectOrderBy)}",
$"Inplace Sort: {Benchmark(UnderTestInPlaceSort)}");
Console.WriteLine(report);
Outcome: (Core i7 3.8GHz, .Net 4.8 IA64)
OrderBy + Select: 4869, 4870, 4872, 4874, 4878, 4895 average : 4876
Select + OrderBy: 4763, 4763, 4793, 4802, 4827, 4849 average : 4800
Inplace Sort: 888, 889, 890, 893, 896, 904 average : 893
I don't see any significant difference, Select + OrderBy seems to be slightly more efficient (about 2% gain) than OrderBy + Select. Inplace Sort, however, has far better performance (5 times) than any of Linq.
Depending on which Linq-provider you have, there may happen some optimization on the query. E.g. if you´d use some kind of database, chances are high your provider would create the exact same query for both statements similar to this one:
select myColumn from myTable order by myColumn;
Thus performamce should be identical, no matter if you order first in Linq or select first.
As this does not seem to happen here, you probably use Linq2Objects, which has no optimization at all. So the order of your statements may have an efffect, in particular if you´d have some kind of filter using Where which would filter many objects out so that later statements won´t operate on the entire collection.
To keep long things short: the difference most probably comes from some internal initialzation-logic. As a dataset of 100000 numbers is not really big - at least not big enough - even some fast initialization has a big impact.
int [] n=new int[10]{2,3,33,33,55,55,123,33,88,234};
output=2,3,123,88,234;
use LINQ
i can do it using two for loops by continuously checking.but i need a more simple way using LINQ
its not removing duplicates..
removing duplicates by distinct will give = 2,3,123,33,55,88,234
my output should be = 2,3,123,,88,234;
I combined your grouping idea and matiash's count. Not sure about its speed.
var result = n.GroupBy(s => s).Where(g => g.Count() == 1).Select(g => g.Key);
Update: i have measured the speed and it seems the time is linear, so you can use it on large collections
var result = n.Where(d => n.Count(d1 => d1 == d) <= 1);
This reads: only take those elements that are present at most 1 times in n.
It's quadratic though. Doesn't matter for short collections, but could possibly be improved.
EDIT Dmitry's solution is linear, and hence far better.
I have a string consisting of 0, 1 and * (wildcard character) and this is called a binary schema, e.g. 0*10*.
Suppose I have a list of schemas e.g. [11010, 0010*, 0*11*, 1*100], only 0010* is sub-schema of 0*10*.
All schemas being compared are guaranteed to be of same length, though can be set initially in the program.
Edit:
So far, here's the step solution I can think of:
find wildcard indices in both schema and compare if one is superset to the other.
if true, then remove all characters in superset indices to both schema string.
return true if both trimmed schema is of same string.
Is there a more efficient way to do this? What I mean efficient is in terms of execution speed / as few iterations as possible because the checker will be invoked very frequently.
If I understand the question correctly this should do what you want and it performs as little work as possible assuming mismatching positions are unbiased. It might however be faster not to use LINQ. If you need the resulting list only once you can probably get away without turning the result into a list.
var s = "0*10*";
var sscs = new[] { "11010", "0010*", "0*11*", "1*100" };
var sss = sscs.Where(ssc => s.Zip(ssc, (a, b) => (a == b) || (a == '*')).All(_ => _)).ToList();
Every subschema candidate is compared symbol by symbol with the specified schema. If all symbols match or the schema has a wildcard in case of an mismatch the subschema candidate is a subschema. The comparison is aborted immediately if there is a mismatch and schema has no wildcard.
I heavily abbreviated the variable names to make it (almost) fit.
s schema
sscs subschema candidates
ssc subschema candidate
sss subschemas
a symbol in schema
b symbol in subschema candidate
Not exactly sure what you are asking, but I assume that you have a starting list of schemas that you are working off of, and that list is unique (no subsets, etc).
Define a simple IsSubsetOf() function, and then call that as part of a Linq 'any' call, or you can do it in a for-loop:
var startingSchemas = new [] { "100**101", "110*101*", "1111*000" };
startingSchemas.Any(x => IsSubsetOf(x, "11****11")); // false
startingSchemas.Any(x => IsSubsetOf(x, "11011010")); // true
public bool IsSubsetOf(string main, string sub)
{
for (var i = 0; i < main.Length; i++)
{
if (main[i] == '*') continue; // main is '*', so anything is ok
if (main[i] == sub[i]) continue; // equal is ok, both 1/0/*
return false; // if not equal, sub[i] could be *, or the opposite of main[i]
}
return true;
}
One issue that I think you might need to clarify is what you want to do when you find something that is NOT a subset, but when combined with another schema then it is.
1*1 => 101 or 111
0*1 => 001 or 011 (not a subset of 1*)
But these two combined = the **1 schema or {001, 011, 101, 111}
Do you want to take a string list and then reduce it to the minimal set of schemas that will still match the same inputs? IsSubset(x,y) = false, but IsSubset(y,x) = true
Addition:
Making the starting data unique is pretty easy, if it isn't already:
var uniqueSchemas = startingSchemas.Distinct().ToList();
uniqueSchemas.RemoveAll(x => uniqueSchemas.Any(y => y != x && IsSubsetOf(y, x)));
Compiled (release, no pdb, optimize):
for (int index = 0; index < main.Length; ++index)
{
if ((int) main[index] != 42 && (int) main[index] != (int) sub[index])
return false;
}
Performance
Very crude performance check. Run in parallels VM on an i7/4gb 1 core allocated to vm, other processes running, etc.
200 schemas (randomly generated, 800 length)
1 test string (800 length, randomly generated the same way)
1 millions runs each
Output: (all runs were +/- 500ms at the outside, usually in unison)
// unsafe = new 'public static void unsafe IsSubsetOf' function using pointers
// safe = standard method
Any() warmup : elapsed = 11965 (.0120 ms)
Any() safe : elapsed = 11300 (.0113 ms)
Any() unsafe : elapsed = 10754 (.0108 ms)
for() safe : elapsed = 11480 (.0115 ms)
for() unsafe : elapsed = 7964 (.008 ms)
So, that what I get from this. If there is a clever data structure for this, i've no clue.
Unsafe Version
This isn't guaranteed to be 100% correct. I don't normally do this, and I don't know if the discrepancy I saw was because of the test harness or the code. Also, disclaimer, it's been a good 6 years since I wrote a tiny bit of unsafe code. But I don't pump .net for performance this way, there is usually a bigger bottleneck... If you do use unsafe code, my only advice would be to NOT modify anything. If you just read you should be pretty safe. Check all your bounds!
private unsafe static bool IsSubsetOfUnsafe(String main, String sub)
{
if (main == null && sub == null) return true; // is this what you want? decide
if (main == null || sub == null) return false; // this too? maybe if null then say "true" and discard?
if (main.Length != sub.Length) return false;
fixed (char* m = main)
fixed (char* s = sub)
{
var m1 = m;
var s1 = s;
int len = main.Length;
for (int i = 0; i < len; ++i)
{
if ((int)m1 != 42 && m1 != s1) return false;
m1++;
s1++;
}
return true;
}
}
Unfortunately I still don't fully understand what you are doing but I will present my idea anyway, maybe it is useful.
The central idea is to replace your string representation with a more compact bit representation - your string 1*10 for example gets turned into 11001110 or 0xCE. Because one symbol takes up 2 bits you can pack 32 symbols into one UInt64, longer strings become arrays of UInt64s.
0 => 10
1 => 11
* => 00
01 => unused
Now you can find subschemas with the following LINQ expression
var sss = sscs.Where(ssc => s.Zip(ssc, (a, b) => (a ^ b) & ((a & 0xAAAAAAAAAAAAAAAA) | ((a & 0xAAAAAAAAAAAAAAAA) >> 1))).All(x => x == 0)).ToList();
that is structure like my previous answer to make the comparison more meaningful. The obvious advantage is that it processes 32 symbols in parallel and indeed it is 30 times faster than my previous answer. But I am actually a bit disappointed because I hopped for maybe a 100 times speedup because the compacter representation also means less memory traffic but maybe the overhead from using LINQ is the actual bottleneck. So I turned it into plain for loops and this made it 130 times faster than the LINQ string version. But this is only really useful if it can be deeply integrated into your application because the conversion between the string representation and this representation is quite expensive.
In the complex event processing workshop on the RX site, challenge 5 is done using a Buffer. I have a solution using LINQ dot or lamba notation. Out of interest I'd like to convert it to the LINQ language integrated query notation.
The challenge and my code follow. For some reason result2 doesn't work properly, it makes the UI unresponsive and the output looks truncated. Is this something funky, is it my query, can you fix it?
The Challenge (Download here)
IObservable<object> Query(IObservable<StockQuote> quotes)
{
// TODO: Change the query below to compute the average high and average low over
// the past five trading days as well as the current close and date.
// HINT: Try using Buffer.
return from quote in quotes
where quote.Symbol == "MSFT"
select new { quote.Close, quote.Date };
}
My solution
IObservable<object> Query(IObservable<StockQuote> quotes)
{
// TODO: Change the query below to compute the average high and average low over
// the past five trading days as well as the current close and date.
// HINT: Try using Buffer.
var result1 = quotes.Where(qt => qt.Symbol == "MSFT").Buffer(5, 1).Select(quoteList =>
{
var avg = quoteList.Average(qt => qt.Close);
return new { avg, quoteList.Last().Close, quoteList.Last().Date };
});
var result2 = from quote in quotes
where quote.Symbol == "MSFT"
from quoteList in quotes.Buffer(5, 1)
let avg = quoteList.Average(qt => qt.Close)
select new { avg, quoteList.Last().Close, quoteList.Last().Date };
return result2;
}
Both solutions subscribe to quotes multiple times (even more than two times - remember multiple from clauses result in a SelectMany call under the hood), so there's already something wrong there :-). Try again.
I think that the query should look more like this:
var result =
from quote in quotes
where quote.Symbol == "MSFT"
from quoteList in quotes.Buffer(5, 1)
let avgHigh = quoteList.Average(qt => qt.High)
let avgLow = quoteList.Average(qt => qt.Low)
select new { avgHigh, avgLow, quote.Close, quote.Date };
But this has only minor differences to yours - only that there is no need to do quoteList.Last() when quote would do & the question asked for the average of High & Low, not the Close.
From what I can tell the issue is to do with the chart and not with the Rx component. I think the chart is redrawing so often that it is blocking.
The problem requires that both graphs (current, average) be drawn simultaneously.
I don't believe it's possible to use the bind monad (SelectMany) to independently compose the latest values of those two streams which is specified by the problem.
Comprehensions downward of quotes.Buffer would be bound to the rate of the buffered stream.
Alternatively:
quotes = quotes.Publish().RefCount();
return Observable.CombineLatest
(
first: quotes,
second: quotes.Buffer(5, 1).Select(
buffer => new { High = buffer.Average(q => q.High), Low = buffer.Average(q => q.Low) }),
resultSelector: (l, r) => new { l.Close, l.Date, r.High, r.Low }
);
moves at the same rate as the source resulting in a smooth graph.
I was hoping to figure out a way to write the below in a functional style with extension functions. Ideally this functional style would perform well compared to the iterative/loop version. I'm guessing that there isn't a way. Probably because of the many additional function calls and stack allocations, etc.
Fundamentally I think the pattern which is making it troublesome is that it both calculates a value to use for the Predicate and then needs that calculated value again as part of the resulting collection.
// This is what is passed to each function.
// Do not assume the array is in order.
var a = (0).To(999999).ToArray().Shuffle();
// Approx times in release mode (on my machine):
// Functional is avg 20ms per call
// Iterative is avg 5ms per call
// Linq is avg 14ms per call
private static List<int> Iterative(int[] a)
{
var squares = new List<int>(a.Length);
for (int i = 0; i < a.Length; i++)
{
var n = a[i];
if (n % 2 == 0)
{
int square = n * n;
if (square < 1000000)
{
squares.Add(square);
}
}
}
return squares;
}
private static List<int> Functional(int[] a)
{
return
a
.Where(x => x % 2 == 0 && x * x < 1000000)
.Select(x => x * x)
.ToList();
}
private static List<int> Linq(int[] a)
{
var squares =
from num in a
where num % 2 == 0 && num * num < 1000000
select num * num;
return squares.ToList();
}
An alternative to Konrad's suggestion. This avoids the double calculation, but also avoids even calculating the square when it doesn't have to:
return a.Where(x => x % 2 == 0)
.Select(x => x * x)
.Where(square => square < 1000000)
.ToList();
Personally, I wouldn't sweat the difference in performance until I'd seen it be significant in a larger context.
(I'm assuming that this is just an example, by the way. Normally you'd possibly compute the square root of 1000000 once and then just compare n with that, to shave off a few milliseconds. It does require two comparisons or an Abs operation though, of course.)
EDIT: Note that a more functional version would avoid using ToList at all. Return IEnumerable<int> instead, and let the caller transform it into a List<T> if they want to. If they don't, they don't take the hit. If they only want the first 5 values, they can call Take(5). That laziness could be a big performance win over the original version, depending on the context.
Just solving your problem of the double calculation:
return (from x in a
let sq = x * x
where x % 2 == 0 && sq < 1000000
select sq).ToList();
That said, I’m not sure that this will lead to much performance improvement. Is the functional variant actually noticeably faster than the iterative one? The code offers quite a lot of potential for automated optimisation.
How about some parallel processing? Or does the solution have to be LINQ (which I consider to be slow).
var squares = new List<int>(a.Length);
Parallel.ForEach(a, n =>
{
if(n < 1000 && n % 2 == 0) squares.Add(n * n);
}
The Linq version would be:
return a.AsParallel()
.Where(n => n < 1000 && n % 2 == 0)
.Select(n => n * n)
.ToList();
I don't think there's a functional solution that will be completely on-par with the iterative solution performance-wise. In my timings (see below) the 'functional' implementation from the OP appears to be around twice as slow as the iterative implementation.
Micro-benchmarks like this one are prone to all manner of issues. A common tactic in dealing with variability problems is to repeatedly call the method being timed and compute an average time per call - like this:
// from main
Time(Functional, "Functional", a);
Time(Linq, "Linq", a);
Time(Iterative, "Iterative", a);
// ...
static int reps = 1000;
private static List<int> Time(Func<int[],List<int>> func, string name, int[] a)
{
var sw = System.Diagnostics.Stopwatch.StartNew();
List<int> ret = null;
for(int i = 0; i < reps; ++i)
{
ret = func(a);
}
sw.Stop();
Console.WriteLine(
"{0} per call timings - {1} ticks, {2} ms",
name,
sw.ElapsedTicks/(double)reps,
sw.ElapsedMilliseconds/(double)reps);
return ret;
}
Here are the timings from one session:
Functional per call timings - 46493.541 ticks, 16.945 ms
Linq per call timings - 46526.734 ticks, 16.958 ms
Iterative per call timings - 21971.274 ticks, 8.008 ms
There are a host of other challenges as well: strobe-effects with the timer use, how and when the just-in-time compiler does its thing, the garbage collector running its collections, the order that competing algorithms are run, the type of cpu, the OS swapping other processes in and out, etc.
I tried my hand at a little optimization. I removed the square from the test (num * num < 1000000) - changing it to (num < 1000) - which seemed safe since there are no negatives in the input - that is, I took the square root of both sides of the inequality. Surprisingly, I got different results as compared to the methods in the OP - there were only 500 items in my optimized output as compared to the 241,849 from the three implementations in the OP implementations. So why the difference? Much of the input when squared overflows 32 bit integers, so those extra 241,349 items came from numbers that when squared overflowed to either negative numbers or numbers under 1 million while still passing our evenness test.
optimized (functional) timing:
Optimized per call timings - 16849.529 ticks, 6.141 ms
This was one of the functional implementations altered as suggested. It output the 500 items passing the criteria as expected. It is deceptively "faster" only because it output fewer items than the iterative solution.
We can make the original implementations blow up with an OverflowException by adding a checked block around their implementations. Here is a checked block added to the "Iterative" method:
private static List<int> Iterative(int[] a)
{
checked
{
var squares = new List<int>(a.Length);
// rest of method omitted for brevity...
return squares;
}
}