LINQ queries on possibly infinite lists - c#

I am currently doing some Project Euler problems and the earlier ones often involve things like Fibonacci numbers or primes. Iterating over them seems to be a natural fit for LINQ, at least in readability and perceived "elegance" of the code (I'm trying to use language-specific features where possible and applicable to get a feel for the languages).
My problem is now, if I only need a set of numbers up to a certain limit, how should I best express this? Currently I have hard-coded the respective limit in the iterator but I'd really like the enumerator to return the list until something outside decides not to query it anymore, since it's over a certain limit. So basically that I have a potentially infinite iterator but I only take a finite set of numbers from it. I know such things are trivial in functional languages, but I wonder whether C# allows for that, too. The only other idea I had would be to have an iterator Primes(long) that returns primes up to a certain limit, likewise for other sequences.
Any ideas?

Most of the LINQ methods (Enumerable class) are lazy. So for instance, there's nothing wrong with:
var squares = Enumerable.Range(0, Int32.MaxValue).Select(x=>x*x);
You can use the Take method to limit the results:
var 10squares = squares.Take(10);
var smallSquares = squares.TakeWhile(x => x < 10000);
Edit: The things you need to avoid are functions that return "lazily" but have to consume the entire enumerable to produce a result. For example, grouping or sorting:
var oddsAndEvens = Enumerable.Range(0, Int32.MaxValue)
.GroupBy(x => x % 2 == 0);
foreach (var item in oddsAndEvens) {
Console.WriteLine(item.Key);
}
(That'll probably give you an OutOfMemoryExeption on 32-bit.)

Related

Loop - Calculated last element different

Hi everyone (sry for the bad title),
I have a loop in which I can get a rounding difference every time I pass. I would like to cumulate them and add it to the last record of my result.
var cumulatedRoundDifference = 0m;
var resultSet = Enumerable.Range(0, periods)
.Select(currentPeriod => {
var value = this.CalculateValue(currentPeriod);
var valueRounded = this.CommercialRound(value);
// Bad part :(
cumulatedRoundDifference += value - valueRounded;
if (currentPeriod == periods - 1)
valueRounded = this.CommercialRound(value + valueRounded);
return valuesRounded;
}
At the moment the code of my opinion is not so nice.
Is there a pattern / algorithm for such a thing or is it somehow clever with Linq, without a variable outside the loop?
many Greetings
It seems like you are doing two things - rounding everything, and calculating the total rounding error.
You could remove the variable outside the lambda, but then you would need 2 queries.
var baseQuery = Enumerable.Range(0, periods)
.Select(x => new { Value = CalculateValue(x), ValueRounded = CommercialRound(x) });
var cumulateRoundDifference = baseQuery.Select(x => x.Value - x.ValueRounded).Sum();
// LINQ isn't really good at doing something different to the last element
var resultSet = baseQuery.Select(x => x.ValueRounded).Take(periods - 1).Concat(new[] { CommercialRound(CalculateValue(periods - 1) + CommericalRound(periods - 1)) });
Is there a pattern / algorithm for such a thing or is it somehow clever with Linq, without a variable outside the loop?
I don't quite agree with what you're trying to accomplish. You're trying to accomplish two very different tasks, so why are you trying to merge them into the same iteration block? The latter (handling the last item) isn't even supposed to be an iteration.
For readability's sake, I suggest splitting the two off. It makes more sense and doesn't require you to check if you're on the last loop of the iteration (which saves you some code and nesting).
While I don't quite understand the calculation in and of itself, I can answer the algorithm you're directly asking for (though I'm not sure this is the best way to do it, which I'll address later in the answer).
var allItemsExceptTheLastOne = allItems.Take(allItems.Count() - 1);
foreach(var item in allItemsExceptTheLastOne)
{
// Your logic for all items except the last one
}
var theLastItem = allItems.Last();
// Your logic for the last item
This is in my opinion a cleaner and more readable approach. I'm not a fan of using lambda methods as mini-methods with a less-than-trivial readability. This may be subjective and a matter of personal style.
On rereading, I think I understand the calculation better, so I've added an attempt at implementing it, while still maximizing readability as best I can:
// First we make a list of the values (without the sum)
var myValues = Enumerable
.Range(0, periods)
.Select(period => this.CalculateValue(period))
.Select(period => period - this.CommercialRound(period))
.ToList();
// myValues = [ 0.1, 0.2, 0.3 ]
myValues.Add(myValues.Sum());
// myValues = [ 0.1, 0.2, 0.3, 0.6 ]
This follows the same approach as the algorithm I first suggested: iterate over the iteratable items, and then separately handle the last value of your intended result list.
Note that I separated the logic into two subsequent Select statements as I consider it the most readable (no excessive lambda bodies) and efficient (no duplicate CalculateValue calls) way of doing this. If, however, you are more concerned about performance, e.g. when you are expecting to process massive lists, you may want to merge these again.
I suggest that you always try to default to writing code that favors readability over (excessive) optimization; and only deviate from that path when there is a clear need for additional optimization (which I cannot decide based on your question).
On a second reread, I'm not sure you've explained the actual calculation well enough, as cumulatedRoundDifference is not actually used in your calculations, but the code seems to suggest that its value should be important to the end result.

How linq function OrderByDescending and OrderBy for string length works internally? Is it faster than doing it with loop?

My question is raised based on this question, I had posted an answer on that question..here
This is the code.
var lines = System.IO.File.ReadLines(#"C:\test.txt");
var Minimum = lines[0];//Default length set
var Maximum = "";
foreach (string line in lines)
{
if (Maximum.Length < line.Length)
{
Maximum = line;
}
if (Minimum.Length > line.Length)
{
Minimum = line;
}
}
and alternative for this code using LINQ (My approach)
var lines = System.IO.File.ReadLines(#"C:\test.txt");
var Maximum = lines.OrderByDescending(a => a.Length).First().ToString();
var Minimum = lines.OrderBy(a => a.Length).First().ToString();
LINQ is easy to read and implement..
I want to know which one is good for performance.
And how Linq work internally for OrderByDescending and OrderBy for ordering by length?
You can read the source code for OrderBy.
Stop doing micro-optimizing or premature-optimization on your code. Try to write code that performs correctly, then if you face a performance problem later then profile your application and see where is the problem. If you have a piece of code which have performance problem due to finding the shortest and longest string then start to optimize this part.
We should forget about small efficiencies, say about 97% of the time:
premature optimization is the root of all evil. Yet we should not pass
up our opportunities in that critical 3% - Donald Knuth
File.ReadLines is returning an IEnumerable<string>, It means that if you do a foreach over it it will return data to you one by one. I think the best performance improvement you can do here is to improve the reading of file from the disk. If it is small enough to load the whole file into memory use File.ReadAllLines, if it is not try reading the file in big chunks that fits in memory. Reading a file line by line will cause performance degradation due to I/O operation from disk. So the problem here is not how LINQ or loop perform, The problem is in number of disk reads.
With the second method, you are not only sorting the lines twice... You are reading the file twice. This because File.ReadLines returns a IEnumerable<string>. This clearly shows why you shouldn't ever ever enumerate a IEnumerable<> twice unless you know how it was built. If you really want to do it, add a .ToList() or a .ToArray() that will materialize the IEnumerable<> to a collection... And while the first method has a memory footprint of a single line of text (because it reads the file one line at a time), the second method will load the whole file in memory to sort it, so will have a much bigger memory footprint, and if the file is some hundred mb, the difference is big (note that technically you could have a file with a single line of text long 1gb, so this rule isn't absolute... It is for reasonable files that have lines long up to some hundred characters :-) )
Now... Someone will tell you that premature optimization is evil, but I'll tell you that ignorance is twice evil.
If you know the difference between the two blocks of code then you can do an informed choice between the two... Otherwise you are simply randomly throwing rocks until it seems to work. Where seems to work is the keyword here.
In my opinion, you need to understand some points for deciding what is the best way.
First, let's think that we want to solve the problem with LINQ. Then, to write the most optimized code, you must understand Deferred Execution. Most Linq methods, such as Select, Where, OrderBy, Skip, Take and some others uses DE. So, what is Deferred Execution? It means that, these methods will not be executed unless the user doesn't need them. These methods will just create iterator. And this iterator is ready to be executed, when we need them. So, how can user make them execute? The answer is, with the help of foreach which will call GetEnumerator or other Linq methods. Such as, ToList(), First(), FirstOrDefault(), Max() and some others.
These process will help us to gain some performance.
Now, let's come back to your problem. File.ReadLines will return IEnumerable<string>, which it means that, it will not read the lines, unless we need them. In your example, you have twice called sorting method for this object, which it means that it will sort this collection over again twice. Instead of that, you can sort the collection once, then call ToList() which will execute the OrderedEnumerable iterator and then get first and last element of the collection which physically inside our hands.
var orderedList = lines
.OrderBy(a => a.Length) // This method uses deferred execution, so it is not executed yet
.ToList(); // But, `ToList()` makes it to execute.
var Maximum = orderedList.Last();
var Minimum = orderedList.First();
BTW, you can find OrderBy source code, here.
It returns OrderedEnumerable instance and the sorting algorithm is here:
public IEnumerator<TElement> GetEnumerator()
{
Buffer<TElement> buffer = new Buffer<TElement>(source);
if (buffer.count > 0)
{
EnumerableSorter<TElement> sorter = GetEnumerableSorter(null);
int[] map = sorter.Sort(buffer.items, buffer.count);
sorter = null;
for (int i = 0; i < buffer.count; i++) yield return buffer.items[map[i]];
}
}
And now, let's come back to another aspect which effects the performance. If you see, Linq uses another element to store sorted collection. Of course, it will take some memory, which tells us it is not the most efficent way.
I just tried to explain you how does Linq work. But, I am very agree with #Dotctor as a result to your overall answer. Just, don't forget that, you can use File.ReadAllLines which will not return IEnumerable<stirng>, but string[].
What does it mean? As I tried to explain in the beginning, difference is that, if it is IEnumerable, then .net will read line one by one when enuemrator enumerates over iterator. But, if it is string[], then all lines in our application memory.
The most efficient approach is to avoid LINQ here, the approach using foreach needs only one enumeration.
If you want to put the whole file into a collection anyway you could use this:
List<string> orderedLines = System.IO.File.ReadLines(#"C:\test.txt")
.OrderBy(l => l.Length)
.ToList();
string shortest = orderedLines.First();
string longest = orderedLines.Last();
Apart from that you should read about LINQ's deferred execution.
Also note that your LINQ approach does not only order all lines twice to get the longest and the shortest, it also needs to read the whole file twice since File.ReadLines is using a StreamReader(as opposed to ReadAllLines which reads all lines into an array first).
MSDN:
When you use ReadLines, you can start enumerating the collection of
strings before the whole collection is returned; when you use
ReadAllLines, you must wait for the whole array of strings be returned
before you can access the array
In general that can help to make your LINQ queries more efficient, f.e. if you filter out lines with Where, but in this case it's making things worse.
As Jeppe Stig Nielsen has mentioned in a comment, since OrderBy needs to create another buffer-collection internally(with ToList the second), there is another approach that might be more efficient:
string[] allLines = System.IO.File.ReadAllLines(#"C:\test.txt");
Array.Sort(allLines, (x, y) => x.Length.CompareTo(y.Length));
string shortest = allLines.First();
string longest = allLines.Last();
The only drawback of Array.Sort is that it performs an unstable sort as opposed to OrderBy. So if two lines have the same length the order might not be maintained.

Is there a difference between conjuncted condition and multiple Where method call?

I was sitting this cloudy Saturday morning thinking to myself:
IEnumerable<SomeType>
someThings = ...,
conjunctedThings = someThings.Where(thing => thing.Big && thing.Tall),
multiWhereThings = someThings
.Where(thing => thing.Big).Where(thing => thing.Tall);
Intuitively, I'd say that conjunctedThings will be computed no slower than multiWhereThings but is there really a difference in a general case?
I can imagine that depending on the share if big things and tall tings, the computations might elapse differently but I'd like to disregard that aspect.
Are there any other properties I need to take into consideration? E.g. the type of the enumerable or anything else?
In general the MultiWhere will be slower. It needs to process more items and call more lambdas.
If someThings contains n items, and m of which are Big then the lambda for conjucated-things is called n times while the lambdas for multi-where are called n+m times. Of couse this is true if the user of the two sequences intends to iterate all the contents. Since the Where method performs yield return internally the number of iterations might be less depending on the user of the collections. In other words the numbers above are the worst-case estimate.

Does List<T>.Sort suffer worst case performance on sorted lists?

According to the docs List<T>.Sort uses the QuickSort algorithm. I've heard that this can exibit worst case performance when called on a pre-sorted list if the pivot is not chosen wisely.
Does the .NET implementation of QuickSort experience worst case behaviour on pre-sorted lists?
In my case I'm writing a method that's going to do some processing on a list. The list needs to be sorted in order for the method to work. In most usage cases the list will be passed already sorted, but it's not impossible that there will be some small changes to the order. I'm wondering whether it's a good idea to re-sort the list on every method call. Clearly though, I am falling into the premature optimization trap.
Edit: I've edited the question.
My question was badly asked I guess. It should really have been:
Does List<T>.Sort suffer worst case performance on sorted lists?
To which the answer appears to be "No".
I did some testing and it seems that sorted lists require fewer comparisons to sort than randomized lists: https://gist.github.com/3749646
const int listSize = 1000;
const int sampleSize = 10000;
var sortedList = Enumerable.Range(0,listSize).ToList();
var unsortedList = new List<int>(sortedList);
var sortedCount = 0;
sortedList.Sort((l,r) => {sortedCount++; return l - r;});
//sortedCount.Dump("Sorted");
// Returns: 10519
var totalUnsortedComparisons = 0;
for(var i = 0; i < sampleSize; i++)
{
var unsortedCount = 0;
unsortedList.Shuffle();
unsortedList.Sort((l,r) => {unsortedCount++; return l - r;});
totalUnsortedComparisons += unsortedCount;
}
//(totalUnsortedComparisons / sampleSize).Dump("Unsorted");
// Returns: 13547
Of course, #dlev raises a valid point. I should never have allowed myself to get into a situation where I was not sure whether my list was sorted.
I've switched to using a SortedList instead to avoid this issue.
Until you have hard metrics to make comparisons off of, you would be falling into the premature optimization trap. Run your code in a loop over 1000 times and gather time for execution using the two different methods to see which is faster and whether it makes a difference.
Choosing the right algorithm is not premature optimization.
When your list is already sorted or nearly so, it makes sense to use a stable sort. .NET ships with one, LINQ's OrderBy implementation. Unfortunately, it will copy your entire list several times, but copying is still O(N), so for a non-trivial list, that will still be faster.

In LINQ, does orderby() execute the comparing function only once or execute it whenever needed?

I found a method to shuffle an array on the internet.
Random rand = new Random();
shuffledArray = myArray.OrderBy(x => rand.Next()).ToArray();
However, I am a little concerned about the correctness of this method. If OrderBy executes x => rand.Next() many times for the same item, the results may conflict and result in weird things (possibly exceptions).
I tried it and everything is fine, but I still want to know whether this is absolutely safe and always works as expected, and I can't find the answer by Google.
Could anyone give me some explanations?
Thanks in advance.
Your approach should work but it is slow.
It works because OrderBy first calculates the keys for every item using the key selector, then it sorts the keys. So the key selector is only called once per item.
In .NET Reflector see the method ComputeKeys in the class EnumerableSorter.
this.keys = new TKey[count];
for (int i = 0; i < count; i++)
{
this.keys[i] = this.keySelector(elements[i]);
}
// etc...
whether this is absolutely safe and always works as expected
It is undocumented so in theory it could change in future.
For shuffling randomly you can use the Fisher-Yates shuffle. This is also more efficient - using only O(n) time and shuffling in-place instead of O(n log(n)) time and O(n) extra memory.
Related question
C#: Is using Random and OrderBy a good shuffle algorithm?
I assume that you're talking about LINQ-to-Objects, in which case the key used for comparison is only generated once per element. (Note that this is just a detail of the current implementation and could change, although it's very unlikely to because such a change would introduce the bugs that you mention.)
To answer your more general question: your approach should work, but there are better ways to do it. Using OrderBy will typically be O(n log n) performance, whereas a Fisher-Yates-Durstenfeld shuffle will be O(n).
(There's an example Shuffle extension for IEnumerable<T> here, or an in-place equivalent for IList<T> here, if you prefer.)
Using a shufflebag will definitely work.
As for your orderby method, I think that it's not completely random as the order of equal elements is kept. So if you have a random array [5 6 7 2 6] then the elements at the two sixes will always be in the same order.
I'd have to run a frequency test to be sure.

Categories

Resources