Efficiently calculating totals from a file using LINQ

Efficiently calculating totals from a file using LINQ - c#

I'm reading a file and turning each line within it into a class, let's call it Record, and returning each Record as it is read using IEnumerable<Record> and yield return.
Because of this I only start actually performing these reads whenever I do an operation on the enumeration, such as performing a sum on it or iterating through it with a foreach.
I do need to go through each record and then translate that into a database, but due to database design before my time I need the totals on each record in the database, so I need these totals before I start translating them into my database.
At the moment I have five separate .Count() or .Sum() operations on my enumeration before I start iterating the enumeration (example int i = records.Sum(r => r.SomeField) or int j = records.Count(r => r.IsSomethingTrue)). Each one of those counts or sums will loop through the entire file to calculate each one separately. I'm not really happy with this behaviour and would like to find a more efficient way of doing this.
I am using .NET 3.5 if that makes any difference.

You could use your own struct to calculate a few values at the single pass through an enumerable object.
public struct ComplexAccumulator
{
public int TotalSumField { get; set; }
public int CountSomethingTrue { get; set; }
}
Now you can use Aggreagate extension method to accumulate values:
records.Aggregate(default(ComplexAccumulator), (a, r) => new ComplexAccumulator
{
TotalSumFiled = a.TotalSumField + r.SumField,
CountSomethingTrue = a.CountSomethingTrue + r.IsSomethingTrue ? 1 : 0,
});
Instead of the struct you could use suitable Tuple instance, f.e. something like Tuple<int, int, int>.

Efficiency is not a strength of LINQ... You need to replace some LINQ things with manual loops here.
You seem to need two passes over the data. One for aggregation:
var sum = 0; //etc.
foreach (var item in items) {
//compute all 5 aggregates here
}
And then one to translate the data:
items.Select(item => Translate(item, aggregates))
Whether you should buffer items (for example using ToList) or not depends on whether available memory can hold those items or not.
You can use Aggregate to perform all 5 aggregations in one pass but that's not better than a loop in any way. It's slower, far more code and the code arguably is illegible.

Related

What is the best way to get count of dictionary values matching expression?

I need to iterate over all dictionary values in a dictionary of type:
Dictionary<Vector3, bool>
and get the number of entries that have bool == true;
Currently, I am doing this:
int walkableTiles = 0;
foreach (bool walkable in region.RegionPositions.Values)
{
if (walkable) walkableTiles++;
}
Debug.Log("walkable tiles new " + walkableTiles);
Is there a more efficient way of doing this?

Whatever you do, with the data structure you're using, you will have to iterate every element. You can do it in less code but not more efficiently.
var walkableTiles = region.RegionPositions.Count(x => x.Value);
You might consider having a data structure that keeps a running count of true's if this is something you do often.

If you want shorter code, you can use Count:
int walkableTiles = region.RegionPositions.Values.Count(x => x);
In terms of speed though, there really isn't any faster way to know how many elements in a collection are true, other than to count them one by one. This is what you are doing with your code, and what Count does under the hood too. Assuming that you actually have a performance problem, there are probably better places you can optimise, than counting trues.

C# - Creating a recursive function to calculate the sum of a list. Is it possible using only the list as the only parameter?

So in my attempt to start learning c# one challenge I've come across is to create a recursive function that will calculate the sum of a list. I'm wondering if it's possible to do this using a list as the only argument of the function? Or would I need to apply an index size as well to work through the list?
int addRecursively(List<int> numList)
{
int total = numList[0];
if (numList.Count > 1)
{
numList.RemoveAt(0);
return total += addRecursively(numList);
}
Console.WriteLine(total);
return total;
}
List<int> numbers = new<List<int> {1,2,3,4,5,6,7,8};
addRecursively(numbers); //returns only the last element of whichever list I enter.
I was hoping by assigning the total to the first index of the list before deleting the first index of the list that when passed into the next instance of the function the index of each element in the list would move down one, allowing me to get each value in the list and totalling them up. However using the function will only ever return the last element of whichever list of integers I enter.
My thought process came from arrays and the idea of the shift method on an array in JS, removing the first element and bringing the whole thing down.
Am I attempting something stupid here? Is there another similar method I should be using or would I be better off simply including a list size as another parameter?
Thanks for your time

So in my attempt to start learning c# one challenge I've come across is to create a recursive function that will calculate the sum of a list. I'm wondering if it's possible to do this using a list as the only argument of the function? Or would I need to apply an index size as well to work through the list?
That's a great exercise for a beginner. However, you would never, ever do this with a List<int> in a realistic program. First, because you'd simply call .Sum() on it. But that's a cop-out; someone had to write Sum, and that person could be you.
The reason you would never do this recursively is List<T> is not a recursive data structure. As you note, every time you recurse there has to be something different. If there is not something different then you have an unbounded recursion!
That means you have to change one of the arguments, either by mutating it, if it is a reference type, or passing a different argument. Neither is correct in this case where the argument is a list.
For a list, you never want to mutate the list, by removing items, say. You don't own that list. The caller owns the list and it is rude to mutate it on them. When I call your method to sum a list, I don't want the list to be emptied; I might want to use it for something else.
And for a list, you never want to pass a different list in a recursion because constructing the new list from the old list is very expensive.
(There is also the issue of deep recursion; presumably we wish to sum lists of more than a thousand numbers, but that will eat up all the stack space if you go with a recursive solution; C# is not a guaranteed-tail-recursive language like F# is. However, for learning purposes let's ignore this issue and assume we are dealing with only small lists.)
Since both of the techniques for avoiding unbounded recursions are inapplicable, you must not write recursive algorithms on List<T> (or, as you note, you must pass an auxiliary parameter such as an index, and that's the thing you change). But your exercise is still valid; we just have to make it a better exercise by asking "what would we have to change to make a list that is amenable to recursion?"
We need to change two things: (1) make the list immutable, and (2) make it a recursively defined data structure. If it is immutable then you cannot change the caller's data by accident; it's unchangeable. And if it is a recursively defined data structure then there is a natural way to do recursion on it that is cheap.
So this is your new exercise:
An ImmutableList is either (1) empty, or (2) a single integer, called the "head", and an immutable list, called the "tail". Implement these in the manner of your choosing. (Abstract base class, interface implemented by multiple classes, single class that does the whole thing, whatever you think is best. Pay particular attention to the constructors.)
ImmutableList has three public read-only properties: bool IsEmpty, int Head and ImmutableList Tail. Implement them.
Now we can define int Sum(ImmutableList) as a recursive method: the base case is the sum of an empty list is zero; the inductive case is the sum of a non-empty list is the head plus the sum of the tail. Implement it; can you do it as a single line of code?
You will learn much more about C# and programming in a functional style with this exercise. Use iterative algorithms on List<T>, always; that is what it was designed for. Use recursion on data structures that are designed for recursion.
Bonus exercises:
Write Sum as an extension method, so that you can call myImmutableList.Sum().
Sum is a special case of an operation called Aggregate. It returns an integer, and takes three parameters: an immutable list, an integer called the accumulator, and a Func<int, int, int>. If the list is empty, the result is the accumulator. Otherwise, the result is the recursion on the tail and calling the function on the head and the accumulator. Write a recursive Aggregate; if you've done it correctly then int Sum(ImmutableList items) => Aggregate(items, 0, (acc, item) => acc + item); should be a correct implementation of Sum.
Genericize ImmutableList to ImmutableList<T>; genericize Aggregate to Aggregate<T, R> where T is the list element type and R is the accumulator type.

Try this way:
int addRecursively(List<int> lst)
{
if(lst.Count() == 0) return 0;
return lst.Take(1).First() + addRecursively(lst.Skip(1).ToList());
}

one more example:
static public int RecursiveSum(List<int> ints)
{
int nextIndex = 0;
if(ints.Count == 0)
return 0;
return ints[0] + RecursiveSum(ints.GetRange(++nextIndex, ints.Count - 1));
}

These are some ways to get the sum of integers in a list.
You don't need a recursive method, it spends more system resources when it isn't needed.
class Program
{
static void Main(string[] args)
{
List<int> numbers = new List<int>() { 1, 2, 3, 4, 5 };
int sum1 = numbers.Sum();
int sum2 = GetSum2(numbers);
int sum3 = GetSum3(numbers);
int sum4 = GetSum4(numbers);
}
private static int GetSum2(List<int> numbers)
{
int total = 0;
foreach (int number in numbers)
{
total += number;
}
return total;
}
private static int GetSum3(List<int> numbers)
{
int total = 0;
for (int i = 0; i < numbers.Count; i++)
{
total += numbers[i];
}
return total;
}
private static int GetSum4(List<int> numbers)
{
int total = 0;
numbers.ForEach((number) =>
{
total += number;
});
return total;
}
}

"Unzip" IEnumerable dynamically in C# or best alternative

Lets assume you have a function that returns a lazily-enumerated object:
struct AnimalCount
{
int Chickens;
int Goats;
}
IEnumerable<AnimalCount> FarmsInEachPen()
{
....
yield new AnimalCount(x, y);
....
}
You also have two functions that consume two separate IEnumerables, for example:
ConsumeChicken(IEnumerable<int>);
ConsumeGoat(IEnumerable<int>);
How can you call ConsumeChicken and ConsumeGoat without a) converting FarmsInEachPen() ToList() beforehand because it might have two zillion records, b) no multi-threading.
Basically:
ConsumeChicken(FarmsInEachPen().Select(x => x.Chickens));
ConsumeGoats(FarmsInEachPen().Select(x => x.Goats));
But without forcing the double enumeration.
I can solve it with multithread, but it gets unnecessarily complicated with a buffer queue for each list.
So I'm looking for a way to split the AnimalCount enumerator into two int enumerators without fully evaluating AnimalCount. There is no problem running ConsumeGoat and ConsumeChicken together in lock-step.
I can feel the solution just out of my grasp but I'm not quite there. I'm thinking along the lines of a helper function that returns an IEnumerable being fed into ConsumeChicken and each time the iterator is used, it internally calls ConsumeGoat, thus executing the two functions in lock-step. Except, of course, I don't want to call ConsumeGoat more than once..

I don't think there is a way to do what you want, since ConsumeChickens(IEnumerable<int>) and ConsumeGoats(IEnumerable<int>) are being called sequentially, each of them enumerating a list separately - how do you expect that to work without two separate enumerations of the list?
Depending on the situation, a better solution is to have ConsumeChicken(int) and ConsumeGoat(int) methods (which each consume a single item), and call them in alternation. Like this:
foreach(var animal in animals)
{
ConsomeChicken(animal.Chickens);
ConsomeGoat(animal.Goats);
}
This will enumerate the animals collection only once.
Also, a note: depending on your LINQ-provider and what exactly it is you're trying to do, there may be better options. For example, if you're trying to get the total sum of both chickens and goats from a database using linq-to-sql or linq-to-entities, the following query..
from a in animals
group a by 0 into g
select new
{
TotalChickens = g.Sum(x => x.Chickens),
TotalGoats = g.Sum(x => x.Goats)
}
will result in a single query, and do the summation on the database-end, which is greatly preferable to pulling the entire table over and doing the summation on the client end.

The way you have posed your problem, there is no way to do this. IEnumerable<T> is a pull enumerable - that is, you can GetEnumerator to the front of the sequence and then repeatedly ask "Give me the next item" (MoveNext/Current). You can't, on one thread, have two different things pulling from the animals.Select(a => a.Chickens) and animals.Select(a => a.Goats) at the same time. You would have to do one then the other (which would require materializing the second).
The suggestion BlueRaja made is one way to change the problem slightly. I would suggest going that route.
The other alternative is to utilize IObservable<T> from Microsoft's reactive extensions (Rx), a push enumerable. I won't go into the details of how you would do that, but it's something you could look into.
Edit:
The above is assuming that ConsumeChickens and ConsumeGoats are both returning void or are at least not returning IEnumerable<T> themselves - which seems like an obvious assumption. I'd appreciate it if the lame downvoter would actually comment.

Actually simples way to achieve what you what is convert FarmsInEachPen return value to push collection or IObservable and use ReactiveExtensions for working with it
var observable = new Subject<Animals>()
observable.Do(x=> DoSomethingWithChicken(x. Chickens))
observable.Do(x=> DoSomethingWithGoat(x.Goats))
foreach(var item in FarmsInEachPen())
{
observable.OnNext(item)
}

I figured it out, thanks in large part due to the path that #Lee put me on.
You need to share a single enumerator between the two zips, and use an adapter function to project the correct element into the sequence.
private static IEnumerable<object> ConsumeChickens(IEnumerable<int> xList)
{
foreach (var x in xList)
{
Console.WriteLine("X: " + x);
yield return null;
}
}
private static IEnumerable<object> ConsumeGoats(IEnumerable<int> yList)
{
foreach (var y in yList)
{
Console.WriteLine("Y: " + y);
yield return null;
}
}
private static IEnumerable<int> SelectHelper(IEnumerator<AnimalCount> enumerator, int i)
{
bool c = i != 0 || enumerator.MoveNext();
while (c)
{
if (i == 0)
{
yield return enumerator.Current.Chickens;
c = enumerator.MoveNext();
}
else
{
yield return enumerator.Current.Goats;
}
}
}
private static void Main(string[] args)
{
var enumerator = GetAnimals().GetEnumerator();
var chickensList = ConsumeChickens(SelectHelper(enumerator, 0));
var goatsList = ConsumeGoats(SelectHelper(enumerator, 1));
var temp = chickensList.Zip(goatsList, (i, i1) => (object) null);
temp.ToList();
Console.WriteLine("Total iterations: " + iterations);
}

Is possible to change search method in LINQ?

I have csv file with 30 000 lines. I have to select many values based on many conditions, so insted of many loops and "if's" i decided to use linq. I have written class to read csv. It implements IEnumerable to be used with linq. This is my enumerator:
class CSVEnumerator : IEnumerator
{
private CSVReader _csv;
private int _index;
public CSVEnumerator(CSVReader csv)
{
_csv = csv;
_index = -1;
}
public void Reset(){_index = -1;}
public object Current
{
get
{
return new CSVRow(_index,_csv);
}
}
public bool MoveNext()
{
return ++_index < _csv.TotalRows;
}
}
It's working, but it's slow. Let's say i want to select max value in column A in range 100;150 row.
max = (from CSVRow r in csv where r.ID > 100 && r.ID < 150 select r).Max(y=>y["A"]);
This will work, but linq searches for max value in 30 000 rows instead of 48.
As I said, I could use loop, but only in this example case, conditions are "brutal" :)
Is there any way to override linq collection search. Something like: look into query used on my enumerator, look, if any linq conditions in "where" contains "row ID filter" and give another data based on this.
I don't want to copy part of data to another array/collection and problem is not in my csv reader. Accessing every row by id is fast, only problem is when you access all 30 000 of them.
Any help appriciated :-)

If you wanted to be able to use LINQ for this efficiently, you would need to use expression trees, in a similar (but much simpler) way than what various LINQ providers for SQL databases do. While doable, I think it would be quite a lot of code for such a simple task.
Because of that, I think a better solution would be to use a separate method to select the rows you want (and then possibly use LINQ to work with the result).
Also, many operations that return collections (including your original code and my modification) can be simplified by using iterator methods.
So, your code could look something like this:
public static IEnumerable<CSVRow> GetRows(
this CSVReader reader, int idGreaterThan, int idLessThan)
{
for (int i = idGreaterThan + 1; i < idLessThan; i++)
{
yield return new CSVRow(reader, i);
}
}
Here, it's an extension method for CSVReader, but another solution (e.g. actual method on that class) might be more appropriate for you.
Your example would then look something like:
max = csvReader.GetRows(100, 150).Max(y => y["A"]);
(Also, I find it weird that when you have limits 100 and 150, you actually want rows between 101 and 149. But I'm assuming you have a reason for that, so I did the same.)

As far as LINQ is concerned, r.ID is simply a value that is being filtered and so all 30k lines are considered for use in the Max operation. If this is a row index, which seems to be the case here, you can use Skip and Take to avoid comparing all 30k rows.
max = csv.Skip(100).Take(50).Max(y => y["A"]);

#DougM is right about the order of evaluation, but in this case what I would do is take a one time hit on initialization and generate lookups for any "index" fields: basically, pre calculate a map (dictionary) of row index to row. That said, this would only be useful if you have many repeated queries for a given index field.

LINQ Performance for Large Collections

I have a large collection of strings (up to 1M) alphabetically sorted. I have experimented with LINQ queries against this collection using HashSet, SortedDictionary, and Dictionary. I am static caching the collection, it's up to 50MB in size, and I'm always calling the LINQ query against the cached collection. My problem is as follows:
Regardless of collection type, performance is much poorer than SQL (up to 200ms). When doing a similar query against the underlying SQL tables, performance is much quicker ( 5-10ms). I have implemented my LINQ queries as follows:
public static string ReturnSomething(string query, int limit)
{
StringBuilder sb = new StringBuilder();
foreach (var stringitem in MyCollection.Where(
x => x.StartsWith(query) && x.Length > q.Length).Take(limit))
{
sb.Append(stringitem);
}
return sb.ToString();
}
It is my understanding that the HashSet, Dictionary, etc. implement lookups using binary tree search instead of the standard enumeration. What are my options for high performance LINQ queries into the advanced collection types?

In your current code you don't make use of any of the special features of the Dictionary / SortedDictionary / HashSet collections, you are using them the same way that you would use a List. That is why you don't see any difference in performance.
If you use a dictionary as index where the first few characters of the string is the key and a list of strings is the value, you can from the search string pick out a small part of the entire collection of strings that has possible matches.
I wrote the class below to test this. If I populate it with a million strings and search with an eight character string it rips through all possible matches in about 3 ms. Searching with a one character string is the worst case, but it finds the first 1000 matches in about 4 ms. Finding all matches for a one character strings takes about 25 ms.
The class creates indexes for 1, 2, 4 and 8 character keys. If you look at your specific data and what you search for, you should be able to select what indexes to create to optimise it for your conditions.
public class IndexedList {
private class Index : Dictionary<string, List<string>> {
private int _indexLength;
public Index(int indexLength) {
_indexLength = indexLength;
}
public void Add(string value) {
if (value.Length >= _indexLength) {
string key = value.Substring(0, _indexLength);
List<string> list;
if (!this.TryGetValue(key, out list)) {
Add(key, list = new List<string>());
}
list.Add(value);
}
}
public IEnumerable<string> Find(string query, int limit) {
return
this[query.Substring(0, _indexLength)]
.Where(s => s.Length > query.Length && s.StartsWith(query))
.Take(limit);
}
}
private Index _index1;
private Index _index2;
private Index _index4;
private Index _index8;
public IndexedList(IEnumerable<string> values) {
_index1 = new Index(1);
_index2 = new Index(2);
_index4 = new Index(4);
_index8 = new Index(8);
foreach (string value in values) {
_index1.Add(value);
_index2.Add(value);
_index4.Add(value);
_index8.Add(value);
}
}
public IEnumerable<string> Find(string query, int limit) {
if (query.Length >= 8) return _index8.Find(query, limit);
if (query.Length >= 4) return _index4.Find(query,limit);
if (query.Length >= 2) return _index2.Find(query,limit);
return _index1.Find(query, limit);
}
}

I bet you have an index on the column so SQL server can do the comparison in O(log(n)) operations rather than O(n). To imitate the SQL server behavior, use a sorted collection and find all strings s such that s >= query and then look at values until you find a value that does not start with s and then do an additional filter on the values. This is what is called a range scan (Oracle) or an index seek (SQL server).
This is some example code which is very likely to go into infinite loops or have one-off errors because I didn't test it, but you should get the idea.
// Note, list must be sorted before being passed to this function
IEnumerable<string> FindStringsThatStartWith(List<string> list, string query) {
int low = 0, high = list.Count - 1;
while (high > low) {
int mid = (low + high) / 2;
if (list[mid] < query)
low = mid + 1;
else
high = mid - 1;
}
while (low < list.Count && list[low].StartsWith(query) && list[low].Length > query.Length)
yield return list[low];
low++;
}
}

If you're doing a "starts with", you only care about ordinal comparisons, and you can have the collection sorted (again in ordinal order) then I would suggest you have the values in a list. You can then binary search to find the first value which starts with the right prefix, then go down the list linearly yielding results until the first value which doesn't start with the right prefix.
In fact, you could probably do another binary search for the first value which doesn't start with the prefix, so you'd have a start and an end point. Then you just need to apply the length criterion to that matching portion. (I'd hope that if it's sensible data, the prefix matching is going to get rid of most candidate values.) The way to find the first value which doesn't start with the prefix is to search for the lexicographically-first value which doesn't - e.g. with a prefix of "ABC", search for "ABD".
None of this uses LINQ, and it's all very specific to your particular case, but it should work. Let me know if any of this doesn't make sense.

If you are trying to optimize looking up a list of strings with a given prefix you might want to take a look at implementing a Trie (not to be mistaken with a regular tree) data structure in C#.
Tries offer very fast prefix lookups and have a very small memory overhead compared to other data structures for this sort of operation.
About LINQ to Objects in general. It's not unusual to have a speed reduction compared to SQL. The net is littered with articles analyzing its performance.

Just looking at your code, I would say that you should reorder the comparison to take advantage of short-circuiting when using boolean operators:
foreach (var stringitem in MyCollection.Where(
x => x.Length > q.Length && x.StartsWith(query)).Take(limit))
The comparison of length is always going to be an O(1) operation (as the length is being stored as part of the string, it doesn't count each character every time), whereas the call to StartsWith is going to be an O(N) operation, where N is the length of query (or the length of the string, whichever is smaller).
By placing the comparison of length before the call to StartsWith, if that comparison fails, you save yourself some extra cycles which could add up when processing large numbers of items.
I don't think that a lookup table is going to help you here, as lookup tables are good when you are comparing the entire key, not parts of the key, like you are doing with the call to StartsWith.
Rather, you might be better off using a tree structure which is split based on the letters in the words in the list.
However, at that point, you are really just recreating what SQL Server is doing (in the case of indexes) and that would just be a duplication of effort on your part.

I think the problem is that Linq has no way to use the fact that your sequence is already sorted. Especially it cannot know, that applying the StartsWith function retains the order.
I would suggest to use the List.BinarySearch method together with a IComparer<string> that does only comparison of the first query chars (this might be tricky, since it's not clear, if the query string will always be the first or the second parameter to ()).
You could even use the standard string comparison, since BinarySearch returns a negative number which you can complement (using ~) in order to get the index of the first element that is larger than your query.
You have then to start from the returned index (in both directions!) to find all elements matching your query string.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.