Performance regarding cached thread-safe IEnumerable<T> implementation

Performance regarding cached thread-safe IEnumerable<T> implementation - c#

I created the ThreadSafeCachedEnumerable<T> class intending to increase performance where long running queries where being reused. The idea was to get an enumerator from an IEnumerable<T> and add items to a cache on each call to MoveNext(). The following is my current implementation:
/// <summary>
/// Wraps an IEnumerable<T> and provides a thread-safe means of caching the values."/>
/// </summary>
/// <typeparam name="T"></typeparam>
class ThreadSafeCachedEnumerable<T> : IEnumerable<T>
{
// An enumerator from the original IEnumerable<T>
private IEnumerator<T> enumerator;
// The items we have already cached (from this.enumerator)
private IList<T> cachedItems = new List<T>();
public ThreadSafeCachedEnumerable(IEnumerable<T> enumerable)
{
this.enumerator = enumerable.GetEnumerator();
}
public IEnumerator<T> GetEnumerator()
{
// The index into the sequence
int currentIndex = 0;
// We will break with yield break
while (true)
{
// The currentIndex will never be decremented,
// so we can check without locking first
if (currentIndex < this.cachedItems.Count)
{
var current = this.cachedItems[currentIndex];
currentIndex += 1;
yield return current;
}
else
{
// If !(currentIndex < this.cachedItems.Count),
// we need to synchronize access to this.enumerator
lock (enumerator)
{
// See if we have more cached items ...
if (currentIndex < this.cachedItems.Count)
{
var current = this.cachedItems[currentIndex];
currentIndex += 1;
yield return current;
}
else
{
// ... otherwise, we'll need to get the next item from this.enumerator.MoveNext()
if (this.enumerator.MoveNext())
{
// capture the current item and cache it, then increment the currentIndex
var current = this.enumerator.Current;
this.cachedItems.Add(current);
currentIndex += 1;
yield return current;
}
else
{
// We reached the end of the enumerator - we're done
yield break;
}
}
}
}
}
}
System.Collections.IEnumerator System.Collections.IEnumerable.GetEnumerator()
{
return this.GetEnumerator();
}
}
I simply lock (this.enumerator) when the no more items appear to be in the cache, just in case another thread is just about to add another item (I assume that calling MoveNext() on this.enumerator from two threads is a bad idea).
The performance is great when retrieving previously cached items, but it starts to suffer when getting many items for the first time (due to the constant locking). Any suggestions for increasing the performance?
Edit: The new Reactive Framework solves the problem outlined above, using the System.Linq.EnumerableEx.MemoizeAll() extension method.
Internally, MemoizeAll() uses a System.Linq.EnumerableEx.MemoizeAllEnumerable<T> (found in the System.Interactive assembly), which is similar to my ThreadSafeCachedEnumerable<T> (sorta).
Here's an awfully contrived example that prints the contents of an Enumerable (numbers 1-10) very slowly, then quickly prints the contents a second time (because it cached the values):
// Create an Enumerable<int> containing numbers 1-10, using Thread.Sleep() to simulate work
var slowEnum = EnumerableEx.Generate(1, currentNum => (currentNum <= 10), currentNum => currentNum, previousNum => { Thread.Sleep(250); return previousNum + 1; });
// This decorates the slow enumerable with one that will cache each value.
var cachedEnum = slowEnum.MemoizeAll();
// Print the numbers
foreach (var num in cachedEnum.Repeat(2))
{
Console.WriteLine(num);
}

A couple of recommendations:
It is now generally accepted practice not to make container classes responsible for locking. Someone calling your cached enumerator, for instance, might also want to prevent new entries from being added to the container while enumerating, which means that locking would occur twice. Therefore, it's best to defer that responsibility to the caller.
Your caching depends on the enumerator always returning items in-order, which is not guaranteed. It's better to use a Dictionary or HashSet. Similarly, items may be removed inbetween calls, invalidating the cache.
It is generally not recommended to establish locks on publically accessible objects. That includes the wrapped enumerator. Exceptions are conceivable, for example when you're absolutely certain you're absolutely certain you're the only instance holding a reference to the container class you're enumerating over. This would also largely invalidate my objections under #2.

Locking in .NET is normally very quick (if there is no contention). Has profiling identified locking as the source of the performance problem? How long does it take to call MoveNext on the underlying enumerator?
Additionally, the code as it stands is not thread-safe. You cannot safely call this.cachedItems[currentIndex] on one thread (in if (currentIndex < this.cachedItems.Count)) while invoking this.cachedItems.Add(current) on another. From the List(T) documentation: "A List(T) can support multiple readers concurrently, as long as the collection is not modified." To be thread-safe, you would need to protect all access to this.cachedItems with a lock (if there's any chance that one or more threads could modify it).

Related

Expensive IEnumerable: Any way to prevent multiple enumerations without forcing an immediate enumeration? [duplicate]

This question already has answers here:
Is there an IEnumerable implementation that only iterates over it's source (e.g. LINQ) once?
(4 answers)
Closed 9 months ago.
I have a very large enumeration and am preparing an expensive deferred operation on it (e.g. sorting it). I'm then passing this into a function which may or may not consume the IEnumerable, depending on some logic of its own.
Here's an illustration:
IEnumerable<Order> expensiveEnumerable = fullCatalog.OrderBy(c => Prioritize(c));
MaybeFullFillSomeOrders(expensiveEnumerable);
// Elsewhere... (example use-case for multiple enumerations, not real code)
void MaybeFullFillSomeOrders(IEnumerable<Order> nextUpOrders){
if(notAGoodTime())
return;
foreach(var order in nextUpOrders)
collectSomeInfo(order);
processInfo();
foreach(var order in nextUpOrders) {
maybeFulfill(order);
if(atCapacity())
break;
}
}
I'm would like to prepare my input to the other function such that:
If they do not consume the enumerable, the performance price of sorting is not paid.
This already precludes calling e.g. ToList() or ToArray() on it
If they choose to enumerate multiple times (perhaps not realizing how expensive it would be in this case) I want some defence in place to prevent the multiple enumeration.
Ideally, the result is still an IEnumerable<T>
The best solution I've come up with is to use Lazy<>
var expensive = new Lazy<List<Order>>>(
() => fullCatalog.OrderBy(c => Prioritize(c)).ToList());
This appears to satisfy criteria 1 and 2, but has a couple of drawbacks:
I have to change the interface to all downstream usages to expect a Lazy.
The full list (which in this case was built up from a SelectMany() on serveral smaller partitions) would need to be allocated as a new single contiguous list in memory. I'm not sure there's an easy way around this if I want to "cache" the sort result, but if you know of one I'm all ears.
One idea I had to solve the first problem was to wrap Lazy<> in some custom class that either implements or can implicitly be converted to an IEnumerable<T>, but I'm hoping someone knows of a more elegant approach.

You certainly could write your own IEnumerable<T> implementation that wraps another one, remembering all the elements it's already seen (and whether it's exhausted or not). If you need it to be thread-safe that becomes trickier, and you'd need to remember that at any time there may be multiple iterators working against the same IEnumerable<T>.
Fundamentally I think it would come down to working out what to do when asked for the next element (which is somewhat-annoyingly split into MoveNext() and Current, but that can probably be handled...):
If you've already read the next element within another iterator, you can yield it from your buffer
If you've already discovered that there is no next element, you can return that immediately
Otherwise, you need to ask the original iterator for the next element, and remember if for all the other wrapped iterators.
The other aspect that's tricky is knowing when to dispose of the underlying IEnumerator<T> - if you don't need to do that, it makes things simpler.
As a very sketchy attempt that I haven't even attempted to compile, and which is definitely not thread-safe, you could try something like this:
public class LazyEnumerable<T> : IEnumerable<T>
{
private readonly IEnumerator<T> iterator;
private List<T> buffer;
private bool completed = false;
public LazyEnumerable(IEnumerable<T> original)
{
// TODO: You could be even lazier, only calling
// GetEnumerator when you first need an element
iterator = original.GetEnumerator();
}
IEnumerator GetEnumerator() => GetEnumerator();
public IEnumerator<T> GetEnumerator()
{
int index = 0;
while (true)
{
// If we already have the element, yield it
if (index < buffer.Count)
{
yield return buffer[index];
}
// If we've yielded everything in the buffer and some
// other iterator has come to the end of the original,
// we're done.
else if (completed)
{
yield break;
}
// Otherwise, see if there's anything left in the original
// iterator.
else
{
bool hasNext = iterator.MoveNext();
if (hasNext)
{
var current = iterator.Current;
buffer.Add(current);
yield return current;
}
else
{
completed = true;
yield break;
}
}
index++;
}
}
}

Concurrent collection enumerator

I'm currently programming my own implementation of priority queue / sorted list and I would like to have it concurrent.
In order to have it thread safe I'm using lock(someObject) and I would like to verify some behavior of mutexes in C#.
Inner representation of my sorted list is basically linked list with head and slots linked together.
Something like:
internal class Slot
{
internal T Value;
internal Slot Next;
public Slot(T value, Slot next = null)
{
Value = value;
Next = next;
}
}
Every time I'm manipulating with head I have to use lock(someObject)because of thread safety.
In order to implement ICollection interface I have to implement public IEnumerator<T> GetEnumerator(). In this method I have take my head and read from it so I should use mutex.
public IEnumerator<T> GetEnumerator()
{
lock (syncLock)
{
var curr = head;
while (curr != null)
{
yield return curr.Value;
curr = curr.Next;
}
}
}
My question is: Is syncLock locked for whole time in enumerator (so it will be unlocked after reaching end of the method) or it is automatically unlocked after yielding value?

Thank you guys from the comments, here's sum up.
Answer: yes, syncLock will be locked for the whole time → hence, it's a really bad idea
Possible solution:
make collection not thread safe
obtain lock, copy whole collection and return enumerator of this collection #Evk
use some kind of boolean flag, set it on true while enumerating over the collection and throw exception when Add, Clear or Remove methods are called -> this is default List behavior #ManfredRadlwimmer
make that collection immutable #InBetween

C# List<T> indexer thread safety

Until recently, I had been under the assumption that setting an element of a List<T> via indexer is thread safe in the following context.
// Assumes destination.Count >= source.Count
static void Function<T,U>(List<T> source, Func<T,U> converter, List<U> destination)
{
Parallel.ForEach(Partitioner.Create(0, source.Count), range =>
{
for(int i = range.Item1; i < range.Item2; i++)
{
destination[i] = converter(source[i]);
}
});
}
Since List<T> stores its elements in an array internally and setting one by index shouldn't necessitate resizing, this seemed like a reasonable leap of faith. Looking at the implementation of List<T> in .NET Core however, it appears that the indexer's setter modifies some internal state (see below).
// Sets or Gets the element at the given index.
public T this[int index]
{
get
{
// Following trick can reduce the range check by one
if ((uint)index >= (uint)_size)
{
ThrowHelper.ThrowArgumentOutOfRange_IndexException();
}
Contract.EndContractBlock();
return _items[index];
}
set
{
if ((uint)index >= (uint)_size)
{
ThrowHelper.ThrowArgumentOutOfRange_IndexException();
}
Contract.EndContractBlock();
_items[index] = value;
_version++;
}
}
So should I assume that List<T> is not thread-safe even when each thread is only getting/setting elements from its own portion of the collection?

Have a read here:
https://msdn.microsoft.com/en-us/library/6sh2ey19.aspx#Anchor_10
To answer your question, no - as per the documentation, it's not guaranteed to be thread safe.
Even if the current implementation appeared to be thread safe (which it doesn't, anyway), it would still be a bad idea to make that assumption. Since the documentation explicitly says it's not thread safe - future versions may legally change the underlying implementation to no longer be thread safe and break any assumption you previously relied on.

Help me understand the code snippet in c#

I am reading this blog: Pipes and filters pattern
I am confused by this code snippet:
public class Pipeline<T>
{
private readonly List<IOperation<T>> operations = new List<IOperation<T>>();
public Pipeline<T> Register(IOperation<T> operation)
{
operations.Add(operation);
return this;
}
public void Execute()
{
IEnumerable<T> current = new List<T>();
foreach (IOperation<T> operation in operations)
{
current = operation.Execute(current);
}
IEnumerator<T> enumerator = current.GetEnumerator();
while (enumerator.MoveNext());
}
}
what is the purpose of this statement: while (enumerator.MoveNext());? seems this code is a noop.

First consider this:
IEnumerable<T> current = new List<T>();
foreach (IOperation<T> operation in operations)
{
current = operation.Execute(current);
}
This code appears to be creating nested enumerables, each of which takes elements from the previous, applies some operation to them, and passes the result to the next. But it only constructs the enumerables. Nothing actually happens yet. It's just ready to go, stored in the variable current. There are lots of ways to implement IOperation.Execute but it could be something like this.
IEnumerable<T> Execute(IEnumerable<T> ts)
{
foreach (T t in ts)
yield return this.operation(t); // Perform some operation on t.
}
Another option suggested in the article is a sort:
IEnumerable<T> Execute(IEnumerable<T> ts)
{
// Thank-you LINQ!
// This was 10 lines of non-LINQ code in the original article.
return ts.OrderBy(t => t.Foo);
}
Now look at this:
IEnumerator<T> enumerator = current.GetEnumerator();
while (enumerator.MoveNext());
This actually causes the chain of operations to be performed. When the elements are requested from the enumeration, it causes elements from the original enumerable to be passed through the chain of IOperations, each of which performs some operation on them. The end result is discarded so only the side-effect of the operation is interesting - such as writing to the console or logging to a file. This would have been a simpler way to write the last two lines:
foreach (T t in current) {}
Another thing to observe is that the initial list that starts the process is an empty list so for this to make sense some instances of T have to be created inside the first operation. In the article this is done by asking the user for input from the console.

In this case, the while (enumerator.MoveNext()); is simply evaluating all the items that are returned by the final IOperation<T>. It looks a little confusing, but the empty List<T> is only created in order to supply a value to the first IOperation<T>.
In many collections this would do exaclty nothing as you suggest, but given that we are talking about the pipes and filters pattern it is likely that the final value is some sort of iterator that will cause code to be executed. It could be something like this, for example (assuming that is an integer):
public class WriteToConsoleOperation : IOperation<int>
{
public IEnumerable<int> Execute(IEnumerable<int> ints)
{
foreach (var i in ints)
{
Console.WriteLine(i);
yield return i;
}
}
}
So calling MoveNext() for each item on the IEnumerator<int> returned by this iterator will return each of the values (which are ignored in the while loop) but also output each of the values to the console.
Does that make sense?

while (enumerator.MoveNext());
Inside the current block of code, there is no affect (it moves through all the items in the enumeration). The displayed code doesn't act on the current element in the enumeration. What might be happening is that the MoveNext() method is moving to the next element, and it is doing something to the objects in the collection (updating an internal value, pull the next from the database etc.). Since the type is List<T> this is probably not the case, but in other instances it could be.

Yield keyword value added?

still trying to find where i would use the "yield" keyword in a real situation.
I see this thread on the subject
What is the yield keyword used for in C#?
but in the accepted answer, they have this as an example where someone is iterating around Integers()
public IEnumerable<int> Integers()
{
yield return 1;
yield return 2;
yield return 4;
yield return 8;
yield return 16;
yield return 16777216;
}
but why not just use
list<int>
here instead. seems more straightforward..

If you build and return a List (say it has 1 million elements), that's a big chunk of memory, and also of work to create it.
Sometimes the caller may only want to know what the first element is. Or they might want to write them to a file as they get them, rather than building the whole list in memory and then writing it to a file.
That's why it makes more sense to use yield return. It doesn't look that different to building the whole list and returning it, but it's very different because the whole list doesn't have to be created in memory before the caller can look at the first item on it.
When the caller says:
foreach (int i in Integers())
{
// do something with i
}
Each time the loop requires a new i, it runs a bit more of the code in Integers(). The code in that function is "paused" when it hits a yield return statement.

Yield allows you to build methods that produce data without having to gather everything up before returning. Think of it as returning multiple values along the way.
Here's a couple of methods that illustrate the point
public IEnumerable<String> LinesFromFile(String fileName)
{
using (StreamReader reader = new StreamReader(fileName))
{
String line;
while ((line = reader.ReadLine()) != null)
yield return line;
}
}
public IEnumerable<String> LinesWithEmails(IEnumerable<String> lines)
{
foreach (String line in lines)
{
if (line.Contains("#"))
yield return line;
}
}
Neither of these two methods will read the whole contents of the file into memory, yet you can use them like this:
foreach (String lineWithEmail in LinesWithEmails(LinesFromFile("test.txt")))
Console.Out.WriteLine(lineWithEmail);

You can use yield to build any iterator. That could be a lazily evaluated series (reading lines from a file or database, for example, without reading everything at once, which could be too much to hold in memory), or could be iterating over existing data such as a List<T>.
C# in Depth has a free chapter (6) all about iterator blocks.
I also blogged very recently about using yield for smart brute-force algorithms.
For an example of the lazy file reader:
static IEnumerable<string> ReadLines(string path) {
using (StreamReader reader = File.OpenText(path)) {
string line;
while ((line = reader.ReadLine()) != null) {
yield return line;
}
}
}
This is entirely "lazy"; nothing is read until you start enumerating, and only a single line is ever held in memory.
Note that LINQ-to-Objects makes extensive use of iterator blocks (yield). For example, the Where extension is essentially:
static IEnumerable<T> Where<T>(this IEnumerable<T> data, Func<T, bool> predicate) {
foreach (T item in data) {
if (predicate(item)) yield return item;
}
}
And again, fully lazy - allowing you to chain together multiple operations without forcing everything to be loaded into memory.

yield allows you to process collections that are potentially infinite in size because the entire collection is never loaded into memory in one go, unlike a List based approach. For instance an IEnumerable<> of all the prime numbers could be backed off by the appropriate algo for finding the primes, whereas a List approach would always be finite in size and therefore incomplete. In this example, using yield also allows processing for the next element to be deferred until it is required.

A real situation for me, is when i want to process a collection that takes a while to populate more smoothly.
Imagine something along the lines (psuedo code):
public IEnumberable<VerboseUserInfo> GetAllUsers()
{
foreach(UserId in userLookupList)
{
VerboseUserInfo info = new VerboseUserInfo();
info.Load(ActiveDirectory.GetLotsOfUserData(UserId));
info.Load(WebSerice.GetSomeMoreInfo(UserId));
yield return info;
}
}
Instead of having to wait a minute for the collection to populate before i can start processing items in it. I will be able to start immediately, and then report back to the user-interface as it happens.

You may not always want to use yield instead of returning a list, and in your example you use yield to actually return a list of integers. Depending on whether you want a mutable list, or a immutable sequence, you could use a list, or an iterator (or some other collection muttable/immutable).
But there are benefits to use yield.
Yield provides an easy way to build lazy evaluated iterators. (Meaning only the code to get next element in sequence is executed when the MoveNext() method is called then the iterator returns doing no more computations, until the method is called again)
Yield builds a state machine under the covers, and this saves you allot of work by not having to code the states of your generic generator => more concise/simple code.
Yield automatically builds optimized and thread safe iterators, sparing you the details on how to build them.
Yield is much more powerful than it seems at first sight and can be used for much more than just building simple iterators, check out this video to see Jeffrey Richter and his AsyncEnumerator and how yield is used make coding using the async pattern easy.

You might want to iterate through various collections:
public IEnumerable<ICustomer> Customers()
{
foreach( ICustomer customer in m_maleCustomers )
{
yield return customer;
}
foreach( ICustomer customer in m_femaleCustomers )
{
yield return customer;
}
// or add some constraints...
foreach( ICustomer customer in m_customers )
{
if( customer.Age < 16 )
{
yield return customer;
}
}
// Or....
if( Date.Today == 1 )
{
yield return m_superCustomer;
}
}

I agree with everything everyone has said here about lazy evaluation and memory usage and wanted to add another scenario where I have found the iterators using the yield keyword useful. I have run into some cases where I have to do a sequence of potentially expensive processing on some data where it is extremely useful to use iterators. Rather than processing the entire file immediately, or rolling my own processing pipeline, I can simply use iterators something like this:
IEnumerable<double> GetListFromFile(int idxItem)
{
// read data from file
return dataReadFromFile;
}
IEnumerable<double> ConvertUnits(IEnumerable<double> items)
{
foreach(double item in items)
yield return convertUnits(item);
}
IEnumerable<double> DoExpensiveProcessing(IEnumerable<double> items)
{
foreach(double item in items)
yield return expensiveProcessing(item);
}
IEnumerable<double> GetNextList()
{
return DoExpensiveProcessing(ConvertUnits(GetListFromFile(curIdx++)));
}
The advantage here is that by keeping the input and output to all of the functions IEnumerable<double>, my processing pipeline is completely composable, easy to read, and lazy evaluated so I only have to do the processing I really need to do. This lets me put almost all of my processing in the GUI thread without impacting responsiveness so I don't have to worry about any threading issues.

I came up with this to overcome .net shortcoming having to manually deep copy List.
I use this:
static public IEnumerable<SpotPlacement> CloneList(List<SpotPlacement> spotPlacements)
{
foreach (SpotPlacement sp in spotPlacements)
{
yield return (SpotPlacement)sp.Clone();
}
}
And at another place:
public object Clone()
{
OrderItem newOrderItem = new OrderItem();
...
newOrderItem._exactPlacements.AddRange(SpotPlacement.CloneList(_exactPlacements));
...
return newOrderItem;
}
I tried to come up with oneliner that does this, but it's not possible, due to yield not working inside anonymous method blocks.
EDIT:
Better still, use generic List cloner:
class Utility<T> where T : ICloneable
{
static public IEnumerable<T> CloneList(List<T> tl)
{
foreach (T t in tl)
{
yield return (T)t.Clone();
}
}
}

The method used by yield of saving memory by processing items on-the-fly is nice, but really it's just syntactic sugar. It's been around for a long time. In any language that has function or interface pointers (even C and assembly) you can get the same effect using a callback function / interface.
This fancy stuff:
static IEnumerable<string> GetItems()
{
yield return "apple";
yield return "orange";
yield return "pear";
}
foreach(string item in GetItems())
{
Console.WriteLine(item);
}
is basically equivalent to old-fashioned:
interface ItemProcessor
{
void ProcessItem(string s);
};
class MyItemProcessor : ItemProcessor
{
public void ProcessItem(string s)
{
Console.WriteLine(s);
}
};
static void ProcessItems(ItemProcessor processor)
{
processor.ProcessItem("apple");
processor.ProcessItem("orange");
processor.ProcessItem("pear");
}
ProcessItems(new MyItemProcessor());

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Performance regarding cached thread-safe IEnumerable<T> implementation - c#

Related

Expensive IEnumerable: Any way to prevent multiple enumerations without forcing an immediate enumeration? [duplicate]

Concurrent collection enumerator

C# List<T> indexer thread safety

Help me understand the code snippet in c#

Yield keyword value added?

Categories

Resources