Why does LINQ orderby consume more memory? - c#

I want to know why orderBy consumes more memory then simply copying the list and sorting.
void printMemoryUsage()
{
long memory = GC.GetTotalMemory(true);
long mb = 1024 * 1024;
Console.WriteLine("memory: " + memory/mb + " MB" );
}
var r = new Random();
var list = Enumerable.Range(0, 20*1024*1024).OrderBy(x => r.Next()).ToList();
printMemoryUsage();
var lsitCopy = list.OrderBy(x => x);
foreach(var v in lsitCopy)
{
printMemoryUsage();
break;
}
Console.ReadKey();
The result I got is:
memory: 128 MB
memory: 288 MB
But copying the list and sorting consume less memory.
void printMemoryUsage()
{
long memory = GC.GetTotalMemory(true);
long mb = 1024 * 1024;
Console.WriteLine("memory: " + memory/mb + " MB" );
}
var r = new Random();
var list = Enumerable.Range(0, 20*1024*1024).OrderBy(x => r.Next()).ToList();
printMemoryUsage();
var lsitCopy = list.ToList();
printMemoryUsage();
lsitCopy.Sort();
printMemoryUsage();
Console.ReadKey();
Results are:
memory: 128 MB
memory: 208 MB
memory: 208 MB
More testing shows that memory consumed by orderBy is twice the list size.

It's a bit unsurprising when you dive into how the two approaches are implemented internally. Take a look at the Reference Source for .NET.
In your second approach where you call the Sort() method on the list, the internal array in the List object is passed to the TrySZSort method that is written in native code outside of C#, which means no work for the garbage collector.
private static extern bool TrySZSort(Array keys, Array items, int left, int right);
Now, in your first approach you're using LINQ to sort the enumerable. What's really happening when you call .OrderBy() is an OrderedEnumerable<T> object is constructed. Just calling OrderBy doesn't sort the list; it is only sorted when it is enumerated by the GetEnumerator method being called. GetEnumerator is implicitly called behind the scenes when you call ToList or when you enumerate over using a construct like foreach.
You're actually sorting the list twice since you're enumerating the list once on this line:
var list = Enumerable.Range(0, 20*1024*1024).OrderBy(x => r.Next()).ToList();
and again when you enumerate via foreach on this line:
var lsitCopy = list.OrderBy(x => x);
foreach(var v in lsitCopy)
Since these LINQ methods are not using native code, they rely on the garbage collector to pick up after them. Each of the classes is also creating a bunch of objects (e.g. OrderedEnumerable creates a Buffer<TElement> with another copy of the array). All of these objects consume RAM.

I had to do some research on this one, and found some interesting information. The default List.Sort function performs an in-place sort (not a second copy), but does some via a call to Array.Sort, which ultimately calls through to TrySZSort, a heavily optimized native, unmanaged CLR function that selects the specific sort algorithm based on the input type, but in most cases performs what's called an Introspective Sort, which combines the best use cases of the QuickSort, HeapSort, and InsertSort for maximum efficiency. This is done in unmanaged code, meaning it's generally faster and more efficient.
If you're interested in going down the rabbit hole, the Array Sort source is here and the TrySZSort implementation is here. Ultimately though, the use of Unmanaged code means the garbage collector doesn't get involved, and thus less memory is used.
The implementation used by OrderBy is a standard Quicksort, and the OrderedEnumerable actually creates a second copy of the keys used in the sort (in your case the only field, though that doesn't have to be the case if you considered a larger class object with a single property or two used as the sorter), leading to exactly what you observed, which is additional usage equal to the size of the collection for the second copy. Assuming you then typed that out to a List or Array (rather than an OrderedEnumerable) and waited for or forced a garbage collection, you should recover most of that memory. The Enumerable.OrderBy method source is here if you want to dig in to it.

The source of extra memory used can be found in the implementation of OrderedEnumerable which is created on the line
IOrderedEnumerable<int> lsitCopy = list.OrderBy(x => x);
OrderedEnumerable is a generic implementation that sorts by any criteria you provide it, which is distinctly different to the implementation of List.Sort which sorts elements only by value. If you follow the coding of OrderedEnumerable you will find it creates a buffer into which your values are copied accounting for an extra 80MB (4*20*1024*1024) of memory. The additional 40MB (2*20*1024*1024) is associated with structures created to sort the list by the keys.
Another thing to note is not only does OrderBy(x => x) result in more memory use it also uses a lot more processing power, calling Sort by my testing is about 6 times faster than using OrderBy(x => x).
The List.Sort() method is backed by a native implementation heavily optimised method for sorting elements by their value, whereas the Linq OrderBy method is far more versatile and consequently less optimised for simply sorting the list by value...
IOrderedEnumerable<TSource> OrderBy<TSource, TKey>(this IEnumerable<TSource> source, Func<TSource, TKey> keySelector)
P.S I would suggest you stop using var instead of the actual variable types as it hides valuable information to the reader of the code about how the code is actually functioning. I recommend developers only use the var keyword for anonymous types

Connor answer gave a clue what is happening here. Implementation of OrderedEnumerable makes it clearer. GetEnumerator of OrderedEnumerable is
public IEnumerator<TElement> GetEnumerator() {
Buffer<TElement> buffer = new Buffer<TElement>(source);
if (buffer.count > 0) {
EnumerableSorter<TElement> sorter = GetEnumerableSorter(null);
int[] map = sorter.Sort(buffer.items, buffer.count);
sorter = null;
for (int i = 0; i < buffer.count; i++) yield return buffer.items[map[i]];
}
}
Buffer is another copy of the original data. And Map keeps the mapping of the order. So, if the code is
// memory_foot_print_1
var sortedList = originalList.OrderBy(v=>v)
foreach(var v in sortedList)
{
// memory_foot_print_2
...
}
Here, memory_foot_print_2 will be equal to memory_foot_print_1 + size_of(originalList) + size_of(new int[count_of(originalList)]) (assuming no GC)
Thus, if originalList is a list of ints of size 80Mb, memory_foot_print_2 - memory_foot_print_1 = 80 + 80= 160Mb. And if originalList is a list of logs of size 80Mb, memory_foot_print_2 - memory_foot_print_1 = 80+ 40 (size of map)= 120Mb (assuming int - 4bytes, longs- 8 bytes) which is what I was observing.
It leads to another question if it makes sense to use OrderBy for larger objects.

Related

When immutable collections are preferable than concurrent

Recently read about immutable collections.
They are recommended to be used as a thread safe for reading, when the read operations are performed more often than write.
Then I want to test read performance ImmutableDictionary vs ConcurrentDictionary. Here is this very simple test (in .NET Core 2.1):
using System;
using System.Collections.Concurrent;
using System.Collections.Generic;
using System.Collections.Immutable;
using System.Diagnostics;
using System.Linq;
using System.Threading.Tasks;
namespace ImmutableSpeedTests
{
class Program
{
public class ConcurrentVsImmutable
{
public int ValuesCount;
public int ThreadsCount;
private ImmutableDictionary<int, int> immutable = ImmutableDictionary<int, int>.Empty;
private ConcurrentDictionary<int, int> concurrent = new ConcurrentDictionary<int, int>();
public ConcurrentVsImmutable(int valuesCount, int threadsCount)
{
ValuesCount = valuesCount;
ThreadsCount = threadsCount;
}
public void Setup()
{
// fill both collections. I don't measure time cause immutable is filling much slower obviously.
for (var i = 0; i < ValuesCount; i++)
{
concurrent[i] = i;
immutable = immutable.Add(i, i);
}
}
public async Task<long> ImmutableSum() => await Sum(immutable);
public async Task<long> ConcurrentSum() => await Sum(concurrent);
private async Task<long> Sum(IReadOnlyDictionary<int, int> dic)
{
var tasks = new List<Task<long>>();
// main job. Run multiple tasks to sum all values.
for (var i = 0; i < ThreadsCount; i++)
tasks.Add(Task.Run(() =>
{
long x = 0;
foreach (var key in dic.Keys)
{
x += dic[key];
}
return x;
}));
var result = await Task.WhenAll(tasks.ToArray());
return result.Sum();
}
}
static void Main(string[] args)
{
var test = new ConcurrentVsImmutable(1000000, 4);
test.Setup();
var sw = new Stopwatch();
sw.Start();
var result = test.ConcurrentSum().Result;
sw.Stop();
// Convince that the result of the work is the same
Console.WriteLine($"Concurrent. Result: {result}. Elapsed: {sw.ElapsedTicks}.");
sw.Reset();
sw.Start();
result = test.ImmutableSum().Result;
sw.Stop();
Console.WriteLine($" Immutable. Result: {result}. Elapsed: {sw.ElapsedTicks}.");
Console.ReadLine();
}
}
}
You can run this code. Elapsed time in ticks will differ from time to time but the time spent by ConcurrentDictionary is several times less than by ImmutableDictionary.
This experiment makes me embarrassed. Did I do it wrong? What the reason to use immutable collections if we have concurrent? When they are preferable?
Immutable collections are not alternative to concurrent collections. And the way they are designed to reduce memory consumption, they are bound to be slower, the trade off here is to use less memory and thus by using less n operations to do anything.
We usually copy collections to other collections to achieve immutability to persist state. Lets see what it means,
var s1 = ImmutableStack<int>.Empty;
var s2 = s1.Push(1);
// s2 = [1]
var s3 = s2.Push(2);
// s2 = [1]
// s3 = [1,2]
// notice that s2 has only one item, it is not modified..
var s4 = s3.Pop(ref var i);
// s2 = [1];
// still s2 has one item...
Notice that, s2 always has only one item. Even if all items are removed.
The way all data is stored internally is a huge tree and your collection is pointing to a branch which has descendants that represents initial state of the tree.
I don't think the performance can be matched with concurrent collection where goals are totally different.
In concurrent collection, you have a single copy of collection accessed by all threads.
In immutable collection you have virtually isolated copy of a tree, navigating that tree is always costly.
It is useful in transactional system, where if a transaction has to be rolled back, state of collection can be retained in commit points.
This is a criticism that's been made before.
As Akash already said, ImmutableDictionary works with an internal tree, instead of a hashset.
One aspect of this is that you can improve the performance slightly if you build the dictionary in one step instead of iteratively adding all the keys:
immutable = concurrent.ToImmutableDictionary();
Enumerating a hashset and a balanced tree are both O(n) operations. I took the average of a few runs on a single thread for varying container size and get results consistent with that:
I don't know why the immutable slope is 6x steeper. For now I'll just assume its doing tricky nonblocking tree stuff. I assume this class would be optimized for random stores and reads rather than enumeration.
To identify what exact scenarios ImmutableDictionary wins at, we'd need to wrap a concurrent dictionary to provide some level of immutability, and test both classes in the face of levels of read/write contention.
Not a serious suggestion, but a counterpoint to your test is to use immutability to "cheat" over multiple iterations by comparing:
private ConcurrentDictionary<object, long> cache = new ConcurrentDictionary<object, long>();
public long ImmutableSum()
{
return cache.GetOrAdd(immutable, (obj) => (obj as ImmutableDictionary<int, int>).Sum(kvp => (long)kvp.Value));
}
public long ConcurrentSum() => concurrent.Sum(kvp => (long)kvp.Value);
This makes a quite a difference on subsequent calls to sum an unchanged collection!
The two are not mutually exclusive. I use both.
If your dictionary is small the read performance of ImmutableDictionary will be superior to ConcurrentDictionary as K1*Log(N) < K2 where Log(N) < K2/K1 (when the hash table overhead is worse than tree traversal).
I personally find the write semantics of the Immutable collections easier to understand than those of the concurrent collections as they tend to be more consistent, especially when dealing with AddOrUpdate() and GetOrAdd().
In practice, I find that there have many cases in which I have a good number of small (or empty) dictionaries that are more appropriate as ImmutableDictionary and some larger ones that warrant the use of ConcurrentDictionary.
Having said that, if they are small then it doesn't make much of a difference what you use.
Regarding the answer of Peter Wishart, the enumeration performance of ImmutableDictionary is higher than ConcurrentDictionary (for reasonable N) because tree traversal is brutal in terms of memory latency on modern cache architectures.

LINQ Why is "Enumerable = Enumerable.Skip(N)" slow?

I am having an issue with the performance of a LINQ query and so I created a small simplified example to demonstrate the issue below. The code takes a random list of small integers and returns the list partitioned into several smaller lists each which totals 10 or less.
The problem is that (as I've written this) the code takes exponentially longer with N. This is only an O(N) problem. With N=2500, the code takes over 10 seconds to run on my pc.
I would appriciate greatly if someone could explain what is going on. Thanks, Mark.
int N = 250;
Random r = new Random();
var work = Enumerable.Range(1,N).Select(x => r.Next(0, 6)).ToList();
var chunks = new List<List<int>>();
// work.Dump("All the work."); // LINQPad Print
var workEnumerable = work.AsEnumerable();
Stopwatch sw = Stopwatch.StartNew();
while(workEnumerable.Any()) // or .FirstorDefault() != null
{
int soFar = 0;
var chunk = workEnumerable.TakeWhile( x =>
{
soFar += x;
return (soFar <= 10);
}).ToList();
chunks.Add(chunk); // Commented out makes no difference.
workEnumerable = workEnumerable.Skip(chunk.Count); // <== SUSPECT
}
sw.Stop();
// chunks.Dump("Work Chunks."); // LINQPad Print
sw.Elapsed.Dump("Time elapsed.");
What .Skip() does is create a new IEnumerable that loops over the source, and only begins yielding results after the first N elements. You chain who knows how many of these after each other. Everytime you call .Any(), you need to loop over all the previously skipped elements again.
Generally speaking, it's a bad idea to set up very complicated operator chains in LINQ and enumerating them repeatedly. Also, since LINQ is a querying API, methods like Skip() are a bad choice when what you're trying to achieve amounts to modifying a data structure.
You effectively keep chaining Skip() onto the same enumerable. In a list of 250, the last chunk will be created from a lazy enumerable with ~25 'Skip' enumerator classes on the front.
You would find things become a lot faster, already if you did
workEnumerable = workEnumerable.Skip(chunk.Count).ToList();
However, I think the whole approach could be altered.
How about using standard LINQ to achieve the same:
See it live on http://ideone.com/JIzpml
using System;
using System.Collections.Generic;
using System.Linq;
public class Program
{
private readonly static Random r = new Random();
public static void Main(string[] args)
{
int N = 250;
var work = Enumerable.Range(1,N).Select(x => r.Next(0, 6)).ToList();
var chunks = work.Select((o,i) => new { Index=i, Obj=o })
.GroupBy(e => e.Index / 10)
.Select(group => group.Select(e => e.Obj).ToList())
.ToList();
foreach(var chunk in chunks)
Console.WriteLine("Chunk: {0}", string.Join(", ", chunk.Select(i => i.ToString()).ToArray()));
}
}
The Skip() method and others like it basically create a placeholder object, implementing IEnumerable, that references its parent enumerable and contains the logic to perform the skipping. Skips in loops, therefore, are non-performant, because instead of throwing away elements of the enumerable, like you think they are, they add a new layer of logic that's lazily executed when you actually need the first element after all the ones you've skipped.
You can get around this by calling ToList() or ToArray(). This forces "eager" evaluation of the Skip() method, and really does get rid of the elements you're skipping from the new collection you will be enumerating. That comes at an increased memory cost, and requires all of the elements to be known (so if you're running this on an IEnumerable that represents an infinite series, good luck).
The second option is to not use Linq, and instead use the IEnumerable implementation itself, to get and control an IEnumerator. Then instead of Skip(), simply call MoveNext() the necessary number of times.

C#: ToArray performance [duplicate]

This question already has answers here:
Is it better to call ToList() or ToArray() in LINQ queries?
(16 answers)
Closed 9 years ago.
Background:
I admit I did not attempt to benchmark this, but I'm curious...
What are the CPU/memory characteristics of the Enumerable.ToArray<T> (and its cousin Enumerable.ToList<T>)?
Since IEnumerable does not advertise in advance how many elements it has, I (perhaps naively) presume ToArray would have to "guess" an initial array size, and then to resize/reallocate the array if the first guess appears to be too small, then to resize it yet again if the second guess appears to be too small etc... Which would give worse-than-linear performance.
I can imagine better approaches involving (hybrid) lists, but this would still require more than one allocation (though not reallocation) and quite bit of copying, though it could be linear overall despite the overhead.
Question:
Is there any "magic" taking place behind the scenes, that avoids the need for this repetitive resizing, and makes ToArray linear in space and time?
More generally, is there an "official" documentation on BCL performance characteristics?
No magic. Resizing happens if required.
Note that it is not always required. If the IEnumerable<T> being .ToArrayed also implements ICollection<T>, then the .Count property is used to pre-allocate the array (making the algorithm linear in space and time.) If not, however, the following (rough) code is executed:
foreach (TElement current in source)
{
if (array == null)
{
array = new TElement[4];
}
else
{
if (array.Length == num)
{
// Doubling happens *here*
TElement[] array2 = new TElement[checked(num * 2)];
Array.Copy(array, 0, array2, 0, num);
array = array2;
}
}
array[num] = current;
num++;
}
Note the doubling when the array fills.
Regardless, it's generally a good practice to avoid calling .ToArray() and .ToList() unless you absolute require it. Interrogating the query directly when needed is often a better choice.
I extracted the code behind .ToArray() method using .NET Reflector:
public static TSource[] ToArray<TSource>(this IEnumerable<TSource> source)
{
if (source == null)
{
throw Error.ArgumentNull("source");
}
Buffer<TSource> buffer = new Buffer<TSource>(source);
return buffer.ToArray();
}
and Buffer.ToArray:
internal TElement[] ToArray()
{
if (this.count == 0)
{
return new TElement[0];
}
if (this.items.Length == this.count)
{
return this.items;
}
TElement[] destinationArray = new TElement[this.count];
Array.Copy(this.items, 0, destinationArray, 0, this.count);
return destinationArray;
}
And inside the Buffer constructor it loops through all elements to calculate the real Count and array of Elements.
IIRC, it uses a doubling algorithm.
Remember that for most types, all you need to store are references. It's not like you're allocating enough memory to copy the entire object (unless of course you're using a lot of structs... tsk tsk).
It's still a good idea to avoid using .ToArray() or .ToList() until the last possible moment. Most of the time you can just keep using IEnumerable<T> all the way up until you either run a foreach loop or assign it to a data source.

Calling a list of methods in a random sequence?

I have a list of 10 methods. Now I want to call this methods in a random sequence. The sequence should be generated at runtime. Whats the best way to do this?
It is always astonishing to me the number of incorrect and inefficient answers one sees whenever anyone asks how to shuffle a list of things on StackOverflow. Here we have several examples of code which is brittle (because it assumes that key collisions are impossible when in fact they are merely rare) or slow for large lists. (In this case the problem is stated to be only ten elements, but when possible surely it is better to give a solution that scales to thousands of elements if doing so is not difficult.)
This is not a hard problem to solve correctly. The correct, fast way to do this is to create an array of actions, and then shuffle that array in-place using a Fisher-Yates Shuffle.
http://en.wikipedia.org/wiki/Fisher-Yates_shuffle
Some things not to do:
Do not implement Fischer-Yates shuffle incorrectly. One sees more incorrect than correct implementations of this trivial algorithm. In particular, make sure you are choosing the random number from the correct range. Choosing it from the wrong range produces a biased shuffle.
If the shuffle algorithm must actually be unpredictable then use a source of randomness other than Random, which is only pseudo-random. Remember, Random only has 232 possible seeds, and therefore there are fewer than that many possible shuffles.
If you are going to be producing many shuffles in a short amount of time, do not create a new instance of Random every time. Save and re-use the old one, or use a different source of randomness entirely. Random chooses its seed based on the time; many Randoms created in close succession will produce the same sequence of "random" numbers.
Do not sort on a "random" GUID as your key. GUIDs are guaranteed to be unique. They are not guaranteed to be randomly ordered. It is perfectly legal for an implementation to spit out consecutive GUIDs.
Do not use a random function as a comparator and feed that to a sorting algorithm. Sort algorithms are permitted to do anything they please if the comparator is bad, including crashing, and including producing non-random results. As Microsoft recently found out, it is extremely embarrassing to get a simple algorithm like this wrong.
Do not use the input to random as the key to a dictionary, and then sort the dictionary. There is nothing stopping the randomness source from choosing the same key twice, and therefore either crashing your application with a duplicate key exception, or silently losing one of your methods.
Do not use the algorithm "Create two lists. Add the elements to the first list. Repeatedly move a random element from the first list to the second list, removing the element from the first list". If the list is O(n) to remove an item then this is an O(n2) algorithm.
Do not use the algorithm "Create two lists. Add the elements to the first list. Repeatedly move a random non-null element from the first list to the second list, setting the element in the first list to null." Also do not do this crazy equivalent of that algorithm.If there are lots of items in the list then this gets slower and slower as you start hitting more and more nulls.
New, short answer
Starting from where Ilya Kogan left off, totally correct after we had Eric Lippert find the bug:
var methods = new Action[10];
var rng = new Random();
var shuffled = methods.Select(m => Tuple.Create(rng.Next(), m))
.OrderBy(t => t.Item1).Select(t => t.Item2);
foreach (var action in shuffled) {
action();
}
Of course this is doing a lot behind the scenes. The method below should be much faster. But if LINQ is fast enough...
Old answer (much longer)
After stealing this code from here:
public static T[] RandomPermutation<T>(T[] array)
{
T[] retArray = new T[array.Length];
array.CopyTo(retArray, 0);
Random random = new Random();
for (int i = 0; i < array.Length; i += 1)
{
int swapIndex = random.Next(i, array.Length);
if (swapIndex != i)
{
T temp = retArray[i];
retArray[i] = retArray[swapIndex];
retArray[swapIndex] = temp;
}
}
return retArray;
}
the rest is easy:
var methods = new Action[10];
var perm = RandomPermutation(methods);
foreach (var method in perm)
{
// call the method
}
Have an array of delegates. Suppose you have this:
class YourClass {
public int YourFunction1(int x) { }
public int YourFunction2(int x) { }
public int YourFunction3(int x) { }
}
Now declare a delegate:
public delegate int MyDelegate(int x);
Now create an array of delegates:
MyDelegate delegates[] = new MyDelegate[10];
delegates[0] = new MyDelegate(YourClass.YourFunction1);
delegates[1] = new MyDelegate(YourClass.YourFunction2);
delegates[2] = new MyDelegate(YourClass.YourFunction3);
and now call it like this:
int result = delegates[randomIndex] (48);
You can create a shuffled collection of delegates, and then call all methods in the collection.
Here is an easy way of doing so using a dictionary. The keys of the dictionary are random numbers, and the values are delegates to your methods. When you iterate through the dictionary, it has the effect of shuffling.
var shuffledActions = actions.ToDictionary(
action => random.Next(),
action => action);
foreach (var pair in shuffledActions.OrderBy(item => item.Key))
{
pair.Value();
}
actions is an enumerable of your methods.
random is a of type Random.
Think that this is a list of objects and you want it to extract the objects randomly. You can get a random index using the Random.Next Method (always use current List.Count as parameter) and after that remove object from the list so it will not be drawn again.
When processing a list in a random order, the natural inclination is to shuffle a list.
Another approach is to just keep the list order, but randomly select and remove each item.
var actionList = new[]
{
new Action( () => CallMethodOne() ),
new Action( () => CallMethodTwo() ),
new Action( () => CallMethodThree() )
}.ToList();
var r = new Random();
while(actionList.Count() > 0) {
var index = r.Next(actionList.Count());
var action = actionList[index];
actionList.RemoveAt(index);
action();
}
I think:
Via reflection get Method Objects;
create an array of created Method Object;
generate random index (normalize range);
invoke method;
You can remove method from array to execute method one times.
Bye

How to initialize a List<T> to a given size (as opposed to capacity)?

.NET offers a generic list container whose performance is almost identical (see Performance of Arrays vs. Lists question). However they are quite different in initialization.
Arrays are very easy to initialize with a default value, and by definition they already have certain size:
string[] Ar = new string[10];
Which allows one to safely assign random items, say:
Ar[5]="hello";
with list things are more tricky. I can see two ways of doing the same initialization, neither of which is what you would call elegant:
List<string> L = new List<string>(10);
for (int i=0;i<10;i++) L.Add(null);
or
string[] Ar = new string[10];
List<string> L = new List<string>(Ar);
What would be a cleaner way?
EDIT: The answers so far refer to capacity, which is something else than pre-populating a list. For example, on a list just created with a capacity of 10, one cannot do L[2]="somevalue"
EDIT 2: People wonder why I want to use lists this way, as it is not the way they are intended to be used. I can see two reasons:
One could quite convincingly argue that lists are the "next generation" arrays, adding flexibility with almost no penalty. Therefore one should use them by default. I'm pointing out they might not be as easy to initialize.
What I'm currently writing is a base class offering default functionality as part of a bigger framework. In the default functionality I offer, the size of the List is known in advanced and therefore I could have used an array. However, I want to offer any base class the chance to dynamically extend it and therefore I opt for a list.
List<string> L = new List<string> ( new string[10] );
I can't say I need this very often - could you give more details as to why you want this? I'd probably put it as a static method in a helper class:
public static class Lists
{
public static List<T> RepeatedDefault<T>(int count)
{
return Repeated(default(T), count);
}
public static List<T> Repeated<T>(T value, int count)
{
List<T> ret = new List<T>(count);
ret.AddRange(Enumerable.Repeat(value, count));
return ret;
}
}
You could use Enumerable.Repeat(default(T), count).ToList() but that would be inefficient due to buffer resizing.
Note that if T is a reference type, it will store count copies of the reference passed for the value parameter - so they will all refer to the same object. That may or may not be what you want, depending on your use case.
EDIT: As noted in comments, you could make Repeated use a loop to populate the list if you wanted to. That would be slightly faster too. Personally I find the code using Repeat more descriptive, and suspect that in the real world the performance difference would be irrelevant, but your mileage may vary.
Use the constructor which takes an int ("capacity") as an argument:
List<string> = new List<string>(10);
EDIT: I should add that I agree with Frederik. You are using the List in a way that goes against the entire reasoning behind using it in the first place.
EDIT2:
EDIT 2: What I'm currently writing is a base class offering default functionality as part of a bigger framework. In the default functionality I offer, the size of the List is known in advanced and therefore I could have used an array. However, I want to offer any base class the chance to dynamically extend it and therefore I opt for a list.
Why would anyone need to know the size of a List with all null values? If there are no real values in the list, I would expect the length to be 0. Anyhow, the fact that this is cludgy demonstrates that it is going against the intended use of the class.
Create an array with the number of items you want first and then convert the array in to a List.
int[] fakeArray = new int[10];
List<int> list = fakeArray.ToList();
If you want to initialize the list with N elements of some fixed value:
public List<T> InitList<T>(int count, T initValue)
{
return Enumerable.Repeat(initValue, count).ToList();
}
Why are you using a List if you want to initialize it with a fixed value ?
I can understand that -for the sake of performance- you want to give it an initial capacity, but isn't one of the advantages of a list over a regular array that it can grow when needed ?
When you do this:
List<int> = new List<int>(100);
You create a list whose capacity is 100 integers. This means that your List won't need to 'grow' until you add the 101th item.
The underlying array of the list will be initialized with a length of 100.
This is an old question, but I have two solutions. One is fast and dirty reflection; the other is a solution that actually answers the question (set the size not the capacity) while still being performant, which none of the answers here do.
Reflection
This is quick and dirty, and should be pretty obvious what the code does. If you want to speed it up, cache the result of GetField, or create a DynamicMethod to do it:
public static void SetSize<T>(this List<T> l, int newSize) =>
l.GetType().GetField("_size", BindingFlags.NonPublic | BindingFlags.Instance).SetValue(l, newSize);
Obviously a lot of people will be hesitant to put such code into production.
ICollection<T>
This solution is based around the fact that the constructor List(IEnumerable<T> collection) optimizes for ICollection<T> and immediately adjusts the size to the correct amount, without iterating it. It then calls the collections CopyTo to do the copy.
The code for the List<T> constructor is as follows:
public List(IEnumerable<T> collection) {
....
ICollection<T> c = collection as ICollection<T>;
if (collection is ICollection<T> c)
{
int count = c.Count;
if (count == 0)
{
_items = s_emptyArray;
}
else {
_items = new T[count];
c.CopyTo(_items, 0);
_size = count;
}
}
So we can completely optimally pre-initialize the List to the correct size, without any extra copying.
How so? By creating an ICollection<T> object that does nothing other than return a Count. Specifically, we will not implement anything in CopyTo which is the only other function called.
private struct SizeCollection<T> : ICollection<T>
{
public SizeCollection(int size) =>
Count = size;
public void Add(T i){}
public void Clear(){}
public bool Contains(T i)=>true;
public void CopyTo(T[]a, int i){}
public bool Remove(T i)=>true;
public int Count {get;}
public bool IsReadOnly=>true;
public IEnumerator<T> GetEnumerator()=>null;
IEnumerator IEnumerable.GetEnumerator()=>null;
}
public List<T> InitializedList<T>(int size) =>
new List<T>(new SizeCollection<T>(size));
We could in theory do the same thing for AddRange/InsertRange for an existing array, which also accounts for ICollection<T>, but the code there creates a new array for the supposed items, then copies them in. In such case, it would be faster to just empty-loop Add:
public void SetSize<T>(this List<T> l, int size)
{
if(size < l.Count)
l.RemoveRange(size, l.Count - size);
else
for(size -= l.Count; size > 0; size--)
l.Add(default(T));
}
Initializing the contents of a list like that isn't really what lists are for. Lists are designed to hold objects. If you want to map particular numbers to particular objects, consider using a key-value pair structure like a hash table or dictionary instead of a list.
You seem to be emphasizing the need for a positional association with your data, so wouldn't an associative array be more fitting?
Dictionary<int, string> foo = new Dictionary<int, string>();
foo[2] = "string";
The accepted answer (the one with the green check mark) has an issue.
The problem:
var result = Lists.Repeated(new MyType(), sizeOfList);
// each item in the list references the same MyType() object
// if you edit item 1 in the list, you are also editing item 2 in the list
I recommend changing the line above to perform a copy of the object. There are many different articles about that:
String.MemberwiseClone() method called through reflection doesn't work, why?
https://code.msdn.microsoft.com/windowsdesktop/CSDeepCloneObject-8a53311e
If you want to initialize every item in your list with the default constructor, rather than NULL, then add the following method:
public static List<T> RepeatedDefaultInstance<T>(int count)
{
List<T> ret = new List<T>(count);
for (var i = 0; i < count; i++)
{
ret.Add((T)Activator.CreateInstance(typeof(T)));
}
return ret;
}
You can use Linq to cleverly initialize your list with a default value. (Similar to David B's answer.)
var defaultStrings = (new int[10]).Select(x => "my value").ToList();
Go one step farther and initialize each string with distinct values "string 1", "string 2", "string 3", etc:
int x = 1;
var numberedStrings = (new int[10]).Select(x => "string " + x++).ToList();
string [] temp = new string[] {"1","2","3"};
List<string> temp2 = temp.ToList();
After thinking again, I had found the non-reflection answer to the OP question, but Charlieface beat me to it. So I believe that the correct and complete answer is https://stackoverflow.com/a/65766955/4572240
My old answer:
If I understand correctly, you want the List<T> version of new T[size], without the overhead of adding values to it.
If you are not afraid the implementation of List<T> will change dramatically in the future (and in this case I believe the probability is close to 0), you can use reflection:
public static List<T> NewOfSize<T>(int size) {
var list = new List<T>(size);
var sizeField = list.GetType().GetField("_size",BindingFlags.Instance|BindingFlags.NonPublic);
sizeField.SetValue(list, size);
return list;
}
Note that this takes into account the default functionality of the underlying array to prefill with the default value of the item type. All int arrays will have values of 0 and all reference type arrays will have values of null. Also note that for a list of reference types, only the space for the pointer to each item is created.
If you, for some reason, decide on not using reflection, I would have liked to offer an option of AddRange with a generator method, but underneath List<T> just calls Insert a zillion times, which doesn't serve.
I would also like to point out that the Array class has a static method called ResizeArray, if you want to go the other way around and start from Array.
To end, I really hate when I ask a question and everybody points out that it's the wrong question. Maybe it is, and thanks for the info, but I would still like an answer, because you have no idea why I am asking it. That being said, if you want to create a framework that has an optimal use of resources, List<T> is a pretty inefficient class for anything than holding and adding stuff to the end of a collection.
A notice about IList:
MSDN IList Remarks:
"IList implementations fall into three categories: read-only, fixed-size, and variable-size. (...). For the generic version of this interface, see
System.Collections.Generic.IList<T>."
IList<T> does NOT inherits from IList (but List<T> does implement both IList<T> and IList), but is always variable-size.
Since .NET 4.5, we have also IReadOnlyList<T> but AFAIK, there is no fixed-size generic List which would be what you are looking for.
This is a sample I used for my unit test. I created a list of class object. Then I used forloop to add 'X' number of objects that I am expecting from the service.
This way you can add/initialize a List for any given size.
public void TestMethod1()
{
var expected = new List<DotaViewer.Interface.DotaHero>();
for (int i = 0; i < 22; i++)//You add empty initialization here
{
var temp = new DotaViewer.Interface.DotaHero();
expected.Add(temp);
}
var nw = new DotaHeroCsvService();
var items = nw.GetHero();
CollectionAssert.AreEqual(expected,items);
}
Hope I was of help to you guys.
A bit late but first solution you proposed seems far cleaner to me : you dont allocate memory twice.
Even List constrcutor needs to loop through array in order to copy it; it doesn't even know by advance there is only null elements inside.
1.
- allocate N
- loop N
Cost: 1 * allocate(N) + N * loop_iteration
2.
- allocate N
- allocate N + loop ()
Cost : 2 * allocate(N) + N * loop_iteration
However List's allocation an loops might be faster since List is a built-in class, but C# is jit-compiled sooo...

Categories

Resources