Should I use ToList() Deep Clone IList? - c#

Assume I have below code to deep clone a to b
IList<int> a = new List<int>();
a.Add(5);
IList<int> b = a.ToList();
bad or good?
It's seems work as ToList return a new List. But when I google it, others always use things like
listToClone.Select(item => (T)item.Clone()).ToList();
what the diff?

It can be explained if you understand how data got stored. There are two storage types of data, value type and reference type. Here below is an example of a declaration of a primitive type and an object
int i = 0;
MyInt myInt = new MyInt(0);
The MyInt class is then
public class MyInt {
private int myint;
public MyInt(int i) {
myint = int;
}
public void SetMyInt(int i) {
myint = i;
}
public int GetMyInt() {
return myint;
}
}
How would that be stored in memory ? Here below is an example. Please note that all memory examples here below are simplified !
_______ __________
| i | 0 |
| | |
| myInt | 0xadfdf0 |
|_______|__________|
For each object that you are creating in your code, you will create a reference to the said object. The object will be grouped in a heap. For a difference between stack and heap memory, please refer to this explanation.
Now, back to your question, cloning a list. Here below is an example of creating lists of integers and MyInt objects
List<int> ints = new List<int>();
List<MyInt> myInts = new List<MyInt>();
// assign 1 to 5 in both collections
for(int i = 1; i <= 5; i++) {
ints.Add(i);
myInts.Add(new MyInt(i));
}
Then we look to the memory,
_______ __________
| ints | 0x856980 |
| | |
| myInts| 0xa02490 |
|_______|__________|
Since lists comes from a collection, each field contains a reference address, which leads to the next
___________ _________
| 0x856980 | 1 |
| 0x856981 | 2 |
| 0x856982 | 3 |
| 0x856983 | 4 |
| 0x856984 | 5 |
| | |
| | |
| | |
| 0xa02490 | 0x12340 |
| 0xa02491 | 0x15631 |
| 0xa02492 | 0x59531 |
| 0xa02493 | 0x59421 |
| 0xa02494 | 0x59921 |
|___________|_________|
Now you can see that the list myInts contains again references while ints contains values. When we want to clone a list using ToList(),
List<int> cloned_ints = ints.ToList();
List<MyInt> cloned_myInts = myInts.ToList();
we get a result like here below.
original cloned
___________ _________ ___________ _________
| 0x856980 | 1 | | 0x652310 | 1 |
| 0x856981 | 2 | | 0x652311 | 2 |
| 0x856982 | 3 | | 0x652312 | 3 |
| 0x856983 | 4 | | 0x652313 | 4 |
| 0x856984 | 5 | | 0x652314 | 5 |
| | | | | |
| | | | | |
| | | | | |
| 0xa02490 | 0x12340 | | 0xa48920 | 0x12340 |
| 0xa02491 | 0x15631 | | 0xa48921 | 0x12340 |
| 0xa02492 | 0x59531 | | 0xa48922 | 0x59531 |
| 0xa02493 | 0x59421 | | 0xa48923 | 0x59421 |
| 0xa02494 | 0x59921 | | 0xa48924 | 0x59921 |
| | | | | |
| | | | | |
| 0x12340 | 0 | | | |
|___________|_________| |___________|_________|
The 0x12340 is then the reference of the first MyInt object, holding variable 0. It's shown here simplified to explain it well.
You can see that the list appears as cloned. But when we want to change a variable of the cloned list, the first one will be set to 7.
cloned_ints[0] = 7;
cloned_myInts[0].SetMyInt(7);
Then we get the next result
original cloned
___________ _________ ___________ _________
| 0x856980 | 1 | | 0x652310 | 7 |
| 0x856981 | 2 | | 0x652311 | 2 |
| 0x856982 | 3 | | 0x652312 | 3 |
| 0x856983 | 4 | | 0x652313 | 4 |
| 0x856984 | 5 | | 0x652314 | 5 |
| | | | | |
| | | | | |
| | | | | |
| 0xa02490 | 0x12340 | | 0xa48920 | 0x12340 |
| 0xa02491 | 0x15631 | | 0xa48921 | 0x12340 |
| 0xa02492 | 0x59531 | | 0xa48922 | 0x59531 |
| 0xa02493 | 0x59421 | | 0xa48923 | 0x59421 |
| 0xa02494 | 0x59921 | | 0xa48924 | 0x59921 |
| | | | | |
| | | | | |
| 0x12340 | 7 | | | |
|___________|_________| |___________|_________|
Did you see the changes ? The first value in 0x652310 got changed to 7. But at the MyInt object, the reference address didn't got changed. However, the value will be assigned to the 0x12340 address.
When we want to display the result, then we have the next
ints[0] -------------------> 1
cloned_ints[0] -------------> 7
myInts[0].GetMyInt() -------> 7
cloned_myInts[0].GetMyInt() -> 7
As you can see, the original ints has kept its values while the original myInts has a different value, it got changed. That's because both pointers points to the same object. If you edit that object, both pointers will call that object.
That's why there are two types of clonings, deep and shallowed one. The example below here is a deep clone
listToClone.Select(item => (T)item.Clone()).ToList();
This selects each item in the original list, and clones each found object in the list. The Clone() comes from the Object class, which will create a new object with same variables.
However, please notice that it's not secure if you have an object or any reference types in your class, you have to implement the cloning mechanism yourself. Or you will face same issues as described here above, that the original and cloned object is just holding a reference. You can do that by implmenting the ICloneable interface, and this example of implementing it.
I hope that it's now clear for you.

It depends. If you have a collection of value types it will copy them. If you have a list of reference types then it will only copy references to real objects and not the real values. Here's a small example
void Main()
{
var a = new List<int> { 1, 2 };
var b = a.ToList();
b[0] = 2;
a.Dump();
b.Dump();
var c = new List<Person> { new Person { Age = 5 } };
var d = c.ToList();
d[0].Age = 10;
c.Dump();
d.Dump();
}
class Person
{
public int Age { get; set; }
}
The previous code results in
a - 1, 2
b - 2, 2
c - Age = 10
d - Age = 10
As you can see the first number in the new collection changed and did not affect the other one. But that was not the case with the age of the person I created.

If the contents an IList<T> either encapsulate values directly, identify immutable objects for the purpose of encapsulating the values therein, or encapsulate the identities of shared mutable objects, then calling ToList will create a new list, detached from the original, which encapsulates the same data.
If the contents of an IList<T> encapsulate values or states in mutable objects, however, such an approach would not work. References to mutable objects can only values or if the objects in question are unshared. If references to mutable objects are shared, the whereabouts of all such references will become an part of the state encapsulated thereby. To avoid this, copying a list of such objects requires producing a new list containing references to other (probably newly-created) objects that encapsulate the same value or states as those in the original. If the objects in question include a usable Clone method, that might be usable for the purpose, but if the objects in question are collections themselves, the correct and efficient behavior of their Clone method would rely upon them knowing the whether they contain objects which must not be exposed to the recipient of the copied list--something which the .NET Framework has no way of telling them.

Related

Why is my IEnumerable variable also getting updated?

I am a little confused about why the logic here isn't working, and I feel like I have been staring at this bit of code for so long that I am missing something here. I have this method that gets called by another method:
private async Task<bool> Method1(int start, int end, int increment, IEnumerable<ModelExample> examples)
{
for (int i = start; i<= end; i++)
{
ModelExample example = examples.Where(x => x.id == i).Select(i => i).First();
example.id = example.id + increment; //Line X
// do stuff
}
return true;
}
I debuged the code above and it seems like when "Line X" gets executed not only is example.id changed individually but now that example in the List "examples" gets updated with the new id value.
I am not sure why? I want the list "examples" to remain the same for the entirety of the for loop, I am confused why updating the value for example.id updates it in the list as well?
(i.e. if the list before "Line X" had an entry with id = 0, after "Line X" that same entry has its id updated to 1, how can I keep the variable "examples" constant here?)
Any help appreciated, thanks.
This is what your list looks like:
+---------------+
| List examples | +-----------+
+---------------+ | Example |
| | +-----------+
| [0] ---------------> | Id: 1 |
| | | ... |
+---------------+ +-----------+
| |
| [1] ---------------> +-----------+
| | | Example |
+---------------+ +-----------+
| | | Id: 2 |
| ... | | ... |
| | +-----------+
+---------------+
In other words, your list just contains references to your examples, not copies. Thus, your variable example refers to one of the entities on the right-hand side and modifies it in-place.
If you need a copy, you need to create one yourself.

Modifying an array "in-place" in C#?

There's a leetCode problem with a prompt like
Given an array 'nums' and a value 'val', remove all instances of that
value in-place and return the new length.
Do not allocate extra space for another array, you must do this by
modifying the input array in-place with O(1) extra memory.
The order of elements can be changed. It doesn't matter what you leave
beyond the new length.
Example 1:
Given nums = [3,2,2,3], val = 3,
Your function should return length = 2, with the first two elements of
nums being 2.
Does this problem make sense in C# (one of the supported languages for solving this problem on the site)? You can't modify the length of an array, only can allocate a new array.
Does this problem make sense in C#?
Yes.
[In C#] You can't modify the length of an array, only can allocate a new array.
Absolutely right. So re-read the question, and see what you can do. I don't want to give the answer, as clearly you will learn from working this out yourself. So stop reading here; read on if you want a hint.
Hint: focus on this part...
The order of elements can be changed. It doesn't matter what you leave beyond the new length.
A totally nonsensical approach, using unsafe, fixed, and Pointers... Why? well why not. And besides, any leet code problem ought to use leet pointers in its solve.
Given
public static unsafe int Filter(this int[] array, int value, int length)
{
fixed (int* pArray = array)
{
var pI = pArray;
var pLen = pArray + length;
for (var p = pArray; p < pLen; p++)
if (*p != value)
{
*pI = *p;
pI++;
}
return (int)(pI - pArray);
}
}
Usage
var input = new[] {23,4,4,3,32};
var length = input.Filter(4, input.GetLength(0));
for (var i = 0; i < length; i++)
Console.WriteLine(input[i]);
Output
23
3
32
Just FYI, a normal safe solution would look like this
var result = 0;
for (var i = 0; i < length; i++)
if (array[i] != value) array[result++] = array[i];
return result;
Benchmarks
And just because i had my benchmarker open and morbidly curious, its always good to test optimizations as you just never know. Each test is run 10000 times on each scale and Garbage collected each run.
----------------------------------------------------------------------------
Mode : Release (64Bit)
Test Framework : .NET Framework 4.7.1 (CLR 4.0.30319.42000)
----------------------------------------------------------------------------
Operating System : Microsoft Windows 10 Pro
Version : 10.0.17134
----------------------------------------------------------------------------
CPU Name : Intel(R) Core(TM) i7-2600 CPU # 3.40GHz
Description : Intel64 Family 6 Model 42 Stepping 7
Cores (Threads) : 4 (8) : Architecture : x64
Clock Speed : 3401 MHz : Bus Speed : 100 MHz
L2Cache : 1 MB : L3Cache : 8 MB
----------------------------------------------------------------------------
Test 1
--- Standard input ------------------------------------------------------------
| Value | Average | Fastest | Cycles | Garbage | Test | Gain |
--- Scale 100 -------------------------------------------------- Time 9.902 ---
| Array | 434.091 ns | 300.000 ns | 3.731 K | 0.000 B | Base | 0.00 % |
| UnsafeOpti | 445.116 ns | 300.000 ns | 3.662 K | 0.000 B | N/A | -2.54 % |
| Unsafe | 448.286 ns | 300.000 ns | 3.755 K | 0.000 B | N/A | -3.27 % |
--- Scale 1,000 ----------------------------------------------- Time 10.161 ---
| UnsafeOpti | 1.421 µs | 900.000 ns | 7.221 K | 0.000 B | N/A | 17.70 % |
| Array | 1.727 µs | 1.200 µs | 8.230 K | 0.000 B | Base | 0.00 % |
| Unsafe | 1.740 µs | 1.200 µs | 8.302 K | 0.000 B | N/A | -0.78 % |
--- Scale 10,000 ---------------------------------------------- Time 10.571 ---
| UnsafeOpti | 10.910 µs | 9.306 µs | 39.518 K | 0.000 B | N/A | 20.03 % |
| Array | 13.643 µs | 12.007 µs | 48.849 K | 0.000 B | Base | 0.00 % |
| Unsafe | 13.657 µs | 12.007 µs | 48.907 K | 0.000 B | N/A | -0.10 % |
--- Scale 100,000 --------------------------------------------- Time 15.443 ---
| UnsafeOpti | 105.386 µs | 84.954 µs | 362.969 K | 0.000 B | N/A | 19.93 % |
| Unsafe | 130.150 µs | 110.771 µs | 447.383 K | 0.000 B | N/A | 1.12 % |
| Array | 131.621 µs | 113.773 µs | 452.262 K | 0.000 B | Base | 0.00 % |
--- Scale 1,000,000 ------------------------------------------- Time 22.183 ---
| UnsafeOpti | 1.556 ms | 1.029 ms | 5.209 M | 0.000 B | N/A | 23.13 % |
| Unsafe | 2.007 ms | 1.303 ms | 6.780 M | 0.000 B | N/A | 0.85 % |
| Array | 2.024 ms | 1.354 ms | 6.841 M | 0.000 B | Base | 0.00 % |
-------------------------------------------------------------------------------
Looks like you're looking for an explanation of the algorithm so I'll try to explain it.
You're "rewriting" the array in place. Think of having a cursor traversing the array reading the data, while simultaneously having a cursor to write values behind it. The "read" cursor in this case could be a foreach loop or a for loop with an indexer.
Test data:
[12,34,56,34,78]
and we want to remove 34, we start with both cursors at position 0 (i.e. values[0]) with the "newLength = 0" representing the length of the "new" array and thus where the "write" cursor currently resides:
[12,34,56,34,78]
^r
^w
newLength: 0
The first element read, 12, does not match, so we include that element in the "new" array by writing it to the position in the array signified by the current length of the "new" array (to start, the length is 0, so that's values[0]. In this case, we're writing the same value to that element, so nothing changes and we move the write cursor to the next position by incrementing the length of the "new" array
[12,34,56,34,78]
^r
^w
newLength: 1
Now the next element does match the value to remove so we don't want to include it in the new array. We do this by not increasing the length of the "new" array and leaving that write cursor where it is while the read cursor moves on to the next element:
[12,34,56,34,78]
^r
^w
newLength: 1
If this array was only two elements, we're done, and no values in the array have changed but the returned length is only 1. As we have more elements, let's continue to see what happens. We read 56, which does not match the value to remove, so we "write" it at the position specified by the new "length", after which we increment the length:
[12,56,56,34,78]
^r
^w
newLength: 2
Now we read a value that matches, so we skip writing it:
[12,56,56,34,78]
^r
^w
newLength: 2
And the final value doesn't match, so we write it at the position specified by "length" and increment the length:
[12,56,78,34,78]
^r
^w
newLength: 3
The "read" cursor now exceeds the length of the array so we're done. The new "length" value is now 3 since we "wrote" three values to the array. Using the array in conjunction with the newLength value functionally results in the "new" array [12,56,78].
Here's a safe implementation of #TheGeneral's functionally correct but unsafe implementation using pointers:
public static int DoStuffSafely( int[] values, int valueToRemove, int length )
{
var newLength = 0;
// ~13.5% slower than #TheGeneral's unsafe implementation
foreach( var v in values )
{
// if the value matches the value to remove, skip it
if( valueToRemove != v )
{
// newLength tracks the length of the "new" array
// we'll set the value at that index
values[ newLength ] = v;
// then increase the length of the "new" array
++newLength;
// I never use post-increment/decrement operators
// it requires a temp memory alloc that we toss immediately
// which is expensive for this simple loop, adds about 11% in execution time
}
}
return newLength;
}
Edit: for #Richardissimo
values[ newLength ] = v;
00007FFBC1AC8993 cmp eax,r10d
00007FFBC1AC8996 jae 00007FFBC1AC89B0
00007FFBC1AC8998 movsxd rsi,eax
00007FFBC1AC899B mov dword ptr [r8+rsi*4+10h],r11d
++newLength;
00007FFBC1AC89A0 inc eax
values[ newLength++ ] = v;
00007FFBC1AD8993 lea esi,[rax+1]
00007FFBC1AD8996 cmp eax,r10d
00007FFBC1AD8999 jae 00007FFBC1AD89B3
00007FFBC1AD899B movsxd rax,eax
00007FFBC1AD899E mov dword ptr [r8+rax*4+10h],r11d
00007FFBC1AD89A3 mov eax,esi

With two very large lists/collections - how to detect and/or remove duplicates efficiently

Why
When comparing and deduplicating across two lists coders don't often find the most runtime-efficient implementation while under time-pressure. Two nested for-loops is a common goto solution for many coders. One might try a CROSS JOIN with LINQ, but this is clearly inefficient. Coders need a memorable and code-efficient approach for this that is also relatively runtime-efficient.
This question was created after seeing a more specific one: Delete duplicates in a single dataset relative to another one in C# - it's more specialised with the use of Datasets. The term "dataset" would not help people in the future. No other generalised question was found.
What
I have used the term List/Collection to help with this more general coding problem.
var setToDeduplicate = new List<int>() { 1,2,3,4,5,6,7,8,9,10,11,.....}; //All integer values 1-1M
var referenceSet = new List<int>() { 1,3,5,7,9,....}; //All odd integer values 1-1M
var deduplicatedSet = deduplicationFunction(setToDeduplicate, referenceSet);
By implementing the deduplicationFunction function the input data and output should be clear. The output can be IEnumerable. The expected output in this input example would be the even numbers from 1-1M {2,4,6,8,...}
Note: There may be duplicates within the referenceSet. The values in both sets are indicative only, so I'm not looking for a mathematical solution - this should also work for random number inputs in both sets.
If this is approached with simple LINQ functions it will be too slow O(1M*0.5M). There needs to be a faster approach for such large sets.
Speed is important, but incremental improvements with a large bloat of code will be of less value. Also, ideally it would work for other datatypes including data model objects, but answering this specific question should be enough. Other datatypes would simply involve some more pre-processing or slight change to the answer.
Solution Summary
Here's the test code, for results which follow:
using System;
using System.Collections.Generic;
using System.Data;
using System.Diagnostics;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace Test
{
class Program
{
static void Main(string[] args)
{
Console.WriteLine("Preparing...");
List<int> set1 = new List<int>();
List<int> set2 = new List<int>();
Random r = new Random();
var max = 10000;
for (int i = 0; i < max; i++)
{
set1.Add(r.Next(0, max));
set2.Add(r.Next(0, max/2) * 2);
}
Console.WriteLine("First run...");
Stopwatch sw = new Stopwatch();
IEnumerable<int> result;
int count;
while (true)
{
sw.Start();
result = deduplicationFunction(set1, set2);
var results1 = result.ToList();
count = results1.Count;
sw.Stop();
Console.WriteLine("Dictionary and Where - Count: {0}, Milliseconds: {1:0.00}.", count, sw.ElapsedTicks / (decimal)10000);
sw.Reset();
sw.Start();
result = deduplicationFunction2(set1, set2);
var results2 = result.ToList();
count = results2.Count;
sw.Stop();
Console.WriteLine(" HashSet ExceptWith - Count: {0}, Milliseconds: {1:0.00}.", count, sw.ElapsedTicks / (decimal)10000);
sw.Reset();
sw.Start();
result = deduplicationFunction3(set1, set2);
var results3 = result.ToList();
count = results3.Count;
sw.Stop();
Console.WriteLine(" Sort Dual Index - Count: {0}, Milliseconds: {1:0.00}.", count, sw.ElapsedTicks / (decimal)10000);
sw.Reset();
sw.Start();
result = deduplicationFunction4(set1, set2);
var results4 = result.ToList();
count = results3.Count;
sw.Stop();
Console.WriteLine("Presorted Dual Index - Count: {0}, Milliseconds: {1:0.00}.", count, sw.ElapsedTicks / (decimal)10000);
sw.Reset();
set2.RemoveAt(set2.Count - 1); //Remove the last item, because it was added in the 3rd test
sw.Start();
result = deduplicationFunction5(set1, set2);
var results5 = result.ToList();
count = results5.Count;
sw.Stop();
Console.WriteLine(" Nested Index - Count: {0}, Milliseconds: {1:0.00}.", count, sw.ElapsedTicks / (decimal)10000);
sw.Reset();
Console.ReadLine();
Console.WriteLine("");
Console.WriteLine("Next Run");
Console.WriteLine("");
}
}
//Returns an IEnumerable from which more can be chained or simply terminated with ToList by the caller
static IEnumerable<int> deduplicationFunction(List<int> Set, List<int> Reference)
{
//Create a hashset first, which is much more efficient for searching
var ReferenceHashSet = Reference
.Distinct() //Inserting duplicate keys in a dictionary will cause an exception
.ToDictionary(x => x, x => x); //If there was a ToHashSet function, that would be nicer
int throwAway;
return Set.Distinct().Where(y => ReferenceHashSet.TryGetValue(y, out throwAway) == false);
}
//Returns an IEnumerable from which more can be chained or simply terminated with ToList by the caller
static IEnumerable<int> deduplicationFunction2(List<int> Set, List<int> Reference)
{
//Create a hashset first, which is much more efficient for searching
var SetAsHash = new HashSet<int>();
Set.ForEach(x =>
{
if (SetAsHash.Contains(x))
return;
SetAsHash.Add(x);
}); // .Net 4.7.2 - ToHashSet will reduce this code to a single line.
SetAsHash.ExceptWith(Reference); // This is ultimately what we're testing
return SetAsHash.AsEnumerable();
}
static IEnumerable<int> deduplicationFunction3(List<int> Set, List<int> Reference)
{
Set.Sort();
Reference.Sort();
Reference.Add(Set[Set.Count - 1] + 1); //Ensure the last set item is non-duplicate for an In-built stop clause. This is easy for int list items, just + 1 on the last item.
return deduplicationFunction4(Set, Reference);
}
static IEnumerable<int> deduplicationFunction4(List<int> Set, List<int> Reference)
{
int i1 = 0;
int i2 = 0;
int thisValue = Set[i1];
int thisReference = Reference[i2];
while (true)
{
var difference = thisReference - thisValue;
if (difference < 0)
{
i2++; //Compare side is too low, there might be an equal value to be found
if (i2 == Reference.Count)
break;
thisReference = Reference[i2];
continue;
}
if (difference > 0) //Duplicate
yield return thisValue;
GoFurther:
i1++;
if (i1 == Set.Count)
break;
if (Set[i1] == thisValue) //Eliminates duplicates
goto GoFurther; //I rarely use goto statements, but this is a good situation
thisValue = Set[i1];
}
}
static IEnumerable<int> deduplicationFunction5(List<int> Set, List<int> Reference)
{
var found = false;
var lastValue = 0;
var thisValue = 0;
for (int i = 0; i < Set.Count; i++)
{
thisValue = Set[i];
if (thisValue == lastValue)
continue;
lastValue = thisValue;
found = false;
for (int x = 0; x < Reference.Count; x++)
{
if (thisValue != Reference[x])
continue;
found = true;
break;
}
if (found)
continue;
yield return thisValue;
}
}
}
}
I'll use this to compare performance of multiple approaches. (I'm particularly interested in Hash-approach vs dual-index-on-sorted-approach at this stage, although ExceptWith enables a terse solution)
Results so far on 10k items in set (Good Run):
First Run
Dictionary and Where - Count: 3565, Milliseconds: 16.38.
HashSet ExceptWith - Count: 3565, Milliseconds: 5.33.
Sort Dual Index - Count: 3565, Milliseconds: 6.34.
Presorted Dual Index - Count: 3565, Milliseconds: 1.14.
Nested Index - Count: 3565, Milliseconds: 964.16.
Good Run
Dictionary and Where - Count: 3565, Milliseconds: 1.21.
HashSet ExceptWith - Count: 3565, Milliseconds: 0.94.
Sort Dual Index - Count: 3565, Milliseconds: 1.09.
Presorted Dual Index - Count: 3565, Milliseconds: 0.76.
Nested Index - Count: 3565, Milliseconds: 628.60.
Chosen answer:
#backs HashSet.ExceptWith approach - is marginally faster with minimal code, uses an interesting function ExceptWith, however it is weakened due to lack of versatility, and the fact the interesting function is less commonly known.
One of my answers: HashSet > Where(..Contains..) - is only a tiny bit slower than #backs, but uses a code pattern that uses LINQ and is very versitile beyond lists of primative elements. I believe this is a more common scenario I find myself with when coding, and trust this is the case for many other coders.
Special thanks to #TheGeneral for his benchmarking of some of the answers and also some interesting unsafe versions, and for helping to make #Backs answer more efficient for a followup test.
Use HashSet for your initial list and ExceptWith method to get result sett:
var setToDeduplicate = new HashSet<int>() { 1,2,3,4,5,6,7,8,9,10,11,.....}; //All integer values 1-1M
var referenceSet = new List<int>() { 1,3,5,7,9,....}; //All odd integer values 1-1M
setToDeduplicate.ExceptWith(referenceSet);
Here are some more, basically i wanted to test both distinct and not distinct input against a variety of solutions. In the non distinct version i had to call distinct where needed on the final output.
Mode : Release (64Bit)
Test Framework : .NET Framework 4.7.1
Operating System : Microsoft Windows 10 Pro
Version : 10.0.17134
CPU Name : Intel(R) Core(TM) i7-3770K CPU # 3.50GHz
Description : Intel64 Family 6 Model 58 Stepping 9
Cores (Threads) : 4 (8) : Architecture : x64
Clock Speed : 3901 MHz : Bus Speed : 100 MHz
L2Cache : 1 MB : L3Cache : 8 MB
Benchmarks Runs : Inputs (1) * Scales (5) * Benchmarks (6) * Runs (100) = 3,000
Results Distinct input
--- Random Set 1 ---------------------------------------------------------------------
| Value | Average | Fastest | Cycles | Garbage | Test | Gain |
--- Scale 100 --------------------------------------------------------- Time 0.334 ---
| Backs | 0.008 ms | 0.007 ms | 31,362 | 8.000 KB | Pass | 68.34 % |
| ListUnsafe | 0.009 ms | 0.008 ms | 35,487 | 8.000 KB | Pass | 63.45 % |
| HasSet | 0.012 ms | 0.011 ms | 46,840 | 8.000 KB | Pass | 50.03 % |
| ArrayUnsafe | 0.013 ms | 0.011 ms | 49,388 | 8.000 KB | Pass | 47.75 % |
| HashSetUnsafe | 0.018 ms | 0.013 ms | 66,866 | 16.000 KB | Pass | 26.62 % |
| Todd | 0.024 ms | 0.019 ms | 90,763 | 16.000 KB | Base | 0.00 % |
--- Scale 1,000 ------------------------------------------------------- Time 0.377 ---
| Backs | 0.070 ms | 0.060 ms | 249,374 | 28.977 KB | Pass | 57.56 % |
| ListUnsafe | 0.078 ms | 0.067 ms | 277,080 | 28.977 KB | Pass | 52.67 % |
| HasSet | 0.093 ms | 0.083 ms | 329,686 | 28.977 KB | Pass | 43.61 % |
| ArrayUnsafe | 0.096 ms | 0.082 ms | 340,154 | 36.977 KB | Pass | 41.72 % |
| HashSetUnsafe | 0.103 ms | 0.085 ms | 367,681 | 55.797 KB | Pass | 37.07 % |
| Todd | 0.164 ms | 0.151 ms | 578,933 | 112.664 KB | Base | 0.00 % |
--- Scale 10,000 ------------------------------------------------------ Time 0.965 ---
| ListUnsafe | 0.706 ms | 0.611 ms | 2,467,327 | 258.516 KB | Pass | 48.60 % |
| Backs | 0.758 ms | 0.654 ms | 2,656,610 | 180.297 KB | Pass | 44.81 % |
| ArrayUnsafe | 0.783 ms | 0.696 ms | 2,739,156 | 276.281 KB | Pass | 43.02 % |
| HasSet | 0.859 ms | 0.752 ms | 2,999,230 | 198.063 KB | Pass | 37.47 % |
| HashSetUnsafe | 0.864 ms | 0.783 ms | 3,029,086 | 332.273 KB | Pass | 37.07 % |
| Todd | 1.373 ms | 1.251 ms | 4,795,929 | 604.742 KB | Base | 0.00 % |
--- Scale 100,000 ----------------------------------------------------- Time 5.535 ---
| ListUnsafe | 5.624 ms | 4.874 ms | 19,658,154 | 2.926 MB | Pass | 40.36 % |
| HasSet | 7.574 ms | 6.548 ms | 26,446,193 | 2.820 MB | Pass | 19.68 % |
| Backs | 7.585 ms | 5.634 ms | 26,303,794 | 2.009 MB | Pass | 19.57 % |
| ArrayUnsafe | 8.287 ms | 6.219 ms | 28,923,797 | 3.583 MB | Pass | 12.12 % |
| Todd | 9.430 ms | 7.326 ms | 32,880,985 | 2.144 MB | Base | 0.00 % |
| HashSetUnsafe | 9.601 ms | 7.859 ms | 32,845,228 | 5.197 MB | Pass | -1.81 % |
--- Scale 1,000,000 -------------------------------------------------- Time 47.652 ---
| ListUnsafe | 57.751 ms | 44.734 ms | 201,477,028 | 29.309 MB | Pass | 22.14 % |
| Backs | 65.567 ms | 49.023 ms | 228,772,283 | 21.526 MB | Pass | 11.61 % |
| HasSet | 73.163 ms | 56.799 ms | 254,703,994 | 25.904 MB | Pass | 1.36 % |
| Todd | 74.175 ms | 53.739 ms | 258,760,390 | 9.144 MB | Base | 0.00 % |
| ArrayUnsafe | 86.530 ms | 67.803 ms | 300,374,535 | 13.755 MB | Pass | -16.66 % |
| HashSetUnsafe | 97.140 ms | 77.844 ms | 337,639,426 | 39.527 MB | Pass | -30.96 % |
--------------------------------------------------------------------------------------
Results Random List using Distinct on results where needed
--- Random Set 1 ---------------------------------------------------------------------
| Value | Average | Fastest | Cycles | Garbage | Test | Gain |
--- Scale 100 --------------------------------------------------------- Time 0.272 ---
| Backs | 0.007 ms | 0.006 ms | 28,449 | 8.000 KB | Pass | 72.96 % |
| HasSet | 0.010 ms | 0.009 ms | 38,222 | 8.000 KB | Pass | 62.05 % |
| HashSetUnsafe | 0.014 ms | 0.010 ms | 51,816 | 16.000 KB | Pass | 47.52 % |
| ListUnsafe | 0.017 ms | 0.014 ms | 64,333 | 16.000 KB | Pass | 33.84 % |
| ArrayUnsafe | 0.020 ms | 0.015 ms | 72,468 | 16.000 KB | Pass | 24.70 % |
| Todd | 0.026 ms | 0.021 ms | 95,500 | 24.000 KB | Base | 0.00 % |
--- Scale 1,000 ------------------------------------------------------- Time 0.361 ---
| Backs | 0.061 ms | 0.053 ms | 219,141 | 28.977 KB | Pass | 70.46 % |
| HasSet | 0.092 ms | 0.080 ms | 325,353 | 28.977 KB | Pass | 55.78 % |
| HashSetUnsafe | 0.093 ms | 0.079 ms | 331,390 | 55.797 KB | Pass | 55.03 % |
| ListUnsafe | 0.122 ms | 0.101 ms | 432,029 | 73.016 KB | Pass | 41.19 % |
| ArrayUnsafe | 0.133 ms | 0.113 ms | 469,560 | 73.016 KB | Pass | 35.88 % |
| Todd | 0.208 ms | 0.173 ms | 730,661 | 148.703 KB | Base | 0.00 % |
--- Scale 10,000 ------------------------------------------------------ Time 0.870 ---
| Backs | 0.620 ms | 0.579 ms | 2,174,415 | 180.188 KB | Pass | 55.31 % |
| HasSet | 0.696 ms | 0.635 ms | 2,440,300 | 198.063 KB | Pass | 49.87 % |
| HashSetUnsafe | 0.731 ms | 0.679 ms | 2,563,125 | 332.164 KB | Pass | 47.32 % |
| ListUnsafe | 0.804 ms | 0.761 ms | 2,818,293 | 400.492 KB | Pass | 42.11 % |
| ArrayUnsafe | 0.810 ms | 0.751 ms | 2,838,680 | 400.492 KB | Pass | 41.68 % |
| Todd | 1.388 ms | 1.271 ms | 4,863,651 | 736.953 KB | Base | 0.00 % |
--- Scale 100,000 ----------------------------------------------------- Time 6.616 ---
| Backs | 5.604 ms | 4.710 ms | 19,600,934 | 2.009 MB | Pass | 62.92 % |
| HasSet | 6.607 ms | 5.847 ms | 23,093,963 | 2.820 MB | Pass | 56.29 % |
| HashSetUnsafe | 8.565 ms | 7.465 ms | 29,239,067 | 5.197 MB | Pass | 43.34 % |
| ListUnsafe | 11.447 ms | 9.543 ms | 39,452,865 | 5.101 MB | Pass | 24.28 % |
| ArrayUnsafe | 11.517 ms | 9.841 ms | 39,731,502 | 5.483 MB | Pass | 23.81 % |
| Todd | 15.116 ms | 11.369 ms | 51,963,309 | 3.427 MB | Base | 0.00 % |
--- Scale 1,000,000 -------------------------------------------------- Time 55.310 ---
| Backs | 53.766 ms | 44.321 ms | 187,905,335 | 21.526 MB | Pass | 51.32 % |
| HasSet | 60.759 ms | 50.742 ms | 212,409,649 | 25.904 MB | Pass | 44.99 % |
| HashSetUnsafe | 79.248 ms | 67.130 ms | 275,455,545 | 39.527 MB | Pass | 28.25 % |
| ListUnsafe | 106.527 ms | 90.159 ms | 370,838,650 | 39.153 MB | Pass | 3.55 % |
| Todd | 110.444 ms | 93.225 ms | 384,636,081 | 22.676 MB | Base | 0.00 % |
| ArrayUnsafe | 114.548 ms | 98.033 ms | 398,219,513 | 38.974 MB | Pass | -3.72 % |
--------------------------------------------------------------------------------------
Data
private Tuple<List<int>, List<int>> GenerateData(int scale)
{
return new Tuple<List<int>, List<int>>(
Enumerable.Range(0, scale)
.Select(x => x)
.ToList(),
Enumerable.Range(0, scale)
.Select(x => Rand.Next(10000))
.ToList());
}
Code
public class Backs : Benchmark<Tuple<List<int>, List<int>>, List<int>>
{
protected override List<int> InternalRun()
{
var hashSet = new HashSet<int>(Input.Item1);
hashSet.ExceptWith(Input.Item2);
return hashSet.ToList();
}
}
public class HasSet : Benchmark<Tuple<List<int>, List<int>>, List<int>>
{
protected override List<int> InternalRun()
{
var hashSet = new HashSet<int>(Input.Item2);
return Input.Item1.Where(y => !hashSet.Contains(y)).ToList();
}
}
public class Todd : Benchmark<Tuple<List<int>, List<int>>, List<int>>
{
protected override List<int> InternalRun()
{
var referenceHashSet = Input.Item2.Distinct()
.ToDictionary(x => x, x => x);
return Input.Item1.Where(y => !referenceHashSet.TryGetValue(y, out _)).ToList();
}
}
public unsafe class HashSetUnsafe : Benchmark<Tuple<List<int>, List<int>>, List<int>>
{
protected override List<int> InternalRun()
{
var reference = new HashSet<int>(Input.Item2);
var result = new HashSet<int>();
fixed (int* pAry = Input.Item1.ToArray())
{
var len = pAry+Input.Item1.Count;
for (var p = pAry; p < len; p++)
{
if(!reference.Contains(*p))
result.Add(*p);
}
}
return result.ToList();
}
}
public unsafe class ListUnsafe : Benchmark<Tuple<List<int>, List<int>>, List<int>>
{
protected override List<int> InternalRun()
{
var reference = new HashSet<int>(Input.Item2);
var result = new List<int>(Input.Item2.Count);
fixed (int* pAry = Input.Item1.ToArray())
{
var len = pAry+Input.Item1.Count;
for (var p = pAry; p < len; p++)
{
if(!reference.Contains(*p))
result.Add(*p);
}
}
return result.ToList();
}
}
public unsafe class ArrayUnsafe : Benchmark<Tuple<List<int>, List<int>>, List<int>>
{
protected override List<int> InternalRun()
{
var reference = new HashSet<int>(Input.Item2);
var result = new int[Input.Item1.Count];
fixed (int* pAry = Input.Item1.ToArray(), pRes = result)
{
var j = 0;
var len = pAry+Input.Item1.Count;
for (var p = pAry; p < len; p++)
{
if(!reference.Contains(*p))
*(pRes+j++) = *p;
}
return result.Take(j).ToList();
}
}
}
Summary
No surprises here really, if you have a distinct list to start with its better for some solutions, If not the simplest hashset version is the best
Single loop Dual-Index
As recommended by #PepitoSh in the Question comments:
I think HashSet is a very generic solution to a rather specific
problem. If your lists are ordered, scanning them parallel and compare
the current items is the fastest
This is very different to having two nested loops. Instead there is a single general loop and the indexes are incremented ascending in parallel, depending on the relative value difference. The difference is basically the output of any normal Comparison function: { negative, 0, positive }
static IEnumerable<int> deduplicationFunction4(List<int> Set, List<int> Reference)
{
int i1 = 0;
int i2 = 0;
int thisValue = Set[i1];
int thisReference = Reference[i2];
while (true)
{
var difference = thisReference - thisValue;
if (difference < 0)
{
i2++; //Compare side is too low, there might be an equal value to be found
if (i2 == Reference.Count)
break;
thisReference = Reference[i2];
continue;
}
if (difference > 0) //Duplicate
yield return thisValue;
GoFurther:
i1++;
if (i1 == Set.Count)
break;
if (Set[i1] == thisValue) //Eliminates duplicates
goto GoFurther; //I rarely use goto statements, but this is a good situation
thisValue = Set[i1];
}
}
How to call this function, if the lists aren't yet sorted:
Set.Sort();
Reference.Sort();
Reference.Add(Set[Set.Count - 1] + 1); //Ensure the last set item is non-duplicate for an In-built stop clause. This is easy for int list items, just + 1 on the last item.
return deduplicationFunction4(Set, Reference);
This gave me the best performance in my benchmarking. This could probably also be tried with unsafe code for more of a speedup in some scenarios. In scenarios where data is already sorted, this is by far the best. A faster sorting algorithm might also be selected, but not the subject of this question.
Note: This method deduplicates as it goes.
I have actually coded such a single loop pattern before when finalising text-search results, except I had N arrays to check for "closeness". So I had an array of indexes - array[index[i]]. So I'm sure having a single loop with controlled index incrementing isn't a new concept, but it's certainly a great solution here.
HashSet and Where
You must use a HashSet (or Dictionary) for speed:
//Returns an IEnumerable from which more can be chained or simply terminated with ToList by the caller
IEnumerable<int> deduplicationFunction(List<int> Set, List<int> Reference)
{
//Create a hashset first, which is much more efficient for searching
var ReferenceHashSet = Reference
.Distinct() //Inserting duplicate keys in a dictionary will cause an exception
.ToDictionary(x => x, x => x); //If there was a ToHashSet function, that would be nicer
int throwAway;
return Set.Where(y => ReferenceHashSet.TryGetValue(y, out throwAway));
}
That's a lambda expression version. It uses Dictionary which provides adaptability for varying the value if needed. Literal for-loops could be used and perhaps some more incremental performance improvement gained, but relative to having two-nested-loops, this is already an amazing improvement.
Learning a few things while looking at other answers, here is a faster implementation:
static IEnumerable<int> deduplicationFunction(List<int> Set, List<int> Reference)
{
//Create a hashset first, which is much more efficient for searching
var ReferenceHashSet = new HashSet<int>(Reference);
return Set.Where(y => ReferenceHashSet.Contains(y) == false).Distinct();
}
Importantly, this approach (while a tiny bit slower than #backs answer) is still versatile enough to use for database entities, AND other types can easily be used on the duplicate check field.
Here's an example how the code is easily adjusted for use with a Person kind of database entity list.
static IEnumerable<Person> deduplicatePeople(List<Person> Set, List<Person> Reference)
{
//Create a hashset first, which is much more efficient for searching
var ReferenceHashSet = new HashSet<int>(Reference.Select(p => p.ID));
return Set.Where(y => ReferenceHashSet.Contains(y.ID) == false)
.GroupBy(p => p.ID).Select(p => p.First()); //The groupby and select should accomplish DistinctBy(..p.ID)
}

Obtain data from SQL and loop in different order

in my application I need to operate with data obtained from SQL SERVER. At first, I grab 5 elements from table1:
example 1
example 2
example 3
example 4
example 5
And now, for each element in array1 I grab 3 elements from table2
foo 1
foo 2
foo 3
foo 4
foo 5
foo 6
foo 7
foo 8
foo 9
foo 10
foo 11
foo 12
foo 13
foo 14
foo 15
Allright. What I would like to do now is to loop them in that order:
(I cant post third image because Im new member, but you know the idea.)
To do this I adeed data to arrays:
arrayExample[5]
+----------+
| example0 |
+----------+
| example1 |
+----------+
| example2 |
+----------+
| example3 |
+----------+
| example4 |
+----------+
arrayFoo[5,3]
+--------+--------+--------+
| foo 0 | foo 1 | foo 2 |
+--------+--------+--------+
| foo 3 | foo 4 | foo 5 |
+--------+--------+--------+
| foo 6 | foo 7 | foo 8 |
+--------+--------+--------+
| foo 9 | foo 10 | foo 11 |
+--------+--------+--------+
| foo 12 | foo 13 | foo 14 |
+--------+--------+--------+
And I loop them with:
for (int i=0; i<3; i++)
{
for (int j=0; j<5; j++)
{
DoActions(arrayExample[j], arrayFoo[i,j]);
}
}
And It works. But now I would like to get more data from SQL server. I would need to create more and more dimensions of my arrays. I think its too complicated. There must be a better solution to do this.
tl;dr
I want to grab 5 elements EXAMPLE and each of wchich should have assigned 3 FOO elements.
Now loop every elements with first FOO asigned, then second and then third.
DO you have any better idea than mine?

Packing items vertically (with fixed length and horizontal position)

Let's say i have some items, that have a defined length and horizontal position (both are constant) :
1 : A
2 : B
3 : CC
4 : DDD (item 4 start at position 1, length = 3)
5 : EE
6 : F
I'd like to pack them vertically, resulting in a rectangle having smallest height as possible.
Until now, I have some very simple algorithm that loops over the items and that check row by row if placing them in that row is possible (that means without colliding with something else). Sometimes, it works perfectly (by chance) but sometimes, it results in non-optimal solution.
Here is what it would give for the above example (step by step) :
A | A B | ACC B | ACC B | ACC B | ACC B |
DDD | DDD | FDDD |
EE | EE |
While optimal solution would be :
ADDDB
FCCEE
Note : I have found that sorting items by their length (descending order) first, before applying algorithm, give better results (but it is still not perfect).
Is there any algorithm that would give me optimal solution in reasonable time ? (trying all possibilities is not feasible)
EDIT : here is an example that would not work using sorting trick and that would not work using what TylerOhlsen suggested (unless i dont understand his answer) :
1 : AA
2 : BBB
3 : CCC
4 : DD
Would give :
AA BBB
CCC
DD
Optimal solution :
DDBBB
AACCC
Just spitballing (off the top of my head and just pseudocode ). This algorithm is looping through positions of the current row and attempts to find the best item to place at the position and then moves on to the next row when this row completes. The algorithm completes when all items are used.
The key to the performance of this algorithm is creating an efficient method which finds the longest item at a specific position. This could be done by creating a dictionary (or hash table) of: key=positions, value=sorted list of items at that position (sorted by length descending). Then finding the longest item at a position is as simple as looking up the list of items by position from that hash table and popping the top item off that list.
int cursorRow = 0;
int cursorPosition = 0;
int maxRowLength = 5;
List<Item> items = //fill with item list
Item[][] result = new Item[][];
while (items.Count() > 0)
(
Item item = FindLongestItemAtPosition(cursorPosition);
if (item != null)
{
result[cursorRow][cursorPosition] = item;
items.Remove(item);
cursorPosition += item.Length;
}
else //No items remain with this position
{
cursorPosition++;
}
if (cursorPosition == maxRowLength)
{
cursorPosition = 0;
cursorRow++;
}
}
This should result in the following steps for Example 1 (at the beginning of each loop)...
Row=0 | Row=0 | Row=0 | Row=1 | Row=1 | Row=1 | Row=2 |
Pos=0 | Pos=1 | Pos=4 | Pos=0 | Pos=1 | Pos=3 | Pos=0 |
| A | ADDD | ADDDB | ADDDB | ADDDB | ADDDB |
F | FCC | FCCEE |
This should result in the following steps for Example 2 (at the beginning of each loop)...
Row=0 | Row=0 | Row=0 | Row=1 | Row=1 | Row=1 | Row=2 |
Pos=0 | Pos=2 | Pos=4 | Pos=0 | Pos=1 | Pos=3 | Pos=0 |
| AA | AACCC | AACCC | AACCC | AACCC | AACCC |
DD | DDBBB |
This is a classic Knapsack Problem. As #amit said, it is NP-Complete. The most efficient solution makes use of Dynamic Programming to solve.
The Wikipedia page is a very good start. I've never implemented any algorithm to solve this problem, but I've studied it relation with the minesweeper game, which is also NP-Complete.
Wikipedia: Knapsack Problem

Categories

Resources