This question already has answers here:
How do I get the last four characters from a string in C#?
(27 answers)
Closed last month.
string phoneNumber1 = 01234567899;
string PhoneNumber2 = +441234567899;
How do I compare last 10 digits only of these 2 strings in c#? These are 2 different formats of UK phone numbers for same number. I wan't to compare them to find if they're matching.
Thanks in advance
Approach with Reverse() and Take()
string number1 = "01234567899";
string number2 = "+441234567899";
bool result = number1.Reverse().Take(10).SequenceEqual(number2.Reverse().Take(10));
Use .TakeLast(10) to get the last 10 elements and then do the list comparison with .SequenceEqual:
string phoneNumber1 = "01234567899";
string phoneNumber2 = "+441234567899";
var phoneNumber1_last10 = phoneNumber1.TakeLast(10).ToList();
var phoneNumber2_last10 = phoneNumber2.TakeLast(10).ToList();
// Check the result
Console.WriteLine(phoneNumber1_last10.SequenceEqual(phoneNumber2_last10));
ReadOnlySpan<char> - Version:
public static bool IsMatch(ReadOnlySpan<char> a, ReadOnlySpan<char> b)
{
var a10 = a[^10..];
var b10 = b[^10..];
return a10.Equals(b10, StringComparison.Ordinal);
}
Usable as var isMatch = IsMatch(phoneNumber1 , phoneNumber2 );
=> https://dotnetfiddle.net/43wNR9
I would also recommend to maybe take into consideration creating a "PhoneNumber" type, that parses Country-Code if present and the rest of number? And then you can create EqualityComparer, override Equals ...
If you need to call this very often, you should consider this:
BenchmarkDotNet=v0.13.3, OS=Windows 10 (10.0.19044.2364/21H2/November2021Update)
Intel Core i9-10885H CPU 2.40GHz, 1 CPU, 16 logical and 8 physical cores
.NET SDK=7.0.101
[Host] : .NET 7.0.1 (7.0.122.56804), X64 RyuJIT AVX2
DefaultJob : .NET 7.0.1 (7.0.122.56804), X64 RyuJIT AVX2
| Method | Mean | Error | StdDev | Gen0 | Allocated |
|------------------ |-----------:|----------:|----------:|-------:|----------:|
| MemoryCompareSpan | 5.564 ns | 0.1313 ns | 0.1228 ns | - | - |
| ReverseTake | 467.598 ns | 8.5469 ns | 7.9948 ns | 0.0629 | 528 B |
| TakeLast | 629.914 ns | 4.7967 ns | 4.4868 ns | 0.1068 | 896 B |
// * Hints *
Outliers
Benchmark.MemoryCompareSpan: Default -> 1 outlier was removed, 3 outliers were detected (6.72 ns, 6.76 ns, 7.31 ns)
Benchmark.TakeLast: Default -> 1 outlier was detected (620.55 ns)
// * Legends *
Mean : Arithmetic mean of all measurements
Error : Half of 99.9% confidence interval
StdDev : Standard deviation of all measurements
Gen0 : GC Generation 0 collects per 1000 operations
Allocated : Allocated memory per single operation (managed only, inclusive, 1KB = 1024B)
1 ns : 1 Nanosecond (0.000000001 sec)
For when they match and
BenchmarkDotNet=v0.13.3, OS=Windows 10 (10.0.19044.2364/21H2/November2021Update)
Intel Core i9-10885H CPU 2.40GHz, 1 CPU, 16 logical and 8 physical cores
.NET SDK=7.0.101
[Host] : .NET 7.0.1 (7.0.122.56804), X64 RyuJIT AVX2
DefaultJob : .NET 7.0.1 (7.0.122.56804), X64 RyuJIT AVX2
| Method | Mean | Error | StdDev | Gen0 | Allocated |
|------------------ |-----------:|----------:|----------:|-------:|----------:|
| MemoryCompareSpan | 4.708 ns | 0.1113 ns | 0.1041 ns | - | - |
| ReverseTake | 277.765 ns | 2.3049 ns | 2.1560 ns | 0.0629 | 528 B |
| TakeLast | 583.637 ns | 7.0153 ns | 6.5621 ns | 0.1068 | 896 B |
// * Legends *
Mean : Arithmetic mean of all measurements
Error : Half of 99.9% confidence interval
StdDev : Standard deviation of all measurements
Gen0 : GC Generation 0 collects per 1000 operations
Allocated : Allocated memory per single operation (managed only, inclusive, 1KB = 1024B)
1 ns : 1 Nanosecond (0.000000001 sec)
when they don't.
Why
When comparing and deduplicating across two lists coders don't often find the most runtime-efficient implementation while under time-pressure. Two nested for-loops is a common goto solution for many coders. One might try a CROSS JOIN with LINQ, but this is clearly inefficient. Coders need a memorable and code-efficient approach for this that is also relatively runtime-efficient.
This question was created after seeing a more specific one: Delete duplicates in a single dataset relative to another one in C# - it's more specialised with the use of Datasets. The term "dataset" would not help people in the future. No other generalised question was found.
What
I have used the term List/Collection to help with this more general coding problem.
var setToDeduplicate = new List<int>() { 1,2,3,4,5,6,7,8,9,10,11,.....}; //All integer values 1-1M
var referenceSet = new List<int>() { 1,3,5,7,9,....}; //All odd integer values 1-1M
var deduplicatedSet = deduplicationFunction(setToDeduplicate, referenceSet);
By implementing the deduplicationFunction function the input data and output should be clear. The output can be IEnumerable. The expected output in this input example would be the even numbers from 1-1M {2,4,6,8,...}
Note: There may be duplicates within the referenceSet. The values in both sets are indicative only, so I'm not looking for a mathematical solution - this should also work for random number inputs in both sets.
If this is approached with simple LINQ functions it will be too slow O(1M*0.5M). There needs to be a faster approach for such large sets.
Speed is important, but incremental improvements with a large bloat of code will be of less value. Also, ideally it would work for other datatypes including data model objects, but answering this specific question should be enough. Other datatypes would simply involve some more pre-processing or slight change to the answer.
Solution Summary
Here's the test code, for results which follow:
using System;
using System.Collections.Generic;
using System.Data;
using System.Diagnostics;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace Test
{
class Program
{
static void Main(string[] args)
{
Console.WriteLine("Preparing...");
List<int> set1 = new List<int>();
List<int> set2 = new List<int>();
Random r = new Random();
var max = 10000;
for (int i = 0; i < max; i++)
{
set1.Add(r.Next(0, max));
set2.Add(r.Next(0, max/2) * 2);
}
Console.WriteLine("First run...");
Stopwatch sw = new Stopwatch();
IEnumerable<int> result;
int count;
while (true)
{
sw.Start();
result = deduplicationFunction(set1, set2);
var results1 = result.ToList();
count = results1.Count;
sw.Stop();
Console.WriteLine("Dictionary and Where - Count: {0}, Milliseconds: {1:0.00}.", count, sw.ElapsedTicks / (decimal)10000);
sw.Reset();
sw.Start();
result = deduplicationFunction2(set1, set2);
var results2 = result.ToList();
count = results2.Count;
sw.Stop();
Console.WriteLine(" HashSet ExceptWith - Count: {0}, Milliseconds: {1:0.00}.", count, sw.ElapsedTicks / (decimal)10000);
sw.Reset();
sw.Start();
result = deduplicationFunction3(set1, set2);
var results3 = result.ToList();
count = results3.Count;
sw.Stop();
Console.WriteLine(" Sort Dual Index - Count: {0}, Milliseconds: {1:0.00}.", count, sw.ElapsedTicks / (decimal)10000);
sw.Reset();
sw.Start();
result = deduplicationFunction4(set1, set2);
var results4 = result.ToList();
count = results3.Count;
sw.Stop();
Console.WriteLine("Presorted Dual Index - Count: {0}, Milliseconds: {1:0.00}.", count, sw.ElapsedTicks / (decimal)10000);
sw.Reset();
set2.RemoveAt(set2.Count - 1); //Remove the last item, because it was added in the 3rd test
sw.Start();
result = deduplicationFunction5(set1, set2);
var results5 = result.ToList();
count = results5.Count;
sw.Stop();
Console.WriteLine(" Nested Index - Count: {0}, Milliseconds: {1:0.00}.", count, sw.ElapsedTicks / (decimal)10000);
sw.Reset();
Console.ReadLine();
Console.WriteLine("");
Console.WriteLine("Next Run");
Console.WriteLine("");
}
}
//Returns an IEnumerable from which more can be chained or simply terminated with ToList by the caller
static IEnumerable<int> deduplicationFunction(List<int> Set, List<int> Reference)
{
//Create a hashset first, which is much more efficient for searching
var ReferenceHashSet = Reference
.Distinct() //Inserting duplicate keys in a dictionary will cause an exception
.ToDictionary(x => x, x => x); //If there was a ToHashSet function, that would be nicer
int throwAway;
return Set.Distinct().Where(y => ReferenceHashSet.TryGetValue(y, out throwAway) == false);
}
//Returns an IEnumerable from which more can be chained or simply terminated with ToList by the caller
static IEnumerable<int> deduplicationFunction2(List<int> Set, List<int> Reference)
{
//Create a hashset first, which is much more efficient for searching
var SetAsHash = new HashSet<int>();
Set.ForEach(x =>
{
if (SetAsHash.Contains(x))
return;
SetAsHash.Add(x);
}); // .Net 4.7.2 - ToHashSet will reduce this code to a single line.
SetAsHash.ExceptWith(Reference); // This is ultimately what we're testing
return SetAsHash.AsEnumerable();
}
static IEnumerable<int> deduplicationFunction3(List<int> Set, List<int> Reference)
{
Set.Sort();
Reference.Sort();
Reference.Add(Set[Set.Count - 1] + 1); //Ensure the last set item is non-duplicate for an In-built stop clause. This is easy for int list items, just + 1 on the last item.
return deduplicationFunction4(Set, Reference);
}
static IEnumerable<int> deduplicationFunction4(List<int> Set, List<int> Reference)
{
int i1 = 0;
int i2 = 0;
int thisValue = Set[i1];
int thisReference = Reference[i2];
while (true)
{
var difference = thisReference - thisValue;
if (difference < 0)
{
i2++; //Compare side is too low, there might be an equal value to be found
if (i2 == Reference.Count)
break;
thisReference = Reference[i2];
continue;
}
if (difference > 0) //Duplicate
yield return thisValue;
GoFurther:
i1++;
if (i1 == Set.Count)
break;
if (Set[i1] == thisValue) //Eliminates duplicates
goto GoFurther; //I rarely use goto statements, but this is a good situation
thisValue = Set[i1];
}
}
static IEnumerable<int> deduplicationFunction5(List<int> Set, List<int> Reference)
{
var found = false;
var lastValue = 0;
var thisValue = 0;
for (int i = 0; i < Set.Count; i++)
{
thisValue = Set[i];
if (thisValue == lastValue)
continue;
lastValue = thisValue;
found = false;
for (int x = 0; x < Reference.Count; x++)
{
if (thisValue != Reference[x])
continue;
found = true;
break;
}
if (found)
continue;
yield return thisValue;
}
}
}
}
I'll use this to compare performance of multiple approaches. (I'm particularly interested in Hash-approach vs dual-index-on-sorted-approach at this stage, although ExceptWith enables a terse solution)
Results so far on 10k items in set (Good Run):
First Run
Dictionary and Where - Count: 3565, Milliseconds: 16.38.
HashSet ExceptWith - Count: 3565, Milliseconds: 5.33.
Sort Dual Index - Count: 3565, Milliseconds: 6.34.
Presorted Dual Index - Count: 3565, Milliseconds: 1.14.
Nested Index - Count: 3565, Milliseconds: 964.16.
Good Run
Dictionary and Where - Count: 3565, Milliseconds: 1.21.
HashSet ExceptWith - Count: 3565, Milliseconds: 0.94.
Sort Dual Index - Count: 3565, Milliseconds: 1.09.
Presorted Dual Index - Count: 3565, Milliseconds: 0.76.
Nested Index - Count: 3565, Milliseconds: 628.60.
Chosen answer:
#backs HashSet.ExceptWith approach - is marginally faster with minimal code, uses an interesting function ExceptWith, however it is weakened due to lack of versatility, and the fact the interesting function is less commonly known.
One of my answers: HashSet > Where(..Contains..) - is only a tiny bit slower than #backs, but uses a code pattern that uses LINQ and is very versitile beyond lists of primative elements. I believe this is a more common scenario I find myself with when coding, and trust this is the case for many other coders.
Special thanks to #TheGeneral for his benchmarking of some of the answers and also some interesting unsafe versions, and for helping to make #Backs answer more efficient for a followup test.
Use HashSet for your initial list and ExceptWith method to get result sett:
var setToDeduplicate = new HashSet<int>() { 1,2,3,4,5,6,7,8,9,10,11,.....}; //All integer values 1-1M
var referenceSet = new List<int>() { 1,3,5,7,9,....}; //All odd integer values 1-1M
setToDeduplicate.ExceptWith(referenceSet);
Here are some more, basically i wanted to test both distinct and not distinct input against a variety of solutions. In the non distinct version i had to call distinct where needed on the final output.
Mode : Release (64Bit)
Test Framework : .NET Framework 4.7.1
Operating System : Microsoft Windows 10 Pro
Version : 10.0.17134
CPU Name : Intel(R) Core(TM) i7-3770K CPU # 3.50GHz
Description : Intel64 Family 6 Model 58 Stepping 9
Cores (Threads) : 4 (8) : Architecture : x64
Clock Speed : 3901 MHz : Bus Speed : 100 MHz
L2Cache : 1 MB : L3Cache : 8 MB
Benchmarks Runs : Inputs (1) * Scales (5) * Benchmarks (6) * Runs (100) = 3,000
Results Distinct input
--- Random Set 1 ---------------------------------------------------------------------
| Value | Average | Fastest | Cycles | Garbage | Test | Gain |
--- Scale 100 --------------------------------------------------------- Time 0.334 ---
| Backs | 0.008 ms | 0.007 ms | 31,362 | 8.000 KB | Pass | 68.34 % |
| ListUnsafe | 0.009 ms | 0.008 ms | 35,487 | 8.000 KB | Pass | 63.45 % |
| HasSet | 0.012 ms | 0.011 ms | 46,840 | 8.000 KB | Pass | 50.03 % |
| ArrayUnsafe | 0.013 ms | 0.011 ms | 49,388 | 8.000 KB | Pass | 47.75 % |
| HashSetUnsafe | 0.018 ms | 0.013 ms | 66,866 | 16.000 KB | Pass | 26.62 % |
| Todd | 0.024 ms | 0.019 ms | 90,763 | 16.000 KB | Base | 0.00 % |
--- Scale 1,000 ------------------------------------------------------- Time 0.377 ---
| Backs | 0.070 ms | 0.060 ms | 249,374 | 28.977 KB | Pass | 57.56 % |
| ListUnsafe | 0.078 ms | 0.067 ms | 277,080 | 28.977 KB | Pass | 52.67 % |
| HasSet | 0.093 ms | 0.083 ms | 329,686 | 28.977 KB | Pass | 43.61 % |
| ArrayUnsafe | 0.096 ms | 0.082 ms | 340,154 | 36.977 KB | Pass | 41.72 % |
| HashSetUnsafe | 0.103 ms | 0.085 ms | 367,681 | 55.797 KB | Pass | 37.07 % |
| Todd | 0.164 ms | 0.151 ms | 578,933 | 112.664 KB | Base | 0.00 % |
--- Scale 10,000 ------------------------------------------------------ Time 0.965 ---
| ListUnsafe | 0.706 ms | 0.611 ms | 2,467,327 | 258.516 KB | Pass | 48.60 % |
| Backs | 0.758 ms | 0.654 ms | 2,656,610 | 180.297 KB | Pass | 44.81 % |
| ArrayUnsafe | 0.783 ms | 0.696 ms | 2,739,156 | 276.281 KB | Pass | 43.02 % |
| HasSet | 0.859 ms | 0.752 ms | 2,999,230 | 198.063 KB | Pass | 37.47 % |
| HashSetUnsafe | 0.864 ms | 0.783 ms | 3,029,086 | 332.273 KB | Pass | 37.07 % |
| Todd | 1.373 ms | 1.251 ms | 4,795,929 | 604.742 KB | Base | 0.00 % |
--- Scale 100,000 ----------------------------------------------------- Time 5.535 ---
| ListUnsafe | 5.624 ms | 4.874 ms | 19,658,154 | 2.926 MB | Pass | 40.36 % |
| HasSet | 7.574 ms | 6.548 ms | 26,446,193 | 2.820 MB | Pass | 19.68 % |
| Backs | 7.585 ms | 5.634 ms | 26,303,794 | 2.009 MB | Pass | 19.57 % |
| ArrayUnsafe | 8.287 ms | 6.219 ms | 28,923,797 | 3.583 MB | Pass | 12.12 % |
| Todd | 9.430 ms | 7.326 ms | 32,880,985 | 2.144 MB | Base | 0.00 % |
| HashSetUnsafe | 9.601 ms | 7.859 ms | 32,845,228 | 5.197 MB | Pass | -1.81 % |
--- Scale 1,000,000 -------------------------------------------------- Time 47.652 ---
| ListUnsafe | 57.751 ms | 44.734 ms | 201,477,028 | 29.309 MB | Pass | 22.14 % |
| Backs | 65.567 ms | 49.023 ms | 228,772,283 | 21.526 MB | Pass | 11.61 % |
| HasSet | 73.163 ms | 56.799 ms | 254,703,994 | 25.904 MB | Pass | 1.36 % |
| Todd | 74.175 ms | 53.739 ms | 258,760,390 | 9.144 MB | Base | 0.00 % |
| ArrayUnsafe | 86.530 ms | 67.803 ms | 300,374,535 | 13.755 MB | Pass | -16.66 % |
| HashSetUnsafe | 97.140 ms | 77.844 ms | 337,639,426 | 39.527 MB | Pass | -30.96 % |
--------------------------------------------------------------------------------------
Results Random List using Distinct on results where needed
--- Random Set 1 ---------------------------------------------------------------------
| Value | Average | Fastest | Cycles | Garbage | Test | Gain |
--- Scale 100 --------------------------------------------------------- Time 0.272 ---
| Backs | 0.007 ms | 0.006 ms | 28,449 | 8.000 KB | Pass | 72.96 % |
| HasSet | 0.010 ms | 0.009 ms | 38,222 | 8.000 KB | Pass | 62.05 % |
| HashSetUnsafe | 0.014 ms | 0.010 ms | 51,816 | 16.000 KB | Pass | 47.52 % |
| ListUnsafe | 0.017 ms | 0.014 ms | 64,333 | 16.000 KB | Pass | 33.84 % |
| ArrayUnsafe | 0.020 ms | 0.015 ms | 72,468 | 16.000 KB | Pass | 24.70 % |
| Todd | 0.026 ms | 0.021 ms | 95,500 | 24.000 KB | Base | 0.00 % |
--- Scale 1,000 ------------------------------------------------------- Time 0.361 ---
| Backs | 0.061 ms | 0.053 ms | 219,141 | 28.977 KB | Pass | 70.46 % |
| HasSet | 0.092 ms | 0.080 ms | 325,353 | 28.977 KB | Pass | 55.78 % |
| HashSetUnsafe | 0.093 ms | 0.079 ms | 331,390 | 55.797 KB | Pass | 55.03 % |
| ListUnsafe | 0.122 ms | 0.101 ms | 432,029 | 73.016 KB | Pass | 41.19 % |
| ArrayUnsafe | 0.133 ms | 0.113 ms | 469,560 | 73.016 KB | Pass | 35.88 % |
| Todd | 0.208 ms | 0.173 ms | 730,661 | 148.703 KB | Base | 0.00 % |
--- Scale 10,000 ------------------------------------------------------ Time 0.870 ---
| Backs | 0.620 ms | 0.579 ms | 2,174,415 | 180.188 KB | Pass | 55.31 % |
| HasSet | 0.696 ms | 0.635 ms | 2,440,300 | 198.063 KB | Pass | 49.87 % |
| HashSetUnsafe | 0.731 ms | 0.679 ms | 2,563,125 | 332.164 KB | Pass | 47.32 % |
| ListUnsafe | 0.804 ms | 0.761 ms | 2,818,293 | 400.492 KB | Pass | 42.11 % |
| ArrayUnsafe | 0.810 ms | 0.751 ms | 2,838,680 | 400.492 KB | Pass | 41.68 % |
| Todd | 1.388 ms | 1.271 ms | 4,863,651 | 736.953 KB | Base | 0.00 % |
--- Scale 100,000 ----------------------------------------------------- Time 6.616 ---
| Backs | 5.604 ms | 4.710 ms | 19,600,934 | 2.009 MB | Pass | 62.92 % |
| HasSet | 6.607 ms | 5.847 ms | 23,093,963 | 2.820 MB | Pass | 56.29 % |
| HashSetUnsafe | 8.565 ms | 7.465 ms | 29,239,067 | 5.197 MB | Pass | 43.34 % |
| ListUnsafe | 11.447 ms | 9.543 ms | 39,452,865 | 5.101 MB | Pass | 24.28 % |
| ArrayUnsafe | 11.517 ms | 9.841 ms | 39,731,502 | 5.483 MB | Pass | 23.81 % |
| Todd | 15.116 ms | 11.369 ms | 51,963,309 | 3.427 MB | Base | 0.00 % |
--- Scale 1,000,000 -------------------------------------------------- Time 55.310 ---
| Backs | 53.766 ms | 44.321 ms | 187,905,335 | 21.526 MB | Pass | 51.32 % |
| HasSet | 60.759 ms | 50.742 ms | 212,409,649 | 25.904 MB | Pass | 44.99 % |
| HashSetUnsafe | 79.248 ms | 67.130 ms | 275,455,545 | 39.527 MB | Pass | 28.25 % |
| ListUnsafe | 106.527 ms | 90.159 ms | 370,838,650 | 39.153 MB | Pass | 3.55 % |
| Todd | 110.444 ms | 93.225 ms | 384,636,081 | 22.676 MB | Base | 0.00 % |
| ArrayUnsafe | 114.548 ms | 98.033 ms | 398,219,513 | 38.974 MB | Pass | -3.72 % |
--------------------------------------------------------------------------------------
Data
private Tuple<List<int>, List<int>> GenerateData(int scale)
{
return new Tuple<List<int>, List<int>>(
Enumerable.Range(0, scale)
.Select(x => x)
.ToList(),
Enumerable.Range(0, scale)
.Select(x => Rand.Next(10000))
.ToList());
}
Code
public class Backs : Benchmark<Tuple<List<int>, List<int>>, List<int>>
{
protected override List<int> InternalRun()
{
var hashSet = new HashSet<int>(Input.Item1);
hashSet.ExceptWith(Input.Item2);
return hashSet.ToList();
}
}
public class HasSet : Benchmark<Tuple<List<int>, List<int>>, List<int>>
{
protected override List<int> InternalRun()
{
var hashSet = new HashSet<int>(Input.Item2);
return Input.Item1.Where(y => !hashSet.Contains(y)).ToList();
}
}
public class Todd : Benchmark<Tuple<List<int>, List<int>>, List<int>>
{
protected override List<int> InternalRun()
{
var referenceHashSet = Input.Item2.Distinct()
.ToDictionary(x => x, x => x);
return Input.Item1.Where(y => !referenceHashSet.TryGetValue(y, out _)).ToList();
}
}
public unsafe class HashSetUnsafe : Benchmark<Tuple<List<int>, List<int>>, List<int>>
{
protected override List<int> InternalRun()
{
var reference = new HashSet<int>(Input.Item2);
var result = new HashSet<int>();
fixed (int* pAry = Input.Item1.ToArray())
{
var len = pAry+Input.Item1.Count;
for (var p = pAry; p < len; p++)
{
if(!reference.Contains(*p))
result.Add(*p);
}
}
return result.ToList();
}
}
public unsafe class ListUnsafe : Benchmark<Tuple<List<int>, List<int>>, List<int>>
{
protected override List<int> InternalRun()
{
var reference = new HashSet<int>(Input.Item2);
var result = new List<int>(Input.Item2.Count);
fixed (int* pAry = Input.Item1.ToArray())
{
var len = pAry+Input.Item1.Count;
for (var p = pAry; p < len; p++)
{
if(!reference.Contains(*p))
result.Add(*p);
}
}
return result.ToList();
}
}
public unsafe class ArrayUnsafe : Benchmark<Tuple<List<int>, List<int>>, List<int>>
{
protected override List<int> InternalRun()
{
var reference = new HashSet<int>(Input.Item2);
var result = new int[Input.Item1.Count];
fixed (int* pAry = Input.Item1.ToArray(), pRes = result)
{
var j = 0;
var len = pAry+Input.Item1.Count;
for (var p = pAry; p < len; p++)
{
if(!reference.Contains(*p))
*(pRes+j++) = *p;
}
return result.Take(j).ToList();
}
}
}
Summary
No surprises here really, if you have a distinct list to start with its better for some solutions, If not the simplest hashset version is the best
Single loop Dual-Index
As recommended by #PepitoSh in the Question comments:
I think HashSet is a very generic solution to a rather specific
problem. If your lists are ordered, scanning them parallel and compare
the current items is the fastest
This is very different to having two nested loops. Instead there is a single general loop and the indexes are incremented ascending in parallel, depending on the relative value difference. The difference is basically the output of any normal Comparison function: { negative, 0, positive }
static IEnumerable<int> deduplicationFunction4(List<int> Set, List<int> Reference)
{
int i1 = 0;
int i2 = 0;
int thisValue = Set[i1];
int thisReference = Reference[i2];
while (true)
{
var difference = thisReference - thisValue;
if (difference < 0)
{
i2++; //Compare side is too low, there might be an equal value to be found
if (i2 == Reference.Count)
break;
thisReference = Reference[i2];
continue;
}
if (difference > 0) //Duplicate
yield return thisValue;
GoFurther:
i1++;
if (i1 == Set.Count)
break;
if (Set[i1] == thisValue) //Eliminates duplicates
goto GoFurther; //I rarely use goto statements, but this is a good situation
thisValue = Set[i1];
}
}
How to call this function, if the lists aren't yet sorted:
Set.Sort();
Reference.Sort();
Reference.Add(Set[Set.Count - 1] + 1); //Ensure the last set item is non-duplicate for an In-built stop clause. This is easy for int list items, just + 1 on the last item.
return deduplicationFunction4(Set, Reference);
This gave me the best performance in my benchmarking. This could probably also be tried with unsafe code for more of a speedup in some scenarios. In scenarios where data is already sorted, this is by far the best. A faster sorting algorithm might also be selected, but not the subject of this question.
Note: This method deduplicates as it goes.
I have actually coded such a single loop pattern before when finalising text-search results, except I had N arrays to check for "closeness". So I had an array of indexes - array[index[i]]. So I'm sure having a single loop with controlled index incrementing isn't a new concept, but it's certainly a great solution here.
HashSet and Where
You must use a HashSet (or Dictionary) for speed:
//Returns an IEnumerable from which more can be chained or simply terminated with ToList by the caller
IEnumerable<int> deduplicationFunction(List<int> Set, List<int> Reference)
{
//Create a hashset first, which is much more efficient for searching
var ReferenceHashSet = Reference
.Distinct() //Inserting duplicate keys in a dictionary will cause an exception
.ToDictionary(x => x, x => x); //If there was a ToHashSet function, that would be nicer
int throwAway;
return Set.Where(y => ReferenceHashSet.TryGetValue(y, out throwAway));
}
That's a lambda expression version. It uses Dictionary which provides adaptability for varying the value if needed. Literal for-loops could be used and perhaps some more incremental performance improvement gained, but relative to having two-nested-loops, this is already an amazing improvement.
Learning a few things while looking at other answers, here is a faster implementation:
static IEnumerable<int> deduplicationFunction(List<int> Set, List<int> Reference)
{
//Create a hashset first, which is much more efficient for searching
var ReferenceHashSet = new HashSet<int>(Reference);
return Set.Where(y => ReferenceHashSet.Contains(y) == false).Distinct();
}
Importantly, this approach (while a tiny bit slower than #backs answer) is still versatile enough to use for database entities, AND other types can easily be used on the duplicate check field.
Here's an example how the code is easily adjusted for use with a Person kind of database entity list.
static IEnumerable<Person> deduplicatePeople(List<Person> Set, List<Person> Reference)
{
//Create a hashset first, which is much more efficient for searching
var ReferenceHashSet = new HashSet<int>(Reference.Select(p => p.ID));
return Set.Where(y => ReferenceHashSet.Contains(y.ID) == false)
.GroupBy(p => p.ID).Select(p => p.First()); //The groupby and select should accomplish DistinctBy(..p.ID)
}
This is almost an impossible question to ask, but any advice on the algorithm would be greatly appreciated (I will explain the best I can);
I have an array of size ~4000 bytes which contains data in byte format.
For this demonstration, I am going to simplify things a bit; say it's size 7 (to represent 'blocks' of data, not single values!);
| 0 | 1 | 2 | 3 | 4 | 5 | 6 |
I am adding a value at position 0, with the reset of the array being '0'
key: N = newest, O = oldest, X = Filled
| N | | | | | | |
I now need to add another value. this will be entered at the next available position.
| O | N | | | | | |
So now position [0] is now the 'oldest' part of the array, and position [1] is the newest.
This has been (currently) worked out by looking all the way right, seeing no values, and then starting from position [0] until it sees a value.
Let's add another:
| O | X | N | | | | |
Note, the oldest value hasn't changed position, as it is still the oldest part of the array.
I am now going to 'clear' the oldest part of the array (in this example it is currently pos [0]). this makes 'O' move over to the next position.
| | O | N | | | | |
Lets add another value. Since it will go to the first 'empty' space, it will go to position [0]; this means the first position is now at position [0].
| N | O | X | | | | |
I'm going to clear another one now; so again, by looking from the right of the newest value, I see a value is at position 1. So i'm going to clear it.
| | | O/N | | | | |
this means position [2] is now both the newest and oldest value available.
Adding another makes;
| N | | O | | | | |
Adding another;
| X | N | O | | | | |
and adding another;
| X | X | O | N | | | |
I am looking to delete the oldest value now. So by looking right from position of the 'newest' variable, I see pos[0] has a value, so that must be it. UH-OH that's not the oldest value!
As you can (hopefully) tell, I am unable to get the oldest ticket by looking to the right for my next value - this problem only occurs every so often, and has been hard to find a solution.
I only know the index of the most recent value added, and this is very hard to find a solution. (lots of scribbliing and diagrams have been attempted, lots of scrumpled up paper).
So if anyone had any ideas as to how I could ALWAYS find the oldest value's index, I would be greatly appreciative! (I also know this is quite a complex question, so if anyone wants/needs clarification, I'll be happy to edit/explain further!) I have tagged c#, but realistically I only need a BASIC algorithm for any progress to be honest!!!
====================================================================================
EDIT
Answers have suggested to allocate to the right of the 'newest' position;
like:
| | | O | N | | | |
| | | O | X | N | | |
| | | O | X | X | N | |
| | | O | X | X | X | N |
| N | | O | X | X | X | X |
| X | N | O | X | X | X | X |
Which I think COULD work, but anyone know if this would fail (say, I removed a value at a certain time/etc?)
I guess you are forced to use an array; if not then you should consider switching to an adequate data structure such as a Queue.
If you are indeed forced to use an array, and can only keep a pointer to the latest block, then i would recommend always adding new blocks the to the right of the latest block, with a index wrapping back to zero at array size.
This lets you determine what the oldest block is by looking to the blocks right of the latest block, until you find a non empty block: this is your oldest block. Null it to remove it from the array and carry on :)
Let's illustrate:
| N | | | | | | | // newBlockIndex at 0, adding, newBlockIndex becomes 1
| X | N | | | | | | // newBlockIndex at 1, adding, newBlockIndex becomes 2
| X | X | N | | | | | // newBlockIndex at 2, adding, newBlockIndex becomes 3
| | X | N | | | | | // newBlockIndex at 3, removing, no item before index 0, we delete it
| | X | X | N | | | | // newBlockIndex at 3, adding, newBlockIndex becomes 4
...
EDIT TO ADD
Regarding your edit, I think the mechanism is quite robust. Even if you were to remove an item (any item, even the latest one) by error, the next operation can succeed because latest and newest are defined in regards to their position to the current index. Newest item is the first on the left of the index, oldest the first on the right.
Even if you don't check for your array size and fill it completely (which I don't recommend, though), the algorithm will overwrite the oldest item with the newest: it may not be good but it is coherent with the notion of a queue. Of course if the array fills up you can always decide to allocate a new one larger and copy the current one to the larger array
What you are looking for is a queue data structure.
Queues can be conveniently implemented with a circular buffer where you have a head index and a tail index.
Head and Tail are both initially set to zero.
Add a new element by writing it to where Tail points, then increment Tail. Wrap as needed if incrementing makes it go off the end of the array.
Delete an old element by incrementing Head. Again, wrap as needed if incrementing makes it go off the end of the array.
Head always points at the oldest element.
Tail always points to the right of the newest element.
Use a System.Collections.Generic.Queue<T>, where T is a byte block.
Queue<byte[]> queue = new Queue<byte[]>();
byte[] block;
queue.Enqueue(new byte[] { 10, 11, 12, 13 });
queue.Enqueue(new byte[] { 20, 21, 22, 23 });
queue.Enqueue(new byte[] { 30, 31, 32, 33 });
block = queue.Dequeue();
queue.Enqueue(new byte[] { 40, 41, 42, 43 });
block = queue.Dequeue();
block = queue.Dequeue();
queue.Enqueue(new byte[] { 50, 51, 52, 53 });
queue.Enqueue(new byte[] { 60, 61, 62, 63 });
queue.Enqueue(new byte[] { 70, 71, 72, 73 });
block = queue.Dequeue();
// ...
Dequeue always removes the oldest element!
Since you have clarified in comments that it must be an array, here is a solution that encapsulates an array-queue in a class. It treats consecutive elements as data block of a defined size. It also allows you to access array elements by index and the array itself. This not typical for queues, but since you need the array...
public class ArrayBlocksQueue<T>
{
private T[] _array;
private int _in, _out, _count, _length, _blockSize;
public ArrayBlocksQueue(int maxBlocks, int blockSize)
{
_length = maxBlocks * blockSize;
_blockSize = blockSize;
_array = new T[_length];
}
public void Enqueue(params T[] block)
{
if (block == null) {
throw new ArgumentNullException();
}
if (block.Length != _blockSize) {
throw new ArgumentException("Data does not have required block size.");
}
if (_count + _blockSize > _length) {
throw new ApplicationException("Queue is full");
}
block.CopyTo(_array, _in);
_in = (_in + _blockSize) % _length;
_count += _blockSize;
}
public T[] Dequeue()
{
if (_count == 0) {
throw new ApplicationException("Queue is empty");
}
T[] temp = new T[_blockSize];
System.Array.Copy(_array, _out, temp, 0, _blockSize);
_out = (_out + _blockSize) % _length;
_count -= _blockSize;
return temp;
}
public int Count { get { return _count; } }
public int BlockCount { get { return _count / _blockSize; } }
public T[] Array { get { return _array; } }
public T this[int index]
{
get
{
if (!IsIndexValid(index)) {
throw new IndexOutOfRangeException();
}
return _array[index];
}
set
{
if (!IsIndexValid(index)) {
throw new IndexOutOfRangeException();
}
_array[index] = value;
}
}
public bool IsIndexValid(int index)
{
if (index < 0 || index >= _length) {
return false;
}
if (_count == _length) {
return true;
}
return _out > _in
? index < _in || index >= _out
: index >= _out && index < _in;
}
}
I am practising implementing some basic layer 7 protocols but I am unsure of the best way of serialising and deserialising bits in the .Net framework.
According to the MSDN Data Type Summary, there is no bit data type. I have no idea how I would go about creating such a data type or even if it's possible so I am left with serialising/deserialising to a byte / byte array.
Given the following example from the top of an NTP packet:
0-1 LeapIndicator (LI) 2 bits
2-4 VersionNumber (VN) 3 bits
5-7 Mode 3 bits
8-15 Stratum 8 bits
I would like to encode into 2 bytes so I can send via the socket.
Also, I am currently using ints to represent the bits in enums, is it possible to use bits/hex or something a better than ints? For example the mode enum is defined as follows:
public enum Mode
{
/*
+-------+--------------------------+
| Value | Meaning |
+-------+--------------------------+
| 0 | reserved |
| 1 | symmetric active |
| 2 | symmetric passive |
| 3 | client |
| 4 | server |
| 5 | broadcast |
| 6 | NTP control message |
| 7 | reserved for private use |
+-------+--------------------------+
*/
Resevered = 0,
SymmetricActive = 1,
SymmetricPassive = 2,
Client = 3,
Server = 4,
Broadcast = 5,
ControlMessage = 6,
PrivateUse = 7
}
Side Note: The code for this project will eventually be open sourced, please bare in mind that if you answer. If you do not wish for the code to be shared, please say :) A link will be placed in the code back to this question.
Thanks in advance :)
Update: In case people are wondering what the NTP packet structure looks like, taken directly from RFC 5905, page 18
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|LI | VN |Mode | Stratum | Poll | Precision |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Root Delay |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Root Dispersion |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Reference ID |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| |
+ Reference Timestamp (64) +
| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| |
+ Origin Timestamp (64) +
| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| |
+ Receive Timestamp (64) +
| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| |
+ Transmit Timestamp (64) +
| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| |
. .
. Extension Field 1 (variable) .
. .
| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| |
. .
. Extension Field 2 (variable) .
. .
| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Key Identifier |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| |
| dgst (128) |
| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
I don't think I'd use an enum here at all. I'd probably create a struct to represent the packet header, storing the data in a ushort (16 bits):
public struct NtpHeader
{
private readonly ushort bits;
// Creates a header from a portion of a byte array, e.g
// given a complete packet and the index within it
public NtpHeader(byte[] data, int index)
{
bits = (ushort) (data[index] + (data[index] << 8));
}
public NtpHeader(int leapIndicator, int versionNumber,
int mode, int stratum)
{
// TODO: Validation
bits = (ushort) (leapIndicator |
(versionNumber << 2) |
(mode << 5) |
(stratum << 8));
}
public int LeapIndicator { get { return bits & 3; } }
public int VersionNumber { get { return (bits >> 2) & 7; } }
public int Mode { get { return (bits >> 5) & 7; } }
public int Stratum { get { return bits >> 8; } }
}
You'll want to check this though - it's not immediately clear what bit arrangement is really represented in the RFC. If you have sample packets with expected values, that would make things much clearer.
FYI, there is a struct that represents a bit in .NET, it's System.Boolean. As mentioned by Marc, the protocol is in even bytes, so you could use an int (with each int holding 32 bits), or use enums in a bitmask style. Either way, you can use the System.BitConverter's static methods to do convert to and from byte arrays.
Have you considered using the Flags attribute? It allows you to treat enumerated type values as bits instead of ints:
http://msdn.microsoft.com/en-us/library/system.flagsattribute.aspx
The smallest type of an enum in c# is byte (the other types available are explained here http://msdn.microsoft.com/en-us/library/sbbt4032.aspx).
Define an enum of type byte:
enum Name:byte{}
in your example:
public enum Mode:byte
{
/*
+-------+--------------------------+
| Value | Meaning |
+-------+--------------------------+
| 0 | reserved |
| 1 | symmetric active |
| 2 | symmetric passive |
| 3 | client |
| 4 | server |
| 5 | broadcast |
| 6 | NTP control message |
| 7 | reserved for private use |
+-------+--------------------------+
*/
Resevered = 0,
SymmetricActive = 1,
SymmetricPassive = 2,
Client = 3,
Server = 4,
Broadcast = 5,
ControlMessage = 6,
PrivateUse = 7
}
If we wish to save space but have less readability, we can see that sizeof(LeapIndicator) + sizeof(VersionNumber) + sizeof(Mode) = 8 bits = 1 byte.
and also sizeof(Sratum) = 8 bits = 1 byte.
Serialize: To put your packet fields into the result, simply multiply by 2 (left shift) by the appropriate number of bits and then OR with the cumulative result so far.
Deserialize: To extract your packet fields from the result, simple use an AND bitmask then divide by 2 (right shift) by the appropriate number of bits.