I've been doing some reading on the generic Dictionary class and the general advice is to use Dictionary if you need really fast access to an item matching a specific key. This is because a dictionary is using a type-safe Hashtable under the hood. When accessing items the search complexity is O(1) in dictionaries whereas in a List we would need to loop through EVERY SINGLE item until we find a match making the complexity O(n).
I wrote a little console app to see just how significant the difference between the two would be. The app stores 10 million items in each collection and attempts to access the second last item. The time difference between the List and Dictionary<TKey,TValue> is only one second, making the dictionary a winner but only just.
Question - can you provide an example(verbal is fine) where using a Dictionary vs a List would yield significant performance improvements?
class Program
{
static void Main(string[] args)
{
var iterations = 10000000;//10 million
var sw = new Stopwatch();
sw.Start();
var value1 = GetSecondLastFromDictionary(iterations);
sw.Stop();
var t1 = sw.Elapsed.ToString();
sw.Restart();
var value2 = GetSecondLastFromList(iterations);
sw.Stop();
var t2 = sw.Elapsed.ToString();
Console.WriteLine($"Dictionary - {t1}\nList - {t2}");
Console.ReadKey();
}
private static string GetSecondLastFromList(int iterations)
{
var collection = new List<Test>();
for (var i = 0; i < iterations; i++)
collection.Add(new Test { Key = i, Value = $"#{i}" });
return collection.Where(e => e.Key == iterations - 1).First().Value;
}
private static string GetSecondLastFromDictionary(int iterations)
{
var collection = new Dictionary<int, string>();
for (var i = 0; i < iterations; i++)
collection.Add(i, $"#{i}");
return collection[iterations - 1];
}
}
class Test
{
public int Key { get; set; }
public string Value { get; set; }
}
Your own example is fine to show where using a Dictionary yields significant performance improvements. The problem is you're not looking at the right thing. Your code spends a lot of time creating the dictionary or list and then does just one access of it. You need to separate out the collection creation and time multiple accesses of the item.
The code below does this. I get multiple accesses of the dictionary take 0.001s, whereas of the list the same number of accesses takes 2 minutes 32 seconds. Assuming I've done that right I think it shows dictionaries are faster for access.
static void Main(string[] args)
{
var iterations = 100000;
var sw = new Stopwatch();
var dict = CreateDict(iterations);
var list = CreateList(iterations);
sw.Start();
GetSecondLastFromDictionary(iterations, dict);
sw.Stop();
var t1 = sw.Elapsed.ToString();
sw.Restart();
GetSecondLastFromList(iterations, list);
sw.Stop();
var t2 = sw.Elapsed.ToString();
Console.WriteLine($"Dictionary - {t1}\nList - {t2}");
Console.ReadKey();
}
private static Dictionary<int, string> CreateDict(int iterations)
{
var collection = new Dictionary<int, string>();
for (var i = 0; i < iterations; i++)
collection.Add(i, $"#{i}");
return collection;
}
private static List<Test> CreateList(int iterations)
{
var collection = new List<Test>();
for (var i = 0; i < iterations; i++)
collection.Add(new Test { Key = i, Value = $"#{i}" });
return collection;
}
private static void GetSecondLastFromList(int iterations, List<Test> collection)
{
string test;
for (var i = 0; i < iterations; i++)
test = collection.Where(e => e.Key == iterations - 1).First().Value;
}
private static void GetSecondLastFromDictionary(int iterations, Dictionary<int, string> collection)
{
string test;
for (var i = 0; i < iterations; i++)
test = collection[iterations - 1];
}
}
Related
I have an application which is using about 5 different "list of lists", but I am only using index 0 or index 1.
Is this bad practice or will it lead to poor performance?
Here is an example I made similiar to what I'm doing:
internal class Program
{
private const int Count = 64;
private static int _index;
private static List<List<int>> _data = new List<List<int>>();
private static List<List<int>> _dataprevious = new List<List<int>>();
private static List<List<double>> _datacalculated = new List<List<double>>();
private static Random _rand = new Random();
private static void GetData(object o)
{
//Clear list and add new data
_data[_index].Clear();
_datacalculated[_index].Clear();
for (var i = 0; i < Count; i++)
{
_data[_index].Add(_rand.Next(4500, 5500));
}
for (var i = 0; i < Count; i++)
{
_datacalculated[_index].Add(_data[_index][i]/4.78);
}
//Output data to console
Console.WriteLine(_index + ":");
Console.WriteLine(string.Join(":", _data[_index]));
Console.WriteLine();
//Switch between index 0 and 1
_index = 1 - _index;
}
private static void Main()
{
//Setup lists
for (var i = 0; i < 2; i++)
{
_data.Add(new List<int>());
_dataprevious.Add(new List<int>());
_datacalculated.Add(new List<double>());
}
//Get new data every 5 seconds
new Timer(GetData, null, 0, 5000);
Console.ReadLine();
}
}
will it lead to poor performance
Performance is relative. Here, the other operations that you are doing totally dominate the (few) list accesses. If you cache _data[_index] (and others) in local variables then you will face one list access per 64 iterations which is nothing.
You can definitively answer this question by profiling the code. But back of the envelope calculations such as the the in the previous paragraph are also valid and can save some time.
I have somethings like this:
List<string> listUser;
listUser.Add("user1");
listUser.Add("user2");
listUser.Add("userhacker");
listUser.Add("user1other");
List<string> key_blacklist;
key_blacklist.Add("hacker");
key_blacklist.Add("other");
foreach (string user in listUser)
{
foreach (string key in key_blacklist)
{
if (user.Contains(key))
{
// remove it in listUser
}
}
}
The result of listUser is: user1, user2.
The problem is if i have a huge listUser (more than 10 million) and huge key_blacklist (100.000). That code is very very slow.
Is have anyway to get that faster?
UPDATE: I find new solution in there.
http://cc.davelozinski.com/c-sharp/fastest-way-to-check-if-a-string-occurs-within-a-string
Hope that will help someone when he got in there! :)
If you don't have much control over how the list of users is constructed, you can at least test each item in the list in parallel, which on modern machines with multiple cores will speed up the checking a fair bit.
listuser.AsParallel().Where(
s =>
{
foreach (var key in key_blacklist)
{
if (s.Contains(key))
{
return false; //Not to be included
}
}
return true; //To be included, as no match with the blacklist
});
Also - do you have to use .Contains? .Equals is going to be much much quicker, because in almost all cases a non-match will be determined when the HashCodes differ, which can be found only by an integer comparison. Super quick.
If you do need .Contains, you may want to think about restructuring the app. What do these strings in the list really represent? Separate sub-groups of users? Can I test each string, at the time it's added, for whether it represents a user on the blacklist?
UPDATE: In response to #Rawling's comment below - If you know that there is a finite set of usernames which have, say, "hacker" as a substring, that set would have to be pretty large before running a .Equals test of each username against a candidate would be slower than running .Contains on the candidate. This is because HashCode is really quick.
If you are using entity framework or linq to sql then using linq and sending the query to a server can improve the performance.
Then instead of removing the items you are actually querying for the items that fulfil the requirements, i.e. user where the name doesn't contain the banned expression:
listUser.Where(u => !key_blacklist.Any(u.Contains)).Select(u => u).ToList();
A possible solution is to use a tree-like data structure.
The basic idea is to have the blacklisted words organised like this:
+ h
| + ha
| + hac
| - hacker
| - [other words beginning with hac]
|
+ f
| + fu
| + fuk
| - fukoff
| - [other words beginning with fuk]
Then, when you check for blacklisted words, you avoid searching the whole list of words beginning with "hac" if you find out that your user string does not even contain "h".
In the example I provided, with your sample data, this does not of course make any difference, but with the real data sets this should reduce significantly the number of Contains, since you don't check against the full list of blacklisted words every time.
Here is a code example (please note that the code is pretty bad, this is just to illustrate my idea)
using System;
using System.Collections.Generic;
using System.Linq;
class Program {
class Blacklist {
public string Start;
public int Level;
const int MaxLevel = 3;
public Dictionary<string, Blacklist> SubBlacklists = new Dictionary<string, Blacklist>();
public List<string> BlacklistedWords = new List<string>();
public Blacklist() {
Start = string.Empty;
Level = 0;
}
Blacklist(string start, int level) {
Start = start;
Level = level;
}
public void AddBlacklistedWord(string word) {
if (word.Length > Level && Level < MaxLevel) {
string index = word.Substring(0, Level + 1);
Blacklist sublist = null;
if (!SubBlacklists.TryGetValue(index, out sublist)) {
sublist = new Blacklist(index, Level + 1);
SubBlacklists[index] = sublist;
}
sublist.AddBlacklistedWord(word);
} else {
BlacklistedWords.Add(word);
}
}
public bool ContainsBlacklistedWord(string wordToCheck) {
if (wordToCheck.Length > Level && Level < MaxLevel) {
foreach (var sublist in SubBlacklists.Values) {
if (wordToCheck.Contains(sublist.Start)) {
return sublist.ContainsBlacklistedWord(wordToCheck);
}
}
}
return BlacklistedWords.Any(x => wordToCheck.Contains(x));
}
}
static void Main(string[] args) {
List<string> listUser = new List<string>();
listUser.Add("user1");
listUser.Add("user2");
listUser.Add("userhacker");
listUser.Add("userfukoff1");
Blacklist blacklist = new Blacklist();
blacklist.AddBlacklistedWord("hacker");
blacklist.AddBlacklistedWord("fukoff");
foreach (string user in listUser) {
if (blacklist.ContainsBlacklistedWord(user)) {
Console.WriteLine("Contains blacklisted word: {0}", user);
}
}
}
}
You are using the wrong thing. If you have a lot of data, you should be using either HashSet<T> or SortedSet<T>. If you don't need the data sorted, go with HashSet<T>. Here is a program I wrote to demonstrate the time differences:
class Program
{
private static readonly Random random = new Random((int)DateTime.Now.Ticks);
static void Main(string[] args)
{
Console.WriteLine("Creating Lists...");
var stringList = new List<string>();
var hashList = new HashSet<string>();
var sortedList = new SortedSet<string>();
var searchWords1 = new string[3];
int ndx = 0;
for (int x = 0; x < 1000000; x++)
{
string str = RandomString(10);
if (x == 5 || x == 500000 || x == 999999)
{
str = "Z" + str;
searchWords1[ndx] = str;
ndx++;
}
stringList.Add(str);
hashList.Add(str);
sortedList.Add(str);
}
Console.WriteLine("Lists created!");
var sw = new Stopwatch();
sw.Start();
bool search1 = stringList.Contains(searchWords1[2]);
sw.Stop();
Console.WriteLine("List<T> {0} ==> {1}ms", search1, sw.ElapsedMilliseconds);
sw.Reset();
sw.Start();
search1 = hashList.Contains(searchWords1[2]);
sw.Stop();
Console.WriteLine("HashSet<T> {0} ==> {1}ms", search1, sw.ElapsedMilliseconds);
sw.Reset();
sw.Start();
search1 = sortedList.Contains(searchWords1[2]);
sw.Stop();
Console.WriteLine("SortedSet<T> {0} ==> {1}ms", search1, sw.ElapsedMilliseconds);
}
private static string RandomString(int size)
{
var builder = new StringBuilder();
char ch;
for (int i = 0; i < size; i++)
{
ch = Convert.ToChar(Convert.ToInt32(Math.Floor(26 * random.NextDouble() + 65)));
builder.Append(ch);
}
return builder.ToString();
}
}
On my machine, I got the following results:
Creating Lists...
Lists created!
List<T> True ==> 15ms
HashSet<T> True ==> 0ms
SortedSet<T> True ==> 0ms
As you can see, List<T> was extremely slow comparted to HashSet<T> and SortedSet<T>. Those were almost instantaneous.
For some reason it seems like my LinkedList is outperforming my List. I have a LinkedList because there is part of the code where I have to rearrange the children a lot. Everything after that bit of rearranging is simply looping over the data and performing calculations. I was previously just iterating over the LinkedList using an Iterator, but then I thought that I should simultaneously store a List of the same elements so I can iterate over them faster. Somehow, adding the List and iterating over that was significantly slower for the exact same thing. Not sure how this could be, so any info is helpful.
Here is roughly the code that I would EXPECT to be faster:
class Program {
public static void Main(string[] args) {
var originalList = new List<Thing>();
for (var i = 0; i < 1000; i++) {
var t = new Thing();
t.x = 0d;
t.y = 0d;
originalList.Add(t);
}
var something = new Something(originalList);
for (var i = 0; i < 1000; i++) {
var start = DateTime.Now.Ticks / TimeSpan.TicksPerMillisecond;
something.iterate();
time += (DateTime.Now.Ticks / TimeSpan.TicksPerMillisecond) - start;
Console.Out.WriteLine(time / (i + 1));
}
}
class Thing {
public double x {get; set;}
public double y {get; set;}
}
class Something {
private List<Thing> things;
private LinkedList<Thing> things1 = new LinkedList<Thing>();
private List<Thing> things2 = new List<Thing>();
public Class(List<Thing> things) {
this.things = things;
for (var i = 0; i < things.Count; i++) {
things1.AddLast(things[i]);
things2.Add(things[i]);
}
}
public void iterate() {
//loops like this happen a few times, but the list is never changed, only the
//objects properties in the list
for (var i = 0; i < things2.Count; i++) {
var thing = things2[i];
thing.x += someDouble;
thing.y += someOtherDouble;
}
}
}
}
This was what I was doing first, which I think SHOULD be SLOWER:
class Program {
public static void Main(string[] args) {
var originalList = new List<Thing>();
for (var i = 0; i < 1000; i++) {
var t = new Thing();
t.x = 0d;
t.y = 0d;
originalList.Add(t);
}
var something = new Something(originalList);
for (var i = 0; i < 1000; i++) {
var start = DateTime.Now.Ticks / TimeSpan.TicksPerMillisecond;
something.iterate();
time += (DateTime.Now.Ticks / TimeSpan.TicksPerMillisecond) - start;
Console.Out.WriteLine(time / (i + 1));
}
}
class Thing {
public double x {get; set;}
public double y {get; set;}
}
class Something {
private List<Thing> things;
private LinkedList<Thing> things1 = new LinkedList<Thing>();
public Class(List<Thing> things) {
this.things = things;
for (var i = 0; i < things.Count; i++) {
things1.AddLast(things[i]);
}
}
public void iterate() {
//loops like this happen a few times, but the list is never changed, only the
//objects properties in the list
var iterator = things1.First;
while (iterator != null) {
var value = iterator.Value;
value.x += someDouble;
value.y += someOtherDouble;
iterator = iterator.Next;
}
}
}
}
It's difficult to verify anything since it doesn't compile, so I can't run it on my machine, but still there are some big problems:
You should not be using DateTime.Now to measure performance. Instead use Stopwatch, which is capable of high fidelity time measurements, e.g.:
var stopwatch = Stopwatch.StartNew();
//do stuff here
stopwatch.Stop();
double timeInSeconds = stopwatch.Elapsed.TotalSeconds;
Your arithmetic is fundametally flawed in the following line:
DateTime.Now.Ticks / TimeSpan.TicksPerMillisecond
Ticks are represented as integral numbers (long here) and the result of division is not a real number, but is truncated. E.g. 55 / 7 = 7. Therefore it definitely not a stable way of benchmarking.
Additionally, run the benchmark with more elements and ensure you do it in the Release mode.
Based on your code, the cause of slow performance is not the iteration, it is from the action of adding element to the List<T>.
This post does a very good job on comparing the List and the LinkedList. List<T> is based on array, it is allocated in one contiguous block, so when you add more elements to it, the array will be resized, which cause the slow performance in your code.
If you come from the C++ world, List<T> in C# is equivalent to vector<T> in STD, while LinkedList<T> is equivalent to list<T> in STD.
Let's say I have two sequences returning integers 1 to 5.
The first returns 1, 2 and 3 very fast, but 4 and 5 take 200ms each.
public static IEnumerable<int> FastFirst()
{
for (int i = 1; i < 6; i++)
{
if (i > 3) Thread.Sleep(200);
yield return i;
}
}
The second returns 1, 2 and 3 with a 200ms delay, but 4 and 5 are returned fast.
public static IEnumerable<int> SlowFirst()
{
for (int i = 1; i < 6; i++)
{
if (i < 4) Thread.Sleep(200);
yield return i;
}
}
Unioning both these sequences give me just numbers 1 to 5.
FastFirst().Union(SlowFirst());
I cannot guarantee which of the two methods has delays at what point, so the order of the execution cannot guarantee a solution for me. Therefore, I would like to parallelise the union, in order to minimise the (artifical) delay in my example.
A real-world scenario: I have a cache that returns some entities, and a datasource that returns all entities. I'd like to be able to return an iterator from a method that internally parallelises the request to both the cache and the datasource so that the cached results yield as fast as possible.
Note 1: I realise this is still wasting CPU cycles; I'm not asking how can I prevent the sequences from iterating over their slow elements, just how I can union them as fast as possible.
Update 1: I've tailored achitaka-san's great response to accept multiple producers, and to use ContinueWhenAll to set the BlockingCollection's CompleteAdding just the once. I just put it here since it would get lost in the lack of comments formatting. Any further feedback would be great!
public static IEnumerable<TResult> SelectAsync<TResult>(
params IEnumerable<TResult>[] producer)
{
var resultsQueue = new BlockingCollection<TResult>();
var taskList = new HashSet<Task>();
foreach (var result in producer)
{
taskList.Add(
Task.Factory.StartNew(
() =>
{
foreach (var product in result)
{
resultsQueue.Add(product);
}
}));
}
Task.Factory.ContinueWhenAll(taskList.ToArray(), x => resultsQueue.CompleteAdding());
return resultsQueue.GetConsumingEnumerable();
}
Take a look at this.
The first method just returns everything in order results come.
The second checks uniqueness. If you chain them you will get the result you want I think.
public static class Class1
{
public static IEnumerable<TResult> SelectAsync<TResult>(
IEnumerable<TResult> producer1,
IEnumerable<TResult> producer2,
int capacity)
{
var resultsQueue = new BlockingCollection<TResult>(capacity);
var producer1Done = false;
var producer2Done = false;
Task.Factory.StartNew(() =>
{
foreach (var product in producer1)
{
resultsQueue.Add(product);
}
producer1Done = true;
if (producer1Done && producer2Done) { resultsQueue.CompleteAdding(); }
});
Task.Factory.StartNew(() =>
{
foreach (var product in producer2)
{
resultsQueue.Add(product);
}
producer2Done = true;
if (producer1Done && producer2Done) { resultsQueue.CompleteAdding(); }
});
return resultsQueue.GetConsumingEnumerable();
}
public static IEnumerable<TResult> SelectAsyncUnique<TResult>(this IEnumerable<TResult> source)
{
HashSet<TResult> knownResults = new HashSet<TResult>();
foreach (TResult result in source)
{
if (knownResults.Contains(result)) {continue;}
knownResults.Add(result);
yield return result;
}
}
}
The cache would be nearly instant compared to fetching from the database, so you could read from the cache first and return those items, then read from the database and return the items except those that were found in the cache.
If you try to parallelise this, you will add a lot of complexity but get quite a small gain.
Edit:
If there is no predictable difference in the speed of the sources, you could run them in threads and use a synchronised hash set to keep track of which items you have already got, put the new items in a queue, and let the main thread read from the queue:
public static IEnumerable<TItem> GetParallel<TItem, TKey>(Func<TItem, TKey> getKey, params IEnumerable<TItem>[] sources) {
HashSet<TKey> found = new HashSet<TKey>();
List<TItem> queue = new List<TItem>();
object sync = new object();
int alive = 0;
object aliveSync = new object();
foreach (IEnumerable<TItem> source in sources) {
lock (aliveSync) {
alive++;
}
new Thread(s => {
foreach (TItem item in s as IEnumerable<TItem>) {
TKey key = getKey(item);
lock (sync) {
if (found.Add(key)) {
queue.Add(item);
}
}
}
lock (aliveSync) {
alive--;
}
}).Start(source);
}
while (true) {
lock (sync) {
if (queue.Count > 0) {
foreach (TItem item in queue) {
yield return item;
}
queue.Clear();
}
}
lock (aliveSync) {
if (alive == 0) break;
}
Thread.Sleep(100);
}
}
Test stream:
public static IEnumerable<int> SlowRandomFeed(Random rnd) {
int[] values = new int[100];
for (int i = 0; i < 100; i++) {
int pos = rnd.Next(i + 1);
values[i] = i;
int temp = values[pos];
values[pos] = values[i];
values[i] = temp;
}
foreach (int value in values) {
yield return value;
Thread.Sleep(rnd.Next(200));
}
}
Test:
Random rnd = new Random();
foreach (int item in GetParallel(n => n, SlowRandomFeed(rnd), SlowRandomFeed(rnd), SlowRandomFeed(rnd), SlowRandomFeed(rnd))) {
Console.Write("{0:0000 }", item);
}
I am looking for a way to quickly remove items from a C# List<T>. The documentation states that the List.Remove() and List.RemoveAt() operations are both O(n)
List.Remove
List.RemoveAt
This is severely affecting my application.
I wrote a few different remove methods and tested them all on a List<String> with 500,000 items. The test cases are shown below...
Overview
I wrote a method that would generate a list of strings that simply contains string representations of each number ("1", "2", "3", ...). I then attempted to remove every 5th item in the list. Here is the method used to generate the list:
private List<String> GetList(int size)
{
List<String> myList = new List<String>();
for (int i = 0; i < size; i++)
myList.Add(i.ToString());
return myList;
}
Test 1: RemoveAt()
Here is the test I used to test the RemoveAt() method.
private void RemoveTest1(ref List<String> list)
{
for (int i = 0; i < list.Count; i++)
if (i % 5 == 0)
list.RemoveAt(i);
}
Test 2: Remove()
Here is the test I used to test the Remove() method.
private void RemoveTest2(ref List<String> list)
{
List<int> itemsToRemove = new List<int>();
for (int i = 0; i < list.Count; i++)
if (i % 5 == 0)
list.Remove(list[i]);
}
Test 3: Set to null, sort, then RemoveRange
In this test, I looped through the list one time and set the to-be-removed items to null. Then, I sorted the list (so null would be at the top), and removed all the items at the top that were set to null.
NOTE: This reordered my list, so I may have to go put it back in the correct order.
private void RemoveTest3(ref List<String> list)
{
int numToRemove = 0;
for (int i = 0; i < list.Count; i++)
{
if (i % 5 == 0)
{
list[i] = null;
numToRemove++;
}
}
list.Sort();
list.RemoveRange(0, numToRemove);
// Now they're out of order...
}
Test 4: Create a new list, and add all of the "good" values to the new list
In this test, I created a new list, and added all of my keep-items to the new list. Then, I put all of these items into the original list.
private void RemoveTest4(ref List<String> list)
{
List<String> newList = new List<String>();
for (int i = 0; i < list.Count; i++)
{
if (i % 5 == 0)
continue;
else
newList.Add(list[i]);
}
list.RemoveRange(0, list.Count);
list.AddRange(newList);
}
Test 5: Set to null and then FindAll()
In this test, I set all the to-be-deleted items to null, then used the FindAll() feature to find all the items that are not null
private void RemoveTest5(ref List<String> list)
{
for (int i = 0; i < list.Count; i++)
if (i % 5 == 0)
list[i] = null;
list = list.FindAll(x => x != null);
}
Test 6: Set to null and then RemoveAll()
In this test, I set all the to-be-deleted items to null, then used the RemoveAll() feature to remove all the items that are not null
private void RemoveTest6(ref List<String> list)
{
for (int i = 0; i < list.Count; i++)
if (i % 5 == 0)
list[i] = null;
list.RemoveAll(x => x == null);
}
Client Application and Outputs
int numItems = 500000;
Stopwatch watch = new Stopwatch();
// List 1...
watch.Start();
List<String> list1 = GetList(numItems);
watch.Stop(); Console.WriteLine(watch.Elapsed.ToString());
watch.Reset(); watch.Start();
RemoveTest1(ref list1);
watch.Stop(); Console.WriteLine(watch.Elapsed.ToString());
Console.WriteLine();
// List 2...
watch.Start();
List<String> list2 = GetList(numItems);
watch.Stop(); Console.WriteLine(watch.Elapsed.ToString());
watch.Reset(); watch.Start();
RemoveTest2(ref list2);
watch.Stop(); Console.WriteLine(watch.Elapsed.ToString());
Console.WriteLine();
// List 3...
watch.Reset(); watch.Start();
List<String> list3 = GetList(numItems);
watch.Stop(); Console.WriteLine(watch.Elapsed.ToString());
watch.Reset(); watch.Start();
RemoveTest3(ref list3);
watch.Stop(); Console.WriteLine(watch.Elapsed.ToString());
Console.WriteLine();
// List 4...
watch.Reset(); watch.Start();
List<String> list4 = GetList(numItems);
watch.Stop(); Console.WriteLine(watch.Elapsed.ToString());
watch.Reset(); watch.Start();
RemoveTest4(ref list4);
watch.Stop(); Console.WriteLine(watch.Elapsed.ToString());
Console.WriteLine();
// List 5...
watch.Reset(); watch.Start();
List<String> list5 = GetList(numItems);
watch.Stop(); Console.WriteLine(watch.Elapsed.ToString());
watch.Reset(); watch.Start();
RemoveTest5(ref list5);
watch.Stop(); Console.WriteLine(watch.Elapsed.ToString());
Console.WriteLine();
// List 6...
watch.Reset(); watch.Start();
List<String> list6 = GetList(numItems);
watch.Stop(); Console.WriteLine(watch.Elapsed.ToString());
watch.Reset(); watch.Start();
RemoveTest6(ref list6);
watch.Stop(); Console.WriteLine(watch.Elapsed.ToString());
Console.WriteLine();
Results
00:00:00.1433089 // Create list
00:00:32.8031420 // RemoveAt()
00:00:32.9612512 // Forgot to reset stopwatch :(
00:04:40.3633045 // Remove()
00:00:00.2405003 // Create list
00:00:01.1054731 // Null, Sort(), RemoveRange()
00:00:00.1796988 // Create list
00:00:00.0166984 // Add good values to new list
00:00:00.2115022 // Create list
00:00:00.0194616 // FindAll()
00:00:00.3064646 // Create list
00:00:00.0167236 // RemoveAll()
Notes And Comments
The first two tests do not actually remove every 5th item from the list, because the list is being reordered after each remove. In fact, out of 500,000 items, only 83,334 were removed (should have been 100,000). I am okay with this - clearly the Remove()/RemoveAt() methods are not a good idea anyway.
Although I tried to remove the 5th item from the list, in reality there will not be such a pattern. Entries to be removed will be random.
Although I used a List<String> in this example, that will not always be the case. It could be a List<Anything>
Not putting the items in the list to begin with is not an option.
The other methods (3 - 6) all performed much better, comparatively, however I am a little concerned -- In 3, 5, and 6 I was forced to set a value to null, and then remove all the items according to this sentinel. I don't like that approach because I can envision a scenario where one of the items in the list might be null and it would get removed unintentionally.
My question is: What is the best way to quickly remove many items from a List<T>? Most of the approaches I've tried look really ugly, and potentially dangerous, to me. Is a List the wrong data structure?
Right now, I am leaning towards creating a new list and adding the good items to the new list, but it seems like there should be a better way.
List isn't an efficient data structure when it comes to removal. You would do better to use a double linked list (LinkedList) as removal simply requires reference updates in the adjacent entries.
If the order does not matter then there is a simple O(1) List.Remove method.
public static class ListExt
{
// O(1)
public static void RemoveBySwap<T>(this List<T> list, int index)
{
list[index] = list[list.Count - 1];
list.RemoveAt(list.Count - 1);
}
// O(n)
public static void RemoveBySwap<T>(this List<T> list, T item)
{
int index = list.IndexOf(item);
RemoveBySwap(list, index);
}
// O(n)
public static void RemoveBySwap<T>(this List<T> list, Predicate<T> predicate)
{
int index = list.FindIndex(predicate);
RemoveBySwap(list, index);
}
}
This solution is friendly for memory traversal, so even if you need to find the index first it will be very fast.
Notes:
Finding the index of an item must be O(n) since the list must be unsorted.
Linked lists are slow on traversal, especially for large collections with long life spans.
If you're happy creating a new list, you don't have to go through setting items to null. For example:
// This overload of Where provides the index as well as the value. Unless
// you need the index, use the simpler overload which just provides the value.
List<string> newList = oldList.Where((value, index) => index % 5 != 0)
.ToList();
However, you might want to look at alternative data structures, such as LinkedList<T> or HashSet<T>. It really depends on what features you need from your data structure.
I feel a HashSet, LinkedList or Dictionary will do you much better.
You could always remove the items from the end of the list. List removal is O(1) when performed on the last element since all it does is decrement count. There is no shifting of next elements involved. (which is the reason why list removal is O(n) generally)
for (int i = list.Count - 1; i >= 0; --i)
list.RemoveAt(i);
Or you could do this:
List<int> listA;
List<int> listB;
...
List<int> resultingList = listA.Except(listB);
Ok try RemoveAll used like this
static void Main(string[] args)
{
Stopwatch watch = new Stopwatch();
watch.Start();
List<Int32> test = GetList(500000);
watch.Stop(); Console.WriteLine(watch.Elapsed.ToString());
watch.Reset(); watch.Start();
test.RemoveAll( t=> t % 5 == 0);
List<String> test2 = test.ConvertAll(delegate(int i) { return i.ToString(); });
watch.Stop(); Console.WriteLine(watch.Elapsed.ToString());
Console.WriteLine((500000 - test.Count).ToString());
Console.ReadLine();
}
static private List<Int32> GetList(int size)
{
List<Int32> test = new List<Int32>();
for (int i = 0; i < 500000; i++)
test.Add(i);
return test;
}
this only loops twice and removes eactly 100,000 items
My output for this code:
00:00:00.0099495
00:00:00.1945987
1000000
Updated to try a HashSet
static void Main(string[] args)
{
Stopwatch watch = new Stopwatch();
do
{
// Test with list
watch.Reset(); watch.Start();
List<Int32> test = GetList(500000);
watch.Stop(); Console.WriteLine(watch.Elapsed.ToString());
watch.Reset(); watch.Start();
List<String> myList = RemoveTest(test);
watch.Stop(); Console.WriteLine(watch.Elapsed.ToString());
Console.WriteLine((500000 - test.Count).ToString());
Console.WriteLine();
// Test with HashSet
watch.Reset(); watch.Start();
HashSet<String> test2 = GetStringList(500000);
watch.Stop(); Console.WriteLine(watch.Elapsed.ToString());
watch.Reset(); watch.Start();
HashSet<String> myList2 = RemoveTest(test2);
watch.Stop(); Console.WriteLine(watch.Elapsed.ToString());
Console.WriteLine((500000 - test.Count).ToString());
Console.WriteLine();
} while (Console.ReadKey().Key != ConsoleKey.Escape);
}
static private List<Int32> GetList(int size)
{
List<Int32> test = new List<Int32>();
for (int i = 0; i < 500000; i++)
test.Add(i);
return test;
}
static private HashSet<String> GetStringList(int size)
{
HashSet<String> test = new HashSet<String>();
for (int i = 0; i < 500000; i++)
test.Add(i.ToString());
return test;
}
static private List<String> RemoveTest(List<Int32> list)
{
list.RemoveAll(t => t % 5 == 0);
return list.ConvertAll(delegate(int i) { return i.ToString(); });
}
static private HashSet<String> RemoveTest(HashSet<String> list)
{
list.RemoveWhere(t => Convert.ToInt32(t) % 5 == 0);
return list;
}
This gives me:
00:00:00.0131586
00:00:00.1454723
100000
00:00:00.3459420
00:00:00.2122574
100000
I've found when dealing with large lists, this is often faster. The speed of the Remove and finding the right item in the dictionary to remove, more than makes up for creating the dictionary. A couple things though, the original list has to have unique values, and I don't think the order is guaranteed once you are done.
List<long> hundredThousandItemsInOrignalList;
List<long> fiftyThousandItemsToRemove;
// populate lists...
Dictionary<long, long> originalItems = hundredThousandItemsInOrignalList.ToDictionary(i => i);
foreach (long i in fiftyThousandItemsToRemove)
{
originalItems.Remove(i);
}
List<long> newList = originalItems.Select(i => i.Key).ToList();
Lists are faster than LinkedLists until n gets realy big. The reason for this is because so called cache misses occur quite more frequently using LinkedLists than Lists. Memory look ups are quite expensive. As a list is implemented as an array the CPU can load a bunch of data at once because it knows the required data is stored next to each other. However a linked list does not give the CPU any hint which data is required next which forces the CPU to do quite more memory look ups. By the way. With term memory I mean RAM.
For further details take a look at: https://jackmott.github.io/programming/2016/08/20/when-bigo-foolsya.html
The other answers (and the question itself) offer various ways of dealing with this "slug" (slowness bug) using the built-in .NET Framework classes.
But if you're willing to switch to a third-party library, you can get better performance simply by changing the data structure, and leaving your code unchanged except for the list type.
The Loyc Core libraries include two types that work the same way as List<T> but can remove items faster:
DList<T> is a simple data structure that gives you a 2x speedup over List<T> when removing items from random locations
AList<T> is a sophisticated data structure that gives you a large speedup over List<T> when your lists are very long (but may be slower when the list is short).
If you still want to use a List as an underlying structure, you can use the following extension method, which does the heavy lifting for you.
using System.Collections.Generic;
using System.Linq;
namespace Library.Extensions
{
public static class ListExtensions
{
public static IEnumerable<T> RemoveRange<T>(this List<T> list, IEnumerable<T> range)
{
var removed = list.Intersect(range).ToArray();
if (!removed.Any())
{
return Enumerable.Empty<T>();
}
var remaining = list.Except(removed).ToArray();
list.Clear();
list.AddRange(remaining);
return removed;
}
}
}
A simple stopwatch test gives results in about 200ms for removal. Keep in mind this is not a real benchmark usage.
public class Program
{
static void Main(string[] args)
{
var list = Enumerable
.Range(0, 500_000)
.Select(x => x.ToString())
.ToList();
var allFifthItems = list.Where((_, index) => index % 5 == 0).ToArray();
var sw = Stopwatch.StartNew();
list.RemoveRange(allFifthItems);
sw.Stop();
var message = $"{allFifthItems.Length} elements removed in {sw.Elapsed}";
Console.WriteLine(message);
}
}
Output:
100000 elements removed in 00:00:00.2291337