I am looking for a way to quickly remove items from a C# List<T>. The documentation states that the List.Remove() and List.RemoveAt() operations are both O(n)
List.Remove
List.RemoveAt
This is severely affecting my application.
I wrote a few different remove methods and tested them all on a List<String> with 500,000 items. The test cases are shown below...
Overview
I wrote a method that would generate a list of strings that simply contains string representations of each number ("1", "2", "3", ...). I then attempted to remove every 5th item in the list. Here is the method used to generate the list:
private List<String> GetList(int size)
{
List<String> myList = new List<String>();
for (int i = 0; i < size; i++)
myList.Add(i.ToString());
return myList;
}
Test 1: RemoveAt()
Here is the test I used to test the RemoveAt() method.
private void RemoveTest1(ref List<String> list)
{
for (int i = 0; i < list.Count; i++)
if (i % 5 == 0)
list.RemoveAt(i);
}
Test 2: Remove()
Here is the test I used to test the Remove() method.
private void RemoveTest2(ref List<String> list)
{
List<int> itemsToRemove = new List<int>();
for (int i = 0; i < list.Count; i++)
if (i % 5 == 0)
list.Remove(list[i]);
}
Test 3: Set to null, sort, then RemoveRange
In this test, I looped through the list one time and set the to-be-removed items to null. Then, I sorted the list (so null would be at the top), and removed all the items at the top that were set to null.
NOTE: This reordered my list, so I may have to go put it back in the correct order.
private void RemoveTest3(ref List<String> list)
{
int numToRemove = 0;
for (int i = 0; i < list.Count; i++)
{
if (i % 5 == 0)
{
list[i] = null;
numToRemove++;
}
}
list.Sort();
list.RemoveRange(0, numToRemove);
// Now they're out of order...
}
Test 4: Create a new list, and add all of the "good" values to the new list
In this test, I created a new list, and added all of my keep-items to the new list. Then, I put all of these items into the original list.
private void RemoveTest4(ref List<String> list)
{
List<String> newList = new List<String>();
for (int i = 0; i < list.Count; i++)
{
if (i % 5 == 0)
continue;
else
newList.Add(list[i]);
}
list.RemoveRange(0, list.Count);
list.AddRange(newList);
}
Test 5: Set to null and then FindAll()
In this test, I set all the to-be-deleted items to null, then used the FindAll() feature to find all the items that are not null
private void RemoveTest5(ref List<String> list)
{
for (int i = 0; i < list.Count; i++)
if (i % 5 == 0)
list[i] = null;
list = list.FindAll(x => x != null);
}
Test 6: Set to null and then RemoveAll()
In this test, I set all the to-be-deleted items to null, then used the RemoveAll() feature to remove all the items that are not null
private void RemoveTest6(ref List<String> list)
{
for (int i = 0; i < list.Count; i++)
if (i % 5 == 0)
list[i] = null;
list.RemoveAll(x => x == null);
}
Client Application and Outputs
int numItems = 500000;
Stopwatch watch = new Stopwatch();
// List 1...
watch.Start();
List<String> list1 = GetList(numItems);
watch.Stop(); Console.WriteLine(watch.Elapsed.ToString());
watch.Reset(); watch.Start();
RemoveTest1(ref list1);
watch.Stop(); Console.WriteLine(watch.Elapsed.ToString());
Console.WriteLine();
// List 2...
watch.Start();
List<String> list2 = GetList(numItems);
watch.Stop(); Console.WriteLine(watch.Elapsed.ToString());
watch.Reset(); watch.Start();
RemoveTest2(ref list2);
watch.Stop(); Console.WriteLine(watch.Elapsed.ToString());
Console.WriteLine();
// List 3...
watch.Reset(); watch.Start();
List<String> list3 = GetList(numItems);
watch.Stop(); Console.WriteLine(watch.Elapsed.ToString());
watch.Reset(); watch.Start();
RemoveTest3(ref list3);
watch.Stop(); Console.WriteLine(watch.Elapsed.ToString());
Console.WriteLine();
// List 4...
watch.Reset(); watch.Start();
List<String> list4 = GetList(numItems);
watch.Stop(); Console.WriteLine(watch.Elapsed.ToString());
watch.Reset(); watch.Start();
RemoveTest4(ref list4);
watch.Stop(); Console.WriteLine(watch.Elapsed.ToString());
Console.WriteLine();
// List 5...
watch.Reset(); watch.Start();
List<String> list5 = GetList(numItems);
watch.Stop(); Console.WriteLine(watch.Elapsed.ToString());
watch.Reset(); watch.Start();
RemoveTest5(ref list5);
watch.Stop(); Console.WriteLine(watch.Elapsed.ToString());
Console.WriteLine();
// List 6...
watch.Reset(); watch.Start();
List<String> list6 = GetList(numItems);
watch.Stop(); Console.WriteLine(watch.Elapsed.ToString());
watch.Reset(); watch.Start();
RemoveTest6(ref list6);
watch.Stop(); Console.WriteLine(watch.Elapsed.ToString());
Console.WriteLine();
Results
00:00:00.1433089 // Create list
00:00:32.8031420 // RemoveAt()
00:00:32.9612512 // Forgot to reset stopwatch :(
00:04:40.3633045 // Remove()
00:00:00.2405003 // Create list
00:00:01.1054731 // Null, Sort(), RemoveRange()
00:00:00.1796988 // Create list
00:00:00.0166984 // Add good values to new list
00:00:00.2115022 // Create list
00:00:00.0194616 // FindAll()
00:00:00.3064646 // Create list
00:00:00.0167236 // RemoveAll()
Notes And Comments
The first two tests do not actually remove every 5th item from the list, because the list is being reordered after each remove. In fact, out of 500,000 items, only 83,334 were removed (should have been 100,000). I am okay with this - clearly the Remove()/RemoveAt() methods are not a good idea anyway.
Although I tried to remove the 5th item from the list, in reality there will not be such a pattern. Entries to be removed will be random.
Although I used a List<String> in this example, that will not always be the case. It could be a List<Anything>
Not putting the items in the list to begin with is not an option.
The other methods (3 - 6) all performed much better, comparatively, however I am a little concerned -- In 3, 5, and 6 I was forced to set a value to null, and then remove all the items according to this sentinel. I don't like that approach because I can envision a scenario where one of the items in the list might be null and it would get removed unintentionally.
My question is: What is the best way to quickly remove many items from a List<T>? Most of the approaches I've tried look really ugly, and potentially dangerous, to me. Is a List the wrong data structure?
Right now, I am leaning towards creating a new list and adding the good items to the new list, but it seems like there should be a better way.
List isn't an efficient data structure when it comes to removal. You would do better to use a double linked list (LinkedList) as removal simply requires reference updates in the adjacent entries.
If the order does not matter then there is a simple O(1) List.Remove method.
public static class ListExt
{
// O(1)
public static void RemoveBySwap<T>(this List<T> list, int index)
{
list[index] = list[list.Count - 1];
list.RemoveAt(list.Count - 1);
}
// O(n)
public static void RemoveBySwap<T>(this List<T> list, T item)
{
int index = list.IndexOf(item);
RemoveBySwap(list, index);
}
// O(n)
public static void RemoveBySwap<T>(this List<T> list, Predicate<T> predicate)
{
int index = list.FindIndex(predicate);
RemoveBySwap(list, index);
}
}
This solution is friendly for memory traversal, so even if you need to find the index first it will be very fast.
Notes:
Finding the index of an item must be O(n) since the list must be unsorted.
Linked lists are slow on traversal, especially for large collections with long life spans.
If you're happy creating a new list, you don't have to go through setting items to null. For example:
// This overload of Where provides the index as well as the value. Unless
// you need the index, use the simpler overload which just provides the value.
List<string> newList = oldList.Where((value, index) => index % 5 != 0)
.ToList();
However, you might want to look at alternative data structures, such as LinkedList<T> or HashSet<T>. It really depends on what features you need from your data structure.
I feel a HashSet, LinkedList or Dictionary will do you much better.
You could always remove the items from the end of the list. List removal is O(1) when performed on the last element since all it does is decrement count. There is no shifting of next elements involved. (which is the reason why list removal is O(n) generally)
for (int i = list.Count - 1; i >= 0; --i)
list.RemoveAt(i);
Or you could do this:
List<int> listA;
List<int> listB;
...
List<int> resultingList = listA.Except(listB);
Ok try RemoveAll used like this
static void Main(string[] args)
{
Stopwatch watch = new Stopwatch();
watch.Start();
List<Int32> test = GetList(500000);
watch.Stop(); Console.WriteLine(watch.Elapsed.ToString());
watch.Reset(); watch.Start();
test.RemoveAll( t=> t % 5 == 0);
List<String> test2 = test.ConvertAll(delegate(int i) { return i.ToString(); });
watch.Stop(); Console.WriteLine(watch.Elapsed.ToString());
Console.WriteLine((500000 - test.Count).ToString());
Console.ReadLine();
}
static private List<Int32> GetList(int size)
{
List<Int32> test = new List<Int32>();
for (int i = 0; i < 500000; i++)
test.Add(i);
return test;
}
this only loops twice and removes eactly 100,000 items
My output for this code:
00:00:00.0099495
00:00:00.1945987
1000000
Updated to try a HashSet
static void Main(string[] args)
{
Stopwatch watch = new Stopwatch();
do
{
// Test with list
watch.Reset(); watch.Start();
List<Int32> test = GetList(500000);
watch.Stop(); Console.WriteLine(watch.Elapsed.ToString());
watch.Reset(); watch.Start();
List<String> myList = RemoveTest(test);
watch.Stop(); Console.WriteLine(watch.Elapsed.ToString());
Console.WriteLine((500000 - test.Count).ToString());
Console.WriteLine();
// Test with HashSet
watch.Reset(); watch.Start();
HashSet<String> test2 = GetStringList(500000);
watch.Stop(); Console.WriteLine(watch.Elapsed.ToString());
watch.Reset(); watch.Start();
HashSet<String> myList2 = RemoveTest(test2);
watch.Stop(); Console.WriteLine(watch.Elapsed.ToString());
Console.WriteLine((500000 - test.Count).ToString());
Console.WriteLine();
} while (Console.ReadKey().Key != ConsoleKey.Escape);
}
static private List<Int32> GetList(int size)
{
List<Int32> test = new List<Int32>();
for (int i = 0; i < 500000; i++)
test.Add(i);
return test;
}
static private HashSet<String> GetStringList(int size)
{
HashSet<String> test = new HashSet<String>();
for (int i = 0; i < 500000; i++)
test.Add(i.ToString());
return test;
}
static private List<String> RemoveTest(List<Int32> list)
{
list.RemoveAll(t => t % 5 == 0);
return list.ConvertAll(delegate(int i) { return i.ToString(); });
}
static private HashSet<String> RemoveTest(HashSet<String> list)
{
list.RemoveWhere(t => Convert.ToInt32(t) % 5 == 0);
return list;
}
This gives me:
00:00:00.0131586
00:00:00.1454723
100000
00:00:00.3459420
00:00:00.2122574
100000
I've found when dealing with large lists, this is often faster. The speed of the Remove and finding the right item in the dictionary to remove, more than makes up for creating the dictionary. A couple things though, the original list has to have unique values, and I don't think the order is guaranteed once you are done.
List<long> hundredThousandItemsInOrignalList;
List<long> fiftyThousandItemsToRemove;
// populate lists...
Dictionary<long, long> originalItems = hundredThousandItemsInOrignalList.ToDictionary(i => i);
foreach (long i in fiftyThousandItemsToRemove)
{
originalItems.Remove(i);
}
List<long> newList = originalItems.Select(i => i.Key).ToList();
Lists are faster than LinkedLists until n gets realy big. The reason for this is because so called cache misses occur quite more frequently using LinkedLists than Lists. Memory look ups are quite expensive. As a list is implemented as an array the CPU can load a bunch of data at once because it knows the required data is stored next to each other. However a linked list does not give the CPU any hint which data is required next which forces the CPU to do quite more memory look ups. By the way. With term memory I mean RAM.
For further details take a look at: https://jackmott.github.io/programming/2016/08/20/when-bigo-foolsya.html
The other answers (and the question itself) offer various ways of dealing with this "slug" (slowness bug) using the built-in .NET Framework classes.
But if you're willing to switch to a third-party library, you can get better performance simply by changing the data structure, and leaving your code unchanged except for the list type.
The Loyc Core libraries include two types that work the same way as List<T> but can remove items faster:
DList<T> is a simple data structure that gives you a 2x speedup over List<T> when removing items from random locations
AList<T> is a sophisticated data structure that gives you a large speedup over List<T> when your lists are very long (but may be slower when the list is short).
If you still want to use a List as an underlying structure, you can use the following extension method, which does the heavy lifting for you.
using System.Collections.Generic;
using System.Linq;
namespace Library.Extensions
{
public static class ListExtensions
{
public static IEnumerable<T> RemoveRange<T>(this List<T> list, IEnumerable<T> range)
{
var removed = list.Intersect(range).ToArray();
if (!removed.Any())
{
return Enumerable.Empty<T>();
}
var remaining = list.Except(removed).ToArray();
list.Clear();
list.AddRange(remaining);
return removed;
}
}
}
A simple stopwatch test gives results in about 200ms for removal. Keep in mind this is not a real benchmark usage.
public class Program
{
static void Main(string[] args)
{
var list = Enumerable
.Range(0, 500_000)
.Select(x => x.ToString())
.ToList();
var allFifthItems = list.Where((_, index) => index % 5 == 0).ToArray();
var sw = Stopwatch.StartNew();
list.RemoveRange(allFifthItems);
sw.Stop();
var message = $"{allFifthItems.Length} elements removed in {sw.Elapsed}";
Console.WriteLine(message);
}
}
Output:
100000 elements removed in 00:00:00.2291337
Related
I've been doing some reading on the generic Dictionary class and the general advice is to use Dictionary if you need really fast access to an item matching a specific key. This is because a dictionary is using a type-safe Hashtable under the hood. When accessing items the search complexity is O(1) in dictionaries whereas in a List we would need to loop through EVERY SINGLE item until we find a match making the complexity O(n).
I wrote a little console app to see just how significant the difference between the two would be. The app stores 10 million items in each collection and attempts to access the second last item. The time difference between the List and Dictionary<TKey,TValue> is only one second, making the dictionary a winner but only just.
Question - can you provide an example(verbal is fine) where using a Dictionary vs a List would yield significant performance improvements?
class Program
{
static void Main(string[] args)
{
var iterations = 10000000;//10 million
var sw = new Stopwatch();
sw.Start();
var value1 = GetSecondLastFromDictionary(iterations);
sw.Stop();
var t1 = sw.Elapsed.ToString();
sw.Restart();
var value2 = GetSecondLastFromList(iterations);
sw.Stop();
var t2 = sw.Elapsed.ToString();
Console.WriteLine($"Dictionary - {t1}\nList - {t2}");
Console.ReadKey();
}
private static string GetSecondLastFromList(int iterations)
{
var collection = new List<Test>();
for (var i = 0; i < iterations; i++)
collection.Add(new Test { Key = i, Value = $"#{i}" });
return collection.Where(e => e.Key == iterations - 1).First().Value;
}
private static string GetSecondLastFromDictionary(int iterations)
{
var collection = new Dictionary<int, string>();
for (var i = 0; i < iterations; i++)
collection.Add(i, $"#{i}");
return collection[iterations - 1];
}
}
class Test
{
public int Key { get; set; }
public string Value { get; set; }
}
Your own example is fine to show where using a Dictionary yields significant performance improvements. The problem is you're not looking at the right thing. Your code spends a lot of time creating the dictionary or list and then does just one access of it. You need to separate out the collection creation and time multiple accesses of the item.
The code below does this. I get multiple accesses of the dictionary take 0.001s, whereas of the list the same number of accesses takes 2 minutes 32 seconds. Assuming I've done that right I think it shows dictionaries are faster for access.
static void Main(string[] args)
{
var iterations = 100000;
var sw = new Stopwatch();
var dict = CreateDict(iterations);
var list = CreateList(iterations);
sw.Start();
GetSecondLastFromDictionary(iterations, dict);
sw.Stop();
var t1 = sw.Elapsed.ToString();
sw.Restart();
GetSecondLastFromList(iterations, list);
sw.Stop();
var t2 = sw.Elapsed.ToString();
Console.WriteLine($"Dictionary - {t1}\nList - {t2}");
Console.ReadKey();
}
private static Dictionary<int, string> CreateDict(int iterations)
{
var collection = new Dictionary<int, string>();
for (var i = 0; i < iterations; i++)
collection.Add(i, $"#{i}");
return collection;
}
private static List<Test> CreateList(int iterations)
{
var collection = new List<Test>();
for (var i = 0; i < iterations; i++)
collection.Add(new Test { Key = i, Value = $"#{i}" });
return collection;
}
private static void GetSecondLastFromList(int iterations, List<Test> collection)
{
string test;
for (var i = 0; i < iterations; i++)
test = collection.Where(e => e.Key == iterations - 1).First().Value;
}
private static void GetSecondLastFromDictionary(int iterations, Dictionary<int, string> collection)
{
string test;
for (var i = 0; i < iterations; i++)
test = collection[iterations - 1];
}
}
I am working on an application whose bottleneck is list1.Except(list2). From this post : should i use Except or Contains when dealing with HashSet or so in Linq the complexity of except is O(m+n) (m and n standing for the sizes of the lists). However, my lists are sorted. Can this help ?
The first implementation I can think of :
foreach element in list2 (m operations)
look for it in list1 (ln(n) operations)
if present
set it to null (O(1), removing has a O(n))
else continue
It has complexity O(m*ln(n)), which is very interesting when m is small and n is large (this is exactly the case with my data sets : m is around 50, n is around 1 000 000). However the fact that it produces null may have a lot of implications on the functions using it... Is there any way to keep this complexity, without having to write nulls (and then keeping track of them)
Any help would be greatly appreciated!
using System;
using System.Collections.Generic;
public class Test
{
public static void Main()
{
var listM = new List<int>();
var listN = new List<int>();
for(int i = 0, x = 0; x < 50; i+=13, x++) {
listM.Add(i);
}
for(int i = 0, x = 0; x < 10000; i+=7, x++) {
listN.Add(i);
}
Console.WriteLine(SortedExcept(listM, listN).Count);
}
public static List<T> SortedExcept<T>(List<T> m, List<T> n) {
var result = new List<T>();
foreach(var itm in m) {
var index = n.BinarySearch(itm);
if(index < 0) {
result.Add(itm);
}
}
return result;
}
}
EDIT Here's the O(M + N) version as well
public static List<T> SortedExcept2<T>(List<T> m, List<T> n) where T : IComparable<T> {
var result = new List<T>();
int i = 0, j = 0;
if(n.Count == 0) {
result.AddRange(m);
return result;
}
while(i < m.Count) {
if(m[i].CompareTo(n[j]) < 0) {
result.Add(m[i]);
i++;
} else if(m[i].CompareTo(n[j]) > 0) {
j++;
} else {
i++;
}
if(j >= n.Count) {
for(; i < m.Count; i++) {
result.Add(m[i]);
}
break;
}
}
return result;
}
In a quick & dirty benchmark http://ideone.com/Y2oEQD M + N is always faster even when N is 10million. BinarySearch suffers penalties because it's accessing array memory in a non-linear fashion; this causes cache misses which slows down the algorithm, so the larger N gets the more memory access penalties with BinarySearch.
If both lists are sorted, then you can implement your own solution easily:
listA except listB algorithm then works as follows:
1. Start from the beginning of both lists
2. If listA element is smaller than the listB element,
then include the listA element in the output and advance listA
3. If listB element is smaller than the listA element, advance listB
4. If listA and listB elements are equal,
advance both lists and do not push the element to the output
Repeat until listA is exhausted. Take special care that listB might be exhausted before listA.
I have somethings like this:
List<string> listUser;
listUser.Add("user1");
listUser.Add("user2");
listUser.Add("userhacker");
listUser.Add("user1other");
List<string> key_blacklist;
key_blacklist.Add("hacker");
key_blacklist.Add("other");
foreach (string user in listUser)
{
foreach (string key in key_blacklist)
{
if (user.Contains(key))
{
// remove it in listUser
}
}
}
The result of listUser is: user1, user2.
The problem is if i have a huge listUser (more than 10 million) and huge key_blacklist (100.000). That code is very very slow.
Is have anyway to get that faster?
UPDATE: I find new solution in there.
http://cc.davelozinski.com/c-sharp/fastest-way-to-check-if-a-string-occurs-within-a-string
Hope that will help someone when he got in there! :)
If you don't have much control over how the list of users is constructed, you can at least test each item in the list in parallel, which on modern machines with multiple cores will speed up the checking a fair bit.
listuser.AsParallel().Where(
s =>
{
foreach (var key in key_blacklist)
{
if (s.Contains(key))
{
return false; //Not to be included
}
}
return true; //To be included, as no match with the blacklist
});
Also - do you have to use .Contains? .Equals is going to be much much quicker, because in almost all cases a non-match will be determined when the HashCodes differ, which can be found only by an integer comparison. Super quick.
If you do need .Contains, you may want to think about restructuring the app. What do these strings in the list really represent? Separate sub-groups of users? Can I test each string, at the time it's added, for whether it represents a user on the blacklist?
UPDATE: In response to #Rawling's comment below - If you know that there is a finite set of usernames which have, say, "hacker" as a substring, that set would have to be pretty large before running a .Equals test of each username against a candidate would be slower than running .Contains on the candidate. This is because HashCode is really quick.
If you are using entity framework or linq to sql then using linq and sending the query to a server can improve the performance.
Then instead of removing the items you are actually querying for the items that fulfil the requirements, i.e. user where the name doesn't contain the banned expression:
listUser.Where(u => !key_blacklist.Any(u.Contains)).Select(u => u).ToList();
A possible solution is to use a tree-like data structure.
The basic idea is to have the blacklisted words organised like this:
+ h
| + ha
| + hac
| - hacker
| - [other words beginning with hac]
|
+ f
| + fu
| + fuk
| - fukoff
| - [other words beginning with fuk]
Then, when you check for blacklisted words, you avoid searching the whole list of words beginning with "hac" if you find out that your user string does not even contain "h".
In the example I provided, with your sample data, this does not of course make any difference, but with the real data sets this should reduce significantly the number of Contains, since you don't check against the full list of blacklisted words every time.
Here is a code example (please note that the code is pretty bad, this is just to illustrate my idea)
using System;
using System.Collections.Generic;
using System.Linq;
class Program {
class Blacklist {
public string Start;
public int Level;
const int MaxLevel = 3;
public Dictionary<string, Blacklist> SubBlacklists = new Dictionary<string, Blacklist>();
public List<string> BlacklistedWords = new List<string>();
public Blacklist() {
Start = string.Empty;
Level = 0;
}
Blacklist(string start, int level) {
Start = start;
Level = level;
}
public void AddBlacklistedWord(string word) {
if (word.Length > Level && Level < MaxLevel) {
string index = word.Substring(0, Level + 1);
Blacklist sublist = null;
if (!SubBlacklists.TryGetValue(index, out sublist)) {
sublist = new Blacklist(index, Level + 1);
SubBlacklists[index] = sublist;
}
sublist.AddBlacklistedWord(word);
} else {
BlacklistedWords.Add(word);
}
}
public bool ContainsBlacklistedWord(string wordToCheck) {
if (wordToCheck.Length > Level && Level < MaxLevel) {
foreach (var sublist in SubBlacklists.Values) {
if (wordToCheck.Contains(sublist.Start)) {
return sublist.ContainsBlacklistedWord(wordToCheck);
}
}
}
return BlacklistedWords.Any(x => wordToCheck.Contains(x));
}
}
static void Main(string[] args) {
List<string> listUser = new List<string>();
listUser.Add("user1");
listUser.Add("user2");
listUser.Add("userhacker");
listUser.Add("userfukoff1");
Blacklist blacklist = new Blacklist();
blacklist.AddBlacklistedWord("hacker");
blacklist.AddBlacklistedWord("fukoff");
foreach (string user in listUser) {
if (blacklist.ContainsBlacklistedWord(user)) {
Console.WriteLine("Contains blacklisted word: {0}", user);
}
}
}
}
You are using the wrong thing. If you have a lot of data, you should be using either HashSet<T> or SortedSet<T>. If you don't need the data sorted, go with HashSet<T>. Here is a program I wrote to demonstrate the time differences:
class Program
{
private static readonly Random random = new Random((int)DateTime.Now.Ticks);
static void Main(string[] args)
{
Console.WriteLine("Creating Lists...");
var stringList = new List<string>();
var hashList = new HashSet<string>();
var sortedList = new SortedSet<string>();
var searchWords1 = new string[3];
int ndx = 0;
for (int x = 0; x < 1000000; x++)
{
string str = RandomString(10);
if (x == 5 || x == 500000 || x == 999999)
{
str = "Z" + str;
searchWords1[ndx] = str;
ndx++;
}
stringList.Add(str);
hashList.Add(str);
sortedList.Add(str);
}
Console.WriteLine("Lists created!");
var sw = new Stopwatch();
sw.Start();
bool search1 = stringList.Contains(searchWords1[2]);
sw.Stop();
Console.WriteLine("List<T> {0} ==> {1}ms", search1, sw.ElapsedMilliseconds);
sw.Reset();
sw.Start();
search1 = hashList.Contains(searchWords1[2]);
sw.Stop();
Console.WriteLine("HashSet<T> {0} ==> {1}ms", search1, sw.ElapsedMilliseconds);
sw.Reset();
sw.Start();
search1 = sortedList.Contains(searchWords1[2]);
sw.Stop();
Console.WriteLine("SortedSet<T> {0} ==> {1}ms", search1, sw.ElapsedMilliseconds);
}
private static string RandomString(int size)
{
var builder = new StringBuilder();
char ch;
for (int i = 0; i < size; i++)
{
ch = Convert.ToChar(Convert.ToInt32(Math.Floor(26 * random.NextDouble() + 65)));
builder.Append(ch);
}
return builder.ToString();
}
}
On my machine, I got the following results:
Creating Lists...
Lists created!
List<T> True ==> 15ms
HashSet<T> True ==> 0ms
SortedSet<T> True ==> 0ms
As you can see, List<T> was extremely slow comparted to HashSet<T> and SortedSet<T>. Those were almost instantaneous.
Let's say I have a collection of some type, e.g.
IEnumerable<double> values;
Now I need to extract the k highest values from that collection, for some parameter k. This is a very simple way to do this:
values.OrderByDescending(x => x).Take(k)
However, this (if I understand this correctly) first sorts the entire list, then picks the first k elements. But if the list is very large, and k is comparatively small (smaller than log n), this is not very efficient - the list is sorted in O(nlog n), but I figure selecting the k highest values from a list should be more like O(nk).
So, does anyone have any suggestion for a better, more efficient way to do this?
This gives a bit of a performance increase. Note that it's ascending rather than descending but you should be able to repurpose it (see comments):
static IEnumerable<double> TopNSorted(this IEnumerable<double> source, int n)
{
List<double> top = new List<double>(n + 1);
using (var e = source.GetEnumerator())
{
for (int i = 0; i < n; i++)
{
if (e.MoveNext())
top.Add(e.Current);
else
throw new InvalidOperationException("Not enough elements");
}
top.Sort();
while (e.MoveNext())
{
double c = e.Current;
int index = top.BinarySearch(c);
if (index < 0) index = ~index;
if (index < n) // if (index != 0)
{
top.Insert(index, c);
top.RemoveAt(n); // top.RemoveAt(0)
}
}
}
return top; // return ((IEnumerable<double>)top).Reverse();
}
Consider the below method:
static IEnumerable<double> GetTopValues(this IEnumerable<double> values, int count)
{
var maxSet = new List<double>(Enumerable.Repeat(double.MinValue, count));
var currentMin = double.MinValue;
foreach (var t in values)
{
if (t <= currentMin) continue;
maxSet.Remove(currentMin);
maxSet.Add(t);
currentMin = maxSet.Min();
}
return maxSet.OrderByDescending(i => i);
}
And the test program:
static void Main()
{
const int SIZE = 1000000;
const int K = 10;
var random = new Random();
var values = new double[SIZE];
for (var i = 0; i < SIZE; i++)
values[i] = random.NextDouble();
// Test values
values[SIZE/2] = 2.0;
values[SIZE/4] = 3.0;
values[SIZE/8] = 4.0;
IEnumerable<double> result;
var stopwatch = new Stopwatch();
stopwatch.Start();
result = values.OrderByDescending(x => x).Take(K).ToArray();
stopwatch.Stop();
Console.WriteLine(stopwatch.ElapsedMilliseconds);
stopwatch.Restart();
result = values.GetTopValues(K).ToArray();
stopwatch.Stop();
Console.WriteLine(stopwatch.ElapsedMilliseconds);
}
On my machine results are 1002 and 14.
Another way of doing this (haven't been around C# for years, so pseudo-code it is, sorry) would be:
highestList = []
lowestValueOfHigh = 0
for every item in the list
if(lowestValueOfHigh > item) {
delete highestList[highestList.length - 1] from list
do insert into list with binarysearch
if(highestList[highestList.length - 1] > lowestValueOfHigh)
lowestValueOfHigh = highestList[highestList.length - 1]
}
I wouldn't state anything about performance without profiling. In this answer I'll just try to implement O(n*k) take-one-enumeration-for-one-max-value approach. Personally I think that ordering approach is superior. Anyway:
public static IEnumerable<double> GetMaxElements(this IEnumerable<double> source)
{
var usedIndices = new HashSet<int>();
while (true)
{
var enumerator = source.GetEnumerator();
int index = 0;
int maxIndex = 0;
double? maxValue = null;
while(enumerator.MoveNext())
{
if((!maxValue.HasValue||enumerator.Current>maxValue)&&!usedIndices.Contains(index))
{
maxValue = enumerator.Current;
maxIndex = index;
}
index++;
}
usedIndices.Add(maxIndex);
if (!maxValue.HasValue) break;
yield return maxValue.Value;
}
}
Usage:
var biggestElements = values.GetMaxElements().Take(3);
Downsides:
Method assumes that source IEnumerable has an order
Method uses additional memory/operations to save used indices.
Advantage:
You can be sure that it takes one enumeration to get next max value.
See it running
Here is a Linqy TopN operator for enumerable sequences, based on the PriorityQueue<TElement, TPriority> collection:
/// <summary>
/// Selects the top N elements from the source sequence. The selected elements
/// are returned in descending order.
/// </summary>
public static IEnumerable<T> TopN<T>(this IEnumerable<T> source, int n,
IComparer<T> comparer = default)
{
ArgumentNullException.ThrowIfNull(source);
if (n < 1) throw new ArgumentOutOfRangeException(nameof(n));
PriorityQueue<bool, T> top = new(comparer);
foreach (var item in source)
{
if (top.Count < n)
top.Enqueue(default, item);
else
top.EnqueueDequeue(default, item);
}
List<T> topList = new(top.Count);
while (top.TryDequeue(out _, out var item)) topList.Add(item);
for (int i = topList.Count - 1; i >= 0; i--) yield return topList[i];
}
Usage example:
IEnumerable<double> topValues = values.TopN(k);
The topValues sequence contains the k maximum values in the values, in descending order. In case there are duplicate values in the topValues, the order of the equal values is undefined (non-stable sort).
For a SortedSet<T>-based implementation that compiles on .NET versions earlier than .NET 6, you could look at the 5th revision of this answer.
An operator PartialSort with similar functionality exists in the MoreLinq package. It's not implemented optimally though (source code). It performs invariably a binary search for each item, instead of comparing it with the smallest item in the top list, resulting in many more comparisons than necessary.
Surprisingly the LINQ itself is well optimized for the OrderByDescending+Take combination, resulting in excellent performance. It's only slightly slower than the TopN operator above. This applies to all versions of the .NET Core and later (.NET 5 and .NET 6). It doesn't apply to the .NET Framework platform, where the complexity is O(n*log n) as expected.
A demo that compares 4 different approaches can be found here. It compares:
values.OrderByDescending(x => x).Take(k).
values.OrderByDescending(x => x).HideIdentity().Take(k), where HideIdentity is a trivial LINQ propagator that hides the identity of the underlying enumerable, and so it effectively disables the LINQ optimizations.
values.PartialSort(k, MoreLinq.OrderByDirection.Descending) (MoreLinq).
values.TopN(k)
Below is a typical output of the demo, running in Release mode on .NET 6:
.NET 6.0.0-rtm.21522.10
Extract the 100 maximum elements from 2,000,000 random values, and calculate the sum.
OrderByDescending+Take Duration: 156 msec, Comparisons: 3,129,640, Sum: 99.997344
OrderByDescending+HideIdentity+Take Duration: 1,415 msec, Comparisons: 48,602,298, Sum: 99.997344
MoreLinq.PartialSort Duration: 277 msec, Comparisons: 13,999,582, Sum: 99.997344
TopN Duration: 62 msec, Comparisons: 2,013,207, Sum: 99.997344
I can't figure out a discrepancy between the time it takes for the Contains method to find an element in an ArrayList and the time it takes for a small function that I wrote to do the same thing. The documentation states that Contains performs a linear search, so it's supposed to be in O(n) and not any other faster method. However, while the exact values may not be relevant, the Contains method returns in 00:00:00.1087087 seconds while my function takes 00:00:00.1876165. It might not be much, but this difference becomes more evident when dealing with even larger arrays. What am I missing and how should I write my function to match Contains's performances?
I'm using C# on .NET 3.5.
public partial class Window1 : Window
{
public bool DoesContain(ArrayList list, object element)
{
for (int i = 0; i < list.Count; i++)
if (list[i].Equals(element)) return true;
return false;
}
public Window1()
{
InitializeComponent();
ArrayList list = new ArrayList();
for (int i = 0; i < 10000000; i++) list.Add("zzz " + i);
Stopwatch sw = new Stopwatch();
sw.Start();
//Console.Out.WriteLine(list.Contains("zzz 9000000") + " " + sw.Elapsed);
Console.Out.WriteLine(DoesContain(list, "zzz 9000000") + " " + sw.Elapsed);
}
}
EDIT:
Okay, now, lads, look:
public partial class Window1 : Window
{
public bool DoesContain(ArrayList list, object element)
{
int count = list.Count;
for (int i = count - 1; i >= 0; i--)
if (element.Equals(list[i])) return true;
return false;
}
public bool DoesContain1(ArrayList list, object element)
{
int count = list.Count;
for (int i = 0; i < count; i++)
if (element.Equals(list[i])) return true;
return false;
}
public Window1()
{
InitializeComponent();
ArrayList list = new ArrayList();
for (int i = 0; i < 10000000; i++) list.Add("zzz " + i);
Stopwatch sw = new Stopwatch();
long total = 0;
int nr = 100;
for (int i = 0; i < nr; i++)
{
sw.Reset();
sw.Start();
DoesContain(list,"zzz");
total += sw.ElapsedMilliseconds;
}
Console.Out.WriteLine(total / nr);
total = 0;
for (int i = 0; i < nr; i++)
{
sw.Reset();
sw.Start();
DoesContain1(list, "zzz");
total += sw.ElapsedMilliseconds;
}
Console.Out.WriteLine(total / nr);
total = 0;
for (int i = 0; i < nr; i++)
{
sw.Reset();
sw.Start();
list.Contains("zzz");
total += sw.ElapsedMilliseconds;
}
Console.Out.WriteLine(total / nr);
}
}
I made an average of 100 running times for two versions of my function(forward and backward loop) and for the default Contains function. The times I've got are 136 and
133 milliseconds for my functions and a distant winner of 87 for the Contains version. Well now, if before you could argue that the data was scarce and I based my conclusions on a first, isolated run, what do you say about this test? Not does only on average Contains perform better, but it achieves consistently better results in each run. So, is there some kind of disadvantage in here for 3rd party functions, or what?
First, you're not running it many times and comparing averages.
Second, your method isn't being jitted until it actually runs. So the just in time compile time is added into its execution time.
A true test would run each multiple times and average the results (any number of things could cause one or the other to be slower for run X out of a total of Y), and your assemblies should be pre-jitted using ngen.exe.
As you're using .NET 3.5, why are you using ArrayList to start with, rather than List<string>?
A few things to try:
You could see whether using foreach instead of a for loop helps
You could cache the count:
public bool DoesContain(ArrayList list, object element)
{
int count = list.Count;
for (int i = 0; i < count; i++)
{
if (list[i].Equals(element))
{
return true;
}
return false;
}
}
You could reverse the comparison:
if (element.Equals(list[i]))
While I don't expect any of these to make a significant (positive) difference, they're the next things I'd try.
Do you need to do this containment test more than once? If so, you might want to build a HashSet<T> and use that repeatedly.
I'm not sure if you're allowed to post Reflector code, but if you open the method using Reflector, you can see that's it's essentially the same (there are some optimizations for null values, but your test harness doesn't include nulls).
The only difference that I can see is that calling list[i] does bounds checking on i whereas the Contains method does not.
Using the code below I was able to get the following timings relatively consitently (within a few ms):
1: 190ms DoesContainRev
2: 198ms DoesContainRev1
3: 188ms DoesContainFwd
4: 203ms DoesContainFwd1
5: 199ms Contains
Several things to notice here.
This is run with release compiled code from the commandline. Many people make the mistake of benchmarking code inside the Visual Studio debugging environment, not to say anyone here did but something to be careful of.
The list[i].Equals(element) appears to be just a bit slower than element.Equals(list[i]).
using System;
using System.Diagnostics;
using System.Collections;
namespace ArrayListBenchmark
{
class Program
{
static void Main(string[] args)
{
Stopwatch sw = new Stopwatch();
const int arrayCount = 10000000;
ArrayList list = new ArrayList(arrayCount);
for (int i = 0; i < arrayCount; i++) list.Add("zzz " + i);
sw.Start();
DoesContainRev(list, "zzz");
sw.Stop();
Console.WriteLine(String.Format("1: {0}", sw.ElapsedMilliseconds));
sw.Reset();
sw.Start();
DoesContainRev1(list, "zzz");
sw.Stop();
Console.WriteLine(String.Format("2: {0}", sw.ElapsedMilliseconds));
sw.Reset();
sw.Start();
DoesContainFwd(list, "zzz");
sw.Stop();
Console.WriteLine(String.Format("3: {0}", sw.ElapsedMilliseconds));
sw.Reset();
sw.Start();
DoesContainFwd1(list, "zzz");
sw.Stop();
Console.WriteLine(String.Format("4: {0}", sw.ElapsedMilliseconds));
sw.Reset();
sw.Start();
list.Contains("zzz");
sw.Stop();
Console.WriteLine(String.Format("5: {0}", sw.ElapsedMilliseconds));
sw.Reset();
Console.ReadKey();
}
public static bool DoesContainRev(ArrayList list, object element)
{
int count = list.Count;
for (int i = count - 1; i >= 0; i--)
if (element.Equals(list[i])) return true;
return false;
}
public static bool DoesContainFwd(ArrayList list, object element)
{
int count = list.Count;
for (int i = 0; i < count; i++)
if (element.Equals(list[i])) return true;
return false;
}
public static bool DoesContainRev1(ArrayList list, object element)
{
int count = list.Count;
for (int i = count - 1; i >= 0; i--)
if (list[i].Equals(element)) return true;
return false;
}
public static bool DoesContainFwd1(ArrayList list, object element)
{
int count = list.Count;
for (int i = 0; i < count; i++)
if (list[i].Equals(element)) return true;
return false;
}
}
}
With a really good optimizer there should not be difference at all, because the semantics seems to be the same. However the existing optimizer can optimize your function not so good as the hardcoded Contains is optimized. Some of the points for optimization:
comparing to a property each time can be slower than counting downwards and comparing against 0
function call itself has its performance penalty
using iterators instead of explicit indexing can be faster (foreach loop instead of plain for)
First, if you are using types you know ahead of time, I'd suggest using generics. So List instead of ArrayList. Underneath the hood, ArrayList.Contains actually does a bit more than what you are doing. The following is from reflector:
public virtual bool Contains(object item)
{
if (item == null)
{
for (int j = 0; j < this._size; j++)
{
if (this._items[j] == null)
{
return true;
}
}
return false;
}
for (int i = 0; i < this._size; i++)
{
if ((this._items[i] != null) && this._items[i].Equals(item))
{
return true;
}
}
return false;
}
Notice that it forks itself on being passed a null value for item. However, since all the values in your example are not null, the additional check on null at the beginning and in the second loop should in theory take longer.
Are you positive you are dealing with fully compiled code? I.e., when your code runs the first time it gets JIT compiled where as the framework is obviously already compiled.
After your Edit, I copied the code and made a few improvements to it.
The difference was not reproducable, it turns out to be a measuring/rounding issue.
To see that, change your runs to this form:
sw.Reset();
sw.Start();
for (int i = 0; i < nr; i++)
{
DoesContain(list,"zzz");
}
total += sw.ElapsedMilliseconds;
Console.WriteLine(total / nr);
I just moved some lines. The JIT issue was insignificant with this numbr of repetitions.
My guess would be that ArrayList is written in C++ and could be taking advantage of some micro-optimizations (note: this is a guess).
For instance, in C++ you can use pointer arithmetic (specifically incrementing a pointer to iterate an array) to be faster than using an index.
using an array structure, you can't search faster than O(n) whithout any additional information.
if you know that the array is sorted, then you can use binary search algorithm and spent only o(log(n))
otherwise you should use a set.
Revised after reading comments:
It does not use some Hash-alogorithm to enable fast lookup.
Use SortedList<TKey,TValue>, Dictionary<TKey, TValue> or System.Collections.ObjectModel.KeyedCollection<TKey, TValue> for fast access based on a key.
var list = new List<myObject>(); // Search is sequential
var dictionary = new Dictionary<myObject, myObject>(); // key based lookup, but no sequential lookup, Contains fast
var sortedList = new SortedList<myObject, myObject>(); // key based and sequential lookup, Contains fast
KeyedCollection<TKey, TValue> is also fast and allows indexed lookup, however, it needs to be inherited as it is abstract. Therefore, you need a specific collection. However, with the following you can create a generic KeyedCollection.
public class GenericKeyedCollection<TKey, TValue> : KeyedCollection<TKey, TValue> {
public GenericKeyedCollection(Func<TValue, TKey> keyExtractor) {
this.keyExtractor = keyExtractor;
}
private Func<TValue, TKey> keyExtractor;
protected override TKey GetKeyForItem(TValue value) {
return this.keyExtractor(value);
}
}
The advantage of using the KeyedCollection is that the Add method does not require that a key is specified.