Fast lookup in a multidimensional System.Collection.IList - c#

I need your advice on the following.
I have a multi-dimensional IList containing items which have an index, Id and Text. Normally I know the value of Id and based on that I need to get the Text. Both Id and Text values are read from a database.
What we are currently using to get the value of Text field is:
foreach (Object myObj in List)
{
if (((MessageType)myObj).Id == id)
{
return ((MessageType)myObj).Text;
}
}
When count in IList becomes large (more than 32K), it takes some time to process.
Question: Is there a way to efficiently get the Text value without iterating through the IList?
Things I tried without success:
Use List.IndexOf(Id) - did not work because IndexOf applies to text only.
Converting List to multi-dimensional array - failed on List.CopyTo(array,0) my guess because it is multi-dimensional:
string[] array=new string[List.Count,List.Count];
List.CopyTo(array,0);
I can not use a AJAX/JQuery solution because it is an existing(live) project and it will take too much to re-code.
Thanks

If you want fast searching by some identifier in a collection with 32k elements, you should use Dictionary<K,V> as your collection.
var dict = new Dictionary<IDType, MessageType>();
A Dictionary is basically a search tree where the elements are stored in a sorted way so an element with a specific key (in your case Id) can be found without looking at all elements. For more information see MSDN.
If you cannot refactor the collection to be a dictionary, you may initially fill the dictionary (slow) and then search in the dictionary (fast). This will only be faster if you do multiple searches before you fill the dictionary again, i.e. if your list does not change often.
foreach(object o in List)
{
var msg = (MessageType)o;
dict.Add(msg.Id, msg);
}
Searching then is easy:
MessageType msg = dict[id];
EDIT: Well, I was curious and wrote a test routine which compares the linear search and the dictionary approach. Here's what I used:
using System;
using System.Collections;
using System.Collections.Generic;
using System.Diagnostics;
namespace ConsoleApplication1
{
class MessageType
{
public string Id;
public string Text;
}
class Program
{
static void Main(string[] args)
{
var rand = new Random ();
// filling a list with random text messages
List<MessageType> list = new List<MessageType>();
for (int i = 0; i < 32000; i++)
{
string txt = rand.NextDouble().ToString();
var msg = new MessageType() {Id = i.ToString(), Text = txt };
list.Add(msg);
}
IList List = (IList)list;
// doing some random searches
foreach (int some in new int[] { 2, 10, 100, 1000 })
{
var watch1 = new Stopwatch();
var watch2 = new Stopwatch();
Dictionary<string, MessageType> dict = null;
for (int i = 0; i < some; i++)
{
string id = rand.Next(32000).ToString();
watch1.Start();
LinearLookup(List, id);
watch1.Stop();
watch2.Start();
// fill once
if (dict == null)
{
dict = new Dictionary<string, MessageType>();
foreach (object o in List)
{
var msg = (MessageType)o;
dict.Add(msg.Id, msg);
}
}
// lookup
DictionaryLookup(dict, id);
watch2.Stop();
}
Console.WriteLine(some + " x LinearLookup took "
+ watch1.Elapsed.TotalSeconds + "s");
Console.WriteLine("Dictionary fill and " + some
+ " x DictionaryLookup took "
+ watch2.Elapsed.TotalSeconds + "s");
}
}
static string LinearLookup(IList List, string id)
{
foreach (object myObj in List)
{
if (((MessageType)myObj).Id == id)
{
return ((MessageType)myObj).Text;
}
}
throw new Exception();
}
static string DictionaryLookup(Dictionary<string, MessageType> dict,
string id)
{
return dict[id].Text;
}
}
}
The results I got in Release / x86:
Number of | Time [ms] with | Time[ms] with | Speedup (approx.)
searches | linear search | dictionary(*) | with dictionary
----------+----------------+---------------+-----------------
2 | 1.161 | 2.006 | 0.6
----------+----------------+---------------+-----------------
10 | 2.834 | 2.060 | 1.4
----------+----------------+---------------+-----------------
100 | 25.39 | 1.973 | 13
----------+----------------+---------------+-----------------
1000 | 261.4 | 5.836 | 45
----------+----------------+---------------+-----------------
(*) including filling the dictionary once.
So, I was a bit optimistic to say that searching twice would already pay off. In my test application I have to search 10 times for the dictionary to be faster.
I'm sorry I could not make a more realistic example, my Ids are all sorted. Feel free to try modifying and experimenting though ;-)

From the looks of it you have a List<MessageType> here, which is not multi-dimensional. Rather the objects inside the list have multiple properties.
You could easily get them out with LINQ much faster than a loop most likely:
var text = (from MessageType msgType in myList
where msgType.Id == id
select msgType.Text).FirstOrDefault();
Or even easier with an inline LINQ statement:
var text = myList.Where(s => s.Id == id).Select(s => s.Text).FirstOrDefault();
NOTE: As mentioned in comments above, the speed of these LINQ statements are only as good as the object's position in the List. If it is the last object in the list, you will likely see the same performance discrepancy. Dictionary<Index, MessageType> is going to be much more performant.

Better way is to use ILookup.
For example:
var look = query.ToLookup(x => x.SomeID, y=> y.Name)
and use:
if (look.Contains(myID)){
var name = look[myID].First();
}

Related

Merge data from two arrays or something else

How to combine Id from the list I get from file /test.json and id from list ourOrders[i].id?
Or if there is another way?
private RegionModel FilterByOurOrders(RegionModel region, List<OurOrderModel> ourOrders, MarketSettings market, bool byOurOrders)
{
var result = new RegionModel
{
updatedTs = region.updatedTs,
orders = new List<OrderModel>(region.orders.Count)
};
var json = File.ReadAllText("/test.json");
var otherBotOrders = JsonSerializer.Deserialize<OrdersTimesModel>(json);
OtherBotOrders = new Dictionary<string, OrderTimesInfoModel>();
foreach (var otherBotOrder in otherBotOrders.OrdersTimesInfo)
{
//OtherBotOrders.Add(otherBotOrder.Id, otherBotOrder);
BotController.WriteLine($"{otherBotOrder.Id}"); //Output ID orders to the console works
}
foreach (var order in region.orders)
{
if (ConvertToDecimal(order.price) < 1 || !byOurOrders)
{
int i = 0;
var isOurOrder = false;
while (i < ourOrders.Count && !isOurOrder)
{
if (ourOrders[i].id.Equals(order.id, StringComparison.InvariantCultureIgnoreCase))
{
isOurOrder = true;
}
++i;
}
if (!isOurOrder)
{
result.orders.Add(order);
}
}
}
return result;
}
OrdersTimesModel Looks like that:
public class OrdersTimesModel
{
public List<OrderTimesInfoModel> OrdersTimesInfo { get; set; }
}
test.json:
{"OrdersTimesInfo":[{"Id":"1"},{"Id":"2"}]}
Added:
I'll try to clarify the question:
There are three lists with ID:
First (all orders): region.orders, as order.id
Second (our orders): ourOrders, as ourOrders[i].id in a while loop
Third (our orders 2): from the /test.json file, as an array {"Orders":[{"Id":"12345..."...},{"Id":"12345..." ...}...]}
There is a foreach in which there is a while, where the First (all orders) list and the Second (our orders) list are compared. If the id's match, then these are our orders: isOurOrder = true;
Accordingly, those orders that isOurOrder = false; will be added to the result: result.orders.Add(order)
I need:
So that if (ourOrders[i].id.Equals(order.id, StringComparison.InvariantCultureIgnoreCase)) would include more Id's from the Third (our orders 2) list.
Or any other way to do it?
You should be able to completely avoid writing loops if you use LINQ (there will be loops running in the background, but it's way easier to read)
You can access some documentation here: https://learn.microsoft.com/en-us/dotnet/csharp/programming-guide/concepts/linq/introduction-to-linq-queries
and you have some pretty cool extension methods for arrays: https://learn.microsoft.com/en-us/dotnet/api/system.linq.enumerable?view=net-6.0 (these are great to get your code easy to read)
Solution
unsing System.Linq;
private RegionModel FilterByOurOrders(RegionModel region, List<OurOrderModel> ourOrders, MarketSettings market, bool byOurOrders)
{
var result = new RegionModel
{
updatedTs = region.updatedTs,
orders = new List<OrderModel>(region.orders.Count)
};
var json = File.ReadAllText("/test.json");
var otherBotOrders = JsonSerializer.Deserialize<OrdersTimesModel>(json);
// This line should get you an array containing
// JUST the ids in the JSON file
var idsFromJsonFile = otherBotOrders.Select(x => x.Id);
// Here you'll get an array with the ids for your orders
var idsFromOurOrders = ourOrders.Select(x => x.id);
// Union will only take unique values,
// so you avoid repetition.
var mergedArrays = idsFromJsonFile.Union(idsFromOurOrders);
// Now we just need to query the region orders
// We'll get every element that has an id contained in the arrays we created earlier
var filteredRegionOrders = region.orders.Where(x => !mergedArrays.Contains(x.id));
result.orders.AddRange(filteredRegionOrders );
return result;
}
You can add conditions to any of those actions (like checking for order price or the boolean flag you get as a parameter), and of course you can do it without assigning so many variables, I did it that way just to make it easier to explain.

I want to have data structure with (start,end) as key and then be able to search integer in whole data structure and get corresponding value

I'd like to have a
data structure with <Key, Value> where
Key = (start, end),
Value = string
After which I should be able to search an integer optimally in the data structure and get corresponding value.
Example:
var lookup = new Something<(int, int), string>()
{
{(1,100),"In 100s"},
{(101,200),"In 100-200"},
}
var value1 = lookup[10]; //value1 = "In 100s"
var value2 = lookup[110]; //value2 = "In 100-200"
Could anyone suggest?
If you want to be able to use something like lookup[10] as you mentioned, you can create your own class that implements some sort of key/value data type. Which underlying data type you ultimately decide to use really depends on what your data looks like.
Here's a simple example of doing this while implementing a Dictionary<>:
public class RangeLookup : Dictionary<(int Min, int Max), string>
{
public string this[int index] => this.Single(x => x.Key.Min <= index && index <= x.Key.Max).Value;
}
This allows you to define a custom indexer on top of the dictionary to encapsulate your range lookup. A usage of this class would look like:
var lookup = new RangeLookup
{
{ (1, 100), "In 100s" },
{ (101, 200), "In 101-200s" },
};
Console.WriteLine($"50: {lookup[50]}");
Which produces output as:
In terms of performance with this approach, the following is an example of some tests (using Win10 with an Intel i7-4770 CPU) retrieving a value from a dictionary with 10,000,000 records:
var lookup = new RangeLookup();
for (var i = 1; i <= 10000000; i++)
{
var max = i * 100;
var min = max - 99;
lookup.Add((min, max), $"In {min}-{max}s");
}
var stopwatch = new Stopwatch();
stopwatch.Start();
Console.WriteLine($"50: {lookup[50]} (TimeToLookup: {stopwatch.ElapsedMilliseconds})");
stopwatch.Restart();
Console.WriteLine($"5,000: {lookup[5000]} (TimeToLookup: {stopwatch.ElapsedMilliseconds})");
stopwatch.Restart();
Console.WriteLine($"1,000,000,000: {lookup[1000000000]} (TimeToLookup: {stopwatch.ElapsedMilliseconds})");
Which gives the following results:
So unless you plan on working with more than tens of millions of records inside of this data set, an approach like this should be satisfactory in terms of performance.
You basically have a Dictionary<> structure here, for example:
var lookup = new Dictionary<(int, int), string>()
{
{(1,100),"In 100s"},
{(101,200),"In 100-200"},
};
You can use some basic Linq queries to search that container, for example:
var searchValue = 10;
var value1 = lookup.First(l => l.Key.Item1 <= searchValue && l.Key.Item2 >= searchValue);
searchValue = 110;
var value2 = lookup.First(l => l.Key.Item1 <= searchValue && l.Key.Item2 >= searchValue);
But as Lee suggested in the comments, you might get better performance using a SortedDictionary, your mileage may vary, which means you need to test the performance of both.

Appropriate datastructure for key.contains(x) Map/Dictionary

I am somewhat struggling with the terminology and complexity of my explanations here, feel free to edit it.
I have 1.000 - 20.000 objects. Each one can contain several name words (first, second, middle, last, title...) and normalized numbers(home, business...), email adresses or even physical adresses and spouse names.
I want to implement a search that enables users to freely combine word parts and number parts.When I search for "LL 676" I want to find all objects that contain any String with "LL" AND "676".
Currently I am iterating over every object and every objects property, split the searchString on " " and do a stringInstance.Contains(searchword).
This is too slow, so I am looking for a better solution.
What is the appropriate language agnostic data structure for this?
In my case I need it for C#.
Is the following data structure a good solution?
It's based on a HashMap/Dictionary.
At first I create a String that contains all name parts and phone numbers I want to look through, one example would be: "William Bill Henry Gates III 3. +436760000 billgatesstreet 12":
Then I split on " " and for every word x I create all possible substrings y that fullfill x.contains(y). I put every of those substrings inside the hashmap/dictionary.
On lookup/search I just need to call the search for every searchword and the join the results. Naturally, the lookup speed is blazingly fast (native Hashmap/Dictionary speed).
EDIT: Inserts are very fast as well (insignificant time) now that I use a smarter algorithm to get the substrings.
It's possible I've misunderstood your algorithm or requirement, but this seems like it could be a potential performance improvement:
foreach (string arg in searchWords)
{
if (String.IsNullOrEmpty(arg))
continue;
tempList = new List<T>();
if (dictionary.ContainsKey(arg))
foreach (T obj in dictionary[arg])
if (list.Contains(obj))
tempList.Add(obj);
list = new List<T>(tempList);
}
The idea is that you do the first search word separately before this, and only put all the subsequent words into the searchWords list.
That should allow you to remove your final foreach loop entirely. Results only stay in your list as long as they keep matching every searchWord, rather than initially having to pile everything that matches a single word in then filter them back out at the end.
In case anyone cares for my solution:
Disclaimer:
This is only a rough draft.
I have only done some synthetic testing and I have written a lot of it without testing it again.I have revised my code: Inserts are now ((n^2)/2)+(n/2) instead of 2^n-1 which is infinitely faster. Word length is now irrelevant.
namespace MegaHash
{
using System;
using System.Collections.Concurrent;
using System.Collections.Generic;
using System.Threading.Tasks;
public class GenericConcurrentMegaHash<T>
{
// After doing a bulk add, call AwaitAll() to ensure all data was added!
private ConcurrentBag<Task> bag = new ConcurrentBag<Task>();
private ConcurrentDictionary<string, List<T>> dictionary = new ConcurrentDictionary<string, List<T>>();
// consider changing this to include for example '-'
public char[] splitChars;
public GenericConcurrentMegaHash()
: this(new char[] { ' ' })
{
}
public GenericConcurrentMegaHash(char[] splitChars)
{
this.splitChars = splitChars;
}
public void Add(string keyWords, T o)
{
keyWords = keyWords.ToUpper();
foreach (string keyWord in keyWords.Split(splitChars))
{
if (keyWord == null || keyWord.Length < 1)
return;
this.bag.Add(Task.Factory.StartNew(() => { AddInternal(keyWord, o); }));
}
}
public void AwaitAll()
{
lock (this.bag)
{
foreach (Task t in bag)
t.Wait();
this.bag = new ConcurrentBag<Task>();
}
}
private void AddInternal(string key, T y)
{
for (int i = 0; i < key.Length; i++)
{
for (int i2 = 0; i2 < i + 1; i2++)
{
string desire = key.Substring(i2, key.Length - i);
if (dictionary.ContainsKey(desire))
{
List<T> l = dictionary[desire];
lock (l)
{
try
{
if (!l.Contains(y))
l.Add(y);
}
catch (Exception ex)
{
ex.ToString();
}
}
}
else
{
List<T> l = new List<T>();
l.Add(y);
dictionary[desire] = l;
}
}
}
}
public IList<T> FulltextSearch(string searchString)
{
searchString = searchString.ToUpper();
List<T> list = new List<T>();
string[] searchWords = searchString.Split(splitChars);
foreach (string arg in searchWords)
{
if (arg == null || arg.Length < 1)
continue;
if (dictionary.ContainsKey(arg))
foreach (T obj in dictionary[arg])
if (!list.Contains(obj))
list.Add(obj);
}
List<T> returnList = new List<T>();
foreach (T o in list)
{
foreach (string arg in searchWords)
if (dictionary[arg] == null || !dictionary[arg].Contains(o))
goto BREAK;
returnList.Add(o);
BREAK:
continue;
}
return returnList;
}
}
}

List only contains first two items

Ok, I'm pretty sure isNumber is finally working. Thanks to everyone for the help. I think I'm almost ready to start working on this project for real. I'm just trying to wrap my head around lists.
What I'm doing is trying to check a bunch of inputs at once if they are numbers and store the results in a list. That way, to find out if one of them is a number, I can just check the corresponding value in the second list to find out.
So, my problem is that I'm clearly putting 3 things in my list, but when I have it print out the count of items it always displays 2. What the heck is wrong with this? Specifically, why does areNumbers always return a list of length 2 when I am obviously making it at least as long as numberOfNumbers?
PS I know my code doesn't look very nice yet. I want to get the basics right before I learn about style.
static void Main(string[] args)
{
var maybe = new ArrayList(3);
maybe.Add(100f);
maybe.Add("not a number");
maybe.Add(1000);
Console.WriteLine(areNumbers(maybe).Count);
Console.ReadLine();
}
static ArrayList areNumbers(ArrayList maybeNumbers)
{
var theResults = new ArrayList(0);
var numbersEnumerator = maybeNumbers.GetEnumerator();
var numberOfNumbers = 0;
try
{
for (; ; )
{
numberOfNumbers = numberOfNumbers + 1;
numbersEnumerator.MoveNext();
var myIsNumber = isNumber(numbersEnumerator.Current);
var myAreNumbers = new ArrayList(numberOfNumbers);
myAreNumbers.Add(theResults);
myAreNumbers.Add(myIsNumber);
theResults = myAreNumbers;
}
}
catch (InvalidOperationException)
{
return theResults;
}
}
static bool isNumber(object theObject)
{
var s = theObject.GetType().ToString().ToUpper();
Console.WriteLine(s);
return theObject is int || theObject is Int64 || theObject is float || theObject is double;
}
Like the commenters stated, the return value areNumbers will at most ever be an ArrayList with 2 items (first item would be an ArrayList of booleans for items 0 thru N-2; the second item would be a boolean value for the (N-1)th value). If I stepped through the code in my head correctly, you would get an empty ArrayList if you sent it an empty ArrayList.
After one item:
areNumbers[0]: [] // empty ArrayList
areNumbers[1]: true
After two items:
areNumbers[0]: [[], true] // after first item
areNumbers[1]: false
After three items
areNumbers[0]: [[[], true], false] // after second item
areNumbers[1]: true
If you were to call with a 4th value that was numeric:
areNumbers[0]: [[[[], true], false], true]
areNumbers[1]: true
Now hopefully you aren't stuck in the pre-generics & pre-LINQ world...
Where will filter based on your isNumber function:
var maybeNumbers = new List<object>{ 100f, "not a number", 1000 };
var areNumbers = maybeNumbers.Where(isNumber).ToList();
Assert.AreEqual(2, areNumbers.Count()); //passes!
If you're pre-LINQ, try this:
List<object> maybeNumbers = new List<object>();
maybeNumbers.Add(100f);
maybeNumbers.Add("not a number");
maybeNumbers.Add(1000);
List<object> areNumbers = new List<object>();
foreach(object maybe in maybeNumbers)
{
if (isNumber(maybe))
areNumbers.Add(maybe);
}
Pre-generics (may not compile...)
ArrayList maybeNumbers = new ArrayList();
maybeNumbers.Add(100f);
maybeNumbers.Add("not a number");
maybeNumbers.Add(1000);
ArrayList areNumbers = new ArrayList();
foreach(object maybe in maybeNumbers)
{
if (isNumber(maybe))
areNumbers.Add(maybe);
}
This will loop though a list of objects and give you a boolean response letting you know if they are numeric, which I think is what your code is doing in the end.
var testNumbers = new List<object>();
testNumbers.Add(15);
testNumbers.Add("AUUUGHH");
testNumbers.Add(42);
foreach (var i in testNumbers)
Console.WriteLine(Microsoft.VisualBasic.Information.IsNumeric(i));
Make sure you add a reference to the Microsoft.VisualBasic namespace in order to use IsNumeric()
1.don't rely on try/catch for normal code flow. try/catch is for catching exceptional situations
2.why do you need to build a isNumber method? double.tryParse or Convert.ToDouble() will do similar things (google up to find the difference)
3.No ideea what myAreNumbers is supposed to do but you are basically adding a bool and a list to a new list on every iteration
static ArrayList areNumbers(ArrayList maybeNumbers)
{
var theResults = new ArrayList(0);
foreach(var possibleNumber in maybeNumbers)
{
double myDouble;
if (double.tryParse(possibleNumber, out myDouble))
theResults.Add(possibleNumber);// OR theResults.Add(myDouble); //depending on what you want
}
return theResults;
}
get rid of endless loop
iterate through your maybe numbers
surround with try ... catch only isNumber
If have exception do not increase number of numbers.
And do not return generics for this because what you really need to return is an integer only.
You need sth like (pseudo code):
numberOfNumbers = 0;
while ( there is sth to handle )
{
take element to handle
try
{
check it
numberOfNumbers++;
}
catch ( )
{
// not a number
}
go to the next element
}
return numberOfNumbers
Assuming of course that isNumber throws some execption when your maybe number is not a number.
Try this:
static void Main(string[] args)
{
var maybe = new ArrayList(3);
maybe.Add(100f);
maybe.Add("not a number");
maybe.Add(1000);
foreach (var item in maybe)
{
Console.WriteLine(item);
}
ArrayList res = new ArrayList(maybe.ToArray().Where((o) => o.IsNumber()).ToArray());
foreach (var item in res)
{
Console.WriteLine(item);
}
}
public static bool IsNumber(this object item)
{
const TypeCode filter = TypeCode.Double | TypeCode.Int16 | TypeCode.Int32 | TypeCode.Int64
| TypeCode.Single | TypeCode.UInt16 | TypeCode.UInt32 | TypeCode.UInt64;
Type t = item.GetType();
if (t.IsPrimitive)
{
TypeCode code = System.Type.GetTypeCode(t);
return (code & filter) > 0;
}
return false;
}

Compare adjacent list items

I'm writing a duplicate file detector. To determine if two files are duplicates I calculate a CRC32 checksum. Since this can be an expensive operation, I only want to calculate checksums for files that have another file with matching size. I have sorted my list of files by size, and am looping through to compare each element to the ones above and below it. Unfortunately, there is an issue at the beginning and end since there will be no previous or next file, respectively. I can fix this using if statements, but it feels clunky. Here is my code:
public void GetCRCs(List<DupInfo> dupInfos)
{
var crc = new Crc32();
for (int i = 0; i < dupInfos.Count(); i++)
{
if (dupInfos[i].Size == dupInfos[i - 1].Size || dupInfos[i].Size == dupInfos[i + 1].Size)
{
dupInfos[i].CheckSum = crc.ComputeChecksum(File.ReadAllBytes(dupInfos[i].FullName));
}
}
}
My question is:
How can I compare each entry to its neighbors without the out of bounds error?
Should I be using a loop for this, or is there a better LINQ or other function?
Note: I did not include the rest of my code to avoid clutter. If you want to see it, I can include it.
Compute the Crcs first:
// It is assumed that DupInfo.CheckSum is nullable
public void GetCRCs(List<DupInfo> dupInfos)
{
dupInfos[0].CheckSum = null ;
for (int i = 1; i < dupInfos.Count(); i++)
{
dupInfos[i].CheckSum = null ;
if (dupInfos[i].Size == dupInfos[i - 1].Size)
{
if (dupInfos[i-1].Checksum==null) dupInfos[i-1].CheckSum = crc.ComputeChecksum(File.ReadAllBytes(dupInfos[i-1].FullName));
dupInfos[i].CheckSum = crc.ComputeChecksum(File.ReadAllBytes(dupInfos[i].FullName));
}
}
}
After having sorted your files by size and crc, identify duplicates:
public void GetDuplicates(List<DupInfo> dupInfos)
{
for (int i = dupInfos.Count();i>0 i++)
{ // loop is inverted to allow list items deletion
if (dupInfos[i].Size == dupInfos[i - 1].Size &&
dupInfos[i].CheckSum != null &&
dupInfos[i].CheckSum == dupInfos[i - 1].Checksum)
{ // i is duplicated with i-1
... // your code here
... // eventually, dupInfos.RemoveAt(i) ;
}
}
}
I have sorted my list of files by size, and am looping through to
compare each element to the ones above and below it.
The next logical step is to actually group your files by size. Comparing consecutive files will not always be sufficient if you have more than two files of the same size. Instead, you will need to compare every file to every other same-sized file.
I suggest taking this approach
Use LINQ's .GroupBy to create a collection of files sizes. Then .Where to only keep the groups with more than one file.
Within those groups, calculate the CRC32 checksum and add it to a collection of known checksums. Compare with previously calculated checksums. If you need to know which files specifically are duplicates you could use a dictionary keyed by this checksum (you can achieve this with another GroupBy. Otherwise a simple list will suffice to detect any duplicates.
The code might look something like this:
var filesSetsWithPossibleDupes = files.GroupBy(f => f.Length)
.Where(group => group.Count() > 1);
foreach (var grp in filesSetsWithPossibleDupes)
{
var checksums = new List<CRC32CheckSum>(); //or whatever type
foreach (var file in grp)
{
var currentCheckSum = crc.ComputeChecksum(file);
if (checksums.Contains(currentCheckSum))
{
//Found a duplicate
}
else
{
checksums.Add(currentCheckSum);
}
}
}
Or if you need the specific objects that could be duplicates, the inner foreach loop might look like
var filesSetsWithPossibleDupes = files.GroupBy(f => f.FileSize)
.Where(grp => grp.Count() > 1);
var masterDuplicateDict = new Dictionary<DupStats, IEnumerable<DupInfo>>();
//A dictionary keyed by the basic duplicate stats
//, and whose value is a collection of the possible duplicates
foreach (var grp in filesSetsWithPossibleDupes)
{
var likelyDuplicates = grp.GroupBy(dup => dup.Checksum)
.Where(g => g.Count() > 1);
//Same GroupBy logic, but applied to the checksum (instead of file size)
foreach(var dupGrp in likelyDuplicates)
{
//Create the key for the dictionary (your code is likely different)
var sample = dupGrp.First();
var key = new DupStats() {FileSize = sample.FileSize, Checksum = sample.Checksum};
masterDuplicateDict.Add(key, dupGrp);
}
}
A demo of this idea.
I think the for loop should be : for (int i = 1; i < dupInfos.Count()-1; i++)
var grps= dupInfos.GroupBy(d=>d.Size);
grps.Where(g=>g.Count>1).ToList().ForEach(g=>
{
...
});
Can you do a union between your two lists? If you have a list of filenames and do a union it should result in only a list of the overlapping files. I can write out an example if you want but this link should give you the general idea.
https://stackoverflow.com/a/13505715/1856992
Edit: Sorry for some reason I thought you were comparing file name not size.
So here is an actual answer for you.
using System;
using System.Collections.Generic;
using System.Linq;
public class ObjectWithSize
{
public int Size {get; set;}
public ObjectWithSize(int size)
{
Size = size;
}
}
public class Program
{
public static void Main()
{
Console.WriteLine("start");
var list = new List<ObjectWithSize>();
list.Add(new ObjectWithSize(12));
list.Add(new ObjectWithSize(13));
list.Add(new ObjectWithSize(14));
list.Add(new ObjectWithSize(14));
list.Add(new ObjectWithSize(18));
list.Add(new ObjectWithSize(15));
list.Add(new ObjectWithSize(15));
var duplicates = list.GroupBy(x=>x.Size)
.Where(g=>g.Count()>1);
foreach (var dup in duplicates)
foreach (var objWithSize in dup)
Console.WriteLine(objWithSize.Size);
}
}
This will print out
14
14
15
15
Here is a netFiddle for that.
https://dotnetfiddle.net/0ub6Bs
Final note. I actually think your answer looks better and will run faster. This was just an implementation in Linq.

Categories

Resources