Compare adjacent list items - c#

I'm writing a duplicate file detector. To determine if two files are duplicates I calculate a CRC32 checksum. Since this can be an expensive operation, I only want to calculate checksums for files that have another file with matching size. I have sorted my list of files by size, and am looping through to compare each element to the ones above and below it. Unfortunately, there is an issue at the beginning and end since there will be no previous or next file, respectively. I can fix this using if statements, but it feels clunky. Here is my code:
public void GetCRCs(List<DupInfo> dupInfos)
{
var crc = new Crc32();
for (int i = 0; i < dupInfos.Count(); i++)
{
if (dupInfos[i].Size == dupInfos[i - 1].Size || dupInfos[i].Size == dupInfos[i + 1].Size)
{
dupInfos[i].CheckSum = crc.ComputeChecksum(File.ReadAllBytes(dupInfos[i].FullName));
}
}
}
My question is:
How can I compare each entry to its neighbors without the out of bounds error?
Should I be using a loop for this, or is there a better LINQ or other function?
Note: I did not include the rest of my code to avoid clutter. If you want to see it, I can include it.

Compute the Crcs first:
// It is assumed that DupInfo.CheckSum is nullable
public void GetCRCs(List<DupInfo> dupInfos)
{
dupInfos[0].CheckSum = null ;
for (int i = 1; i < dupInfos.Count(); i++)
{
dupInfos[i].CheckSum = null ;
if (dupInfos[i].Size == dupInfos[i - 1].Size)
{
if (dupInfos[i-1].Checksum==null) dupInfos[i-1].CheckSum = crc.ComputeChecksum(File.ReadAllBytes(dupInfos[i-1].FullName));
dupInfos[i].CheckSum = crc.ComputeChecksum(File.ReadAllBytes(dupInfos[i].FullName));
}
}
}
After having sorted your files by size and crc, identify duplicates:
public void GetDuplicates(List<DupInfo> dupInfos)
{
for (int i = dupInfos.Count();i>0 i++)
{ // loop is inverted to allow list items deletion
if (dupInfos[i].Size == dupInfos[i - 1].Size &&
dupInfos[i].CheckSum != null &&
dupInfos[i].CheckSum == dupInfos[i - 1].Checksum)
{ // i is duplicated with i-1
... // your code here
... // eventually, dupInfos.RemoveAt(i) ;
}
}
}

I have sorted my list of files by size, and am looping through to
compare each element to the ones above and below it.
The next logical step is to actually group your files by size. Comparing consecutive files will not always be sufficient if you have more than two files of the same size. Instead, you will need to compare every file to every other same-sized file.
I suggest taking this approach
Use LINQ's .GroupBy to create a collection of files sizes. Then .Where to only keep the groups with more than one file.
Within those groups, calculate the CRC32 checksum and add it to a collection of known checksums. Compare with previously calculated checksums. If you need to know which files specifically are duplicates you could use a dictionary keyed by this checksum (you can achieve this with another GroupBy. Otherwise a simple list will suffice to detect any duplicates.
The code might look something like this:
var filesSetsWithPossibleDupes = files.GroupBy(f => f.Length)
.Where(group => group.Count() > 1);
foreach (var grp in filesSetsWithPossibleDupes)
{
var checksums = new List<CRC32CheckSum>(); //or whatever type
foreach (var file in grp)
{
var currentCheckSum = crc.ComputeChecksum(file);
if (checksums.Contains(currentCheckSum))
{
//Found a duplicate
}
else
{
checksums.Add(currentCheckSum);
}
}
}
Or if you need the specific objects that could be duplicates, the inner foreach loop might look like
var filesSetsWithPossibleDupes = files.GroupBy(f => f.FileSize)
.Where(grp => grp.Count() > 1);
var masterDuplicateDict = new Dictionary<DupStats, IEnumerable<DupInfo>>();
//A dictionary keyed by the basic duplicate stats
//, and whose value is a collection of the possible duplicates
foreach (var grp in filesSetsWithPossibleDupes)
{
var likelyDuplicates = grp.GroupBy(dup => dup.Checksum)
.Where(g => g.Count() > 1);
//Same GroupBy logic, but applied to the checksum (instead of file size)
foreach(var dupGrp in likelyDuplicates)
{
//Create the key for the dictionary (your code is likely different)
var sample = dupGrp.First();
var key = new DupStats() {FileSize = sample.FileSize, Checksum = sample.Checksum};
masterDuplicateDict.Add(key, dupGrp);
}
}
A demo of this idea.

I think the for loop should be : for (int i = 1; i < dupInfos.Count()-1; i++)
var grps= dupInfos.GroupBy(d=>d.Size);
grps.Where(g=>g.Count>1).ToList().ForEach(g=>
{
...
});

Can you do a union between your two lists? If you have a list of filenames and do a union it should result in only a list of the overlapping files. I can write out an example if you want but this link should give you the general idea.
https://stackoverflow.com/a/13505715/1856992
Edit: Sorry for some reason I thought you were comparing file name not size.
So here is an actual answer for you.
using System;
using System.Collections.Generic;
using System.Linq;
public class ObjectWithSize
{
public int Size {get; set;}
public ObjectWithSize(int size)
{
Size = size;
}
}
public class Program
{
public static void Main()
{
Console.WriteLine("start");
var list = new List<ObjectWithSize>();
list.Add(new ObjectWithSize(12));
list.Add(new ObjectWithSize(13));
list.Add(new ObjectWithSize(14));
list.Add(new ObjectWithSize(14));
list.Add(new ObjectWithSize(18));
list.Add(new ObjectWithSize(15));
list.Add(new ObjectWithSize(15));
var duplicates = list.GroupBy(x=>x.Size)
.Where(g=>g.Count()>1);
foreach (var dup in duplicates)
foreach (var objWithSize in dup)
Console.WriteLine(objWithSize.Size);
}
}
This will print out
14
14
15
15
Here is a netFiddle for that.
https://dotnetfiddle.net/0ub6Bs
Final note. I actually think your answer looks better and will run faster. This was just an implementation in Linq.

Related

Merge data from two arrays or something else

How to combine Id from the list I get from file /test.json and id from list ourOrders[i].id?
Or if there is another way?
private RegionModel FilterByOurOrders(RegionModel region, List<OurOrderModel> ourOrders, MarketSettings market, bool byOurOrders)
{
var result = new RegionModel
{
updatedTs = region.updatedTs,
orders = new List<OrderModel>(region.orders.Count)
};
var json = File.ReadAllText("/test.json");
var otherBotOrders = JsonSerializer.Deserialize<OrdersTimesModel>(json);
OtherBotOrders = new Dictionary<string, OrderTimesInfoModel>();
foreach (var otherBotOrder in otherBotOrders.OrdersTimesInfo)
{
//OtherBotOrders.Add(otherBotOrder.Id, otherBotOrder);
BotController.WriteLine($"{otherBotOrder.Id}"); //Output ID orders to the console works
}
foreach (var order in region.orders)
{
if (ConvertToDecimal(order.price) < 1 || !byOurOrders)
{
int i = 0;
var isOurOrder = false;
while (i < ourOrders.Count && !isOurOrder)
{
if (ourOrders[i].id.Equals(order.id, StringComparison.InvariantCultureIgnoreCase))
{
isOurOrder = true;
}
++i;
}
if (!isOurOrder)
{
result.orders.Add(order);
}
}
}
return result;
}
OrdersTimesModel Looks like that:
public class OrdersTimesModel
{
public List<OrderTimesInfoModel> OrdersTimesInfo { get; set; }
}
test.json:
{"OrdersTimesInfo":[{"Id":"1"},{"Id":"2"}]}
Added:
I'll try to clarify the question:
There are three lists with ID:
First (all orders): region.orders, as order.id
Second (our orders): ourOrders, as ourOrders[i].id in a while loop
Third (our orders 2): from the /test.json file, as an array {"Orders":[{"Id":"12345..."...},{"Id":"12345..." ...}...]}
There is a foreach in which there is a while, where the First (all orders) list and the Second (our orders) list are compared. If the id's match, then these are our orders: isOurOrder = true;
Accordingly, those orders that isOurOrder = false; will be added to the result: result.orders.Add(order)
I need:
So that if (ourOrders[i].id.Equals(order.id, StringComparison.InvariantCultureIgnoreCase)) would include more Id's from the Third (our orders 2) list.
Or any other way to do it?
You should be able to completely avoid writing loops if you use LINQ (there will be loops running in the background, but it's way easier to read)
You can access some documentation here: https://learn.microsoft.com/en-us/dotnet/csharp/programming-guide/concepts/linq/introduction-to-linq-queries
and you have some pretty cool extension methods for arrays: https://learn.microsoft.com/en-us/dotnet/api/system.linq.enumerable?view=net-6.0 (these are great to get your code easy to read)
Solution
unsing System.Linq;
private RegionModel FilterByOurOrders(RegionModel region, List<OurOrderModel> ourOrders, MarketSettings market, bool byOurOrders)
{
var result = new RegionModel
{
updatedTs = region.updatedTs,
orders = new List<OrderModel>(region.orders.Count)
};
var json = File.ReadAllText("/test.json");
var otherBotOrders = JsonSerializer.Deserialize<OrdersTimesModel>(json);
// This line should get you an array containing
// JUST the ids in the JSON file
var idsFromJsonFile = otherBotOrders.Select(x => x.Id);
// Here you'll get an array with the ids for your orders
var idsFromOurOrders = ourOrders.Select(x => x.id);
// Union will only take unique values,
// so you avoid repetition.
var mergedArrays = idsFromJsonFile.Union(idsFromOurOrders);
// Now we just need to query the region orders
// We'll get every element that has an id contained in the arrays we created earlier
var filteredRegionOrders = region.orders.Where(x => !mergedArrays.Contains(x.id));
result.orders.AddRange(filteredRegionOrders );
return result;
}
You can add conditions to any of those actions (like checking for order price or the boolean flag you get as a parameter), and of course you can do it without assigning so many variables, I did it that way just to make it easier to explain.

Loop through list of strings to find different values

I have a list that is populated with different values:
e.g
{GBP, GBP, GBP, USD}
so far I have this:
List<string> currencyTypes = new List<string>();
for (int i = 0; i < currencyTypes.Count; i++)
{
if currencyTypes[i] != [i]
console.writeline("currencies are different");
}
So if the list has all the same entries, the if statement shouldnt fire
e,g {GBP, GBP, GBP, GBP}
however if any of the values are different from the rest then the if statement should notice the difference and fire.
this doesnt work however.
any ideas?
You could use LINQ to test whether all entries are the same
if (currencyTypes.Distinct().Count() > 1) {
Console.WriteLine("currencies are different");
}
Slightly more efficient for long lists:
if (currencyTypes.Count > 1 && currencyTypes.Distinct().Skip(1).Any()) {
Console.WriteLine("currencies are different");
}
This is more efficient because Any iterates at most one element unlike Count which iterates the whole list.
First of all, your list is empty. Maybe it's for the sake of the example. If not, initialize it with data. However, modify line 3 and 5 to this to fix the problem.
for (int i = 1; i < currencyTypes.Count; i++)
{
if (currencyTypes[i] != currencyTypes[i-1])
....
}
you should first group your data and find your result depending the group.
eg
List<string> currencyTypes = new List<string>() {"USD", "GBP", "GBP", "GBP" };
// group list items
var typeGroup = currencyTypes.GroupBy(t => t);
if (typeGroup.Count() > 1)
Console.WriteLine("currencies are different");
// .
// .
// also you can check what item is unique
foreach (var t in typeGroup.Where(g => g.Count() == 1 ))
{
Console.WriteLine($"{t.Single()} is different");
}

Fastest equivalent of comparing all elements of array

What is the fastest equivalent in C#/LINQ to compare all combination of elements of an array like so and add them to a bucket if they are not in a bucket. AKA. How could I optimize this piece of code in C#.
// pseudocode
List<T> elements = { .... }
HashSet<T> bucket = {}
foreach (T element in elements)
foreach (var myelemenet in elements.Where(e => e.id != element.id))
{
if (!element.notInTheList)
{
_elementIsTheSame = element.Equals(myelement);
if (_elementIsTheSame)
{
// append element to a bucket
if (!elementIsInTheBucket(bucket, element))
{
element.notInTheList = true;
addToBucket(bucket, element);
}
}
}
}
}
// takes about 150ms on a fast workstation with only 300 elements in the LIST!
The final order of the elements in the bucket is important
elements.GroupBy(x=>x).SelectMany(x=>x);
https://dotnetfiddle.net/yZ9JDp
This works because GroupBy preserves order.
Note that this puts the first element of each equivalence class first. Your code puts the first element last, and skips classes with just a single element.
Skipping the classes with just a single element can be done with a where before the SelectMany.
elements.GroupBy(x=>x).Where(x=>x.Skip(1).Any()).SelectMany(x=>x);
Getting the first element last is a bit more tricky, but I suspect it's a bug in your code so I will not try to write it out.
Depending on how you use the result you might want to throw a ToList() at the end.
It sounds like you're effectively after DistinctBy? Which can be simulated with something like:
var list = new List<MainType>();
var known = new HashSet<PropertyType>();
foreach (var item in source)
{
if (known.Add(item.TheProperty))
list.Add(item);
}
You now have a list of the items taking the first only when there are duplicates via the selected property, preserving order.
If the intent is to find the fastest solution, as the StackOverflow title suggests, then I would consider using CSharp's Parallel.ForEach to perform a map and reduce.
For example:
var resultsCache = new IRecord[_allRecords.Length];
var resultsCount = 0;
var parallelOptions = new ParallelOptions
{
MaxDegreeOfParallelism = 1 // use an appropriate value
};
Parallel.ForEach(
_allRecords,
parallelOptions,
// Part1: initialize thread local storage
() => { return new FilterMapReduceState(); },
// Part2: define task to perform
(record, parallelLoopState, index, results) =>
{
if (_abortFilterOperation)
{
parallelLoopState.Break();
}
if (strategy.CanKeep(record))
{
resultsCache[index] = record;
results.Count++;
}
else
{
resultsCache[index] = null;
}
return results;
},
// Part3: merge the results
(results) =>
{
Interlocked.Add(ref resultsCount, results.Count);
}
);
where
class FilterMapReduceState
{
public FilterMapReduceState()
{
this.Count = 0;
}
/// <summary>
/// Represents the number of records that meet the search criteria.
/// </summary>
internal int Count { get; set; }
}
As I understand from what you did you need is these pieces.
var multipleIds=elements.GroupBy(x => x.id)
.Where(g => g.Count() > 1)
.Select(y => y.Key);
var distinctIds=elements.Select(x=>x.id).Distinct();
var distinctElements=elements.Where(x=>distinctIds.Contains(x.id));
}

Appropriate datastructure for key.contains(x) Map/Dictionary

I am somewhat struggling with the terminology and complexity of my explanations here, feel free to edit it.
I have 1.000 - 20.000 objects. Each one can contain several name words (first, second, middle, last, title...) and normalized numbers(home, business...), email adresses or even physical adresses and spouse names.
I want to implement a search that enables users to freely combine word parts and number parts.When I search for "LL 676" I want to find all objects that contain any String with "LL" AND "676".
Currently I am iterating over every object and every objects property, split the searchString on " " and do a stringInstance.Contains(searchword).
This is too slow, so I am looking for a better solution.
What is the appropriate language agnostic data structure for this?
In my case I need it for C#.
Is the following data structure a good solution?
It's based on a HashMap/Dictionary.
At first I create a String that contains all name parts and phone numbers I want to look through, one example would be: "William Bill Henry Gates III 3. +436760000 billgatesstreet 12":
Then I split on " " and for every word x I create all possible substrings y that fullfill x.contains(y). I put every of those substrings inside the hashmap/dictionary.
On lookup/search I just need to call the search for every searchword and the join the results. Naturally, the lookup speed is blazingly fast (native Hashmap/Dictionary speed).
EDIT: Inserts are very fast as well (insignificant time) now that I use a smarter algorithm to get the substrings.
It's possible I've misunderstood your algorithm or requirement, but this seems like it could be a potential performance improvement:
foreach (string arg in searchWords)
{
if (String.IsNullOrEmpty(arg))
continue;
tempList = new List<T>();
if (dictionary.ContainsKey(arg))
foreach (T obj in dictionary[arg])
if (list.Contains(obj))
tempList.Add(obj);
list = new List<T>(tempList);
}
The idea is that you do the first search word separately before this, and only put all the subsequent words into the searchWords list.
That should allow you to remove your final foreach loop entirely. Results only stay in your list as long as they keep matching every searchWord, rather than initially having to pile everything that matches a single word in then filter them back out at the end.
In case anyone cares for my solution:
Disclaimer:
This is only a rough draft.
I have only done some synthetic testing and I have written a lot of it without testing it again.I have revised my code: Inserts are now ((n^2)/2)+(n/2) instead of 2^n-1 which is infinitely faster. Word length is now irrelevant.
namespace MegaHash
{
using System;
using System.Collections.Concurrent;
using System.Collections.Generic;
using System.Threading.Tasks;
public class GenericConcurrentMegaHash<T>
{
// After doing a bulk add, call AwaitAll() to ensure all data was added!
private ConcurrentBag<Task> bag = new ConcurrentBag<Task>();
private ConcurrentDictionary<string, List<T>> dictionary = new ConcurrentDictionary<string, List<T>>();
// consider changing this to include for example '-'
public char[] splitChars;
public GenericConcurrentMegaHash()
: this(new char[] { ' ' })
{
}
public GenericConcurrentMegaHash(char[] splitChars)
{
this.splitChars = splitChars;
}
public void Add(string keyWords, T o)
{
keyWords = keyWords.ToUpper();
foreach (string keyWord in keyWords.Split(splitChars))
{
if (keyWord == null || keyWord.Length < 1)
return;
this.bag.Add(Task.Factory.StartNew(() => { AddInternal(keyWord, o); }));
}
}
public void AwaitAll()
{
lock (this.bag)
{
foreach (Task t in bag)
t.Wait();
this.bag = new ConcurrentBag<Task>();
}
}
private void AddInternal(string key, T y)
{
for (int i = 0; i < key.Length; i++)
{
for (int i2 = 0; i2 < i + 1; i2++)
{
string desire = key.Substring(i2, key.Length - i);
if (dictionary.ContainsKey(desire))
{
List<T> l = dictionary[desire];
lock (l)
{
try
{
if (!l.Contains(y))
l.Add(y);
}
catch (Exception ex)
{
ex.ToString();
}
}
}
else
{
List<T> l = new List<T>();
l.Add(y);
dictionary[desire] = l;
}
}
}
}
public IList<T> FulltextSearch(string searchString)
{
searchString = searchString.ToUpper();
List<T> list = new List<T>();
string[] searchWords = searchString.Split(splitChars);
foreach (string arg in searchWords)
{
if (arg == null || arg.Length < 1)
continue;
if (dictionary.ContainsKey(arg))
foreach (T obj in dictionary[arg])
if (!list.Contains(obj))
list.Add(obj);
}
List<T> returnList = new List<T>();
foreach (T o in list)
{
foreach (string arg in searchWords)
if (dictionary[arg] == null || !dictionary[arg].Contains(o))
goto BREAK;
returnList.Add(o);
BREAK:
continue;
}
return returnList;
}
}
}

Concurent foreach iteration of two list strings

Let's say I have two List<string>. These are populated from the results of reading a text file
List owner contains:
cross
jhill
bbroms
List assignee contains:
Chris Cross
Jack Hill
Bryan Broms
During the read from a SQL source (the SQL statement contains a join)... I would perform
if(sqlReader["projects.owner"] == "something in owner list" || sqlReader["assign.assignee"] == "something in assignee list")
{
// add this projects information to the primary results LIST
list_by_owner.Add(sqlReader["projects.owner"],sqlReader["projects.project_date_created"],sqlReader["projects.project_name"],sqlReader["projects.project_status"]);
// if the assignee is not null, add also to the secondary results LIST
// logic to determine if assign.assignee is null goes here
list_by_assignee.Add(sqlReader["assign.assignee"],sqlReader["projects.owner"],sqlReader["projects.project_date_created"],sqlReader["projects.project_name"],sqlReader["projects.project_status"]);
}
I do not want to end up using nested foreach.
The FOR loop would probably suffice. Someone had mentioned ZIP to me but wasn't sure if that would be a preferable route to go in my situation.
One loop to iterate through both lists (assuming both have same count):
for (int i = 0; i < alpha.Count; i++)
{
var itemAlpha = alpha[i] // <= your object of list alpha
var itemBeta = beta[i] // <= your object of list beta
//write your code here
}
From what you describe, you don't need to iterate at all.
This is what you need:
http://msdn.microsoft.com/en-us/library/bhkz42b3.aspx
Usage:
if ((listAlpga.contains(resultA) || (listBeta.contains(resultA)) {
// do your operation
}
List Iteration will happen implicitly inside the contains method. And thats 2n comparisions, vs n*n for nested iteration.
You would be better off with sequential iteration in each list one after the other, if at all you need to go that route.
This list is maybe better represented as a List<KeyValuePair<string, string>> which would pair the two list values together in a single list.
There are several options for this. The least "painful" would be plain old for loop:
for (var index = 0; index < alpha.Count; index++)
{
var alphaItem = alpha[index];
var betaItem = beta[index];
// Do something.
}
Another interesting approach is using the indexed LINQ methods (but you need to remember they get evaluated lazily, you have to consume the resulting enumerable), for example:
alpha.Select((alphaItem, index) =>
{
var betaItem = beta[index];
// Do something
})
Or you can enumerate both collection if you use the enumerator directly:
using (var alphaEnumerator = alpha.GetEnumerator())
using (var betaEnumerator = beta.GetEnumerator())
{
while (alphaEnumerator.MoveNext() && betaEnumerator.MoveNext())
{
var alphaItem = alphaEnumerator.Current;
var betaItem = betaEnumerator.Current;
// Do something
}
}
Zip (if you need pairs) or Concat (if you need combined list) are possible options to iterate 2 lists at the same time.
I like doing something like this to enumerate over parallel lists:
int alphaCount = alpha.Count ;
int betaCount = beta.Count ;
int i = 0 ;
while ( i < alphaCount && i < betaCount )
{
var a = alpha[i] ;
bar b = beta[i] ;
// handle matched alpha/beta pairs
++i ;
}
while ( i < alphaCount )
{
var a = alpha[i] ;
// handle unmatched alphas
++i ;
}
while ( i < betaCount )
{
var b = beta[i] ;
// handle unmatched betas
++i ;
}

Categories

Resources