Linq Except ignoring Custom comparer? - c#

In this toy code:
void Main()
{
var x = new string[] {"abc", "DEF"};
var y = new string[] {"ABC", "def"};
var c = new CompareCI();
var z = x.Except(y, c);
foreach (var s in z) Console.WriteLine(s);
}
private class CompareCI : IEqualityComparer<string>
{
public bool Equals(string x, string y)
{
return string.Equals(x, y, StringComparison.OrdinalIgnoreCase);
}
public int GetHashCode(string obj)
{
return obj.GetHashCode();
}
}
It seems like the Except method is ignoring my customer comparer. I get these results:
abc
DEF
Which looks like the case is not being ignored. Also, when I ran it under debug and put a breakpoint at the call to string.Equals in the Customer Comparer, the breakpoint never hit, although the code ran and I got the result I posted. i expected no results, since the sequences are equal if case is ignored.
Guess I'm doing something wrong, but I need a second pair of eyes to spot it.

Debugging your code shows that GetHashCode() is called but not Equals().
I think this is because two equal object must have equal hashcodes AND return true from Equals(). If the hashcodes are different then they cannot be equal, so there is no need to run the Equals() function.
Your code would work if the hashing function was case-insensitive, obj.ToUpper().GetHashCode().

Rui Jarimba's suggestion to use StringComparer.OriginalIgnoreCase works.

Modify your comparer:
public int GetHashCode(string obj)
{
return 0;
}
Now all items will have the same hash: 0 - which means item x and y may be the same so calling Equals is required.
However, this is not recommended because returning just 0 in GetHashCode will cause performance problems.
Best option is to use built-in StringComparer.OrdinalIgnoreCase equality comparer.

.NET Framework already provides a StringComparer Class, that uses specific case and culture-based or ordinal comparison rules - so in this case there is no need to create a custom comparer.
This will work:
var x = new string[] { "abc", "DEF" };
var y = new string[] { "ABC", "def" };
var z = x.Except(y, StringComparer.OrdinalIgnoreCase);

Related

How to check whether two lists items have value equality using EqualityComparer? [duplicate]

Before marking this as duplicate because of its title please consider the following short program:
static void Main()
{
var expected = new List<long[]> { new[] { Convert.ToInt64(1), Convert.ToInt64(999999) } };
var actual = DoSomething();
if (!actual.SequenceEqual(expected)) throw new Exception();
}
static IEnumerable<long[]> DoSomething()
{
yield return new[] { Convert.ToInt64(1), Convert.ToInt64(999999) };
}
I have a method which returns a sequence of arrays of type long. To test it I wrote some test-code similar to that one within Main.
However I get the exception, but I don´t know why. Shouldn´t the expected sequence be comparable to the actually returned one or did I miss anything?
To me it looks as both the method and the epxected contain exactly one single element containing an array of type long, doesn´t it?
EDIT: So how do I achieve to not get the exception meaning to compare the elements within the enumeration to return equality?
The actual problem is the fact that you're comparing two long[], and Enumerable.SequenceEquals will use an ObjectEqualityComparer<Int64[]> (you can see that by examining EqualityComparer<long[]>.Default which is what is being internally used by Enumerable.SequenceEquals), which will compare references of those two arrays, and not the actual values stored inside the array, which obviously aren't the same.
To get around this, you could write a custom EqualityComparer<long[]>:
static void Main()
{
var expected = new List<long[]>
{ new[] { Convert.ToInt64(1), Convert.ToInt64(999999) } };
var actual = DoSomething();
if (!actual.SequenceEqual(expected, new LongArrayComparer()))
throw new Exception();
}
public class LongArrayComparer : EqualityComparer<long[]>
{
public override bool Equals(long[] first, long[] second)
{
return first.SequenceEqual(second);
}
// GetHashCode implementation in the courtesy of #JonSkeet
// from http://stackoverflow.com/questions/7244699/gethashcode-on-byte-array
public override int GetHashCode(long[] arr)
{
unchecked
{
if (array == null)
{
return 0;
}
int hash = 17;
foreach (long element in arr)
{
hash = hash * 31 + element.GetHashCode();
}
return hash;
}
}
}
No, your sequences are not equal!
Lets remove the sequence bit, and just take what is in the first element of each item
var firstExpected = new[] { Convert.ToInt64(1), Convert.ToInt64(999999) };
var firstActual = new[] { Convert.ToInt64(1), Convert.ToInt64(999999) };
Console.WriteLine(firstExpected == firstActual); // writes "false"
The code above is comparing two separate arrays for equality. Equality does not check the contents of arrays it checks the references for equality.
Your code using SequenceEquals is, essentially, doing the same thing. It checks the references in each case of each element in an enumerable.
SequenceEquals tests for the elements within the sequences to be identical. The elements within the enumerations are of type long[], so we actually compare two different arrays (containing the same elements however) against each other which is obsiously done by comparing their references instead of their actual value .
So what we actually check here is this expected[0] == actual[0] instead of expected[0].SequqnceEquals(actual[0])
This is obiosuly returns false as both arrays share different references.
If we flatten the hierarchy using SelectMany we get what we want:
if (!actual.SelectMany(x => x).SequenceEqual(expected.SelectMany(x => x))) throw new Exception();
EDIT:
Based on this approach I found another elegant way to check if all the elements from expected are contained in actual also:
if (!expected.All(x => actual.Any(y => y.SequenceEqual(x)))) throw new Exception();
This will search if for ever sub-list within expected there is a list within actual that is sequentially identical to the current one. This seems much smarter to be as we do not need any custom EqualityComparer and no weird hashcode-implementation.

Custom List<string[]> Sort

I have a list of string[].
List<string[]> cardDataBase;
I need to sort that list by each list-item's second string value (item[1]) in custom order.
The custom order is a bit complicated, order by those starting characters:
"MW1"
"FW"
"DN"
"MWSTX1CK"
"MWSTX2FF"
then order by these letters following above starting letters:
"A"
"Q"
"J"
"C"
"E"
"I"
"A"
and then by the numbers following above.
a sample, unordered list left, ordered right:
MW1E10 MW1Q04
MWSTX2FFI06 MW1Q05
FWQ02 MW1E10
MW1Q04 MW1I06
MW1Q05 FWQ02
FWI01 FWI01
MWSTX2FFA01 DNC03
DNC03 MWSTX1CKC02
MWSTX1CKC02 MWSTX2FFI03
MWSTX2FFI03 MWSTX2FFI06
MW1I06 MWSTX2FFA01
I tried Linq but I am not that good in it right now and cannot solve this on my own. Do I need a dictionary, regex or a dictionary with regex in it? What would be the best approach?
I think you're approaching this incorrectly. You're not sorting strings, you're sorting structured objects that are misrepresented as strings (somebody aptly named this antipattern "stringly typed"). Your requirements show that you know this structure, yet it's not represented in the datastructure List<string[]>, and that's making your life hard. You should parse that structure into a real type (struct or class), and then sort that.
enum PrefixCode { MW1, FW, DN, MWSTX1CK, MWSTX2FF, }
enum TheseLetters { Q, J, C, E, I, A, }
struct CardRecord : IComparable<CardRecord> {
public readonly PrefixCode Code;
public readonly TheseLetters Letter;
public readonly uint Number;
public CardRecord(string input) {
Code = ParseEnum<PrefixCode>(ref input);
Letter = ParseEnum<TheseLetters>(ref input);
Number = uint.Parse(input);
}
static T ParseEnum<T>(ref string input) { //assumes non-overlapping prefixes
foreach(T val in Enum.GetValues(typeof(T))) {
if(input.StartsWith(val.ToString())) {
input = input.Substring(val.ToString().Length);
return val;
}
}
throw new InvalidOperationException("Failed to parse: "+input);
}
public int CompareTo(CardRecord other) {
var codeCmp = Code.CompareTo(other.Code);
if (codeCmp!=0) return codeCmp;
var letterCmp = Letter.CompareTo(other.Letter);
if (letterCmp!=0) return letterCmp;
return Number.CompareTo(other.Number);
}
public override string ToString() {
return Code.ToString() + Letter + Number.ToString("00");
}
}
A program using the above to process your example might then be:
static class Program {
static void Main() {
var inputStrings = new []{ "MW1E10", "MWSTX2FFI06", "FWQ02", "MW1Q04", "MW1Q05",
"FWI01", "MWSTX2FFA01", "DNC03", "MWSTX1CKC02", "MWSTX2FFI03", "MW1I06" };
var outputStrings = inputStrings
.Select(s => new CardRecord(s))
.OrderBy(c => c)
.Select(c => c.ToString());
Console.WriteLine(string.Join("\n", outputStrings));
}
}
This generates the same ordering as in your example. In real code, I'd recommend you name the types according to what they represent, and not, for example, TheseLetters.
This solution - with a real parse step - is superior because it's almost certain that you'll want to do more with this data at some point, and this allows you to actually access the components of the data easily. Furthermore, it's comprehensible to a future maintainer since the reason behind the ordering is somewhat clear. By contrast, if you chose to do complex string-based processing it's often very hard to understand what's going on (especially if it's part of a larger program, and not a tiny example as here).
Making new types is cheap. If your method's return value doesn't quite "fit" in an existing type, just make a new one, even if that means 1000's of types.
A bit spoonfeeding, but I found this question pretty interesting and perhaps it will be useful for others, also added some comments to explain:
void Main()
{
var cardDatabase = new List<string>{
"MW1E10",
"MWSTX2FFI06",
"FWQ02",
"MW1Q04",
"MW1Q05",
"FWI01",
"MWSTX2FFA01",
"DNC03",
"MWSTX1CKC02",
"MWSTX2FFI03",
"MW1I06",
};
var orderTable = new List<string>[]{
new List<string>
{
"MW1",
"FW",
"DN",
"MWSTX1CK",
"MWSTX2FF"
},
new List<string>
{
"Q",
"J",
"C",
"E",
"I",
"A"
}
};
var test = cardDatabase.Select(input => {
var r = Regex.Match(input, "^(MW1|FW|DN|MWSTX1CK|MWSTX2FF)(A|Q|J|C|E|I|A)([0-9]+)$");
if(!r.Success) throw new Exception("Invalid data!");
// for each input string,
// we are going to split it into "substrings",
// eg: MWSTX1CKC02 will be
// [MWSTX1CK, C, 02]
// after that, we use IndexOf on each component
// to calculate "real" order,
// note that thirdComponent(aka number component)
// does not need IndexOf because it is already representing the real order,
// we still want to convert string to integer though, because we don't like
// "string ordering" for numbers.
return new
{
input = input,
firstComponent = orderTable[0].IndexOf(r.Groups[1].Value),
secondComponent = orderTable[1].IndexOf(r.Groups[2].Value),
thirdComponent = int.Parse(r.Groups[3].Value)
};
// and after it's done,
// we start using LINQ OrderBy and ThenBy functions
// to have our custom sorting.
})
.OrderBy(calculatedInput => calculatedInput.firstComponent)
.ThenBy(calculatedInput => calculatedInput.secondComponent)
.ThenBy(calculatedInput => calculatedInput.thirdComponent)
.Select(calculatedInput => calculatedInput.input)
.ToList();
Console.WriteLine(test);
}
You can use the Array.Sort() method. Where your first parameter is the string[] you're sorting and the second parameter contains the complicated logic of determining the order.
You can use the IEnumerable.OrderBy method provided by the System.Linq namespace.

Using Linq Except not Working as I Thought

List1 contains items { A, B } and List2 contains items { A, B, C }.
What I need is to be returned { C } when I use Except Linq extension. Instead I get returned { A, B } and if I flip the lists around in my expression the result is { A, B, C }.
Am I misunderstanding the point of Except? Is there another extension I am not seeing to use?
I have looked through and tried a number of different posts on this matter with no success thus far.
var except = List1.Except(List2); //This is the line I have thus far
EDIT: Yes I was comparing simple objects. I have never used IEqualityComparer, it was interesting to learn about.
Thanks all for the help. The problem was not implementing the comparer. The linked blog post and example below where helpful.
If you are storing reference types in your list, you have to make sure there is a way to compare the objects for equality. Otherwise they will be checked by comparing if they refer to same address.
You can implement IEqualityComparer<T> and send it as a parameter to Except() function. Here's a blog post you may find helpful.
edit: the original blog post link was broken and has been replaced above
So just for completeness...
// Except gives you the items in the first set but not the second
var InList1ButNotList2 = List1.Except(List2);
var InList2ButNotList1 = List2.Except(List1);
// Intersect gives you the items that are common to both lists
var InBothLists = List1.Intersect(List2);
Edit: Since your lists contain objects you need to pass in an IEqualityComparer for your class... Here is what your except will look like with a sample IEqualityComparer based on made up objects... :)
// Except gives you the items in the first set but not the second
var equalityComparer = new MyClassEqualityComparer();
var InList1ButNotList2 = List1.Except(List2, equalityComparer);
var InList2ButNotList1 = List2.Except(List1, equalityComparer);
// Intersect gives you the items that are common to both lists
var InBothLists = List1.Intersect(List2);
public class MyClass
{
public int i;
public int j;
}
class MyClassEqualityComparer : IEqualityComparer<MyClass>
{
public bool Equals(MyClass x, MyClass y)
{
return x.i == y.i &&
x.j == y.j;
}
public int GetHashCode(MyClass obj)
{
unchecked
{
if (obj == null)
return 0;
int hashCode = obj.i.GetHashCode();
hashCode = (hashCode * 397) ^ obj.i.GetHashCode();
return hashCode;
}
}
}
You simply confused the order of arguments. I can see where this confusion arose, because the official documentation isn't as helpful as it could be:
Produces the set difference of two sequences by using the default equality comparer to compare values.
Unless you're versed in set theory, it may not be clear what a set difference actually is—it's not simply what's different between the sets. In reality, Except returns the list of elements in the first set that are not in the second set.
Try this:
var except = List2.Except(List1); // { C }
Writing a custom comparer does seem to solve the problem, but I think https://stackoverflow.com/a/12988312/10042740 is a much more simple and elegant solution.
It overwrites the GetHashCode() and Equals() methods in your object defining class, then the default comparer does its magic without extra code cluttering up the place.
Just for Ref:
I wanted to compare USB Drives connected and available to the system.
So this is the class which implements interface IEqualityComparer
public class DriveInfoEqualityComparer : IEqualityComparer<DriveInfo>
{
public bool Equals(DriveInfo x, DriveInfo y)
{
if (object.ReferenceEquals(x, y))
return true;
if (x == null || y == null)
return false;
// compare with Drive Level
return x.VolumeLabel.Equals(y.VolumeLabel);
}
public int GetHashCode(DriveInfo obj)
{
return obj.VolumeLabel.GetHashCode();
}
}
and you can use it like this
var newDeviceLst = DriveInfo.GetDrives()
.ToList()
.Except(inMemoryDrives, new DriveInfoEqualityComparer())
.ToList();

Finding duplicates in List<string>

In a list with some hundred thousand entries, how does one go about comparing each entry with the rest of the list for duplicates?
For example, List fileNames contains both "00012345.pdf" and "12345.pdf" and are considered duplicte. What is the best strategy to flagging this kind of a duplicate?
Thanks
Update: The naming of files is restricted to numbers. They are padded with zeros. Duplicates are where the padding is missing. Thus, "123.pdf" & "000123.pdf" are duplicates.
You probably want to implement your own substring comparer to test equality based on whether a substring is contained within another string.
This isn't necessarily optimised, but it will work. You could also possibly consider using Parallel Linq if you are using .NET 4.0.
EDIT: Answer updated to reflect refined question after it was edited
void Main()
{
List<string> stringList = new List<string> { "00012345.pdf","12345.pdf","notaduplicate.jpg","3453456363234.jpg"};
IEqualityComparer<string> comparer = new NumericFilenameEqualityComparer ();
var duplicates = stringList.GroupBy (s => s, comparer).Where(grp => grp.Count() > 1);
// do something with grouped duplicates...
}
// Not safe for null's !
// NB do you own parameter / null checks / string-case options etc !
public class NumericFilenameEqualityComparer : IEqualityComparer<string> {
private static Regex digitFilenameRegex = new Regex(#"\d+", RegexOptions.Compiled);
public bool Equals(string left, string right) {
Match leftDigitsMatch = digitFilenameRegex.Match(left);
Match rightDigitsMatch = digitFilenameRegex.Match(right);
long leftValue = leftDigitsMatch.Success ? long.Parse(leftDigitsMatch.Value) : long.MaxValue;
long rightValue = rightDigitsMatch.Success ? long.Parse(rightDigitsMatch.Value) : long.MaxValue;
return leftValue == rightValue;
}
public int GetHashCode(string value) {
return base.GetHashCode();
}
}
I understand you are looking for duplicates in order to remove them?
One way to go about it could be the following:
Create a class MyString which takes care of duplication rules. That is, overrides Equals and GetHashCode to recreate exactly the duplication rules you are considering. (I'm understanding from your question that 00012345.pdf and 12345.pdf should be considered duplicates?)
Make this class explicitly or implictly convertible to string (or override ToString() for that matter).
Create a HashCode<MyString> and fill it up iterating through your original List<String> checking for duplicates.
Might be dirty but it will do the trick. The only "hard" part here is correctly implementing your duplication rules.
I have a simple solution for everyone to find a duplicate string word and cahracter
For word
public class Test {
public static void main(String[] args) {
findDuplicateWords("i am am a a learner learner learner");
}
private static void findDuplicateWords(String string) {
HashMap<String,Integer> hm=new HashMap<>();
String[] s=string.split(" ");
for(String tempString:s){
if(hm.get(tempString)!=null){
hm.put(tempString, hm.get(tempString)+1);
}
else{
hm.put(tempString,1);
}
}
System.out.println(hm);
}
}
for character use for loop, get array length and use charAt()
Maybe somthing like this:
List<string> theList = new List<string>() { "00012345.pdf", "00012345.pdf", "12345.pdf", "1234567.pdf", "12.pdf" };
theList.GroupBy(txt => txt)
.Where(grouping => grouping.Count() > 1)
.ToList()
.ForEach(groupItem => Console.WriteLine("{0} duplicated {1} times with these values {2}",
groupItem.Key,
groupItem.Count(),
string.Join(" ", groupItem.ToArray())));

Why do 2 delegate instances return the same hashcode?

Take the following:
var x = new Action(() => { Console.Write("") ; });
var y = new Action(() => { });
var a = x.GetHashCode();
var b = y.GetHashCode();
Console.WriteLine(a == b);
Console.WriteLine(x == y);
This will print:
True
False
Why is the hashcode the same?
It is kinda surprising, and will make using delegates in a Dictionary as slow as a List (aka O(n) for lookups).
Update:
The question is why. IOW who made such a (silly) decision?
A better hashcode implementation would have been:
return Method ^ Target == null ? 0 : Target.GetHashcode();
// where Method is IntPtr
Easy! Since here is the implementation of the GetHashCode (sitting on the base class Delegate):
public override int GetHashCode()
{
return base.GetType().GetHashCode();
}
(sitting on the base class MulticastDelegate which will call above):
public sealed override int GetHashCode()
{
if (this.IsUnmanagedFunctionPtr())
{
return ValueType.GetHashCodeOfPtr(base._methodPtr);
}
object[] objArray = this._invocationList as object[];
if (objArray == null)
{
return base.GetHashCode();
}
int num = 0;
for (int i = 0; i < ((int) this._invocationCount); i++)
{
num = (num * 0x21) + objArray[i].GetHashCode();
}
return num;
}
Using tools such as Reflector, we can see the code and it seems like the default implementation is as strange as we see above.
The type value here will be Action. Hence the result above is correct.
UPDATE
My first attempt of a better implementation:
public class DelegateEqualityComparer:IEqualityComparer<Delegate>
{
public bool Equals(Delegate del1,Delegate del2)
{
return (del1 != null) && del1.Equals(del2);
}
public int GetHashCode(Delegate obj)
{
if(obj==null)
return 0;
int result = obj.Method.GetHashCode() ^ obj.GetType().GetHashCode();
if(obj.Target != null)
result ^= RuntimeHelpers.GetHashCode(obj);
return result;
}
}
The quality of this should be good for single cast delegates, but not so much for multicast delegates (If I recall correctly Target/Method return the values of the last element delegate).
But I'm not really sure if it fulfills the contract in all corner cases.
Hmm it looks like quality requires referential equality of the targets.
This smells like some of the cases mentioned in this thread, maybe it will give you some pointers on this behaviour. else, you could log it there :-)
What's the strangest corner case you've seen in C# or .NET?
Rgds GJ
From MSDN :
The default implementation of
GetHashCode does not guarantee
uniqueness or consistency; therefore,
it must not be used as a unique object
identifier for hashing purposes.
Derived classes must override
GetHashCode with an implementation
that returns a unique hash code. For
best results, the hash code must be
based on the value of an instance
field or property, instead of a static
field or property.
So if you have not overwritten the GetHashCode method, it may return the same. I suspect this is because it generates it from the definition, not the instance.

Categories

Resources