Relax C# LINQ String Comparison (Trim, Case Insensitive, ??) - c#

Problem
Background Story: I am rewriting all SQL queries of legacy system into LINQ.
The database is not as clean as I expect. As many of these SQL record contains spaces or different cases which treated as the same.
SELECT *
FROM fruit
WHERE name = #fruitname;
Provided #fruitname is apple, this query will match any record ends with apple, _apple, APPLE_ (where _ is a whitespace character).
However, This is the expected behavior in my use cases.
On the otherhand, LINQ string comparison is more precise. Which annoys me because such issues keep surfacing to me.
Setup
FruitTableAdapter fruitsAdapter = new FruitTableAdapter();
MyGardenDataSet.FruitDataTable fruitsTable = fruitsAdapter.GetData();
Approaches
// Issue 1: Does not match, '_apple' or 'APPLE_'
var fruits1 = fruitsTable.Where(row=>row.name == fruitname);
// Issue 2: String Comparison with case insensitive (does not match 'APPLE')
var fruits2 = fruitsTable.Where(
row=>row.nameEquals(fruitname, StringComparison.OrdinalIgnoreCase));
// Issue 3: Trailing space with case insensitive
var fruits2 = fruitsTable.Where(
row=>row.name.Trim().Equals(fruitname.Trim(),
StringComparison.OrdinalIgnoreCase));
I'm not sure but there could be many issues which SQL query are different from String Comparison.
Is there any SQL aware StringComparison? How can I achieve the same string comparison as SQL in LINQ?

Here's a nice String Extension method that builds on the solutions from a similiar question about casing StackOverflow
Keep in mind, we want to allow for NULL strings in our trim scenarios, so this extension will do a Case Insensitive compare on Trimmed strings after checking for null values
public static class StringExtension
{
// Trim strings and compare values without casing
public static bool SqlCompare(this string source, string value)
{
// Handle nulls before trimming
if (!string.IsNullOrEmpty(source))
source = source.Trim();
if (!string.IsNullOrEmpty(value))
value = value.Trim();
// Compare strings (case insensitive)
return string.Equals(source, value, StringComparison.CurrentCultureIgnoreCase);
}
}
Here's how to use the Extension in your LINQ statement:
(SysUserDisplayFavorites table is composed of char() fields with space filled results. These will get trimmed and compared (case insensitive) to the user provided values in displayFavorite object)
var defaultFavorite = _context.SysUserDisplayFavorites
.Where(x => x.UserId.SqlCompare(displayFavorite.UserId))
.Where(x => x.ModuleCode.SqlCompare(displayFavorite.ModuleCode))
.Where(x => x.ActivityCode.SqlCompare(displayFavorite.ActivityCode))
.Where(x => x.ActivityItemCode.SqlCompare(displayFavorite.ActivityItemCode))
.Where(x => x.IsDefault);

This is a very late answer.
You can use Regex to solve your problem
Here's what I have tried, hope it helps
I created a sample class
public class SampleTable
{
public string Name { get; set; }
public SampleTable(string name)
{
Name = name;
}
}
Populated sample data
List<SampleTable> sampleTblList = new List<SampleTable>();
sampleTblList.Add(new SampleTable(" Apple"));
sampleTblList.Add(new SampleTable(" APPLE"));
sampleTblList.Add(new SampleTable("Apple"));
sampleTblList.Add(new SampleTable("apple"));
sampleTblList.Add(new SampleTable("apple "));
sampleTblList.Add(new SampleTable("apmangple"));
Solution:-
string fruitName = "apple";
List<SampleTable> sortedSampleTblList = sampleTblList.Where(x =>
Regex.IsMatch(fruitName, x.Name, RegexOptions.IgnorePatternWhitespace | RegexOptions.IgnoreCase)).ToList();
Output:-
string ans = String.Join(",", sortedSampleTblList.Select(x => x.Name.Replace(" ","_")).ToArray());
Console.Write(ans);
_Apple,_APPLE,Apple,apple,apple_

fruitsTable.Where(row => row.name.Trim().Equals(fruitname, StringComparison.OrdinalIgnoreCase)); should do what you need, but I'm confused because you've listed almost the same under Issue 3. Were you not realising it was working because you are reusing fruits2?
This little NUnit test is passing
[Test]
public void FruitTest()
{
var fruitsTable = new List<string> { " Apple", " APPLE", "Apple", "apple", "apple ", " apple", "APPLE " };
var fruitname = "apple ".Trim();
var fruits = fruitsTable.Where(row => row.Trim().Equals(fruitname, StringComparison.OrdinalIgnoreCase));
Assert.AreEqual(fruitsTable.Count(), fruits.Count());
}

Related

LINQ filter string with value string

I have a string str = "abc,def,ghi". The length of string will vary. There could be one or more values separated by comma.
I have an object that has a property Code that is string and can contain values such as - "abc, stu, xyz"
I'm trying to filter objects from a collection that will return only those that contain a string in str
So, if object.Code = "abc, stu, xyz" and string str = "abc,def,ghi" then return the object.
objects.Where( x => x.Code.Split(',').Any(s => (???)) );
where ??? is where my string str values will come in.
Thanks,
var result = objects.Where(x => x.Code.Split(',').Any(s => (str.Split(',').Any(f => f.Equals(s)))));
Conversion of the str to a HashSet will improve the testing speed and simplify the query, but perhaps is overkill if your objects only have a few entries. I assume the Code property does not have spaces after each comma.
var strHash = str.Split(',').ToHashSet();
var ans = objects.Where(o => o.Code.Split(',', StringSplitOptions.RemoveEmptyEntries).Any(c1 => strHash.Contains(c1)));

How to find the nearest string in a List in LINQ?

If I want to find the exact match or the next nearest for a string.
Using SQL, I can do :
SELECT TOP 1 *
FROM table
WHERE Code >= #searchcode
ORDER BY Code
How might I achieve this using LINQ and a List of the records.
I was expecting to be able to do something like:
var find = ListDS.Where(c => c.Code >= searchcode).First();
but you can't compare strings that way.
Note that Code is an alpha string, letters, numbers, symbols, whatever..
Nearest means if you have a list containing "England", "France", "Spain", and you search for "France" then you get "France". If you search for "Germany" you get "Spain".
Here is a simple code may help you
List<string> ls = new List<string>();
ls.Add("ddd");
ls.Add("adb");
var vv = from p in ls where p.StartsWith("a") select p;
select all element with starting string "a"
If Code is an int this might work:
var find = ListDS.Where(c => c.Code >= searchcode).OrderBy(c => c.Code).First();
otherwise you need to convert it to one:
int code = int.Parse(searchcode);
var find = ListDS.Where(c => Convert.ToInt32(c.Code) >= code).OrderBy(c => Convert.ToInt32(c.Code)).First();
Try this solution:
class Something
{
public string Code;
public Something(string code)
{
this.Code = code;
}
}
class Program
{
static void Main(string[] args)
{
List<Something> ListDS = new List<Something>();
ListDS.Add(new Something("test1"));
ListDS.Add(new Something("searchword1"));
ListDS.Add(new Something("test2"));
ListDS.Add(new Something("searchword2"));
string searchcode = "searchword";
var find = ListDS.First(x => x.Code.Contains(searchcode));
Console.WriteLine(find.Code);
Console.ReadKey();
}
}
I replaced your >= with .Contains. You can also add the action into First, no need for Where.
It will not find the "nearest", just the first word containg your search parameters.
You could compare string in C#, it will use alphabetically order:
var find = ListDS.Where(c => c.Code.CompareTo(searchcode) >= 0)
.OrderBy(c => c) // get closer one, need to order
.First();
See the CompareTo docs.
Note that with this method, "10" > "2".

How to get distinct value of Upper and lower case?

I have this function and I want to get distinct value in # Data #. but my problem is if there are two value with the same characters but one is Upper and one is Lower (i.e Comedy and comedy) it still have both value Comedy and comedy in my Data. So when I bind to Data...it shows both.
My function is:
public void LoadBookGenre(Book abc)
{
var loadbook = from Book s in BookDB.Books where s.Genre == abc.Genre select s;
BookAttribute.Clear();
foreach (Book m in loadbook) BookAttribute.Add(m);
List<Book> distinct = BookAttribute.GroupBy(a => a.Genre).Select(g => g.First()).ToList();
Data.Clear();
foreach (Book s in distinct) Data.Add(s);
}
You can use the GroupBy overload that allows you to specify a case-insensitive comparer:
List<Book> distinct =
BookAttribute.GroupBy(a => a.Genre, StringComparer.OrdinalIgnoreCase)
.Select(g => g.First())
.ToList();
Depending on your scenario, you might also be able to use Distinct:
List<string> distinctGenres =
BookAttribute.Select(a => a.Genre)
.Distinct(StringComparer.OrdinalIgnoreCase)
.ToList();
Edit: You also need to alter the equality check in your initial query:
var loadbook = from Book s in BookDB.Books
where s.Genre.Equals(abc.Genre, StringComparison.OrdinalIgnoreCase)
select s;
The common solution is to maintain a version of the string that is forced to upper or lower case with upper() or lower(), and use that internal string for comparisons and the original string as the "display" version.
Replace
Data.Add(s);
by
var found = Data.SingleOrDefault(x => x.Genre.ToUpperInvariant() == s.Genre.ToUpperInvariant());
if (found == null)
{
Data.Add(s);
}
This way, you avoid adding the same name twice, while keeping the casing of the first one you find.

Retrieve all values that partly contains values from a list

I have a table in my database called Citites. I want to retrieve all cities whose name contain any of the values from the strings list.
List<string> strings = new List<string>(new string[] {"burg", "wood", "town"} );
I tried this but it will only match the exact value from the strings list. I need to find values that contain e.g town, like cape town and townsend
List<City> cities = db.Cities.Where(c => strings.Contains(c.name));
EDIT
I'm using LINQ to SQL and Any() doesn't seem to be supported here:
Local sequence cannot be used in LINQ to SQL implementations of query
operators except the Contains operator.
This will do what you need, assuming your LINQ provider supports it - since you did not mention what are you using, we can't test it.
List<City> cities = db.Cities.Where(c => strings.Any(s => c.name.Contains(s)));
In detail: for a single value (like Capetown) you would write
strings.Any(s => "Capetown".Contains(s))
Then you just apply this expression inside your current Where condition as shown in the initial code example.
Since you mention that your LINQ provider does not support .Any() in this context, here is a much more complicated code that builds the query expression dynamically.
var strings = new [] { "burg", "wood", "town" };
// just some sample data
var cities = new[] { new City("Capetown"), new City("Hamburg"), new City("New York"), new City("Farwood") };
var param = Expression.Parameter(typeof(City));
var cityName = Expression.PropertyOrField(param, "Name"); // change the property name
Expression condition = Expression.Constant(false);
foreach (var s in strings)
{
var expr = Expression.Call(cityName, "Contains", Type.EmptyTypes, Expression.Constant(s));
condition = Expression.OrElse(condition, expr);
}
// you can apply the .Where call to any query. In the debugger view you can see that
// the actual expression applied is just a bunch of OR statements.
var query = cities.AsQueryable().Where(Expression.Lambda<Func<City, bool>>(condition, param));
var results = query.ToList();
// the class used in the test
private class City
{
public City(string name) { this.Name = name; }
public string Name;
}
But note that since you mentioned in other comments that the strings collection is rather large, you should really look into building a stored procedure and pass the values as XML parameter to that procedure (then load the XML as table and join it in the query) because this approach of building the query will probably soon run into some sort of "query has too many operands" exception.
I'm not sure if it is supported by your LINQ-provider, but at least in LINQ-To-Objects this works:
List<City> cities = db.Cities.Where(c => strings.Any(s=> c.Name.Contains(s)));
You need to check if the City name contains any of the string in the list, not the other way around:
protected bool ContainsSubstring(string cityName, List<string> strings)
{
foreach(string subString in strings)
{
if (cityName.Contains(subString)) return true;
}
return false;
}
...
List<City> cities = db.Cities.Where(c => this.ContainsSubstring(c.name, strings));
If you find it with a lot of loops, try using FUNC<> which will be better (in performance). I have a sample for that :
List<string> _lookup = new List<string>() { "dE", "SE","yu" };
IEnumerable<string> _src = new List<string> { "DER","SER","YUR" };
Func<string, List<string>, bool> test = (i,lookup) =>
{
bool ispassed = false;
foreach (string lkstring in lookup)
{
ispassed = i.Contains(lkstring, StringComparison.OrdinalIgnoreCase);
if (ispassed) break;
}
return ispassed;
};
var passedCities = _src.Where(i => test(i, _lookup));
var cities = from c in db.Cities.AsEnumerable()
from s in strings
where c.name.ToLower().Contains(s.ToLower())
select c.name;
Do you have the ability to call a stored proc or sql? - you could use SQL fulltextsearch, especially if you're searching multiple terms. It'd probably be a lot quicker than doing string comparisons in SQL.
http://technet.microsoft.com/en-us/library/ms142583.aspx
You could create your search terms by doing string.Join(" ", strings)

Detecting "near duplicates" using a LINQ/C# query

I'm using the following queries to detect duplicates in a database.
Using a LINQ join doesn't work very well because Company X may also be listed as CompanyX, therefore I'd like to amend this to detect "near duplicates".
var results = result
.GroupBy(c => new {c.CompanyName})
.Select(g => new CompanyGridViewModel
{
LeadId = g.First().LeadId,
Qty = g.Count(),
CompanyName = g.Key.CompanyName,
}).ToList();
Could anybody suggest a way in which I have better control over the comparison? Perhaps via an IEqualityComparer (although I'm not exactly sure how that would work in this situation)
My main goals are:
To list the first record with a subset of all duplicates (or "near duplicates")
To have some flexibility over the fields and text comparisons I use for my duplicates.
For your explicit "ignoring spaces" case, you can simply call
var results = result.GroupBy(c => c.Name.Replace(" ", ""))...
However, in the general case where you want flexibility, I'd build up a library of IEqualityComparer<Company> classes to use in your groupings. For example, this should do the same in your "ignore space" case:
public class CompanyNameIgnoringSpaces : IEqualityComparer<Company>
{
public bool Equals(Company x, Company y)
{
return x.Name.Replace(" ", "") == y.Name.Replace(" ", "");
}
public int GetHashCode(Company obj)
{
return obj.Name.Replace(" ", "").GetHashCode();
}
}
which you could use as
var results = result.GroupBy(c => c, new CompanyNameIgnoringSpaces())...
It's pretty straightforward to do similar things containing multiple fields, or other definitions of similarity, etc.
Just note that your defintion of "similar" must be transitive, e.g. if you're looking at integers you can't define "similar" as "within 5", because then you'd have "0 is similar to 5" and "5 is similar to 10" but not "0 is similar to 10". (It must also be reflexive and symmetric, but that's more straightforward.)
Okay, so since you're looking for different permutations you could do something like this:
Bear in mind this was written in the answer so it may not fully compile, but you get the idea.
var results = result
.Where(g => CompanyNamePermutations(g.Key.CompanyName).Contains(g.Key.CompanyName))
.GroupBy(c => new {c.CompanyName})
.Select(g => new CompanyGridViewModel
{
LeadId = g.First().LeadId,
Qty = g.Count(),
CompanyName = g.Key.CompanyName,
}).ToList();
private static List<string> CompanyNamePermutations(string companyName)
{
// build your permutations here
// so to build the one in your example
return new List<string>
{
companyName,
string.Join("", companyName.Split(" ".ToCharArray(), StringSplitOptions.RemoveEmptyEntries);
};
}
In this case you need to define where the work is going to take place i.e. fully on the server, in local memory or a mixture of both.
In local memory:
In this case we have two routes, to pull back all the data and just do the logic in local memory, or to stream the data and apply the logic piecewise. To pull all the data just ToList() or ToArray() the base table. To stream the data would suggest using ToLookup() with custom IEqualityComparer, e.g.
public class CustomEqualityComparer: IEqualityComparer<String>
{
public bool Equals(String str1, String str2)
{
//custom logic
}
public int GetHashCode(String str)
{
// custom logic
}
}
//result
var results = result.ToLookup(r => r.Name,
new CustomEqualityComparer())
.Select(r => ....)
Fully on the server:
Depends on your provider and what it can successfully map. E.g. if we define a near duplicate as one with an alternative delimiter one could do something like this:
private char[] delimiters = new char[]{' ','-','*'}
var results = result.GroupBy(r => delimiters.Aggregate( d => r.Replace(d,'')...
Mixture:
In this case we are splitting the work between the two. Unless you come up with a nice scheme this route is most likely to be inefficient. E.g. if we keep the logic on the local side, build groupings as a mapping from a name into a key and just query the resulting groupings we can do something like this:
var groupings = result.Select(r => r.Name)
//pull into local memory
.ToArray()
//do local grouping logic...
//Query results
var results = result.GroupBy(r => groupings[r]).....
Personally I usually go with the first option, pulling all the data for small data sets and streaming large data sets (empirically I found streaming with logic between each pull takes a lot longer than pulling all the data then doing all the logic)
Notes: Dependent on the provider ToLookup() is usually immediate execution and in construction applies its logic piecewise.

Categories

Resources