linq-to-query and contains for an IEnumerable<string>?

linq-to-query and contains for an IEnumerable<string>? - c#

I have list of strings IEnumerable<string> companies
containing companies i.e. mcdonalds, sony inc.
I want to compare with values in the database. I grab the list from database and in the foreach loop I compare
if (companies.Any(c => c.Contains(name.ToLower())))
{...}
in the database I have companies i.e. mcdonalds inc, sony
when searching using "sony" it finds it. When I search using "mcdonalds inc" it doesn't. because of the additional word "inc"
I know I am compare companies.any(contains(mcdonalds inc)) and it doesnt find it.
Any suggesting on how I can extend the if condition to also compare via verse

Why not compare your list of strings with the database, rather than the database with your list?
foreach(var name in list)
{
if(table.Any(t => SqlMethods.Like(name, string.Format("%{0}%", t.Column))
{ ...
}
}
That will capture differences where the names are broadly, but not exactly the same.

You could call Split before comparing and compare only the first element in the list. In this case you would always discard the 'inc'.
This will of course only work if your problem only happens because you may have suffixes that have to be discarded.

It depends on how fuzzy you need your string matching to be. You probably can't just remove all occurrences of inc from the string, for example, because a company named "The inc company" or "Inception" will potentially get affected.
For your specific example, you basically need to strip some terms (let's say inc, incorporated, and llc) from the end of the string. Let's also say that there might be more than a single term at the end of the string. You can potentially do this with a regex, something like
Regex termRemover = new Regex("^(?<companyName>.*?)(\\s+(inc|incorporated|llc))*$");
which could then be used in your example like
string scrubbedName = termRemover.Match(name.ToLower()).Groups["companyName"].Value;
if (companies.Any(c => c.Contains(scrubbedName)))
{...}
(plus error checking, etc, dropped here for brevity). The companies list should be scrubbed in the same way prior to use; otherwise your sony inc will never match anything that scrubs to just sony.

Related

Linq with dynamics "where parameter"

I have this case:
I create an array from a list like this:
String[] parameters = stringParametersToSearch.Split(' ');
The number of parameters can vary from 1 to n and I have to search for objects that in the description field containing all the occurrences of parameters
List<LookUpObject> result =
components.Where(o => o.LongDescription.Contains(parameters[0])).ToList<LookUpObject>();
if the parameter is 1 do so, but if they had two or more?
Currently to resolve this situation, I use an IF in which I build the LINQ expression for cases up to five parameters (maximum of real cases).
I can resolve this situation dynamically using LINQ ?

You either want to use Any or All, depending on whether you want to find objects where all of the parameters match or any of them. So something like:
var result = components
.Where(o => parameters.Any(p => o.LongDescription.Contains(p)))
.ToList();
... but change Any to All if you need to.
It's always worth trying to describe a query in words, and then look at the words you've used. If you use the word "any" or "all" that's a good hint that you might want to use it in the query.
Having said that, given the example you posted (in a now-deleted comment), it's not clear that you really want to use string operations for this. If the long description is:
KW=50 CO2=69 KG=100
... then you'd end up matching on "G=100" or "KG=1" neither of which is what you really want, I suspect. You should probably parse the long description and parameters into name/value pairs, and look for those in the query.

Using Regex to efficiently parse/lex tokens

I am trying to find a way to use Regex in .NET to efficiently determine which of several patterns a string matches. If my tokens were of fixed text, I would use a Dictionary<> and simply look them up. However, the tokens may have one or more sequences of digits embedded in them to represent indices. I have several dozen to ~100 such tokens. For a small example, I would like to match one of the following tokens:
ORDERID
PRICE(\d+)
QUANTITY(\d+)
DESCRIPTION(\d+)
WEIGHT(\d+)_(\d+)
(The imagined use case is that I have a set of name-value pairs and the names use embedded integers to permit repetition. In this example, imagine an order that has multiple lines, and PRICE is the price for the nth line. WEIGHT_ is the weight of the mth individual object of the nth line (imagine the lineitem is a kit of some sort)).
Note that composition of these tokens is outside of my control.
I can efficiently recognize these tokens with something like
^(?<oid>ORDERID)|(?<prc>PRICE(\d+))|(?<qty>QUANTITY(\d+)|(?<dsc>DESCRIPTION(\d+)|(?<wght>WEIGHT(\d+)_(\d+)$
Note that regular expression matching for a given regular express is linear in the size of the string you are matching, and it shouldn't get more than log n less efficient as I add more tokens.
Now do a match:
Match m = r.Match("PRICE44")
Unfortunately, as far as I can tell, to determine which token was matched from the Regex.Match object, I have to iterate through all of the possibilities:
m.Groups["oid"].Success
m.Groups["prc"].Success
m.Groups["qty"].Success
m.Groups["dsc"].Success
m.Groups["wght"].Success
The cost grows linearly (or more likely, n log n) with an increased number of tokens. If there were, say, a SuccessGroups collection, I could iterate through that, where it would generally (in my usage) have a single element: the particular group that was matched.
I could write my own parsing algorithm creating a trie or similar data structure, but I am loathe to reimplement something that Regex already implements, but doesn't appear to give me efficient access to.
Any ideas or suggestions?

Maybe use groups, it will record the first matched and you can iterate through the matches instead of the tokens
http://msdn.microsoft.com/en-us/library/bs2twtah%28v=vs.110%29.aspx#matched_subexpression

How to find and prioritize words from a given Enum list?

I have an ENUM defined as follows:
public enum TextType {
JOIN,
SUBSCRIBE,
STOP,
HELP,
CANCEL,
UNSUBSCRIBE,
UPGRADE,
BALANCE,
USAGE
}
I would like to prioritize this and filter it as follows:
Yes (if phone is not yet verified to receive text then accept this text; else ignore and go to the next one)
stop, cancel, unsubscribe
Help
Balance or USAGE
UPGRADE
So basically when the User sends a Text say "YES BALANCE" then internally first I do a check to see if the phone number
is registered. If registered then I should use the text "Balance" and return the balance. But if the phone us unregistered
then I should use the text "YES" to register the phone first and ignore the rest.
My issue is currently I am using the Enum.IsDefined to find out if the above is a valid Text and since we have two
different texts combined it fails right away.
if (Enum.IsDefined(typeof(TextType), VtextType))
So how do I rewrite the below IF condition in C# so that I am able to accept both 'Yes' and 'Balance'. Should I change the definition of my Enum or should I use "contains" or should I use regex? I am using .Net4.5
Please help.
After Michaels reply can I loop through the string of array as:
foreach (string s in Tokens)
{
Console.WriteLine(s);
}
Will this work?

It sounds like you are receiving a plain text message that you need to parse for a set of instructions. Once you've parsed the instructions, you can then traverse a representative data structure such as an array or abstract syntax tree and make decisions.
Without knowing the full syntax of the messages you're receiving, I can only guess at the best way to parse them. Some options are:
Split the message by whitespace into an array of tokens and loop through the tokens
Use a more sophisticated grammar parsing library such as Irony
Enums may come in handy when defining the set of tokens you're able to parse.
Update: If you're just looking to split up the string and look at each word (or token), you can use something like:
var Tokens = Regex.Split(myString, #"\s+");
Now you have an array of strings, and you can look at each string in the array individually. You could see if the first string is "YES", try to parse each string as your Enum, etc.

special characters in Autocomplete c#

I'm using c# and a database in SQL Server.
I have an autocomplete field that works fine with normal characters. I would like to add the functionality of autocomplete special characters too, as ö, Ä, é, è, ...
I would also like to add the possibility to identify characters that may sound similar in some languages, as 'b' and 'v', so if I type 'boor' would find 'voor' as a possible suggestion.
Any ideas?
Thanks
Edit: The autocomplete textboxes are used for names and surnames (one for each). They are made with AutoCompleteStringCollection. They search in the database for names or surnames that already exist.
This part of the application basically gives to the user the possibility to add new persons in the application (name,surname,etc).
The goal is that when the user is creating a new person in the application, he/she will get a list with the persons with a similar name or surname to the one he/she is typing in.
So if we already have 'James Smith' in the database, when the user is typing Smyth, he/she should get the possibility to change to Smith (as a autocomplete, maybe), saying "hey, do you mean 'Smith'?" So we avoid that the user creates the same person with wrong names.
Because we are working with names and surnames from people from all over the world, the errors in the creation of a new person can come from any language.
PD:
would it be a good idea to create my own autocomplete? hiding/showing a listbox right under the textbox
For what I'm trying, the function SOUNDEX works really good for characters like ö, Ä, é, è, ... But I can't call the database for every single name or surname. So I don't know well how to use use it.

I am not sure what do you mean by autocompletion. Regarding the second part of your question, you probably need a SQL Server feature SOUNDEX. It returns four-character (SOUNDEX) code to evaluate the similarity of two strings.
Use it like:
SELECT SOUNDEX ('Smith'), SOUNDEX ('Smythe');
Above words are spelled almost identically so they have the same soundex'es S530 and S530.
I think the soundex may be used with various languages, though I am not totally sure.

Unfortunately you can use as AutoCompletionSource only a AutoCompleteStringCollection.
But the logic, what will be presented to the user (the box with the matching items below the textbox itself) is fully controlled by the TextBox and can't be influenced in any way.
So even if you use something like SoundEx() or Levenstein Distance, you can't tell it the TextBox, cause it always makes a String.StartsWith() on the given collection and on a selection it replaces the whole content by the select value from the source.
That's something that already drove me crazy. You simply can't really influence what items from the list will be presented to the user and you can't influence what happens if some item from the box is selected.

I would look into Levenshtein distance.
Soundex is rather primitive. It was originally developed to be hand calculated. It results in a key and works well with western names and surnames.
Levenshtein distance looks at two string values and produces a value based on their similarity. It's looking for missing or substituted letters(no phonetic comparison as SoundEx).
Wikipedia reference: http://en.wikipedia.org/wiki/Levenstein_distance
Website for testing two string values using Levenshtein distance: http://gtools.org/levenshtein-calculate.php

How to cut specified words from string

There is a list of banned words ( or strings to be more general) and another list with let's say users mails. I would like to excise all banned words from all mails.
Trivial example:
foreach(string word in wordsList)
{
foreach(string mail in mailList)
{
mail.Replace(word,String.Empty);
}
}
How I can improve this algorithm?
Thanks for advices. I voted few answers up but I didn't mark any as answer since it was more like discussion than solution. Some people missed banned words with bad words. In my case I don't have to bother about recognize 'sh1t' or something like that.

Simple approaches to profanity filtering won't work - complex approaches don't work, for the most part, either.
What happens when you get a work like 'password' and you want to filter out 'ass'? What happens when some clever person writes 'a$$' instead - the intent is still clear, right?
See How do you implement a good profanity filter? for extensive discussion.

You could use RegEx to make things a little cleaner:
var bannedWords = #"\b(this|is|the|list|of|banned|words)\b";
foreach(mail in mailList)
var clean = Regex.Replace(mail, bannedWords, "", RegexOptions.IgnoreCase);
Even that, though, is far from perfect since people will always figure out a way around any type of filter.

You'll get best performance by drawing up a finite state machine (FSM) (or generate one) and then parsing your input 1 character at a time and walking through the states.
You can do this pretty easily with a function that takes your next input char and your current state and that returns the next state, you also create output as you walk through the mail message's characters. You draw the FSM on a paper.
Alternatively you could look into the Windows Workflow Foundation: State Machine Workflows.
In that way you only need to walk each message a single time.

Constructing a regular expression from the words (word1|word2|word3|...) and using this instead of the outer loop might be faster, since then, every e-mail only needs to be parsed once. In addition, using regular expressions would enable you to remove only "complete words" by using the word boundary markers (\b(word1|word2|word3|...)\b).
In general, I don't think you will find a solution which is orders of magnitude faster than your current one: You will have to loop through all mails and you will have to search for all the words, there's no easy way around that.

A general algorithm would be to:
Generate a list of tokens based on the input string (ie. by treating whitespace as token separators)
Compare each token against a list of banned words
Replace matched tokens
A regular expression is convenient for identifying tokens, and a HashSet would provide quick lookups for your list of banned words. There is an overloaded Replace method on the Regex class that takes a function, where you could control the replace behavior based on your lookup.
HashSet<string> BannedWords = new HashSet<string>(StringComparer.InvariantCultureIgnoreCase)
{
"bad",
};
string Input = "this is some bad text.";
string Output = Regex.Replace(Input, #"\b\w+\b", (Match m) => BannedWords.Contains(m.Value) ? new string('x', m.Value.Length) : m.Value);

Replacing it with * is annoying, but less annoying than something that removes the context of your intention by removing the word and leaving a malformed sentence. In discussing the Battle of Hastings, I'd be irritated if I saw William given the title "Grand ******* of Normandy", but at least I'd know I was playing in the small-kids playground, while his having the title of "Grand of Normandy" just looks like a mistake, or (worse) I might think that was actually his title.
Don't try replacing words with more innocuous words unless its funny. People get the joke on 4chan, but yahoo groups about history had confused people because the medireview and mediareview periods were being discussed when eval (not profanity, but is used in some XSS attacks that yahoo had been hit by) was replaced with review in medieval and mediaeval (apparantly, medireview is the American spelling of mediareview!).

In some circumstance is possible to improve it:
Just for fun:
u can use SortedList, if ur mailing list is mailing list (because u have a delimiter like ";") u can do as bellow:
first calculate ur running time algorithm:
Words: n item. (each item has an O(1) length).
mailing list: K item.
each item in mailing list average length of Z.
each sub item in mailing list item average length of Y so the average number of subitems in mailing list items is m = Z/Y.
ur algorithm takes O(n*K*Z). // the best way with knut algorithm
1.now if u sort the words list in O(n log n).
2.1- use mailingListItem.Split(";".ToCharArray()) for each mailing list item: O(Z).
2.2- sort the items in mailing list: O(m * log m)
total sorting takes O(K * Z) in worth case with respect to (m logm << Z).
3- use merge algorithm to merge items of bad word and specific mailing list: O((m + n) * k)
total time is O((m+n)*K + m*Z + n^2) with respect to m << n, total algorithm running time is O(n^2 + Z*K) in worth case, which is smaller than O(n*K*Z) if n < K * Z ( i think so).
So if performance is very very very important, u can do this.

You might consider using Regex instead of simple string matches, to avoid replacing partial content within words. A Regex would allow you to assure you are only getting full words that match. You could use a pattern like this:
"\bBADWORD\b"
Also, you may want to iterate over the mailList on the outside, and the word list on the inner loop.

Wouldn't it be easier (and more efficient) to simply redact them by changing all their characters to * or something? That way no large string needs to be resized or moved around, and the recipents are made more aware what happened, rather than getting nonsensical sentences with missing words.

Well, you certainly don' want to make the clbuttic mistake of naive string.Replace() to do it. The regex solution could work, although you'd either be iterating or using the pipe alternator (and I don't know if/how much that would slow your operation down, particularly for a large list of banned words). You could always just...not do it, since it's entirely futile no matter what--there are ways to make your intended words quite clear even without using the exact letters.
That, and it's ridiculous to have a list of words that "people find offensive" in the first place. There's someone who will be offended by pretty much any word
/censorship is bullshit rant

I assume that you want to detect only complete words (separated by non-letter characters) and ignore words with a filter-word substring (like a p[ass]word example). In that case you should build yourself a HashSet of filter-words, scan the text for words, and for each word check its existence in HashSet. If it's a filter word then build resulting StringBuilder object without it (or with an equal number of asterisks).

I had great results using this algorithm on codeproject.com better than brute force text replacments.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.