match multiple terms in a string to determine topic - c#

I am trying to search a user-entered string against a list of known terms to determine a topic. That is, I maintain my own list of topics and related keywords, and want to match against the user-entered string to determine the topic(s) it relates to. However, I want to make sure multiple terms are "hit" to avoid false-positives.
e.g. based on the code:
//create a list of topic keywords
List<string> CivilWar = new List<string>()
{
"Confederacy", "Union", "Civil War", "Lincoln", "Stonewall Jackson"
};
//does the user agent string exist in the list?
bool isTopic = CivilWar.Exists(x => source.Contains(x));
return isTopic
the string "Stonewall Jackson fought for the Confederacy" returns a correct positive / true result, but the string "John Kennedy Toole wrote A Confederacy of Dunces" returns a false positive / true result.
How can I make sure multiple terms are required to score a positive?

bool isTopic = CivilWar.Where(x => source.Contains(x)).Count() > 1;

Use Count instead of Exists, and make sure it is greater than 1 (multi-term):
//create a list of topic keywords
List<string> CivilWar = new List<string>()
{
"Confederacy", "Union", "Civil War", "Lincoln", "Stonewall Jackson"
};
//does the user agent string exist in the list?
return CivilWar.Count(x => source.Contains(x)) > 1; //must be greater than 1

Related

How to version string elements?

I have below list of strings available as list
List<string> available = new List<string> {"C1,C2,C3,C1_V1" };
I have input parameter C1. Now I have to match available strings in available list of strings. Whenever my input is C1 then matching elements are C1,C1_V1 in available so my string should increase by 1 to get C1_v2. I have mentioned clearly in below table
As per the above table,
case 1- Avail is C1,C2,C3 and input is C1 so my destination should be C1_V1
case 2 - avail is C1,C2,C3,C1_V1 and input is C1 but my destination cannot be C1_v1 because it is already available in avail so next version should be C1_v2 and so on.
I am trying to implement this logic in c#. I have started doing this but couldnt get it done
List<string> available = new List<string> {"C1,C2,C3,C1_V1" };
string input = "C1";
string destination = string.Empty;
foreach(var data in available)
{
destination = $"{input}_V{initialVersion}";
}
Can someone help me to complete this. Any help would be appreciated.
You could count the number of items in available that match the given input, and use that count to determine the version.
A simple matching algorithm might be where a string in available is equal to input or where a string in available starts with "{input}_".
In order to handle cases where input is given with the version part, such as C1_V1 you need to split the input on the version separator, '_', and just look at the "key" part of the input.
public string NextVersion(string input, List<string> available)
{
// Argument validation omitted.
if (input.Contains('_'))
{
input = input.Split('_')[0];
}
int count = available.Count(a => string.Equals(a, input) || a.StartsWith($"{input}_"));
if (count == 0)
{
// The given input doesn't exist in available, so we can just return it as
// the "next version".
return input;
}
// Otherwise, the next version is the count of items that we found.
return $"{input}_V{count}";
}
This assumes that '_' is only valid as a version separator. If you can have strings in available such as "C1_AUX" then you'll run into issues trying to get the next version of "C1".
It also assumes that you want to increment to the next version, even if input is a version that doesn't exist. For example, if available is ["C1", "C1_V1"] and input is "C1_V123" than the return value should be "C1_V2".
Poul Bak raises another caveat. If you end up with a situation where available is missing a version. For example, if available is ["C1", "C1_V2"] and input is "C1" then the result of this function is "C1_V2", leading to a duplicate version. In this case, you'd probably have to find every item in available where the "key" part is input's key part, then parse each one to find the next version.
It isn't clear what the constraints are requirements are exactly, so these caveats may or may not be an issue. But they're certainly worth keeping in mind.

Efficiently Sort Collection in C# by Substring and Index

I'm fetching some records from my database using entity framework as the user types into a searchbox and need to sort the items as they are fetched. I'll try to simplify the problem with the below.
Say I have a random list like the below that I would like to sort in place according to the occurrence of a substring
var randomList = new List<string> { "corona", "corolla", "pecoroll", "copper", "capsicum", "because", "cobra" };
var searchText = "cor";
Sort:
var sortedList = testList.OrderBy(x => x.IndexOf("cor"));
Output:
copper -> capsicum -> because -> cobra -> corona -> corolla -> pecoroll
I understand the code works as expected since the list is sorted by the index of the substring which is -1 for the first 4 items in the output, 0 for the 5th and 6th, and 2 for the 7th item.
Problem:
I'm trying to actually sort by the index of the searchsString and it's closest match to provide the user with suggestions of similar items. The expected result would be something like
corolla -> corona -> pecoroll -> cobra -> copper -> capsicum -> because
where the items containing lower indexes of the matching searchtext would appear first and recursively sort the list by 1 less character from the searchText until no characters remain. i.e. priority given to index of "cor" then "co" then "c".
I can probably write a for loop or recursive method for this but is there a built in LINQ method to achieve this objective on a collection or a library that handles searches this way considering that my code fetches records from a database so performance should be considerd? Thanks for your help in advance
To strictly address your question: "is there a built in LINQ method to achieve this(?)", I believe the answer is no. This type of "best match" search is very subjective; for example it could be argued that "cobra" is a better match than "pecoroll" since the user is more likely to have missed a "b" before the required "r", rather than excluding the first two letters, "pe" of the word "pecoroll". I believe that "proper" implementations of this behavior consider key proximity, common misspellings, and any number of other metrics to best auto-complete the entry. There may well be some established libraries available rather than developing your own method.
However, assuming you did want the exact behavior you requested, and whilst it sounds as if you were happy to do this yourself, here is my two cents:
static List<string> SortedList(List<string> baseList, string searchString)
{
// Take a modifiable copy of the base list
List<string> sourceList = new List<string>(baseList);
// Sort it first alphabetically to resolve tie-breakers
sourceList.Sort();
// Create a instance of our list to be returned
List<string> resultList = new List<string>();
while(
// While there are still elements to be sorted
(resultList.Count != baseList.Count) &&
// And there are characters remaining to be searched for
(searchString.Length > 0))
{
// Order the list elements, that contain the full search string,
// by the index of that search string.
var sortedElements = from item in sourceList
where item.Contains(searchString)
orderby item.IndexOf(searchString)
select item;
// For each of the ordered elements, remove it from the source list
// and add it to the result
foreach(var sortedElement in sortedElements)
{
sourceList.Remove(sortedElement);
resultList.Add(sortedElement);
}
// Remove one character from the search to be used against remaining elements
searchString = searchString.Remove(searchString.Length - 1, 1);
}
return resultList;
}
Testing with:
var randomList = new List<string> { "corona", "corolla", "pecoroll", "copper", "capsicum", "because", "cobra" };
var searchText = "cor";
var sortedList = SortedList(randomList, searchText);
foreach(string entry in sortedList)
{
Console.Write(entry + ", ");
}
I get:
corolla, corona, pecoroll, cobra, copper, capsicum, because,
I hope this helps.

How to exclude unwanted matches from randomly matched strings

For example I have such a code.
string[] person = new string[] { "Son", "Father", "Grandpa" };
string[] age = new string[] { "Young", "In his 40-s", "Old" };
string[] unwanted = new string { "Old Son", "Young GrandPa" };
Random X = new Random();
string Who = person[i.Next(0, person.Length)];
string HowOld = age[i.Next(0, age.Length)];
Console.WriteLine("{0} {1}", Who, HowOld);
I want to get all random matches BUT THEN exclude two variants from the array "unwanted"). (surely it's just an example, there can be many more arrays and possible bad matches).
What is the good way to do it? The keypoint that I wanna keep the possibility to get ALL results anyway. So I wanna have option to exclude stuff AFTER generation, but not making it impossible to generate "old son".
First, define a class that holds both values from the arrays:
class PersonWithAge
{
public string Person { get; set; }
public string Age { get; set; }
}
Next, use LINQ to generate all possible combinations of Person and Age:
// Create cross product
var results = (from x in person
from y in age
select new PersonWithAge{Person=x, Age=y}).ToList();
Now (if desired) remove the exceptions:
results.RemoveAll(n => n.Person == "Son" && n.Age == "Old"
|| n.Person == "Grandpa" && n.Age == "Young");
If you want to prevent some combination I belief you could have 'rules' of pairs/groups that cannot be matched together like for instance an string[][] blocked or int[][] blocked when, if accessing blocked[i][j], i is the current word and the array blocked[i] are the indexes of the words (or the words themselves) it cannot be matched with (all of this assuming you might have more than 1 word you potentially don't want to match to the current one, in case of just one a simple array will suffice), Then you just have to make sure the random value you use is not one of those 'forbidden'. Hope this helps

Check if Characters in ArrayList C# exist - C# (2.0)

I was wondering if there is a way in an ArrayList that I can search to see if the record contains a certain characters, If so then grab the whole entire sentence and put in into a string. For Example:
list[0] = "C:\Test3\One_Title_Here.pdf";
list[1] = "D:\Two_Here.pdf";
list[2] = "C:\Test\Hmmm_Joke.pdf";
list[3] = "C:\Test2\Testing.pdf";
Looking for: "Hmmm_Joke.pdf"
Want to get: "C:\Test\Hmmm_Joke.pdf" and put it in the Remove()
protected void RemoveOther(ArrayList list, string Field)
{
string removeStr;
-- Put code in here to search for part of a string which is Field --
-- Grab that string here and put it into a new variable --
list.Contains();
list.Remove(removeStr);
}
Hope this makes sense. Thanks.
Loop through each string in the array list and if the string does not contain the search term then add it to new list, like this:
string searchString = "Hmmm_Joke.pdf";
ArrayList newList = new ArrayList();
foreach(string item in list)
{
if(!item.ToLower().Contains(searchString.ToLower()))
{
newList.Add(item);
}
}
Now you can work with the new list that has excluded any matches of the search string value.
Note: Made string be lowercase for comparison to avoid casing issues.
In order to remove a value from your ArrayList you'll need to loop through the values and check each one to see if it contains the desired value. Keep track of that index, or indexes if there are many.
Then after you have found all of the values you wish to remove, you can call ArrayList.RemoveAt to remove the values you want. If you are removing multiple values, start with the largest index and then process the smaller indexes, otherwise, the indexes will be off if you remove the smallest first.
This will do the job without raising an InvalidOperationException:
string searchString = "Hmmm_Joke.pdf";
foreach (string item in list.ToArray())
{
if (item.IndexOf(searchString, StringComparison.OrdinalIgnoreCase) >= 0)
{
list.Remove(item);
}
}
I also made it case insensitive.
Good luck with your task.
I would rather use LINQ to solve this. Since IEnumerables are immutable, we should first get what we want removed and then, remove it.
var toDelete = Array.FindAll(list.ToArray(), s =>
s.ToString().IndexOf("Hmmm_Joke.pdf", StringComparison.OrdinalIgnoreCase) >= 0
).ToList();
toDelete.ForEach(item => list.Remove(item));
Of course, use a variable where is hardcoded.
I would also recommend read this question: Case insensitive 'Contains(string)'
It discuss the proper way to work with characters, since convert to Upper case/Lower case since it costs a lot of performance and may result in unexpected behaviours when dealing with file names like: 文書.pdf

C# Array contains partial

How to find whether a string array contains some part of string?
I have array like this
String[] stringArray = new [] { "abc#gmail.com", "cde#yahoo.com", "#gmail.com" };
string str = "coure06#gmail.com"
if (stringArray.Any(x => x.Contains(str)))
{
//this if condition is never true
}
i want to run this if block when str contains a string thats completely or part of any of array's Item.
Assuming you've got LINQ available:
bool partialMatch = stringArray.Any(x => str.Contains(x));
Even without LINQ it's easy:
bool partialMatch = Array.Exists(stringArray, x => str.Contains(x));
or using C# 2:
bool partialMatch = Array.Exists(stringArray,
delegate(string x) { return str.Contains(x)); });
If you're using C# 1 then you probably have to do it the hard way :)
If you're looking for if a particular string in your array contains just "#gmail.com" instead of "abc#gmail.com" you have a couple of options.
On the input side, there are a variety of questions here on SO which will point you in the direction you need to go to validate that your input is a valid email address.
If you can only check on the back end, I'd do something like:
emailStr = "#gmail.com";
if(str.Contains(emailStr) && str.length == emailStr.length)
{
//your processing here
}
You can also use Regex matching, but I'm not nearly familiar enough with that to tell you what pattern you'd need.
If you're looking for just anything containing "#gmail.com", Jon's answer is your best bets.

Categories

Resources