How to version string elements?

How to version string elements? - c#

I have below list of strings available as list
List<string> available = new List<string> {"C1,C2,C3,C1_V1" };
I have input parameter C1. Now I have to match available strings in available list of strings. Whenever my input is C1 then matching elements are C1,C1_V1 in available so my string should increase by 1 to get C1_v2. I have mentioned clearly in below table
As per the above table,
case 1- Avail is C1,C2,C3 and input is C1 so my destination should be C1_V1
case 2 - avail is C1,C2,C3,C1_V1 and input is C1 but my destination cannot be C1_v1 because it is already available in avail so next version should be C1_v2 and so on.
I am trying to implement this logic in c#. I have started doing this but couldnt get it done
List<string> available = new List<string> {"C1,C2,C3,C1_V1" };
string input = "C1";
string destination = string.Empty;
foreach(var data in available)
{
destination = $"{input}_V{initialVersion}";
}
Can someone help me to complete this. Any help would be appreciated.

You could count the number of items in available that match the given input, and use that count to determine the version.
A simple matching algorithm might be where a string in available is equal to input or where a string in available starts with "{input}_".
In order to handle cases where input is given with the version part, such as C1_V1 you need to split the input on the version separator, '_', and just look at the "key" part of the input.
public string NextVersion(string input, List<string> available)
{
// Argument validation omitted.
if (input.Contains('_'))
{
input = input.Split('_')[0];
}
int count = available.Count(a => string.Equals(a, input) || a.StartsWith($"{input}_"));
if (count == 0)
{
// The given input doesn't exist in available, so we can just return it as
// the "next version".
return input;
}
// Otherwise, the next version is the count of items that we found.
return $"{input}_V{count}";
}
This assumes that '_' is only valid as a version separator. If you can have strings in available such as "C1_AUX" then you'll run into issues trying to get the next version of "C1".
It also assumes that you want to increment to the next version, even if input is a version that doesn't exist. For example, if available is ["C1", "C1_V1"] and input is "C1_V123" than the return value should be "C1_V2".
Poul Bak raises another caveat. If you end up with a situation where available is missing a version. For example, if available is ["C1", "C1_V2"] and input is "C1" then the result of this function is "C1_V2", leading to a duplicate version. In this case, you'd probably have to find every item in available where the "key" part is input's key part, then parse each one to find the next version.
It isn't clear what the constraints are requirements are exactly, so these caveats may or may not be an issue. But they're certainly worth keeping in mind.

Related

Efficiently Sort Collection in C# by Substring and Index

I'm fetching some records from my database using entity framework as the user types into a searchbox and need to sort the items as they are fetched. I'll try to simplify the problem with the below.
Say I have a random list like the below that I would like to sort in place according to the occurrence of a substring
var randomList = new List<string> { "corona", "corolla", "pecoroll", "copper", "capsicum", "because", "cobra" };
var searchText = "cor";
Sort:
var sortedList = testList.OrderBy(x => x.IndexOf("cor"));
Output:
copper -> capsicum -> because -> cobra -> corona -> corolla -> pecoroll
I understand the code works as expected since the list is sorted by the index of the substring which is -1 for the first 4 items in the output, 0 for the 5th and 6th, and 2 for the 7th item.
Problem:
I'm trying to actually sort by the index of the searchsString and it's closest match to provide the user with suggestions of similar items. The expected result would be something like
corolla -> corona -> pecoroll -> cobra -> copper -> capsicum -> because
where the items containing lower indexes of the matching searchtext would appear first and recursively sort the list by 1 less character from the searchText until no characters remain. i.e. priority given to index of "cor" then "co" then "c".
I can probably write a for loop or recursive method for this but is there a built in LINQ method to achieve this objective on a collection or a library that handles searches this way considering that my code fetches records from a database so performance should be considerd? Thanks for your help in advance

To strictly address your question: "is there a built in LINQ method to achieve this(?)", I believe the answer is no. This type of "best match" search is very subjective; for example it could be argued that "cobra" is a better match than "pecoroll" since the user is more likely to have missed a "b" before the required "r", rather than excluding the first two letters, "pe" of the word "pecoroll". I believe that "proper" implementations of this behavior consider key proximity, common misspellings, and any number of other metrics to best auto-complete the entry. There may well be some established libraries available rather than developing your own method.
However, assuming you did want the exact behavior you requested, and whilst it sounds as if you were happy to do this yourself, here is my two cents:
static List<string> SortedList(List<string> baseList, string searchString)
{
// Take a modifiable copy of the base list
List<string> sourceList = new List<string>(baseList);
// Sort it first alphabetically to resolve tie-breakers
sourceList.Sort();
// Create a instance of our list to be returned
List<string> resultList = new List<string>();
while(
// While there are still elements to be sorted
(resultList.Count != baseList.Count) &&
// And there are characters remaining to be searched for
(searchString.Length > 0))
{
// Order the list elements, that contain the full search string,
// by the index of that search string.
var sortedElements = from item in sourceList
where item.Contains(searchString)
orderby item.IndexOf(searchString)
select item;
// For each of the ordered elements, remove it from the source list
// and add it to the result
foreach(var sortedElement in sortedElements)
{
sourceList.Remove(sortedElement);
resultList.Add(sortedElement);
}
// Remove one character from the search to be used against remaining elements
searchString = searchString.Remove(searchString.Length - 1, 1);
}
return resultList;
}
Testing with:
var randomList = new List<string> { "corona", "corolla", "pecoroll", "copper", "capsicum", "because", "cobra" };
var searchText = "cor";
var sortedList = SortedList(randomList, searchText);
foreach(string entry in sortedList)
{
Console.Write(entry + ", ");
}
I get:
corolla, corona, pecoroll, cobra, copper, capsicum, because,
I hope this helps.

Fastest way to find which is the longest string from a set contained in an input string

I have a very large List<MyClass> (approximatively 600.000 records +) in which i need to extract the record where MyClass.Property1 is the exact match or the closest of my input string. However even if it seems it, this is not a fuzzy string matching problem, so i can't use the Levenshtein distance. To clear the things a bit i'll give you an example.
Suppose that the following is my data set (listing only MyClass.Property1):
242
2421
2422
24220
24221
24222
24223
24224
Now what i expect is, if i have in input 2422 i expect the third record to be given in output. If i get in input 24210 i expect in output the second record, which is the longest string contained in my output. To make the things faster, when i fill the List<MyClass>, i have saved in a Dictionary<int,int> the index at which the first number in the string change (example from 19999 to 20000) so i can reduce the size of the dataset i'm going to search for the match. What i wonder is: Which is the fastest way to reach my goal?
The only thing i can think is something like that:
Since i'm sure that the List<MyClass> is ordered by the MyClass.Property1 like in the example, and supposing that i have extracted a List<MyClass> called SubSet based on the dictionary i mentioned before, i would do
MyClass result = null;
foreach(MyCLass m in SubSet)
{
if (input.Contains(m.Property1))
{
// if the 2 strings are equal i've found the exact match
if(input == m.Property1)
return m.Property1;
else
result = m;
}
else
return result;
}
The most obvious problem i can see here is the fact that if the desidered result is at the end of the SubSet i need to loop over thousands of records Can you think any better way to reach my goal or a way to improve my current code?

Maybe,You can use Linq method in recursive function like
public string test(string input)
{
string result = Subset.FirstOrDefault(a => a == input);
if (result == null)
return test(input.Substring(0, input.Length - 2));
else
return result;
}

Check if Characters in ArrayList C# exist - C# (2.0)

I was wondering if there is a way in an ArrayList that I can search to see if the record contains a certain characters, If so then grab the whole entire sentence and put in into a string. For Example:
list[0] = "C:\Test3\One_Title_Here.pdf";
list[1] = "D:\Two_Here.pdf";
list[2] = "C:\Test\Hmmm_Joke.pdf";
list[3] = "C:\Test2\Testing.pdf";
Looking for: "Hmmm_Joke.pdf"
Want to get: "C:\Test\Hmmm_Joke.pdf" and put it in the Remove()
protected void RemoveOther(ArrayList list, string Field)
{
string removeStr;
-- Put code in here to search for part of a string which is Field --
-- Grab that string here and put it into a new variable --
list.Contains();
list.Remove(removeStr);
}
Hope this makes sense. Thanks.

Loop through each string in the array list and if the string does not contain the search term then add it to new list, like this:
string searchString = "Hmmm_Joke.pdf";
ArrayList newList = new ArrayList();
foreach(string item in list)
{
if(!item.ToLower().Contains(searchString.ToLower()))
{
newList.Add(item);
}
}
Now you can work with the new list that has excluded any matches of the search string value.
Note: Made string be lowercase for comparison to avoid casing issues.

In order to remove a value from your ArrayList you'll need to loop through the values and check each one to see if it contains the desired value. Keep track of that index, or indexes if there are many.
Then after you have found all of the values you wish to remove, you can call ArrayList.RemoveAt to remove the values you want. If you are removing multiple values, start with the largest index and then process the smaller indexes, otherwise, the indexes will be off if you remove the smallest first.

This will do the job without raising an InvalidOperationException:
string searchString = "Hmmm_Joke.pdf";
foreach (string item in list.ToArray())
{
if (item.IndexOf(searchString, StringComparison.OrdinalIgnoreCase) >= 0)
{
list.Remove(item);
}
}
I also made it case insensitive.
Good luck with your task.

I would rather use LINQ to solve this. Since IEnumerables are immutable, we should first get what we want removed and then, remove it.
var toDelete = Array.FindAll(list.ToArray(), s =>
s.ToString().IndexOf("Hmmm_Joke.pdf", StringComparison.OrdinalIgnoreCase) >= 0
).ToList();
toDelete.ForEach(item => list.Remove(item));
Of course, use a variable where is hardcoded.
I would also recommend read this question: Case insensitive 'Contains(string)'
It discuss the proper way to work with characters, since convert to Upper case/Lower case since it costs a lot of performance and may result in unexpected behaviours when dealing with file names like: 文書.pdf

C# Efficient Substring with many inputs

Assuming I do not want to use external libraries or more than a dozen or so extra lines of code (i.e. clear code, not code golf code), can I do better than string.Contains to handle a collection of input strings and a collection of keywords to check for?
Obviously one can use objString.Contains(objString2) to do a simple substring check. However, there are many well-known algorithms which are able to do better than this under special circumstances, particularly if one is working with multiple strings. But sticking such an algorithm into my code would probably add length and complexity, so I'd rather use some sort of shortcut based on a built in function.
E.g. an input would be a collection of strings, a collection of positive keywords, and a collection of negative keywords. Output would be a subset of the first collection of keywords, all of which had at least 1 positive keyword but 0 negative keywords.
Oh, and please don't mention regular expressions as a suggested solutions.
It may be that my requirements are mutually exclusive (not much extra code, no external libraries or regex, better than String.Contains), but I thought I'd ask.
Edit:
A lot of people are only offering silly improvements that won't beat an intelligently used call to contains by much, if anything. Some people are trying to call Contains more intelligently, which completely misses the point of my question. So here's an example of a problem to try solving. LBushkin's solution is an example of someone offering a solution that probably is asymptotically better than standard contains:
Suppose you have 10,000 positive keywords of length 5-15 characters, 0 negative keywords (this seems to confuse people), and 1 1,000,000 character string. Check if the 1,000,000 character string contains at least 1 of the positive keywords.
I suppose one solution is to create an FSA. Another is delimit on spaces and use hashes.

Your discussion of "negative and positive" keywords is somewhat confusing - and could use some clarification to get more complete answers.
As with all performance related questions - you should first write the simple version and then profile it to determine where the bottlenecks are - these can be unintuitive and hard to predict. Having said that...
One way to optimize the search may (if you are always searching for "words" - and not phrases that could contains spaces) would be to build a search index of from your string.
The search index could either be a sorted array (for binary search) or a dictionary. A dictionary would likely prove faster - both because dictionaries are hashmaps internally with O(1) lookup, and a dictionary will naturally eliminate duplicate values in the search source - thereby reducing the number of comparions you need to perform.
The general search algorithm is:
For each string you are searching against:
Take the string you are searching within and tokenize it into individual words (delimited by whitespace)
Populate the tokens into a search index (either a sorted array or dictionary)
Search the index for your "negative keywords", if one is found, skip to the next search string
Search the index for your "positive keywords", when one is found, add it to a dictionary as they (you could also track a count of how often the word appears)
Here's an example using a sorted array and binary search in C# 2.0:
NOTE: You could switch from string[] to List<string> easily enough, I leave that to you.
string[] FindKeyWordOccurence( string[] stringsToSearch,
string[] positiveKeywords,
string[] negativeKeywords )
{
Dictionary<string,int> foundKeywords = new Dictionary<string,int>();
foreach( string searchIn in stringsToSearch )
{
// tokenize and sort the input to make searches faster
string[] tokenizedList = searchIn.Split( ' ' );
Array.Sort( tokenizedList );
// if any negative keywords exist, skip to the next search string...
foreach( string negKeyword in negativeKeywords )
if( Array.BinarySearch( tokenizedList, negKeyword ) >= 0 )
continue; // skip to next search string...
// for each positive keyword, add to dictionary to keep track of it
// we could have also used a SortedList, but the dictionary is easier
foreach( string posKeyword in positiveKeyWords )
if( Array.BinarySearch( tokenizedList, posKeyword ) >= 0 )
foundKeywords[posKeyword] = 1;
}
// convert the Keys in the dictionary (our found keywords) to an array...
string[] foundKeywordsArray = new string[foundKeywords.Keys.Count];
foundKeywords.Keys.CopyTo( foundKeywordArray, 0 );
return foundKeywordsArray;
}
Here's a version that uses a dictionary-based index and LINQ in C# 3.0:
NOTE: This is not the most LINQ-y way to do it, I could use Union() and SelectMany() to write the entire algorithm as a single big LINQ statement - but I find this to be easier to understand.
public IEnumerable<string> FindOccurences( IEnumerable<string> searchStrings,
IEnumerable<string> positiveKeywords,
IEnumerable<string> negativeKeywords )
{
var foundKeywordsDict = new Dictionary<string, int>();
foreach( var searchIn in searchStrings )
{
// tokenize the search string...
var tokenizedDictionary = searchIn.Split( ' ' ).ToDictionary( x => x );
// skip if any negative keywords exist...
if( negativeKeywords.Any( tokenizedDictionary.ContainsKey ) )
continue;
// merge found positive keywords into dictionary...
// an example of where Enumerable.ForEach() would be nice...
var found = positiveKeywords.Where(tokenizedDictionary.ContainsKey)
foreach (var keyword in found)
foundKeywordsDict[keyword] = 1;
}
return foundKeywordsDict.Keys;
}

If you add this extension method:
public static bool ContainsAny(this string testString, IEnumerable<string> keywords)
{
foreach (var keyword in keywords)
{
if (testString.Contains(keyword))
return true;
}
return false;
}
Then this becomes a one line statement:
var results = testStrings.Where(t => !t.ContainsAny(badKeywordCollection)).Where(t => t.ContainsAny(goodKeywordCollection));
This isn't necessarily any faster than doing the contains checks, except that it will do them efficiently, due to LINQ's streaming of results preventing any unnecessary contains calls.... Plus, the resulting code being a one liner is nice.

If you're truly just looking for space-delimited words, this code would be a very simple implementation:
static void Main(string[] args)
{
string sIn = "This is a string that isn't nearly as long as it should be " +
"but should still serve to prove an algorithm";
string[] sFor = { "string", "as", "not" };
Console.WriteLine(string.Join(", ", FindAny(sIn, sFor)));
}
private static string[] FindAny(string searchIn, string[] searchFor)
{
HashSet<String> hsIn = new HashSet<string>(searchIn.Split());
HashSet<String> hsFor = new HashSet<string>(searchFor);
return hsIn.Intersect(hsFor).ToArray();
}
If you only wanted a yes/no answer (as I see now may have been the case) there's another method of hashset "Overlaps" that's probably better optimized for that:
private static bool FindAny(string searchIn, string[] searchFor)
{
HashSet<String> hsIn = new HashSet<string>(searchIn.Split());
HashSet<String> hsFor = new HashSet<string>(searchFor);
return hsIn.Overlaps(hsFor);
}

Well, there is the Split() method you can call on a string. You could split your input strings into arrays of words using Split() then do a one-to-one check of words with keywords. I have no idea if or under what circumstances this would be faster than using Contains(), however.

First get rid of all the strings that contain negative words. I would suggest doing this using the Contains method. I would think that Contains() is faster then splitting, sorting, and searching.

Seems to me that the best way to do this is take your match strings (both positive and negative) and compute a hash of them. Then march through your million string computing n hashes (in your case it's 10 for strings of length 5-15) and match against the hashes for your match strings. If you get hash matches, then you do an actual string compare to rule out the false positive. There are a number of good ways to optimize this by bucketing your match strings by length and creating hashes based on the string size for a particular bucket.
So you get something like:
IList<Buckets> buckets = BuildBuckets(matchStrings);
int shortestLength = buckets[0].Length;
for (int i = 0; i < inputString.Length - shortestLength; i++) {
foreach (Bucket b in buckets) {
if (i + b.Length >= inputString.Length)
continue;
string candidate = inputString.Substring(i, b.Length);
int hash = ComputeHash(candidate);
foreach (MatchString match in b.MatchStrings) {
if (hash != match.Hash)
continue;
if (candidate == match.String) {
if (match.IsPositive) {
// positive case
}
else {
// negative case
}
}
}
}
}

To optimize Contains(), you need a tree (or trie) structure of your positive/negative words.
That should speed up everything (O(n) vs O(nm), n=size of string, m=avg word size) and the code is relatively small & easy.

Recursive woes - reducing an input string

I'm working on a portion of code that is essentially trying to reduce a list of strings down to a single string recursively.
I have an internal database built up of matching string arrays of varying length (say array lengths of 2-4).
An example input string array would be:
{"The", "dog", "ran", "away"}
And for further example, my database could be made up of string arrays in this manner:
(length 2) {{"The", "dog"},{"dog", "ran"}, {"ran", "away"}}
(length 3) {{"The", "dog", "ran"}.... and so on
So, what I am attempting to do is recursively reduce my input string array down to a single token. So ideally it would parse something like this:
1) {"The", "dog", "ran", "away"}
Say that (seq1) = {"The", "dog"} and (seq2) = {"ran", "away"}
2) { (seq1), "ran", "away"}
3) { (seq1), (seq2)}
In my sequence database I know that, for instance, seq3 = {(seq1), (seq2)}
4) { (seq3) }
So, when it is down to a single token, I'm happy and the function would end.
Here is an outline of my current program logic:
public void Tokenize(Arraylist<T> string_array, int current_size)
{
// retrieve all known sequences of length [current_size] (from global list array)
loc_sequences_by_length = sequences_by_length[current_size-min_size]; // sequences of length 2 are stored in position 0 and so on
// escape cases
if (string_array.Count == 1)
{
// finished successfully
return;
}
else if (string_array.Count < current_size)
{
// checking sequences of greater length than input string, bail
return;
}
else
{
// split input string into chunks of size [current_size] and compare to local database
// of known sequences
// (splitting code works fine)
foreach (comparison)
{
if (match_found)
{
// update input string and recall function to find other matches
string_array[found_array_position] = new_sequence;
string_array.Removerange[found_array_position+1, new_sequence.Length-1];
Tokenize(string_array, current_size)
}
}
}
// ran through unsuccessfully, increment length and try again for new sequence group
current_size++;
if (current_size > MAX_SIZE)
return;
else
Tokenize(string_array, current_size);
}
I thought it was straightforward enough, but have been getting some strange results.
Generally it appears to work, but upon further review of my output data I'm seeing some issues. Mainly, it appears to work up to a certain point...and at that point my 'curr_size' counter resets to the minimum value.
So it is called with a size of 2, then 3, then 4, then resets to 2.
My assumption was that it would run up to my predetermined max size, and then bail completely.
I tried to simplify my code as much as possible, so there are probably some simple syntax errors in transcribing. If there is any other detail that may help an eagle-eyed SO user, please let me know and I'll edit.
Thanks in advance

One bug is:
string_array[found_array_position] = new_sequence;
I don't know where this is defined, and as far as I can tell if it was defined, it is never changed.
In your if statement, when if match_found ever set to true?
Also, it appears you have an extra close brace here, but you may want the last block of code to be outside of the function:
}
}
}
It would help if you cleaned up the code, to make it easier to read. Once we get past the syntactic errors it will be easier to see what is going on, I think.

Not sure what all the issues are, but the first thing I'd do is have your "catch-all" exit block right at the beginning of your method.
public void Tokenize(Arraylist<T> string_array, int current_size)
{
if (current_size > MAX_SIZE)
return;
// Guts go here
Tokenize(string_array, ++current_size);
}
A couple things:
Your tokens are not clearly separated from your input string values. This makes it more difficult to handle, and to see what's going on.
It looks like you're writing pseudo-code:
loc_sequences_by_length is not used
found_array_position is not defined
Arraylist should be ArrayList.
etc.
Overall I agree with James' statement:
It would help if you cleaned up the
code, to make it easier to read.
-Doug

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.