Find most accurate match in strings

Find most accurate match in strings - c#

I'm developing a tool which fixes incorrect filenames by searching the correct names on a YouTube playlist. This tool gets the YouTube playlist videos' titles and stores them in a List:
static List<string> tracksList = new List<string>();
After storing all correct names in this List, the tool performs a search in a folder, it will only search on files with '.mp3' extension:
DirectoryInfo dir = new DirectoryInfo(#"C:\folder");
FileInfo[] files = musicDir.GetFiles("*.mp3", SearchOption.TopDirectoryOnly);
After storing all MP3 files in a FileInfo array, it loops through all of them. This loop will go file by file and, with the filename of each file, will check which is the most similar value that is in the trackList List. I have already tried with this, but it did return an empty array:
var trackMatch = tracksList.Where(track => track.Contains(file.Name.Replace(".mp3", "")))
.ToArray();
Is there any way I could do that?

String comparisons can be performed by using Levenshtein's algorithm (more information). The implementations for this algorithm can be found here.
The function (that will count how many characters have to be changed to have the other string) is the following (taken from https://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Levenshtein_distance#C.23):
public static int LevenshteinDistance(string source, string target)
{
if (String.IsNullOrEmpty(source))
{
if (String.IsNullOrEmpty(target)) return 0;
return target.Length;
}
if (String.IsNullOrEmpty(target)) return source.Length;
if (source.Length > target.Length)
{
var temp = target;
target = source;
source = temp;
}
var m = target.Length;
var n = source.Length;
var distance = new int[2, m + 1];
// Initialize the distance 'matrix'
for (var j = 1; j <= m; j++) distance[0, j] = j;
var currentRow = 0;
for (var i = 1; i <= n; ++i)
{
currentRow = i & 1;
distance[currentRow, 0] = i;
var previousRow = currentRow ^ 1;
for (var j = 1; j <= m; j++)
{
var cost = (target[j - 1] == source[i - 1] ? 0 : 1);
distance[currentRow, j] = Math.Min(Math.Min(
distance[previousRow, j] + 1,
distance[currentRow, j - 1] + 1),
distance[previousRow, j - 1] + cost);
}
}
return distance[currentRow, m];
}
Therefore, if use the previous function for comparing an input string with every string stored in tracksList, we will get Levenshtein value: the lowest one will mean that it's the most similar:
static List<int> matchList = new List<int>();
foreach (string Track in tracksList)
{
matchList.Add(LevenshteinDistance(Track, "Dailucia Where My Heart Matches The Beat (Ft Poprebel) [FULL HQ + HD]"));
}
string match = tracksList.ElementAt(matchList.IndexOf(matchList.Min()));

This is a non-trivial task.
The problem of course is that the errors in the filenames can be anything, from spelling errors to left out words to added spaces..
This means that any character can be affected in any way.
Therefore neither a simplistic Contains nor even a smart RegEx will work reliably.
I would split the filename into words and do a count of how many of the word I find in the list titles. The one with the highest count has the best chance to be the right one.
I would also try to go for a semi-automatic program, where I get offered the choices ordered by hit count and then can confirm, correct or pass..

Related

Adding a break to a bubble sort in case the array is already sorted

I currently have a (somewhat messy) bubble sort of an object array called "sorted", the code is as follows
object storage = 0;
for (int i = 0; i < sorted.Length; i++)
{
for (int c = 0; c < sorted.Length - 1; c++)
{
if (sorted[c].ToString().CompareTo(sorted[c + 1].ToString()) > 0)
{
storage = sorted[c + 1];
sorted[c + 1] = sorted[c];
sorted[c] = storage;
}
}
return sorted;
Problem is that this function always loops through the array , no matter what. Hypothetically speaking the "sorted" array could be a large array and just so happen to be sorted already, in which case the function would still scan the array and work for some time, which I want to prevent.
So the question is, how do I stop the loop properly in case the array is sorted already?

A bubble sort bubbles the largest (smallest) element of an array towards the end of an array. This is what your inner loop does.
First of all you can take advantage of the knowledge that after n iterations, the last n elements are sorted already, which means that your inner loop doesn't need to check the last n elements in the (n+1)th iteration.
Secondly, if the inner loop doesn't change anything, the elements must be in sequence already, which is a good point for a break of the outer loop.
Since you're doing this as a practising exercise, I'll leave the coding up to you :-)

Why don't you use OrderBy instead of sorting it yourself?
sorted = sorted.OrderBy(s=> s).ToArray();
If you insist to use the bubble sort, you can do this:
bool changed;
for (int i = 0; i < sorted.Length; i++)
{
changed = false;
for (int c = 0; c < sorted.Length - 1; c++)
{
if (sorted[c].ToString().CompareTo(sorted[c + 1].ToString()) > 0)
{
changed = true;
storage = sorted[c + 1];
sorted[c + 1] = sorted[c];
sorted[c] = storage;
}
if(!changed) break;
}
I'm setting changed to false each time in the first loop. If there was no changes to the end, then the array is already sorted.

Try this:
object storage = 0;
for (int i = 0; i < sorted.Length; i++)
{
bool swapped = false;
for (int c = 0; c < sorted.Length - 1; c++)
{
if (sorted[c].ToString().CompareTo(sorted[c + 1].ToString()) > 0)
{
storage = sorted[c + 1];
sorted[c + 1] = sorted[c];
sorted[c] = storage;
swapped = true;
}
}
if (!swapped)
{
break;
}
}
If it gets through a pass without swapping then the array is ordered so it will break.

Counting sort in singly-linked list C#

Is there any way to make a counting sort in singly-linked list? I haven't seen any examples and it's quite hard to make it without them. I have example of it in array and would like to do it in singly-linked list.
Has anybody did it in singly-linked list?
public static int[] CountingSortArray(int[] array)
{
int[] aux = new int[array.Length];
// find the smallest and the largest value
int min = array[0];
int max = array[0];
for (int i = 1; i < array.Length; i++)
{
if (array[i] < min) min = array[i];
else if (array[i] > max) max = array[i];
}
int[] counts = new int[max - min + 1];
for (int i = 0; i < array.Length; i++)
{
counts[array[i] - min]++;
}
counts[0]--;
for (int i = 1; i < counts.Length; i++)
{
counts[i] = counts[i] + counts[i - 1];
}
for (int i = array.Length - 1; i >= 0; i--)
{
aux[counts[array[i] - min]--] = array[i];
}
return aux;
}

I found one that works on an array at: http://www.geeksforgeeks.org/counting-sort/
I think with minimal effort it could be changed to a linked list, the only problem is that you'll end up traversing the linked list many many times since you don't have random access eg.[] making it rather inefficient. Since you seem to have found the same thing i did before I could finish typing I think my answer is kinda pointless. However, I'm still a bit curious as to where you're having problems.
Heres a hint if figuring out where to start is the problem: Every time you see array[i] used, you will need to traverse your linked list first instead to get the i'th item first.
Edit: The only reason you would need to create a 2nd linked list of frequencies is if you needed to actually do work on the resulting linked list. If you just need a sorted list of the values inside the linked list for display purposes an array holding the frequencies would work (i suppose at the same time you could just create an array of all the values then do the counting sort you already have on it). I apologize if i have confused my c, c++, c++/cx, somewhere along the way (i don't have a compiler handy right now), but this should give you a good idea of how to do it.
public static node* FindMin(node* root){ //FindMax would be nearly identical
node* minValue = root;
while(node->Next){
if(node->Value < minValue->Value)
minValue = node;
}
return minValue;
}
public static node* CountingSortArray(node* linklist){
node* root = linkedlist
node* min = FindMin(linklist);
node* max = FindMax(linklist);
int[] counts = new int[max->Value - min->Value + 1];
while(root != NULL){
counts[root->Value] += 1;
root = root->Next;
}
int i = 0;
root = linkedlist;
while(ptr != NULL){
if(counts[i] == 0)
++i;
else{
root->Value = i;
--count[i];
root = root->Next;
}
}
}
void push(node** head, int new_data){
node* newNode = new node();
newNode->Value = new_data;
newNode->Next = (*head);
(*head) = newNode;
}
void printList(node* root){
while(root != NULL){
printf(%d ", root->Value);
root = root->Next;
}
printf("\n");
}
int main(void){
node* myLinkedList = NULL;
push(&head, 0);
push(&head, 1);
push(&head, 0);
push(&head, 2);
push(&head, 0);
push(&head, 2);
printList(myLinkedList);
CountingSortArray(myLinkedList);
printList(myLinkedList);
}

The example code is more like a radix sort with base (max-min+1). Usually a counting sort looks like the code below. Make a pass over the list to get min and max. Make a second pass to generate the counts. Make a pass over the counts to generate a new array based on the counts (instead of copying data). Example code fragment:
for (size_t i = 0; i < array.Length; i++)
counts[array[i]-min]++;
size_t i = 0;
for(size_t j = 0; j < counts.Length); j++){
for(size_t n = counts[j]; n; n--){
aux[i++] = j+min;
}
}

Find all possible combinations of word with and without hyphens

For a string that may have zero or more hyphens in it, I need to extract all the different possibilities with and without hyphens.
For example, the string "A-B" would result in "A-B" and "AB" (two possibilities).
The string "A-B-C" would result in "A-B-C", "AB-C", "A-BC" and "ABC" (four possibilities).
The string "A-B-C-D" would result in "A-B-C-D", "AB-C-D", "A-BC-D", "A-B-CD", "AB-CD", "ABC-D", "A-BCD" and "ABCD" (eight possibilities).
...etc, etc.
I've experimented with some nested loops but haven't been able to get anywhere near the desired result. I suspect I need something recursive unless there is some simple solution I am overlooking.
NB. This is to build a SQL query (shame that SQL Server does't have MySQL's REGEXP pattern matching).
Here is one attempt I was working on. This might work if I do this recursively.
string keyword = "A-B-C-D";
List<int> hyphens = new List<int>();
int pos = keyword.IndexOf('-');
while (pos != -1)
{
hyphens.Add(pos);
pos = keyword.IndexOf('-', pos + 1);
}
for (int i = 0; i < hyphens.Count(); i++)
{
string result = keyword.Substring(0, hyphens[i]) + keyword.Substring(hyphens[i] + 1);
Response.Write("<p>" + result);
}
A B C D are words of varying length.

Take a look at your sample cases. Have you noticed a pattern?
With 1 hyphen there are 2 possibilities.
With 2 hyphens there are 4 possibilities.
With 3 hyphens there are 8 possibilities.
The number of possibilities is 2n.
This is literally exponential growth, so if there are too many hyphens in the string, it will quickly become infeasible to print them all. (With just 30 hyphens there are over a billion combinations!)
That said, for smaller numbers of hyphens it might be interesting to generate a list. To do this, you can think of each hyphen as a bit in a binary number. If the bit is 1, the hyphen is present, otherwise it is not. So this suggests a fairly straightforward solution:
Split the original string on the hyphens
Let n = the number of hyphens
Count from 2n - 1 down to 0. Treat this counter as a bitmask.
For each count begin building a string starting with the first part.
Concatenate each of the remaining parts to the string in order, preceded by a hyphen only if the corresponding bit in the bitmask is set.
Add the resulting string to the output and continue until the counter is exhausted.
Translated to code we have:
public static IEnumerable<string> EnumerateHyphenatedStrings(string s)
{
string[] parts = s.Split('-');
int n = parts.Length - 1;
if (n > 30) throw new Exception("too many hyphens");
for (int m = (1 << n) - 1; m >= 0; m--)
{
StringBuilder sb = new StringBuilder(parts[0]);
for (int i = 1; i <= n; i++)
{
if ((m & (1 << (i - 1))) > 0) sb.Append('-');
sb.Append(parts[i]);
}
yield return sb.ToString();
}
}
Fiddle: https://dotnetfiddle.net/ne3N8f

You should be able to track each hyphen position, and basically say its either there or not there. Loop through all the combinations, and you got all your strings. I found the easiest way to track it was using a binary, since its easy to add those with Convert.ToInt32
I came up with this:
string keyword = "A-B-C-D";
string[] keywordSplit = keyword.Split('-');
int combinations = Convert.ToInt32(Math.Pow(2.0, keywordSplit.Length - 1.0));
List<string> results = new List<string>();
for (int j = 0; j < combinations; j++)
{
string result = "";
string hyphenAdded = Convert.ToString(j, 2).PadLeft(keywordSplit.Length - 1, '0');
// Generate string
for (int i = 0; i < keywordSplit.Length; i++)
{
result += keywordSplit[i] +
((i < keywordSplit.Length - 1) && (hyphenAdded[i].Equals('1')) ? "-" : "");
}
results.Add(result);
}

This works for me:
Func<IEnumerable<string>, IEnumerable<string>> expand = null;
expand = xs =>
{
if (xs != null && xs.Any())
{
var head = xs.First();
if (xs.Skip(1).Any())
{
return expand(xs.Skip(1)).SelectMany(tail => new []
{
head + tail,
head + "-" + tail
});
}
else
{
return new [] { head };
}
}
else
{
return Enumerable.Empty<string>();
}
};
var keyword = "A-B-C-D";
var parts = keyword.Split('-');
var results = expand(parts);
I get:
ABCD
A-BCD
AB-CD
A-B-CD
ABC-D
A-BC-D
AB-C-D
A-B-C-D

I've tested this code and it is working as specified in the question. I stored the strings in a List<string>.
string str = "AB-C-D-EF-G-HI";
string[] splitted = str.Split('-');
List<string> finalList = new List<string>();
string temp = "";
for (int i = 0; i < splitted.Length; i++)
{
temp += splitted[i];
}
finalList.Add(temp);
temp = "";
for (int diff = 0; diff < splitted.Length-1; diff++)
{
for (int start = 1, limit = start + diff; limit < splitted.Length; start++, limit++)
{
int i = 0;
while (i < start)
{
temp += splitted[i++];
}
while (i <= limit)
{
temp += "-";
temp += splitted[i++];
}
while (i < splitted.Length)
{
temp += splitted[i++];
}
finalList.Add(temp);
temp = "";
}
}

I'm not sure your question is entirely well defined (i.e. could you have something like A-BCD-EF-G-H?). For "fully" hyphenated strings (A-B-C-D-...-Z), something like this should do:
string toParse = "A-B-C-D";
char[] toParseChars = toPase.toCharArray();
string result = "";
string binary;
for(int i = 0; i < (int)Math.pow(2, toParse.Length/2); i++) { // Number of subsets of an n-elt set is 2^n
binary = Convert.ToString(i, 2);
while (binary.Length < toParse.Length/2) {
binary = "0" + binary;
}
char[] binChars = binary.ToCharArray();
for (int k = 0; k < binChars.Length; k++) {
result += toParseChars[k*2].ToString();
if (binChars[k] == '1') {
result += "-";
}
}
result += toParseChars[toParseChars.Length-1];
Console.WriteLine(result);
}
The idea here is that we want to create a binary word for each possible hyphen. So, if we have A-B-C-D (three hyphens), we create binary words 000, 001, 010, 011, 100, 101, 110, and 111. Note that if we have n hyphens, we need 2^n binary words.
Then each word maps to the output you desire by inserting the hyphen where we have a '1' in our word (000 -> ABCD, 001 -> ABC-D, 010 -> AB-CD, etc). I didn't test the code above, but this is at least one way to solve the problem for fully hyphenated words.
Disclaimer: I didn't actually test the code

how to optimize FirstOrDefault Statement in linq

I have a two linq statements that the first one takes 25 ms and the second one takes 1100 ms second in a loop of 100,000.
I have replaced FirstAll with ElementAt and even used foreach to get the first element, but still takes the same time.
Is there any faster way to get the first element ?
I have considered few other questions but still couldn't find any solution to solve this problem.
var matches = (from subset in MyExtensions.SubSetsOf(List1)
where subset.Sum() <= target
select subset).OrderByDescending(i => i.Sum());
var match = matches.FirstOrDefault(0);
Also tried:
foreach (var match in matches)
{
break;
}
Or even:
var match = matches.ElementAt(0);
Any comments would be appreciated.
EDIT: here is the code for SubSetOf
public static class MyExtensions
{
public static IEnumerable<IEnumerable<T>> SubSetsOf<T>(this IEnumerable<T> source)
{
// Deal with the case of an empty source (simply return an enumerable containing a single, empty enumerable)
if (!source.Any())
return Enumerable.Repeat(Enumerable.Empty<T>(), 1);
// Grab the first element off of the list
var element = source.Take(1);
// Recurse, to get all subsets of the source, ignoring the first item
var haveNots = SubSetsOf(source.Skip(1));
// Get all those subsets and add the element we removed to them
var haves = haveNots.Select(set => element.Concat(set));
// Finally combine the subsets that didn't include the first item, with those that did.
return haves.Concat(haveNots);
}
}

You call Sum twice - it's bad. Precalc it:
var matches = MyExtensions.SubSetsOf(List1)
.Select(subset => new { subset, Sum = subset.Sum() })
.Where(o => o.Sum < target).OrderByDescending(i => i.Sum);
var match = matches.FirstOrDefault();
var subset = match != null ? match.subset : null;

As Jason said, it's a subset sum problem - the option of Knapsack Problem where weight is equal to value. The simplest solution - generate all subsets and check thiers sum, but this algorithm has horrible complexity. So, our optimization does not matter.
You shoul use dynamic programmig to solve this problem:
Let assume a two-dimensional array D(i, c) - maximal sum of i elements that is less or equal to c. N - is amount of elements (list size). W - max sum (your target).
D(0,c) = 0 for every c, because you have no elements :)
And changing c from 1 to W and i from 1 to N let's compute
D(i,c) = max(D(i-1,c),D(i-1,c-list[i])+list[i]).
To restore subset, we must store array of parents and set them during calculations.
Another examples are here.
Whole code:
class Program
{
static void Main(string[] args)
{
var list = new[] { 11, 2, 4, 6 };
var target = 13;
var n = list.Length;
var result = KnapSack(target, list, n);
foreach (var item in result)
{
Console.Write(item + " ");
}
}
private static List<int> KnapSack(int target, int[] val, int n)
{
var d = new int[n + 1, target + 1];
var p = new int[n + 1, target + 1];
for (var i = 0; i <= n; i++)
{
for (var c = 0; c <= target; c++)
{
p[i, c] = -1;
}
}
for (int i = 0; i <= n; i++)
{
for (int c = 0; c <= target; c++)
{
if (i == 0 || c == 0)
{
d[i, c] = 0;
}
else
{
var a = d[i - 1, c];
if (val[i - 1] <= c)
{
var b = val[i - 1] + d[i - 1, c - val[i - 1]];
if (a > b)
{
d[i, c] = a;
p[i, c] = p[i - 1, c];
}
else
{
d[i, c] = b;
p[i, c] = i - 1;
}
}
else
{
d[i, c] = a;
p[i, c] = p[i - 1, c];
}
}
}
}
//sum
//Console.WriteLine(d[n, target);
//restore set
var resultSet = new List<int>();
var m = n;
var s = d[n, target];
var t = p[m, s];
while (t != -1)
{
var item = val[t];
resultSet.Add(item);
m--;
s -= item;
t = p[m, s];
}
return resultSet;
}
}

It looks like the general problem you are trying solve is to find the subset of numbers with the largest sum less than target. The execution time of your linq function is a symptom of your solution. That is a well known and much researched problem called the 'knapsack problem'. I believe your specific variant would fall into the 'bounded knapsack problem' class with the weight being equal to the value. I would start by researching that. The solution you have implemented, brute forcing every possible subset, is known as the 'naive' solution. I am pretty sure it is the worst performing of all possible solutions.

More efficient way to get all indexes of a character in a string

Instead of looping through each character to see if it's the one you want then adding the index your on to a list like so:
var foundIndexes = new List<int>();
for (int i = 0; i < myStr.Length; i++)
{
if (myStr[i] == 'a')
foundIndexes.Add(i);
}

You can use String.IndexOf, see example below:
string s = "abcabcabcabcabc";
var foundIndexes = new List<int>();
long t1 = DateTime.Now.Ticks;
for (int i = s.IndexOf('a'); i > -1; i = s.IndexOf('a', i + 1))
{
// for loop end when i=-1 ('a' not found)
foundIndexes.Add(i);
}
long t2 = DateTime.Now.Ticks - t1; // read this value to see the run time

I use the following extension method to yield all results:
public static IEnumerable<int> AllIndexesOf(this string str, string searchstring)
{
int minIndex = str.IndexOf(searchstring);
while (minIndex != -1)
{
yield return minIndex;
minIndex = str.IndexOf(searchstring, minIndex + searchstring.Length);
}
}
usage:
IEnumerable<int> result = "foobar".AllIndexesOf("o"); // [1,2]
Side note to a edge case: This is a string approach which works for one or more characters. In case of "fooo".AllIndexesOf("oo") the result is just 1 https://dotnetfiddle.net/CPC7D2

How about
string xx = "The quick brown fox jumps over the lazy dog";
char search = 'f';
var result = xx.Select((b, i) => b.Equals(search) ? i : -1).Where(i => i != -1);

The raw iteration is always better & most optimized.
Unless it's a bit complex task, you never really need to seek for a better optimized solution...
So I would suggest to continue with :
var foundIndexes = new List<int>();
for (int i = 0; i < myStr.Length; i++)
if (myStr[i] == 'a') foundIndexes.Add(i);

If the string is short, it may be more efficient to search the string once and count up the number of times the character appears, then allocate an array of that size and search the string a second time, recording the indexes in the array. This will skip any list re-allocations.
What it comes down to is how long the string is and how many times the character appears. If the string is long and the character appears few times, searching it once and appending indicies to a List<int> will be faster. If the character appears many times, then searching the string twice (once to count, and once to fill an array) may be faster. Exactly where the tipping point is depends on many factors that can't be deduced from your question.
If you need to search the string for multiple different characters and get a list of indexes for those characters separately, it may be faster to search through the string once and build a Dictionary<char, List<int>> (or a List<List<int>> using character offsets from \0 as the indicies into the outer array).
Ultimately, you should benchmark your application to find bottlenecks. Often the code that we think will perform slowly is actually very fast, and we spend most of our time blocking on I/O or user input.

public static List<int> GetSubstringLocations(string text, string searchsequence)
{
try
{
List<int> foundIndexes = new List<int> { };
int i = 0;
while (i < text.Length)
{
int cindex = text.IndexOf(searchsequence, i);
if (cindex >= 0)
{
foundIndexes.Add(cindex);
i = cindex;
}
i++;
}
return foundIndexes;
}
catch (Exception ex) { }
return new List<int> { };
}

public static String[] Split(this string s,char c = '\t')
{
if (s == null) return null;
var a = new List<int>();
int i = s.IndexOf(c);
if (i < 0) return new string[] { s };
a.Add(i);
for (i = i+1; i < s.Length; i++) if (s[i] == c) a.Add(i);
var result = new string[a.Count +1];
int startIndex = 0;
result[0] = s.Remove(a[0]);
for(i=0;i<a.Count-1;i++)
{
result[i + 1] = s.Substring(a[i] + 1, a[i + 1] - a[i] - 1);
}
result[a.Count] = s.Substring(a[a.Count - 1] + 1);
return result;
}

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Find most accurate match in strings - c#

Related

Adding a break to a bubble sort in case the array is already sorted

Counting sort in singly-linked list C#

Find all possible combinations of word with and without hyphens

how to optimize FirstOrDefault Statement in linq

More efficient way to get all indexes of a character in a string

Categories

Resources