Question:
I have a array of strings and I am trying to find the closest match to a provided string. I have made a few attempts below as well as checking into some other solutions such as Levenshtein Distance which seems to only work if all the strings are of similar sizes.
Expetation:
If I were to use "two are better" as the match string is that it would match with "Two are better than one".
Thought:
I was wondering if breaking apart the stringToMatch string where there are spaces and then seeing if each of those parts of the stringToMatch string are found in the current iteration of the array ( arrayOfStrings[i] ) would be helpful at all?
// Test array and string to search
string[] arrayOfStrings = new string[] { "A hot potato", "Two are better than one", "Best of both worlds", "Curiosity killed the cat", "Devil's Advocate", "It takes two to tango", "a twofer" };
string stringToMatch = "two are better";
// Contains attempt
List<string> likeNames = new List<string>();
for (int i = 0; i < arrayOfStrings.Count(); i++)
{
if (arrayOfStrings[i].Contains(stringToMatch))
{
Console.WriteLine("Hit1");
likeNames.Add(arrayOfStrings[i]);
}
if (stringToMatch.Contains(arrayOfStrings[i]))
{
Console.WriteLine("Hit2");
likeNames.Add(arrayOfStrings[i]);
}
}
// StringComparison attempt
var matches = arrayOfStrings.Where(s => s.Equals(stringToMatch, StringComparison.InvariantCultureIgnoreCase)).ToList();
// Display matched array items
Console.WriteLine("List likeNames");
likeNames.ForEach(Console.WriteLine);
Console.WriteLine("\n");
Console.WriteLine("var matches");
matches.ForEach(Console.WriteLine);
You can try below code.
I have created List<string> based on your stringToMatch and checked if strings in array of strings contains every string present in toMatch, if yes then selected that string into match.
List<string> toMatch = stringToMatch.Split(' ').ToList();
List<string> match = arrayOfStrings.Where(x =>
!toMatch.Any(ele => !x.ToLower()
.Contains(ele.ToLower())))
.ToList();
For your implementation, I have split the stringToMatch and then took the count for matchings.
The below code will give you Order list with count with ordered with Highest string match count.
string[] arrayOfStrings = new string[] { "A hot potato", "Two are better than one", "Best of both worlds", "Curiosity killed the cat", "Devil's Advocate", "It takes two to tango", "a twofer" };
string stringToMatch = "two are better";
var matches = arrayOfStrings
.Select(s =>
{
int count = 0;
foreach (var item in stringToMatch.Split(' '))
{
if (s.Contains(item))
count++;
}
return new { count, s };
}).OrderByDescending(d => d.count);
I have used very simple string comparison to verify. The algorithm can vary as per exact requirement(Like sequence of matching string, etc)
Related
I have a substring
string subString = "ABC";
Every time all three chars appears in a input, you get one point
for example, if input is:
"AABKM" = 0 points
"AAKLMBDC" = 1 point
"ABCC" = 1 point because all three occurs once
"AAZBBCC" = 2 points because ABC is repeated twice;
etc..
The only solution I could come up with is
Regex.Matches(input, "[ABC]").Count
But does not give me what I'm looking for.
Thanks
You could use a ternary operation, where first we determine that all the characters are present in the string (else we return 0), and then select only those characters, group by each character, and return the minimum count from the groups:
For example:
string subString = "ABC";
var inputStrings = new[] {"AABKM", "AAKLMBDC", "ABCC", "AAZBBCC"};
foreach (var input in inputStrings)
{
var result = subString.All(input.Contains)
? input
.Where(subString.Contains)
.GroupBy(c => c)
.Min(g => g.Count())
: 0;
Console.WriteLine($"{input}: {result}");
}
Output
It could be done with a single line, using Linq. However I am not very confident that this could be a good solution
string subString = "ABC";
string input = "AAZBBBCCC";
var arr = input.ToCharArray()
.Where(x => subString.Contains(x))
.GroupBy(x => x)
.OrderBy(a => a.Count())
.First()
.Count();
The result is 2 because the letter A is present only two times.
Let's try to explain the linq expression.
First transform the input string in a sequence of chars, then take only the chars that are contained in the substring. Now group these chars and order them according the the number of occurrences. At this point take the first group and read the count of chars in that group.
Let's see if someone has a better solution.
try this code :
string subString = "ABC";
var input = new[] { "AABKM", "AAKLMBDC", "ABCC", "AAZBBCC" };
foreach (var item in input)
{
List<int> a = new List<int>();
for (int i = 0; i < subString.Length; i++)
{
a.Add(Regex.Matches(item, subString.ToList()[i].ToString()).Count);
}
Console.WriteLine($"{item} : {a.Min()}");
}
I'm trying to remove all conjunctions and pronouns from any array of strings(let call that array A), The words to be removed are read from a text file and converted into an array of strings(lets call that array B).
What I need is to Get the first element of array A and compare it to every word in array B, if the word matches I want to delete the word of array A.
For example:
array A = [0]I [1]want [2]to [3]go [4]home [5]and [6]sleep
array B = [0]I [1]and [2]go [3]to
Result= array A = [0]want [1]home [2]sleep
//remove any duplicates,conjunctions and Pronouns
public IQueryable<All_Articles> removeConjunctionsProNouns(IQueryable<All_Articles> myArticles)
{
//get words to be removed
string text = System.IO.File.ReadAllText("A:\\EnterpriceAssigment\\EnterpriceAssigment\\TextFiles\\conjunctions&ProNouns.txt").ToLower();
//split word into array of strings
string[] wordsToBeRemoved = text.Split(',');
//all articles
foreach (var article in myArticles)
{
//split articles into words
string[] articleSplit = article.ArticleContent.ToLower().Split(' ');
//loop through array of articles words
foreach (var y in articleSplit)
{
//loop through words to be removed from articleSplit
foreach (var x in wordsToBeRemoved)
{
//if word of articles matches word to be removed, remove word from article
if (y == x)
{
//get index of element in array to be removed
int g = Array.IndexOf(articleSplit,y);
//assign elemnt to ""
articleSplit[g] = "";
}
}
}
//re-assign splitted article to string
article.ArticleContent = articleSplit.ToString();
}
return myArticles;
}
If it is possible as well, I need array A to have no duplicates/distinct values.
You are looking for IEnumerable.Except, where the passed parameter is applied to the input sequence and every element of the input sequence not present in the parameter list is returned only once
For example
string inputText = "I want this string to be returned without some words , but words should have only one occurence";
string[] excludedWords = new string[] {"I","to","be", "some", "but", "should", "have", "one", ","};
var splitted = inputText.Split(' ');
var result = splitted.Except(excludedWords);
foreach(string s in result)
Console.WriteLine(s);
// ---- Output ----
want
this
string
returned
without
words <<-- This appears only once
only
occurence
And applied to your code is:
string text = File.ReadAllText(......).ToLower();
string[] wordsToBeRemoved = text.Split(',');
foreach (var article in myArticles)
{
string[] articleSplit = article.ArticleContent.ToLower().Split(' ');
var result = articleSplit.Except(wordsToBeRemoved);
article.ArticleContent = string.Join(" ", result);
}
You may have your answer already in your code. I am sure your code could be cleaned up a bit, as all our code could be. You loop through articleSplit and pull out each word. Then compare that word to the words in the wordsToBeRemoved array in a loop one by one. You use your conditional to compare and when true you remove the items from your original array, or at least try.
I would create another array of the results and then display, use or what ever you'd like with the array minus the words to exclude.
Loop through articleSplit
foreach x in arcticle split
foreach y in wordsToBeRemoved
if x != y newArray.Add(x)
However this is quite a bit of work. You may want to use array.filter and then add that way. There is a hundred ways to achieve this.
Here are some helpful articles:
filter an array in C#
https://msdn.microsoft.com/en-us/library/d9hy2xwa(v=vs.110).aspx
These will save you from all that looping.
You want to remove stop words. You can do it with a help of Linq:
...
string filePath = #"A:\EnterpriceAssigment\EnterpriceAssigment\TextFiles\conjunctions & ProNouns.txt";
// Hashset is much more efficient than array in the context
HashSet<string> stopWords = new HashSet<string>(File
.ReadLines(filePath), StringComparer.OrdinalIgnoreCase);
foreach (var article in myArticles) {
// read article, split into words, filter out stop words...
var cleared = article
.ArticleContent
.Split(' ')
.Where(word => !stopWords.Contains(word));
// ...and join words back into article
article.ArticleContent = string.Join(" ", cleared);
}
...
Please, notice, that I've preserved Split() which you've used in your code and so you have a toy implementation. In real life you have at least to take punctuation into consideration, and that's why a better code uses regular expressions:
foreach (var article in myArticles) {
// read article, extract words, filter out stop words...
var cleared = Regex
.Matches(article.ArticleContent, #"\w+") // <- extract words
.OfType<Match>()
.Select(match => match.Value)
.Where(word => !stopWords.Contains(word));
// ...and join words back into article
article.ArticleContent = string.Join(" ", cleared);
}
Currently I have large entries (in array) of Pinyin tone notation, some string are combined, for example Diànnǎo = Diàn + nǎo
Now problem is I want replace a string that contain 2 or more, for example:
string[] Py = { "xi", "xia", "xian" };
string[] Km = { "shi", "shie, "shien" };
string[] Input = "xiaguo";
for (int i = 0; i < Py.Length; i++)
if (Input.Contains(Py[i]))
Input = Input.Replace(Py[i], Km[i]);
Code above have a problem due to loop index, xiaguo contains xi become true (shiaguo) not (shieguo) since xi get first before xia
How do I achieve this? and make sure get xia instead of xi
Full code I posted on GitHub: https://github.com/Anime4000/py2km/blob/beta/py2km.api/Converter.cs#L15
Assuming longer tokens take precedence over shorter tokens, the 2 arrays can be converted to a dictionary and then sorted by the length of the key:
var dic = new Dictionary<string, string>
{
{"xi","shi"},
{"xia","shie"},
{"xian","shien"},
}.OrderByDescending(x => x.Key.Length)
.ThenBy(x => x.Key)
.ToDictionary(x => x.Key, x => x.Value);
string input = "xiaguo";
foreach(var d in dic)
input = input.Replace(d.Key, d.Value);
Console.WriteLine(input);
The above example with sort the dictionary:
by the length of the key
then by the alpha sort of the key
finally, the LINQ query is converted back to a dictionary.
From there, just iterate over the dictionary and replace all the tokens; there's no need to check to see if the key/token exists.
you could use a regular expresion for this.
I modified your code so the regex wil only match xi and not xia.
the regex "xi\b" matches xi and the \b means word boundary so it only matches that exact word.
string[] Py = { "xi", "xia", "xian" };
string[] Km = { "shi", "shie, "shien" };
string[] Input = "xiaguo";
string pattern = "xi\b"
Regex rgx = new Regex(pattern, RegexOptions.IgnoreCase);
for (int i = 0; i < Py.Length; i++)
{
MatchCollection matches = rgx.Matches(Py[i]);
if (matches.Count > 0)
{
Input = Input.Replace(Py[i], Km[i]);
}
}
The tone/language specifics could not have an easy structure, so you may assume some pattern and then find out later that it's not right for some 'word'.
Anyway, to handle the informed scenario, you must order the target tone by descending length, and then perform only a single replacement for each 'word' (this will avoid replacing xi, xia when processing xian.
The steps would be:
for each replacement ordered by length descending
try to find tone
if found: replace and mark as done (jump to next 'word')
The idea here is the same as when replacing a two numbers in a list, say 2 to 1 and 3 to 2, for example. The order really matter, if you replace 3 by 2 then you will be replacing both 3 and 2 to 1 after all.
I am trying to use regex to split the string into 2 arrays to turn out like this.
String str1 = "First Second [insideFirst] Third Forth [insideSecond] Fifth";
How do I split str1 to break off into 2 arrays that look like this:
ary1 = ['First Second','Third Forth','Fifth'];
ary2 = ['insideFirst','insideSecond'];
here is my solution
string str = "First Second [insideFirst] Third Forth [insideSecond] Fifth";
MatchCollection matches = Regex.Matches(str,#"\[.*?\]");
string[] arr = matches.Cast<Match>()
.Select(m => m.Groups[0].Value.Trim(new char[]{'[',']'}))
.ToArray();
foreach (string s in arr)
{
Console.WriteLine(s);
}
string[] arr1 = Regex.Split(str,#"\[.*?\]")
.Select(x => x.Trim())
.ToArray();
foreach (string s in arr1)
{
Console.WriteLine(s);
}
Output
insideFirst
insideSecond
First Second
Third Forth
Fifth
Plz Try below code. Its working fine for me.
String str1 = "First Second [insideFirst] Third Forth [insideSecond] Fifth";
var output = String.Join(";", Regex.Matches(str1, #"\[(.+?)\]")
.Cast<Match>()
.Select(m => m.Groups[1].Value));
string[] strInsideBreacket = output.Split(';');
for (int i = 0; i < strInsideBreacket.Count(); i++)
{
str1 = str1.Replace("[", ";");
str1 = str1.Replace("]", "");
str1 = str1.Replace(strInsideBreacket[i], "");
}
string[] strRemaining = str1.Split(';');
Plz look at below screen shot of output while debugging code:
Here,
strInsideBreacket is array of breacket value like insideFirst andinsideSecond
and strRemaining is array of First Second,Third Forth and Fifth
Thanks
Try this solution,
String str1 = "First Second [insideFirst] Third Forth [insideSecond] Fifth";
var allWords = str1.Split(new char[] { '[', ']' }, StringSplitOptions.RemoveEmptyEntries);
var result = allWords.GroupBy(x => x.Contains("inside")).ToArray();
The idea is that, first get all words and then the group it.
It seems to me that "user2828970" asked a question with an example, not with literal text he wanted to parse. In my mind, he could very well have asked this question:
I am trying to use regex to split a string like so.
var exampleSentence = "I had 185 birds but 20 of them flew away";
var regexSplit = Regex.Split(exampleSentence, #"\d+");
The result of regexSplit is: I had, birds but, of them flew away.
However, I also want to know the value which resulted in the second string splitting away from its preceding text, and the value which resulted in the third string splitting away from its preceding text. i.e.: I want to know about 185 and 20.
The string could be anything, and the pattern to split by could be anything. The answer should not have hard-coded values.
Well, this simple function will perform that task. The code can be optimized to compile the regex, or re-organized to return multiple collections or different objects. But this is (nearly) the way I use it in production code.
public static List<Tuple<string, string>> RegexSplitDetail(this string text, string pattern)
{
var splitAreas = new List<Tuple<string, string>>();
var regexResult = Regex.Matches(text, pattern);
var regexSplit = Regex.Split(text, pattern);
for (var i = 0; i < regexSplit.Length; i++)
splitAreas.Add(new Tuple<string, string>(i == 0 ? null : regexResult[i - 1].Value, regexSplit[i]));
return splitAreas;
}
...
var result = exampleSentence.RegexSplitDetail(#"\d+");
This would return a single collection which looks like this:
{ null, "I had "}, // First value, had no value splitting it from a predecessor
{"185", " birds but "}, // Second value, split from the preceding string by "185"
{ "20", " of them flew away"} // Third value, split from the preceding string by "20"
Being that this is a .NET Question and, apart from my more favoured approach in my other answer, you can also capture the Split Value another VERY Simple way. You just then need to create a function to utilize the results as you see fit.
var exampleSentence = "I had 185 birds but 20 of them flew away";
var regexSplit = Regex.Split(exampleSentence, #"(\d+)");
The result of regexSplit is: I had, 185, birds but, 20, of them flew away. As you can see, the split values exist within the split results.
Note the subtle difference compared to my other answer. In this regex split, I used a Capture Group around the entire pattern (\d+) You can't do that!!!?.. can you?
Using a Capture Group in a Split will force all capture groups of the Split Value between the Split Result Capture Groups. This can get messy, so I don't suggest doing it. It also forces somebody using your function(s) to know that they have to wrap their regexes in a capture group.
I'm wondering how I can replace (remove) multiple words (like 500+) from a string. I know I can use the replace function to do this for a single word, but what if I want to replace 500+ words? I'm interested in removing all generic keywords from an article (such as "and", "I", "you" etc).
Here is the code for 1 replacement.. I'm looking to do 500+..
string a = "why and you it";
string b = a.Replace("why", "");
MessageBox.Show(b);
Thanks
# Sergey Kucher Text size will vary between a few hundred words to a few thousand. I am replacing these words from random articles.
I would normally do something like:
// If you want the search/replace to be case sensitive, remove the
// StringComparer.OrdinalIgnoreCase
Dictionary<string, string> replaces = new Dictionary<string, string>(StringComparer.OrdinalIgnoreCase) {
// The format is word to be searched, word that should replace it
// or String.Empty to simply remove the offending word
{ "why", "xxx" },
{ "you", "yyy" },
};
void Main()
{
string a = "why and you it and You it";
// This will search for blocks of letters and numbers (abc/abcd/ab1234)
// and pass it to the replacer
string b = Regex.Replace(a, #"\w+", Replacer);
}
string Replacer(Match m)
{
string found = m.ToString();
string replace;
// If the word found is in the dictionary then it's placed in the
// replace variable by the TryGetValue
if (!replaces.TryGetValue(found, out replace))
{
// otherwise replace the word with the same word (so do nothing)
replace = found;
}
else
{
// The word is in the dictionary. replace now contains the
// word that will substitute it.
// At this point you could add some code to maintain upper/lower
// case between the words (so that if you -> xxx then You becomes Xxx
// and YOU becomes XXX)
}
return replace;
}
As someone else wrote, but without problems with substrings (the ass principle... You don't want to remove asses from classes :-) ), and working only if you only need to remove words:
var escapedStrings = yourReplaces.Select(Regex.Escape);
string result = Regex.Replace(yourInput, #"\b(" + string.Join("|", escapedStrings) + #")\b", string.Empty);
I use the \b word boundary... It's a little complex to explain what it's, but it's useful to find word boundaries :-)
Create a list of all text you want and load it into a list, you do this fairly simple or get very complex. A trivial example would be:
var sentence = "mysentence hi";
var words = File.ReadAllText("pathtowordlist.txt").Split(Enviornment.NewLine);
foreach(word in words)
sentence.replace("word", "x");
You could create two lists if you wanted a dual mapping scheme.
Try this:
string text = "word1 word2 you it";
List<string> words = new System.Collections.Generic.List<string>();
words.Add("word1");
words.Add("word2");
words.ForEach(w => text = text.Replace(w, ""));
Edit
If you want to replace text with another text, you can create class Word:
public class Word
{
public string SearchWord { get; set; }
public string ReplaceWord { get; set; }
}
And change above code to this:
string text = "word1 word2 you it";
List<Word> words = new System.Collections.Generic.List<Word>();
words.Add(new Word() { SearchWord = "word1", ReplaceWord = "replaced" });
words.Add(new Word() { SearchWord = "word2", ReplaceWord = "replaced" });
words.ForEach(w => text = text.Replace(w.SearchWord, w.ReplaceWord));
if you are talking about a single string the solution is to remove them all by a simple replace method. as you can read there:
"Returns a new string in which all occurrences of a specified string in the current instance are replaced with another specified string".
you may be needing to replace several words, and you can make a list of these words:
List<string> wordsToRemove = new List<string>();
wordsToRemove.Add("why");
wordsToRemove.Add("how);
and so on
and then remove them from the string
foreach(string curr in wordsToRemove)
a = a.ToLower().Replace(curr, "");
Importent
if you want to keep your string as it was, without lowering words and without struggling with lower and upper case use
foreach(string curr in wordsToRemove)
// You can reuse this object
Regex regex = new Regex(curr, RegexOptions.IgnoreCase);
myString = regex.Replace(myString, "");
depends on the situation ofcourse,
but if your text is long and you have many words,
and you want optimize performance.
you should build a trie from the words, and search the Trie for a match.
it won't lower the Order of complexity, still O(nm), but for large groups of words, it will be able to check multiple words against each char instead of one by one.
i can assume couple of houndred words should be enough to get this faster.
This is the fastest method in my opinion and
i written a function for you to start with:
public struct FindRecord
{
public int WordIndex;
public int PositionInString;
}
public static FindRecord[] FindAll(string input, string[] words)
{
LinkedList<FindRecord> result = new LinkedList<FindRecord>();
int[] matchs = new int[words.Length];
for (int i = 0; i < input.Length; i++)
{
for (int j = 0; j < words.Length; j++)
{
if (input[i] == words[j][matchs[j]])
{
matchs[j]++;
if(matchs[j] == words[j].Length)
{
FindRecord findRecord = new FindRecord {WordIndex = j, PositionInString = i - matchs[j] + 1};
result.AddLast(findRecord);
matchs[j] = 0;
}
}
else
matchs[j] = 0;
}
}
return result.ToArray();
}
Another option:
it might be the rare case where regex will be faster then building the code.
Try using
public static string ReplaceAll(string input, string[] words)
{
string wordlist = string.Join("|", words);
Regex rx = new Regex(wordlist, RegexOptions.Compiled);
return rx.Replace(input, m => "");
}
Regex can do this better, you just need all the replace words in a list, and then:
var escapedStrings = yourReplaces.Select(PadAndEscape);
string result = Regex.Replace(yourInput, string.Join("|", escapedStrings);
This requires a function that space-pads the strings before escaping them:
public string PadAndEscape(string s)
{
return Regex.Escape(" " + s + " ");
}