C# Accurately Replace String/SubString - c#

Currently I have large entries (in array) of Pinyin tone notation, some string are combined, for example Diànnǎo = Diàn + nǎo
Now problem is I want replace a string that contain 2 or more, for example:
string[] Py = { "xi", "xia", "xian" };
string[] Km = { "shi", "shie, "shien" };
string[] Input = "xiaguo";
for (int i = 0; i < Py.Length; i++)
if (Input.Contains(Py[i]))
Input = Input.Replace(Py[i], Km[i]);
Code above have a problem due to loop index, xiaguo contains xi become true (shiaguo) not (shieguo) since xi get first before xia
How do I achieve this? and make sure get xia instead of xi
Full code I posted on GitHub: https://github.com/Anime4000/py2km/blob/beta/py2km.api/Converter.cs#L15

Assuming longer tokens take precedence over shorter tokens, the 2 arrays can be converted to a dictionary and then sorted by the length of the key:
var dic = new Dictionary<string, string>
{
{"xi","shi"},
{"xia","shie"},
{"xian","shien"},
}.OrderByDescending(x => x.Key.Length)
.ThenBy(x => x.Key)
.ToDictionary(x => x.Key, x => x.Value);
string input = "xiaguo";
foreach(var d in dic)
input = input.Replace(d.Key, d.Value);
Console.WriteLine(input);
The above example with sort the dictionary:
by the length of the key
then by the alpha sort of the key
finally, the LINQ query is converted back to a dictionary.
From there, just iterate over the dictionary and replace all the tokens; there's no need to check to see if the key/token exists.

you could use a regular expresion for this.
I modified your code so the regex wil only match xi and not xia.
the regex "xi\b" matches xi and the \b means word boundary so it only matches that exact word.
string[] Py = { "xi", "xia", "xian" };
string[] Km = { "shi", "shie, "shien" };
string[] Input = "xiaguo";
string pattern = "xi\b"
Regex rgx = new Regex(pattern, RegexOptions.IgnoreCase);
for (int i = 0; i < Py.Length; i++)
{
MatchCollection matches = rgx.Matches(Py[i]);
if (matches.Count > 0)
{
Input = Input.Replace(Py[i], Km[i]);
}
}

The tone/language specifics could not have an easy structure, so you may assume some pattern and then find out later that it's not right for some 'word'.
Anyway, to handle the informed scenario, you must order the target tone by descending length, and then perform only a single replacement for each 'word' (this will avoid replacing xi, xia when processing xian.
The steps would be:
for each replacement ordered by length descending
try to find tone
if found: replace and mark as done (jump to next 'word')
The idea here is the same as when replacing a two numbers in a list, say 2 to 1 and 3 to 2, for example. The order really matter, if you replace 3 by 2 then you will be replacing both 3 and 2 to 1 after all.

Related

find repeated substring in a string

I have a substring
string subString = "ABC";
Every time all three chars appears in a input, you get one point
for example, if input is:
"AABKM" = 0 points
"AAKLMBDC" = 1 point
"ABCC" = 1 point because all three occurs once
"AAZBBCC" = 2 points because ABC is repeated twice;
etc..
The only solution I could come up with is
Regex.Matches(input, "[ABC]").Count
But does not give me what I'm looking for.
Thanks
You could use a ternary operation, where first we determine that all the characters are present in the string (else we return 0), and then select only those characters, group by each character, and return the minimum count from the groups:
For example:
string subString = "ABC";
var inputStrings = new[] {"AABKM", "AAKLMBDC", "ABCC", "AAZBBCC"};
foreach (var input in inputStrings)
{
var result = subString.All(input.Contains)
? input
.Where(subString.Contains)
.GroupBy(c => c)
.Min(g => g.Count())
: 0;
Console.WriteLine($"{input}: {result}");
}
Output
It could be done with a single line, using Linq. However I am not very confident that this could be a good solution
string subString = "ABC";
string input = "AAZBBBCCC";
var arr = input.ToCharArray()
.Where(x => subString.Contains(x))
.GroupBy(x => x)
.OrderBy(a => a.Count())
.First()
.Count();
The result is 2 because the letter A is present only two times.
Let's try to explain the linq expression.
First transform the input string in a sequence of chars, then take only the chars that are contained in the substring. Now group these chars and order them according the the number of occurrences. At this point take the first group and read the count of chars in that group.
Let's see if someone has a better solution.
try this code :
string subString = "ABC";
var input = new[] { "AABKM", "AAKLMBDC", "ABCC", "AAZBBCC" };
foreach (var item in input)
{
List<int> a = new List<int>();
for (int i = 0; i < subString.Length; i++)
{
a.Add(Regex.Matches(item, subString.ToList()[i].ToString()).Count);
}
Console.WriteLine($"{item} : {a.Min()}");
}

Get a number and string from string

I have a kinda simple problem, but I want to solve it in the best way possible. Basically, I have a string in this kind of format: <some letters><some numbers>, i.e. q1 or qwe12. What I want to do is get two strings from that (then I can convert the number part to an integer, or not, whatever). The first one being the "string part" of the given string, so i.e. qwe and the second one would be the "number part", so 12. And there won't be a situation where the numbers and letters are being mixed up, like qw1e2.
Of course, I know, that I can use a StringBuilder and then go with a for loop and check every character if it is a digit or a letter. Easy. But I think it is not a really clear solution, so I am asking you is there a way, a built-in method or something like this, to do this in 1-3 lines? Or just without using a loop?
You can use a regular expression with named groups to identify the different parts of the string you are interested in.
For example:
string input = "qew123";
var match = Regex.Match(input, "(?<letters>[a-zA-Z]+)(?<numbers>[0-9]+)");
if (match.Success)
{
Console.WriteLine(match.Groups["letters"]);
Console.WriteLine(match.Groups["numbers"]);
}
You can try Linq as an alternative to regular expressions:
string source = "qwe12";
string letters = string.Concat(source.TakeWhile(c => c < '0' || c > '9'));
string digits = string.Concat(source.SkipWhile(c => c < '0' || c > '9'));
You can use the Where() extension method from System.Linq library (https://learn.microsoft.com/en-us/dotnet/api/system.linq.enumerable.where), to filter only chars that are digit (number), and convert the resulting IEnumerable that contains all the digits to an array of chars, that can be used to create a new string:
string source = "qwe12";
string stringPart = new string(source.Where(c => !Char.IsDigit(c)).ToArray());
string numberPart = new string(source.Where(Char.IsDigit).ToArray());
MessageBox.Show($"String part: '{stringPart}', Number part: '{numberPart}'");
Source:
https://stackoverflow.com/a/15669520/8133067
if possible add a space between the letters and numbers (q 3, zet 64 etc.) and use string.split
otherwise, use the for loop, it isn't that hard
You can test as part of an aggregation:
var z = "qwe12345";
var b = z.Aggregate(new []{"", ""}, (acc, s) => {
if (Char.IsDigit(s)) {
acc[1] += s;
} else {
acc[0] += s;
}
return acc;
});
Assert.Equal(new [] {"qwe", "12345"}, b);

Loop through string and remove any occurrence of specified word

I'm trying to remove all conjunctions and pronouns from any array of strings(let call that array A), The words to be removed are read from a text file and converted into an array of strings(lets call that array B).
What I need is to Get the first element of array A and compare it to every word in array B, if the word matches I want to delete the word of array A.
For example:
array A = [0]I [1]want [2]to [3]go [4]home [5]and [6]sleep
array B = [0]I [1]and [2]go [3]to
Result= array A = [0]want [1]home [2]sleep
//remove any duplicates,conjunctions and Pronouns
public IQueryable<All_Articles> removeConjunctionsProNouns(IQueryable<All_Articles> myArticles)
{
//get words to be removed
string text = System.IO.File.ReadAllText("A:\\EnterpriceAssigment\\EnterpriceAssigment\\TextFiles\\conjunctions&ProNouns.txt").ToLower();
//split word into array of strings
string[] wordsToBeRemoved = text.Split(',');
//all articles
foreach (var article in myArticles)
{
//split articles into words
string[] articleSplit = article.ArticleContent.ToLower().Split(' ');
//loop through array of articles words
foreach (var y in articleSplit)
{
//loop through words to be removed from articleSplit
foreach (var x in wordsToBeRemoved)
{
//if word of articles matches word to be removed, remove word from article
if (y == x)
{
//get index of element in array to be removed
int g = Array.IndexOf(articleSplit,y);
//assign elemnt to ""
articleSplit[g] = "";
}
}
}
//re-assign splitted article to string
article.ArticleContent = articleSplit.ToString();
}
return myArticles;
}
If it is possible as well, I need array A to have no duplicates/distinct values.
You are looking for IEnumerable.Except, where the passed parameter is applied to the input sequence and every element of the input sequence not present in the parameter list is returned only once
For example
string inputText = "I want this string to be returned without some words , but words should have only one occurence";
string[] excludedWords = new string[] {"I","to","be", "some", "but", "should", "have", "one", ","};
var splitted = inputText.Split(' ');
var result = splitted.Except(excludedWords);
foreach(string s in result)
Console.WriteLine(s);
// ---- Output ----
want
this
string
returned
without
words <<-- This appears only once
only
occurence
And applied to your code is:
string text = File.ReadAllText(......).ToLower();
string[] wordsToBeRemoved = text.Split(',');
foreach (var article in myArticles)
{
string[] articleSplit = article.ArticleContent.ToLower().Split(' ');
var result = articleSplit.Except(wordsToBeRemoved);
article.ArticleContent = string.Join(" ", result);
}
You may have your answer already in your code. I am sure your code could be cleaned up a bit, as all our code could be. You loop through articleSplit and pull out each word. Then compare that word to the words in the wordsToBeRemoved array in a loop one by one. You use your conditional to compare and when true you remove the items from your original array, or at least try.
I would create another array of the results and then display, use or what ever you'd like with the array minus the words to exclude.
Loop through articleSplit
foreach x in arcticle split
foreach y in wordsToBeRemoved
if x != y newArray.Add(x)
However this is quite a bit of work. You may want to use array.filter and then add that way. There is a hundred ways to achieve this.
Here are some helpful articles:
filter an array in C#
https://msdn.microsoft.com/en-us/library/d9hy2xwa(v=vs.110).aspx
These will save you from all that looping.
You want to remove stop words. You can do it with a help of Linq:
...
string filePath = #"A:\EnterpriceAssigment\EnterpriceAssigment\TextFiles\conjunctions & ProNouns.txt";
// Hashset is much more efficient than array in the context
HashSet<string> stopWords = new HashSet<string>(File
.ReadLines(filePath), StringComparer.OrdinalIgnoreCase);
foreach (var article in myArticles) {
// read article, split into words, filter out stop words...
var cleared = article
.ArticleContent
.Split(' ')
.Where(word => !stopWords.Contains(word));
// ...and join words back into article
article.ArticleContent = string.Join(" ", cleared);
}
...
Please, notice, that I've preserved Split() which you've used in your code and so you have a toy implementation. In real life you have at least to take punctuation into consideration, and that's why a better code uses regular expressions:
foreach (var article in myArticles) {
// read article, extract words, filter out stop words...
var cleared = Regex
.Matches(article.ArticleContent, #"\w+") // <- extract words
.OfType<Match>()
.Select(match => match.Value)
.Where(word => !stopWords.Contains(word));
// ...and join words back into article
article.ArticleContent = string.Join(" ", cleared);
}

Regex Help String Matching

I've got a long string in the format of:
WORD_1#WORD_3#WORD_5#CAT_DOG_FISH#WORD_2#WORD_3#CAT_DOG_FISH_2#WORD_7
I'm trying to dynamically match a string so I can return its position within the string.
I know the string will start with CAT_DOG_ but the FISH is dynamic and could be anything. It's also important not to match on the CAT_DOG_FISH_2(int)
Basically, I need to get back a match on any word starting with [CAT_DOG_] but not ending in [_(int)]
I've tried a few different think and I don't seem to be getting anywhere, any help appreciated.
Once I have the regex to match, I'll be able to get the index of the match, then work out when the next #(delimiter) is , which will get me the start/end position of the word, I can then substring it out to return the full word.
I hope that makes sense?
Personally I avoid Regex whenever possible as I find them hard to read and maintain unless you use them a lot, so here is a non-regex solution:
string words = "WORD_1#WORD_3#WORD_5#CAT_DOG_FISH#WORD_2#WORD_3#CAT_DOG_FISH_2#WORD_7";
var result = words.Split('#')
.Select((w,p) => new { WholeWord = w, SplitWord = w.Split('_'), Position = p, Dynamic = w.Split('_').Last() })
.FirstOrDefault(
x => x.SplitWord.Length == 3 &&
x.SplitWord[0] == "CAT" &&
x.SplitWord[1] == "DOG");
That gives you the whole word, the dynamic part and the position. I does assume the dynamic part doesn't have underscores.
You can use the following regex:
\bCAT_DOG_[a-zA-Z]+(?!_\d)\b
See demo
Or (if the FISH is really anything, but not _ or #):
\bCAT_DOG_[^_#]+(?!_\d)\b
See demo
The word boundaries \b with the look-ahead (?!_\d) (meaning that there must be no _ and a digit) help us return only the required strings. The [^_#] character class matches any character but a _ or #.
You can get the indices using LINQ:
var s = "WORD_1#WORD_3#WORD_5#CAT_DOG_FISH#WORD_2#WORD_3#CAT_DOG_FISH_2#WORD_7";
var rx1 = new Regex(#"\bCAT_DOG_[^_#]+(?!_\d)\b");
var indices = rx1.Matches(s).Cast<Match>().Select(p => p.Index).ToList();
Values can be obtained like this:
var values = rx1.Matches(s).Cast<Match>().Select(p => p.Value).ToList();
Or together:
var values = rx1.Matches(s).OfType<Match>().Select(p => new { p.Index, p.Value }).ToList();
Thanks for the help guys, since i know the int the string will end with I've settled on this:
int i = 0;
string[] words = textBox1.Text.Split('#');
foreach (string word in words)
{
if (word.StartsWith("CAT_DOG_") && (!word.EndsWith(i.ToString())) )
{
//process here
MessageBox.Show("match is: " + word);
}
}
Thanks to Eser for pointing me towards String.Split()

Regular Expression split string and get whats in brackets [ ] put into array

I am trying to use regex to split the string into 2 arrays to turn out like this.
String str1 = "First Second [insideFirst] Third Forth [insideSecond] Fifth";
How do I split str1 to break off into 2 arrays that look like this:
ary1 = ['First Second','Third Forth','Fifth'];
ary2 = ['insideFirst','insideSecond'];
here is my solution
string str = "First Second [insideFirst] Third Forth [insideSecond] Fifth";
MatchCollection matches = Regex.Matches(str,#"\[.*?\]");
string[] arr = matches.Cast<Match>()
.Select(m => m.Groups[0].Value.Trim(new char[]{'[',']'}))
.ToArray();
foreach (string s in arr)
{
Console.WriteLine(s);
}
string[] arr1 = Regex.Split(str,#"\[.*?\]")
.Select(x => x.Trim())
.ToArray();
foreach (string s in arr1)
{
Console.WriteLine(s);
}
Output
insideFirst
insideSecond
First Second
Third Forth
Fifth
Plz Try below code. Its working fine for me.
String str1 = "First Second [insideFirst] Third Forth [insideSecond] Fifth";
var output = String.Join(";", Regex.Matches(str1, #"\[(.+?)\]")
.Cast<Match>()
.Select(m => m.Groups[1].Value));
string[] strInsideBreacket = output.Split(';');
for (int i = 0; i < strInsideBreacket.Count(); i++)
{
str1 = str1.Replace("[", ";");
str1 = str1.Replace("]", "");
str1 = str1.Replace(strInsideBreacket[i], "");
}
string[] strRemaining = str1.Split(';');
Plz look at below screen shot of output while debugging code:
Here,
strInsideBreacket is array of breacket value like insideFirst andinsideSecond
and strRemaining is array of First Second,Third Forth and Fifth
Thanks
Try this solution,
String str1 = "First Second [insideFirst] Third Forth [insideSecond] Fifth";
var allWords = str1.Split(new char[] { '[', ']' }, StringSplitOptions.RemoveEmptyEntries);
var result = allWords.GroupBy(x => x.Contains("inside")).ToArray();
The idea is that, first get all words and then the group it.
It seems to me that "user2828970" asked a question with an example, not with literal text he wanted to parse. In my mind, he could very well have asked this question:
I am trying to use regex to split a string like so.
var exampleSentence = "I had 185 birds but 20 of them flew away";
var regexSplit = Regex.Split(exampleSentence, #"\d+");
The result of regexSplit is: I had, birds but, of them flew away.
However, I also want to know the value which resulted in the second string splitting away from its preceding text, and the value which resulted in the third string splitting away from its preceding text. i.e.: I want to know about 185 and 20.
The string could be anything, and the pattern to split by could be anything. The answer should not have hard-coded values.
Well, this simple function will perform that task. The code can be optimized to compile the regex, or re-organized to return multiple collections or different objects. But this is (nearly) the way I use it in production code.
public static List<Tuple<string, string>> RegexSplitDetail(this string text, string pattern)
{
var splitAreas = new List<Tuple<string, string>>();
var regexResult = Regex.Matches(text, pattern);
var regexSplit = Regex.Split(text, pattern);
for (var i = 0; i < regexSplit.Length; i++)
splitAreas.Add(new Tuple<string, string>(i == 0 ? null : regexResult[i - 1].Value, regexSplit[i]));
return splitAreas;
}
...
var result = exampleSentence.RegexSplitDetail(#"\d+");
This would return a single collection which looks like this:
{ null, "I had "}, // First value, had no value splitting it from a predecessor
{"185", " birds but "}, // Second value, split from the preceding string by "185"
{ "20", " of them flew away"} // Third value, split from the preceding string by "20"
Being that this is a .NET Question and, apart from my more favoured approach in my other answer, you can also capture the Split Value another VERY Simple way. You just then need to create a function to utilize the results as you see fit.
var exampleSentence = "I had 185 birds but 20 of them flew away";
var regexSplit = Regex.Split(exampleSentence, #"(\d+)");
The result of regexSplit is: I had, 185, birds but, 20, of them flew away. As you can see, the split values exist within the split results.
Note the subtle difference compared to my other answer. In this regex split, I used a Capture Group around the entire pattern (\d+) You can't do that!!!?.. can you?
Using a Capture Group in a Split will force all capture groups of the Split Value between the Split Result Capture Groups. This can get messy, so I don't suggest doing it. It also forces somebody using your function(s) to know that they have to wrap their regexes in a capture group.

Categories

Resources