find repeated substring in a string

find repeated substring in a string - c#

I have a substring
string subString = "ABC";
Every time all three chars appears in a input, you get one point
for example, if input is:
"AABKM" = 0 points
"AAKLMBDC" = 1 point
"ABCC" = 1 point because all three occurs once
"AAZBBCC" = 2 points because ABC is repeated twice;
etc..
The only solution I could come up with is
Regex.Matches(input, "[ABC]").Count
But does not give me what I'm looking for.
Thanks

You could use a ternary operation, where first we determine that all the characters are present in the string (else we return 0), and then select only those characters, group by each character, and return the minimum count from the groups:
For example:
string subString = "ABC";
var inputStrings = new[] {"AABKM", "AAKLMBDC", "ABCC", "AAZBBCC"};
foreach (var input in inputStrings)
{
var result = subString.All(input.Contains)
? input
.Where(subString.Contains)
.GroupBy(c => c)
.Min(g => g.Count())
: 0;
Console.WriteLine($"{input}: {result}");
}
Output

It could be done with a single line, using Linq. However I am not very confident that this could be a good solution
string subString = "ABC";
string input = "AAZBBBCCC";
var arr = input.ToCharArray()
.Where(x => subString.Contains(x))
.GroupBy(x => x)
.OrderBy(a => a.Count())
.First()
.Count();
The result is 2 because the letter A is present only two times.
Let's try to explain the linq expression.
First transform the input string in a sequence of chars, then take only the chars that are contained in the substring. Now group these chars and order them according the the number of occurrences. At this point take the first group and read the count of chars in that group.
Let's see if someone has a better solution.

try this code :
string subString = "ABC";
var input = new[] { "AABKM", "AAKLMBDC", "ABCC", "AAZBBCC" };
foreach (var item in input)
{
List<int> a = new List<int>();
for (int i = 0; i < subString.Length; i++)
{
a.Add(Regex.Matches(item, subString.ToList()[i].ToString()).Count);
}
Console.WriteLine($"{item} : {a.Min()}");
}

Related

Find uncommon characters between two strings

I have following code:
public static void Main (string[] args) {
string word1 = "AN";
string word2 = "ANN";
//First try:
var intersect = word1.Intersect(word2);
var unCommon1 = word1.Except(intersect).Union(word2.Except(intersect));
//Second try:
var unCommon = word1.Except(word2).Union(word2.Except(word1));
}
Result I am trying to get is N. I tried few ways to get it by reading online posts, I am unable to figure it out. Is there a way to get uncommon character between two strings using linq.
Order of characters in string does not matter.
Here are few more scenarios:
FOO & BAR will result in F,O,O,B,A,R.
ANN & NAN will result in empty string.

Here's a straight forward LINQ function.
string word1 = "AN";
string word2 = "ANN";
//get all the characters in both strings
var group = string.Concat(word1, word2)
//remove duplicates
.Distinct()
//count the times each character appears in word1 and word2, find the
//difference, and repeat the character difference times
.SelectMany(i => Enumerable.Repeat(i, Math.Abs(
word1.Count(j => j == i) -
word2.Count(j => j == i))));

Extract integer from the end of a string

I have multiple IDs in a List<string>()
List<string> IDList = new List<string>() {
"ID101", //101
"I2D102", //102
"103", //103
"I124D104", //104
"ID-105", //105
"-1006" }; //1006
Rule: The string always ends with the id which has length 1 to n and is int only
I need to extract them to int values. But my solution doesn't work
List<int> intList = IDList.Select(x => int.Parse(Regex.Match(x, #".*\d*").Value)).ToList();

If ID is always at the end, you could use LINQ solution instead of Regex:
var query = IDList.Select(id =>
int.Parse(new string(id.Reverse()
.TakeWhile(x => char.IsNumber(x))
.Reverse().ToArray())));
The idea is to take the characters from the last till it finds no number. Whatever you get, you convert it into int. The good thing about this solution is it really represents what you specify.

Well, according to
Rule: The string always ends with the id which has length 1 to n and
is int only
the pattern is nothing but
[0-9]{1,n}$
[0-9] - ints only
{1,n} - from 1 to n (both 1 and n are included)
$ - string always ends with
and possible implementation could be something like this
int n = 5; //TODO: put actual value
String pattern = "[0-9]{1," + n.ToString() + "}$";
List<int> intList = IDList
.Select(line => int.Parse(Regex.Match(line, pattern).Value))
.ToList();
In case there're some broken lines, say "abc" (and you want to filter them out):
List<int> intList = IDList
.Select(line => Regex.Match(line, pattern))
.Where(match => match.Success)
.Select(match => int.Parse(match.Value))
.ToList();

Here's another LINQ approach which works if the number is always at the end and negative values aren't possible. Skips invalid strings:
List<int> intList = IDList
.Select(s => s.Reverse().TakeWhile(Char.IsDigit))
.Where(digits => digits.Any())
.Select(digits => int.Parse(String.Concat(digits.Reverse())))
.ToList();
( Edit: similar to Ian's approach )

This below code extract last id as integer from collection and ignore them which end with none integer value
List<int> intList = IDList.Where(a => Regex.IsMatch(a, #"\d+$") == true)
.Select(x => int.Parse(Regex.Match(x, #"\d+$").Value)).ToList();

i assume you want the last numbers :
var res = IDList.Select(x => int.Parse(Regex.Match(x, #"\d+$").Value)).ToList();

Regex Help String Matching

I've got a long string in the format of:
WORD_1#WORD_3#WORD_5#CAT_DOG_FISH#WORD_2#WORD_3#CAT_DOG_FISH_2#WORD_7
I'm trying to dynamically match a string so I can return its position within the string.
I know the string will start with CAT_DOG_ but the FISH is dynamic and could be anything. It's also important not to match on the CAT_DOG_FISH_2(int)
Basically, I need to get back a match on any word starting with [CAT_DOG_] but not ending in [_(int)]
I've tried a few different think and I don't seem to be getting anywhere, any help appreciated.
Once I have the regex to match, I'll be able to get the index of the match, then work out when the next #(delimiter) is , which will get me the start/end position of the word, I can then substring it out to return the full word.
I hope that makes sense?

Personally I avoid Regex whenever possible as I find them hard to read and maintain unless you use them a lot, so here is a non-regex solution:
string words = "WORD_1#WORD_3#WORD_5#CAT_DOG_FISH#WORD_2#WORD_3#CAT_DOG_FISH_2#WORD_7";
var result = words.Split('#')
.Select((w,p) => new { WholeWord = w, SplitWord = w.Split('_'), Position = p, Dynamic = w.Split('_').Last() })
.FirstOrDefault(
x => x.SplitWord.Length == 3 &&
x.SplitWord[0] == "CAT" &&
x.SplitWord[1] == "DOG");
That gives you the whole word, the dynamic part and the position. I does assume the dynamic part doesn't have underscores.

You can use the following regex:
\bCAT_DOG_[a-zA-Z]+(?!_\d)\b
See demo
Or (if the FISH is really anything, but not _ or #):
\bCAT_DOG_[^_#]+(?!_\d)\b
See demo
The word boundaries \b with the look-ahead (?!_\d) (meaning that there must be no _ and a digit) help us return only the required strings. The [^_#] character class matches any character but a _ or #.
You can get the indices using LINQ:
var s = "WORD_1#WORD_3#WORD_5#CAT_DOG_FISH#WORD_2#WORD_3#CAT_DOG_FISH_2#WORD_7";
var rx1 = new Regex(#"\bCAT_DOG_[^_#]+(?!_\d)\b");
var indices = rx1.Matches(s).Cast<Match>().Select(p => p.Index).ToList();
Values can be obtained like this:
var values = rx1.Matches(s).Cast<Match>().Select(p => p.Value).ToList();
Or together:
var values = rx1.Matches(s).OfType<Match>().Select(p => new { p.Index, p.Value }).ToList();

Thanks for the help guys, since i know the int the string will end with I've settled on this:
int i = 0;
string[] words = textBox1.Text.Split('#');
foreach (string word in words)
{
if (word.StartsWith("CAT_DOG_") && (!word.EndsWith(i.ToString())) )
{
//process here
MessageBox.Show("match is: " + word);
}
}
Thanks to Eser for pointing me towards String.Split()

C# Accurately Replace String/SubString

Currently I have large entries (in array) of Pinyin tone notation, some string are combined, for example Diànnǎo = Diàn + nǎo
Now problem is I want replace a string that contain 2 or more, for example:
string[] Py = { "xi", "xia", "xian" };
string[] Km = { "shi", "shie, "shien" };
string[] Input = "xiaguo";
for (int i = 0; i < Py.Length; i++)
if (Input.Contains(Py[i]))
Input = Input.Replace(Py[i], Km[i]);
Code above have a problem due to loop index, xiaguo contains xi become true (shiaguo) not (shieguo) since xi get first before xia
How do I achieve this? and make sure get xia instead of xi
Full code I posted on GitHub: https://github.com/Anime4000/py2km/blob/beta/py2km.api/Converter.cs#L15

Assuming longer tokens take precedence over shorter tokens, the 2 arrays can be converted to a dictionary and then sorted by the length of the key:
var dic = new Dictionary<string, string>
{
{"xi","shi"},
{"xia","shie"},
{"xian","shien"},
}.OrderByDescending(x => x.Key.Length)
.ThenBy(x => x.Key)
.ToDictionary(x => x.Key, x => x.Value);
string input = "xiaguo";
foreach(var d in dic)
input = input.Replace(d.Key, d.Value);
Console.WriteLine(input);
The above example with sort the dictionary:
by the length of the key
then by the alpha sort of the key
finally, the LINQ query is converted back to a dictionary.
From there, just iterate over the dictionary and replace all the tokens; there's no need to check to see if the key/token exists.

you could use a regular expresion for this.
I modified your code so the regex wil only match xi and not xia.
the regex "xi\b" matches xi and the \b means word boundary so it only matches that exact word.
string[] Py = { "xi", "xia", "xian" };
string[] Km = { "shi", "shie, "shien" };
string[] Input = "xiaguo";
string pattern = "xi\b"
Regex rgx = new Regex(pattern, RegexOptions.IgnoreCase);
for (int i = 0; i < Py.Length; i++)
{
MatchCollection matches = rgx.Matches(Py[i]);
if (matches.Count > 0)
{
Input = Input.Replace(Py[i], Km[i]);
}
}

The tone/language specifics could not have an easy structure, so you may assume some pattern and then find out later that it's not right for some 'word'.
Anyway, to handle the informed scenario, you must order the target tone by descending length, and then perform only a single replacement for each 'word' (this will avoid replacing xi, xia when processing xian.
The steps would be:
for each replacement ordered by length descending
try to find tone
if found: replace and mark as done (jump to next 'word')
The idea here is the same as when replacing a two numbers in a list, say 2 to 1 and 3 to 2, for example. The order really matter, if you replace 3 by 2 then you will be replacing both 3 and 2 to 1 after all.

Using regular expressions to calculate Unique users?

I want to define a Regular expression pattern that will give unique user counts in a file. One more thing i also want to apply length count such that Users value does not exceed more than 15 characters.
So that my code will return 2 in the logs provided below as it should discard users value exceeding length 15.
Logs file format :
User:fd441f1f-22c0-45d2-b020-32e1e6a15a73
User:fd441f1f-22c0-45d2-b020-32e1e6a15f43
User:fd441f1f-24g0-45d2-b050-32e1e6a15a73
User: karansha
User: gulanand
Code i tried:
Regex regex = new Regex(#"User:\s*(?<username>.*?)\s");
MatchCollection matches = regex.Matches(x);
foreach (Match match in matches)
{
var user = match.Groups["username"].Value;
if (!users.Contains(user)) users.Add(user);
}
int numberOfUsers = users.Count;

You can do that with LINQ:
int numberOfUsers = regex.Matches(x)
.Cast<Match>()
.Select(m => m.Groups["username"].Value)
.Distinct() // pick only unique names
.Count(name => name.Length < 15); // calculate count
Or without regular expressions:
int numberOfUsers = File("log.txt")
.ReadLines()
.Select(line => line.Replace("User:", "").Trim())
.Distinct()
.Count(name => name.Length < 15);

I wouldn't use a Regex for this.
Try using string.Split() and Distinct instead.
int numberOfUsers = x.Split(new string[] { "User:" }, StringSplitOptions.RemoveEmptyEntries)
.Distinct()
.Count(name => name.Length < 15);

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

find repeated substring in a string - c#

Related

Find uncommon characters between two strings

Extract integer from the end of a string

Regex Help String Matching

C# Accurately Replace String/SubString

Using regular expressions to calculate Unique users?

Categories

Resources