How do I find a pattern in a string?

How do I find a pattern in a string? - c#

So picture you having a string like this
o7o7o7o7o7o
There is a clear pattern of o7o
my approach was to find the second o after the first one and that would be the pattern, and then see if it matches through out.
the string is, how do I get the index of the second o ?
I tried this
var pattern = "o7o7o7o7o7o";
var index = input.IndexOf("*");
But that is obviously going to get the first index of the first o it finds, I want to get the second one.
How do I do that?

You can do this many ways, the fastest way would be a loop:
string pattern = "o7o7o7o7o7o";
int count = 0;
int index = 0;
while(index < pattern.Length)
{
if(pattern[index] == 'o') count++;
if(count == 2) break;
index++;
}
and index is what you want.
Linq:
int index = pattern.Select((x, i) => new { x, i })
.Where(a => a.x == 'o').Skip(1)
.FirstOrDefault().i;
string.IndexOf():
int count = 0, index = 0;
do
{
index = pattern.IndexOf('o', index);
if (index != -1) { count++; index++; }
} while (index != -1 && count < 2);
and there are lots of other ways, but I think the three above examples would be fine as other ways I think of are slower (at leat those I can think of).

Also could use Regex like so:
var pattern = "o7o7o7o7o7o";
var regex = new Regex("7(o)");
var matches = regex.Matches(pattern);
foreach (Match match in matches)
{
Console.WriteLine(match.Groups[1].Index);
}

Build prefix function and look for compressed representation as described here
Given a string s of length n. We want to find the shortest
"compressed" representation of the string, i.e. we want to find a
string t of smallest length such that s can be represented as a
concatenation of one or more copies of t.
It is clear, that we only need to find the length of t. Knowing the
length, the answer to the problem will be the prefix of s with this
length.
Let us compute the prefix function for s. Using the last value of it
we define the value k=n−π[n−1]. We will show, that if k divides n,
then k will be the answer, otherwise there doesn't exists an effective
compression and the answer is n.
But your string is not representable as (u)^w because it has excessive char at the end. In this case check divisibility of (i+1)
(i is index) by (i-p[i])
For s = '1231231231' we can get representation (123)^3+1 because the last (i+1) divisible by k[i]=3 is 9
i+1: p[i] k[i]
1 : 0 1
2 : 0 2
3 : 0 3
4 : 1 3
5 : 2 3
6 : 3 3
7 : 4 3
8 : 5 3
9 : 6 3
10 : 7 3

To get the index of the second occurrence of o when there should be at least 1 time not an o in between you might use a regex using a capturing group and get the index of that group:
^[^o]*o[^o]+(o)
That would match:
^ Assert the start of the string
[^o]* Match 0+ times not an o using a negated character class
o Match o literally
[^o]+ Match 1+ times not an o using a negated character class (use [^o]* if there can also be 2 consecutive o's).
(o) Capture o in a group
Regex demo
string pattern = #"^[^o]*o[^o]+(o)";
string input = #"o7o7o7o7o7o";
Match m = Regex.Match(input, pattern);
Console.WriteLine(m.Groups[1].Index); // 2
Demo c#

Related

Algorithm to find max occurrences of a substring with value of a given function

I have to find max(s.length * s.count) for any substring s of a given string t, where s.length is the length of the substring and s.count is the number of times s occurs within t. Substrings may overlap within t.
Example:
For the string aaaaaa, the substring aaa has the max (occurrences * length), substrings and occurrences are:
a: 6
aa: 5
aaa: 4
aaaa : 3
aaaaa: 2
aaaaaa: 1
So aaa is our winner with 3 occurrences * length 4 is 12. Yes, aaaa also has a score of 12, but aaa comes first.
I have tried the only means I know or can figure out, but I have an input string of 100,000 length, and just finding all the substrings is O(n^2), and this hangs my program:
var theSet = new HashSet<string>();
for (int i = 1; i < source.Length; i++)
{
for (int start = 0; start <= source.Length - i; start++)
{
var sub = source.Substring(start, i);
if (!theSet.Contains(sub))
{
theSet.Add(sub);
}
}
}
...
// Some not-noteworthy benchmark related code
...
int maxVal = 0;
foreach (var sub in subs)
{
var count = 0;
for (var i = 0; i < source.Length - sub.Length + 1; i++)
{
if (source.Substring(i, sub.Length).Equals(sub)) count++;
}
if (sub.Length * count > maxVal)
{
maxVal = sub.Length * count;
}
}
I know I am looking for a relatively unknown algorithm and or data structure with this, as google yields no results that closely match the problem. In fact, Google is where I basically only found the costly algorithms I have attempted to use in the above code.

Edit: Just realized that the problem has a solution on GFG: https://www.geeksforgeeks.org/substring-highest-frequency-length-product/
This can be solved in O(n) time by applying three well-known algorithms: Suffix Array, LCP Array and Largest Rectangular Area in a Histogram.
I will not provide any code as implementations of these algorithms can easily be found on the Internet. I will assume the input string is "banana" and try to explain the steps and how they work.
1. Run Suffix Array - O(n)
The Suffix Array algorithm sorts the suffixes of the string alphabetically. For the input "banana", the output is going to be the array [5, 3, 1, 0, 4, 2], where 5 corresponds to the suffix starting at position 5 ("a"), 3 corresponds to the suffix starting at position 3 ("ana"), 1 corresponds to the suffix starting at position 1 ("anana"), etc. After we compute this array, it becomes much easier to count the occurrences of a substring because the equal substrings are placed consecutively:
a
ana
anana
banana
na
nana
For example, we can immediately see that the substring "ana" occurs twice by looking at the 2nd and the 3rd suffixes in the above list. Similarly, we can say the substring "n" also occurs twice by looking at the 5th and the 6th.
2. Run LCP Array - O(n)
The LCP algorithm computes the length of the longest common prefix between every consecutive pair of suffixes in the suffix array. The output is going to be [1, 3, 0, 0, 2] for "banana":
a
ana // "a" and "ana" share the prefix "a", which is of length 1
anana // "ana" and "anana" share the prefix "ana", which is of length 3
banana // "anana" and "banana" share no prefix, so 0
na // "banana" and "na" share no prefix, so 0
nana // "na" and "nana" share the prefix "na", which is of length 2
Now if we plot the output of the LCP algorithm as an histogram:
x
x x
xx x
-----
01234
-----
aaabnn
nnaaa
aan n
na a
an
a
Now, here is the main observation: every rectangle in the histogram that touches the y axis corresponds to a substring and its occurences: the rectangle's width is equal to s.count - 1 and its height equals to s.length
For example consider this rectangle in the lower left corner, that corresponds to the substring "a".
xx
--
01
The rectangle is of height 1, which is "a".length and of width 2, which is "a".count - 1. And the value we need (s.count * s.length) is almost the area of the rectangle.
3. Find the largest rectangle in the histogram - O(n)
Now all we need to do is to find the largest rectangle in the histogram to find the answer to the problem, with the simple nuance that while calculating the area of the rectangle we need to add 1 to its width. This can be done by simply adding a + 1 in the area calculation logic in the algorithm.
For the "banana" example, the largest rectangle is the following (considering we added +1 to every rectangle's width):
x
x
x
-
1
We add one to its width and calculate its area as 2 * 3 = 6, which equals to how many times the substring "ana" occurs times its length.
Each of the 3 steps take O(n) time, totalling to an overall time complexity of O(n).

this does the trick despite not being very efficient O(n) complexity. I can't imagine more efficient way though...
static void TestRegexes()
{
var n = CountSubs("aaaaaa", "a");
var nn = CountSubs("aaaaaa", "aa");
var nnn = CountSubs("aaaaaa", "aaa");
var nnnn = CountSubs("aaaaaa", "aaaa");
var nnnnn = CountSubs("aaaaaa", "aaaaa");
var nnnnnn = CountSubs("aaaaaa", "aaaaaa");
;
}
private static int CountSubs( string content, string needle)
{
int l = content.Length;
int i = 0;
int count = 0;
while (content.Length >= needle.Length)
{
if (content.StartsWith(needle))
{
count++;
}
content = content.Substring(1);
i++;
}
return count;
}

C# locating where the * is in a string separated by pipes

I have to find where a * is at when it could be none at all , 1st position | 2nd position | 3rd position.
The positions are separated by pipes |
Thus
No * wildcard would be
`ABC|DEF|GHI`
However, while that could be 1 scenario, the other 3 are
string testPosition1 = "*|DEF|GHI";
string testPosition2 = "ABC|*|GHI";
string testPosition3 = "ABC|DEF|*";
I gather than I should use IndexOf , but it seems like I should incorporate | (pipe) to know the position ( not just the length as the values could be long or short in each of the 3 places. So I just want to end up knowing if * is in first, second or third position ( or not at all )
Thus I was doing this but i'm not going to know about if it is before 1st or 2nd pipe
if(testPosition1.IndexOf("*") > 0)
{
// Look for pipes?
}

There are lots of ways you could approach this. The most readable might actually just be to do it the hard way (i.e. scan the string to find the first '*' character, keeping track of how many '|' characters you see along the way).
That said, this could be a similarly readable and more concise:
int wildcardPosition = Array.IndexOf(testPosition1.Split('|'), "*");
Returns -1 if not found, otherwise 0-based index for which segment of the '|' delimited string contains the wildcard string.
This only works if the wildcard is exactly the one-character string "*". If you need to support other variations on that, you will still want to split the string, but then you can loop over the array looking for whatever criteria you need.

You can try with linq splitting the string at the pipe character and then getting the index of the element that contains just a *
var x = testPosition2.Split('|').Select((k, i) => new { text = k, index = i}).FirstOrDefault(p => p.text == "*" );
if(x != null) Console.WriteLine(x.index);
So the first line starts splitting the string at the pipe creating an array of strings. This sequence is passed to the Select extension that enumerates the sequence passing the string text (k) and the index (i). With these two parameters we build a sequences of anonymous objects with two properties (text and index). FirstOrDefault extract from this sequence the object with text equals to * and we can print the property index of that object.

The other answers are fine (and likely better), however here is another approach, the good old fashioned for loop and the try-get pattern
public bool TryGetStar(string input, out int index)
{
var split = input.Split('|');
for (index = 0; index < split.Length; index++)
if (split[index] == "*")
return true;
return false;
}
Or if you were dealing with large strings and trying to save allocations. You could remove the Split entirely and use a single parse O(n)
public bool TryGetStar(string input, out int index)
{
index = 0;
for (var i = 0; i < input.Length; i++)
if (input[i] == '|') index++;
else if (input[i] == '*') return true;
return false;
}
Note : if performance was a consideration, you could also use unsafe and pointers, or Span<Char> which would afford a small amount of efficiency.

Try DotNETFiddle:
testPosition.IndexOf("*") - testPosition.Replace("|","").IndexOf("*")
Find the index of the wildcard ("*") and see how far it moves if you remove the pipe ("|") characters. The result is a zero-based index.

From the question you have the following code segment:
if(testPosition1.IndexOf("*") > 0)
{
}
If you're now inside the if statement, you're sure the asterisk exists.
From that point, an efficient solution could be to check the first two chars, and the last two chars.
if (testPosition1.IndexOf("*") > 0)
{
if (testPosition1[0] == '*' && testPosition[1] == '|')
{
// First position.
}
else if (testPosition1[testPosition.Length - 1] == '*' && testPosition1[testPosition.Length - 2] == '|')
{
// Third (last) position.
}
else
{
// Second position.
}
}
This assumes that no more than one * can exist, and also assumes that if an * exist, it can only be surrounded by pipes. For example, I assume an input like ABC|DEF|G*H is invalid.
If you want to remove this assumptions, you could do a one-pass loop over the string and keeping track with the necessary information.

How to treat integers from a string as multi-digit numbers and not individual digits?

My input is a string of integers, which I have to check whether they are even and display them on the console, if they are. The problem is that what I wrote checks only the individual digits and not the numbers.
string even = "";
while (true)
{
string inputData = Console.ReadLine();
if (inputData.Equals("x", StringComparison.OrdinalIgnoreCase))
{
break;
}
for (int i = 0; i < inputData.Length; i++)
{
if (inputData[i] % 2 == 0)
{
even +=inputData[i];
}
}
}
foreach (var e in even)
Console.WriteLine(e);
bool something = string.IsNullOrEmpty(even);
if( something == true)
{
Console.WriteLine("N/A");
}
For example, if the input is:
12
34
56
my output is going to be
2
4
6 (every number needs to be displayed on a new line).
What am I doing wrong? Any help is appreciated.

Use string.Split to get the independent sections and then int.TryParse to check if it is a number (check Parse v. TryParse). Then take only even numbers:
var evenNumbers = new List<int>();
foreach(var s in inputData.Split(" "))
{
if(int.TryParse(s, out var num) && num % 2 == 0)
evenNumbers.Add(num); // If can't use collections: Console.WriteLine(num);
}
(notice the use of out vars introduced in C# 7.0)
If you can use linq then similar to this answer:
var evenNumbers = inputData.Split(" ")
.Select(s => (int.TryParse(s, out var value), value))
.Where(pair => pair.Item1)
.Select(pair => pair.value);

I think you do too many things here at once. Instead of already checking if the number is even, it is better to solve one problem at a time.
First we can make substrings by splitting the string into "words". Net we convert every substring to an int, and finally we filter on even numbers, like:
var words = inputData.Split(' '); # split the words by a space
var intwords = words.Select(int.Parse); # convert these to ints
var evenwords = intwords.Where(x => x % 2 == 0); # check if these are even
foreach(var even in evenwords) { # print the even numbers
Console.WriteLine(even);
}
Here it can still happen that some "words" are not integers, for example "12 foo 34". So you will need to implement some extra filtering between splitting and converting.

How do I initialise a string array in c# with values from "AAAAAA" to "ZZZZZZ" in order

I want to easily pre-populate a single dimensional string array which I am calling "letters" with the values:
AAAAAA
AAAAAB
AAAAAC
AAAAAD
..
..
ZZZZZX
ZZZZZY
ZZZZZZ
Thats 165 million combinations in order.
The idea being I need to then be able to ask for any particular combination of 6 characters such as BBCHHJ and use Array.Index to return the element of the array it is in.
I have the second bit fine:
String searchFor;
Console.Write("Enter a string value to search for: ");
searchFor = Console.ReadLine();
int indexValue = Array.IndexOf(letters, searchFor);
Console.WriteLine("The value you are after is in element index: " + indexValue);
Console.ReadLine();
But I have no idea how to easily initialise the letters array with all those combinations, in order!

A variation on Jakub's answer which should be a bit more efficient:
int result = s
.Select(c => c - 'A') // map 'A'-'Z' to 0-25
.Aggregate(0, (total, next) => total * 26 + next); // calculate the base 26 value
This has the advantage of avoiding the Reverse and the separate Sum, and the powers of 26 don't have to be calculated from scratch in each iteration.

Storing 308 million elements in array and searching them is not the best solution, rather calculate the index at runtime. I have created a code sample:
string input = "ZZZZZZ";
//default values
string alphabets_s = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
char[] alphabets = alphabets_s.ToCharArray();
int result = 1; //starting with "one" because zero will make everything zero
//calculating index
for (int i = 0; i < input.Length; i++)
{
//get character index and add "1" to avoid multiplication with "0"
int index = Array.IndexOf(alphabets, input[i]) + 1;
//multiply it with the current result
result *= index;
}
//subtract 1 from final result, because we started it with 1
result--;
PS: I did just basic testing, please inform me if you find anything wrong in it.

As I wrote in a comment, what you're trying to achieve is basically conversion from base 26 number.
The first step is to convert the string to a list of digits. Then just multiply by powers of 26 and add together:
var s = "AAAABB";
var result = s
.Select(c => c - 'A') //map characters to numbers: A -> 0, B -> 1 etc
.Reverse() //reverse the sequence to have the least significant digit first
.Select((d, i) => d * Math.Pow(26, i))
.Sum();

What's the best way to split a list of strings to match first and last letters?

I have a long list of words in C#, and I want to find all the words within that list that have the same first and last letters and that have a length of between, say, 5 and 7 characters. For example, the list might have:
"wasted was washed washing was washes watched watches wilts with wastes wits washings"
It would return
Length: 5-7, First letter: w, Last letter: d, "wasted, washed, watched"
Length: 5-7, First letter: w, Last letter: s, "washes, watches, wilts, wastes"
Then I might change the specification for a length of 3-4 characters which would return
Length: 3-4, First letter: w, Last letter: s, "was, wits"
I found this method of splitting which is really fast, made each item unique, used the length and gave an excellent start:
Spliting string into words length-based lists c#
Is there a way to modify/use that to take account of first and last letters?
EDIT
I originally asked about the 'fastest' way because I usually solve problems like this with lots of string arrays (which are slow and involve a lot of code). LINQ and lookups are new to me, but I can see that the ILookup used in the solution I linked to is amazing in its simplicity and is very fast. I don't actually need the minimum processor time. Any approach that avoids me creating separate arrays for this information would be fantastic.

this one liner will give you groups with same first/last letter in your range
int min = 5;
int max = 7;
var results = str.Split()
.Where(s => s.Length >= min && s.Length <= max)
.GroupBy(s => new { First = s.First(), Last = s.Last()});

var minLength = 5;
var maxLength = 7;
var firstPart = "w";
var lastPart = "d";
var words = new List<string> { "washed", "wash" }; // so on
var matches = words.Where(w => w.Length >= minLength && w.Length <= maxLength &&
w.StartsWith(firstPart) && w.EndsWith(lastPart))
.ToList();
for the most part, this should be fast enough, unless you're dealing with tens of thousands of words and worrying about ms. then we can look further.

Just in LINQPad I created this:
void Main()
{
var words = new []{"wasted", "was", "washed", "washing", "was", "washes", "watched", "watches", "wilts", "with", "wastes", "wits", "washings"};
var firstLetter = "w";
var lastLetter = "d";
var minimumLength = 5;
var maximumLength = 7;
var sortedWords = words.Where(w => w.StartsWith(firstLetter) && w.EndsWith(lastLetter) && w.Length >= minimumLength && w.Length <= maximumLength);
sortedWords.Dump();
}
If that isn't fast enough, I would create a lookup table:
Dictionary<char, Dictionary<char, List<string>> lookupTable;
and do:
lookupTable[firstLetter][lastLetter].Where(<check length>)

Here's a method that does exactly what you want. You are only given a list of strings and the min/max length, correct? You aren't given the first and last letters to filter on. This method processes all the first/last letters in the strings.
private static void ProcessInput(string[] words, int minLength, int maxLength)
{
var groups = from word in words
where word.Length > 0 && word.Length >= minLength && word.Length <= maxLength
let key = new Tuple<char, char>(word.First(), word.Last())
group word by key into #group
orderby Char.ToLowerInvariant(#group.Key.Item1), #group.Key.Item1, Char.ToLowerInvariant(#group.Key.Item2), #group.Key.Item2
select #group;
Console.WriteLine("Length: {0}-{1}", minLength, maxLength);
foreach (var group in groups)
{
Console.WriteLine("First letter: {0}, Last letter: {1}", group.Key.Item1, group.Key.Item2);
foreach (var word in group)
Console.WriteLine("\t{0}", word);
}
}

Just as a quick thought, I have no clue if this would be faster or more efficient than the linq solutions posted, but this could also be done fairly easily with regular expressions.
For example, if you wanted to get 5-7 letter length words that begin with "w" and end with "s", you could use a pattern along the lines of:
\bw[A-Za-z]{3,5}s\b
(and this could fairly easily be made to be more variable driven - For example, have a variable for first letter, min length, max length, last letter and plug them in to the pattern to replace w, 3, 5 & s)
Them, using the RegEx library, you could then just take your captured groups to be your list.
Again, I don't know how this compares efficiency-wise to linq, but I thought it might deserve mention.
Hope this helps!!

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.