regular expression for finding parts of a string within another - c#

I have two strings: the first's value is "catdog" and the second's is "got".
I'm trying to find a regex that tells me if the letters for "got" are in "catdog". I'm particularly looking to avoid the case where there are duplicate letters. For example, I know "got" is a match, however "gott" is not a match because there are not two "t" in "catdog".
EDIT:
Based on Adam's response below this is the C# code I got to work in my solution. Thanks to all those that responded.
Note: I had to convert the char to int and subtract 97 to get the appropriate index for the array. In my case the letters are always lower case.
private bool CompareParts(string a, string b)
{
int[] count1 = new int[26];
int[] count2 = new int[26];
foreach (var item in a.ToCharArray())
count1[(int)item - 97]++;
foreach (var item in b.ToCharArray())
count2[(int)item - 97]++;
for (int i = 0; i < count1.Length; i++)
if(count2[i] > count1[i])
return false;
return true;
}

You're using the wrong tool for the job. This is not something regular expressions are capable of handling easily. Fortunately, it's relatively easy to do this without regular expressions. You just count up the number of occurrences of each letter within both strings, and compare the counts between the two strings - if for each letter of the alphabet, the count in the first string is at least as large as the count in the second string, then your criteria are satisfied. Since you didn't specify a language, here's an answer in pseudocode that should be easily translatable into your language:
bool containsParts(string1, string2)
{
count1 = array of 26 0's
count2 = array of 26 0's
// Note: be sure to check for an ignore non-alphabetic characters,
// and do case conversion if you want to do it case-insensitively
for each character c in string1:
count1[c]++
for each character c in string2:
count2[c]++
for each character c in 'a'...'z':
if count1[c] < count2[c]:
return false
return true
}

Previous suggestions have already been made that perhaps regex isn't the best way to do this and I agree, however, your accepted answer is a little verbose considering what you're trying to achieve and that is test to see if a set of letters is the subset of another set of letters.
Consider the following code which achieves this in a single line of code:
MatchString.ToList().ForEach(Item => Input.Remove(Item));
Which can be used as follows:
public bool IsSubSetOf(string InputString, string MatchString)
{
var InputChars = InputString.ToList();
MatchString.ToList().ForEach(Item => InputChars.Remove(Item));
return InputChars.Count == 0;
}
You can then just call this method to verify if it's a subset or not.
What is interesting here is that "got" will return a list with no items because each item in the match string only appears once, but "gott" will return a list with a single item because there would only be a single call to remove the "t" from the list. Consequently you would have an item left in the list. That is, "gott" is not a subset of "catdog" but "got" is.
You could take it one step further and put the method into a static class:
using System;
using System.Linq;
using System.Runtime.CompilerServices;
static class extensions
{
public static bool IsSubSetOf(this string InputString, string MatchString)
{
var InputChars = InputString.ToList();
MatchString.ToList().ForEach(Item => InputChars.Remove(Item));
return InputChars.Count == 0;
}
}
which makes your method into an extension of the string object which actually makes thins a lot easier in the long run, because you can now make your calls like so:
Console.WriteLine("gott".IsSubSetOf("catdog"));

I don't think there is a sane way to do this with regular expressions. The insane way would be to write out all the permutations:
/^(c?a?t?d?o?g?|c?a?t?d?g?o?| ... )$/
Now, with a little trickery you could do this with a few regexps (example in Perl, untested):
$foo = 'got';
$foo =~ s/c//;
$foo =~ s/a//;
...
$foo =~ s/d//;
# if $foo is now empty, it passes the test.
Sane people would use a loop, of course:
$foo = 'got'
foreach $l (split(//, 'catdog') {
$foo =~ s/$l//;
}
# if $foo is now empty, it passes the test.
There are much better performing ways to pull this off, of course, but they don't use regexps. And there are no doubt ways to do it if e.g., you can use Perl's extended regexp features like embedded code.

You want a string that matches exact those letters, exactly once. It depends what you're writing the regex in, but it's going to be something like
^[^got]*(g|o|t)[^got]$
If you've got an operator for "exactly one match" that will help.

Charlie Martin almost has it right, but you have to do a complete pass for each letter. You can do that with a single regex by using lookaheads for all but the last pass:
/^
(?=[^got]*g[^got]*$)
(?=[^got]*o[^got]*$)
[^got]*t[^got]*
$/x
This makes a nice exercise for honing your regex skills, but if I had to do this in real-life, I wouldn't do it this way. A non-regex approach will require a lot more typing, but any minimally competent programmer will be able to understand and maintain it. If you use a regex, that hypothetical maintainer will also have to be more-than-minimally competent at regexes.

#Adam Rosenfield's solution in Python:
from collections import defaultdict
def count(iterable):
c = defaultdict(int)
for hashable in iterable:
c[hashable] += 1
return c
def can_spell(word, astring):
"""Whether `word` can be spelled using `astring`'s characters."""
count_string = count(astring)
count_word = count(word)
return all(count_string[c] >= count_word[c] for c in word)

The best way to do it with regular expressions is, IMO:
A. Sort the characters in the large string (search space)
Thus: turn "catdog" into "acdgot"
B.
Do the same with the string of which you search the characters of: "gott" becomes, eh, "gott"...
Insert ".*" between each of these characters
Use the latter as the regular expression to search in the former.
For example, some Perl code (if you don't mind):
$main = "catdog"; $search = "gott";
# break into individual characters, sort, and reconcatenate
$main = join '', sort split //, $main;
$regexp = join ".*", sort split //, $search;
print "Debug info: search in '$main' for /$regexp/ \n";
if($main =~ /$regexp/) {
print "Found a match!\n";
} else {
print "Sorry, no match...\n";
}
This prints:
Debug info: search in 'acdgot' for /g.*o.*t.*t/
Sorry, no match...
Drop one "t" and you get a match.

Related

How do I get rid of double spaces without regex or any external methods?

This is an extra exercise given to us on our Uni course, in which we need to find whether a sentence contains a palindrome or not. Whereas finding if a word is a palindrome or not is fairly easy, there could be a situation where the given sentence looks like this: "Dog cat - kajak house". My logic is to, using functions I already wrote, first determine if a character is a letter or not, if not delete it. Then count number of spaces+1 to find out how many words there are in a sentence, prepare an array of those words and then cast a function that checks if a word is palindrome on every element of an array. However, the double space would mess everything up on a "counting" phase. I've spent around an hour fiddling with code to do this, however I can't wrap my head around this. Could anyone help me? Note that I'm not supposed to use any external methods or libraries. I've done this using RegEx and it was fairly easy, however I'd like to do this "legally". Thanks in advance!
Just split on space, the trick is to remove empties. Google StringSplitOptions.RemoveEmptyEntries
then obviously join with one clean space
You could copy all the characters you are interested in to a new string:
var copy = new StringBuilder();
// Indicates whether the previous character was whitespace
var whitespace = true;
foreach (var character in originalString)
{
if (char.IsLetter(character))
{
copy.Append(character);
whitespace = false;
}
else if (char.IsWhiteSpace(character) && !whitespace)
{
copy.Append(character);
whitespace = true;
}
else
{
// Ignore other characters
}
}
If you're looking for the most rudimentary/olde-worlde option, this will work. But no-one would actually do this, other than in an illustrative manner.
string test = "Dog cat kajak house"; // Your string minus any non-letters
int prevLen;
do
{
prevLen = test.Length;
test = test.Replace(" ", " "); // Repeat-replace double spaces until none left
} while (prevLen > test.Length);
Personally, I'd probably do the following to end up with an array of words to check:
string[] words = test.Split(new char[] { ' ' }, StringSplitOptions.RemoveEmptyEntries);

Check array for string that starts with given one (ignoring case)

I am trying to see if my string starts with a string in an array of strings I've created. Here is my code:
string x = "Table a";
string y = "a table";
string[] arr = new string["table", "chair", "plate"]
if (arr.Contains(x.ToLower())){
// this should be true
}
if (arr.Contains(y.ToLower())){
// this should be false
}
How can I make it so my if statement comes up true? Id like to just match the beginning of string x to the contents of the array while ignoring the case and the following characters. I thought I needed regex to do this but I could be mistaken. I'm a bit of a newbie with regex.
It seems you want to check if your string contains an element from your list, so this should be what you are looking for:
if (arr.Any(c => x.ToLower().Contains(c)))
Or simpler:
if (arr.Any(x.ToLower().Contains))
Or based on your comments you may use this:
if (arr.Any(x.ToLower().Split(' ')[0].Contains))
Because you said you want regex...
you can set a regex to var regex = new Regex("(table|plate|fork)");
and check for if(regex.IsMatch(myString)) { ... }
but it for the issue at hand, you dont have to use Regex, as you are searching for an exact substring... you can use
(as #S.Akbari mentioned : if (arr.Any(c => x.ToLower().Contains(c))) { ... }
Enumerable.Contains matches exact values (and there is no build in compare that checks for "starts with"), you need Any that takes predicate that takes each array element as parameter and perform the check. So first step is you want "contains" to be other way around - given string to contain element from array like:
var myString = "some string"
if (arr.Any(arrayItem => myString.Contains(arrayItem)))...
Now you actually asking for "string starts with given word" and not just contains - so you obviously need StartsWith (which conveniently allows to specify case sensitivity unlike Contains - Case insensitive 'Contains(string)'):
if (arr.Any(arrayItem => myString.StartsWith(
arrayItem, StringComparison.CurrentCultureIgnoreCase))) ...
Note that this code will accept "tableAAA bob" - if you really need to break on word boundary regular expression may be better choice. Building regular expressions dynamically is trivial as long as you properly escape all the values.
Regex should be
beginning of string - ^
properly escaped word you are searching for - Escape Special Character in Regex
word break - \b
if (arr.Any(arrayItem => Regex.Match(myString,
String.Format(#"^{0}\b", Regex.Escape(arrayItem)),
RegexOptions.IgnoreCase)) ...
you can do something like below using TypeScript. Instead of Starts with you can also use contains or equals etc..
public namesList: Array<string> = ['name1','name2','name3','name4','name5'];
// SomeString = 'name1, Hello there';
private isNamePresent(SomeString : string):boolean{
if (this.namesList.find(name => SomeString.startsWith(name)))
return true;
return false;
}
I think I understand what you are trying to say here, although there are still some ambiguity. Are you trying to see if 1 word in your String (which is a sentence) exists in your array?
#Amy is correct, this might not have to do with Regex at all.
I think this segment of code will do what you want in Java (which can easily be translated to C#):
Java:
x = x.ToLower();
string[] words = x.Split("\\s+");
foreach(string word in words){
foreach(string element in arr){
if(element.Equals(word)){
return true;
}
}
}
return false;
You can also use a Set to store the elements in your array, which can make look up more efficient.
Java:
x = x.ToLower();
string[] words = x.Split("\\s+");
HashSet<string> set = new HashSet<string>(arr);
for(string word : words){
if(set.contains(word)){
return true;
}
}
return false;
Edit: (12/22, 11:05am)
I rewrote my solution in C#, thanks to reminders by #Amy and #JohnyL. Since the author only wants to match the first word of the string, this edited code should work :)
C#:
static bool contains(){
x = x.ToLower();
string[] words = x.Split(" ");
var set = new HashSet<string>(arr);
if(set.Contains(words[0])){
return true;
}
return false;
}
Sorry my question was so vague but here is the solution thanks to some help from a few people that answered.
var regex = new Regex("^(table|chair|plate) *.*");
if (regex.IsMatch(x.ToLower())){}

How do you remove repeated characters in a string

I have a website which allows users to comment on photos.
Of course, users leave comments like:
'OMGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG!!!!!!!!!!!!!!!'
or
'YOU SUCCCCCCCCCCCCCCCCCKKKKKKKKKKKKKKKKKK'
You get it.
Basically, I want to shorten those comments by removing at least most of those excess repeated characters.
I'm sure there's a way to do it with Regex..i just can't figure it out.
Any ideas?
Keeping in mind that the English language uses double letters often you probably don't want to blindly eliminate them. Here is a regex that will get rid of anything beyond a double.
Regex r = new Regex("(.)(?<=\\1\\1\\1)", RegexOptions.IgnoreCase | RegexOptions.CultureInvariant | RegexOptions.Compiled);
var x = r.Replace("YOU SUCCCCCCCCCCCCCCCCCKKKKKKKKKKKKKKKKKK", String.Empty);
// x = "YOU SUCCKK"
var y = r.Replace("OMGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG!!!!!!!!!!!!!!!", String.Empty);
// y = "OMGG!!"
Do you specifically want to shorten the strings in the code, or would it be enough to simply fail validation and present the form to the user again with a validation error? Something like "Too many repeated characters."
If the latter is acceptable, #"(\w)\1{2}" should match characters of 3 or more (interpreted as "repeated" two or more times).
Edit: As #Piskvor pointed out, this will match on exactly 3 characters. It works fine for matching, but not for replacing. His version, #"(\w)\1{2,}", would work better for replacing. However, I'd like to point out that I think replacing wouldn't be the best practice here. Better to just have the form fail validation than to try to scrub the text being submitted, because there likely will be edge cases where you turn otherwise readable (even if unreasonable) text into nonsense.
var nonRepeatedChars = myString.ToCharArray().Distinct().Where(c => !char.IsWhiteSpace(c) || !myString.Contains(c)).ToString();
Regex would be overkill.
Try this:
public static string RemoveRepeatedChars(String input, int maxRepeat)
{
if(input.Length==0)return input;
StringBuilder b = new StringBuilder;
Char[] chars = input.ToCharArray();
Char lastChar = chars[0];
int repeat = 0;
for(int i=1;i<input.Length;i++){
if(chars[i]==lastChar && ++repeat<maxRepeat)
{
b.Append(chars[i]);
}
else
{
b.Append(chars[i]);
repeat=0;
lastChar = chars[i];
}
}
return b.ToString();
}
Distinct() will remove all duplicates, however it will not see "A" and "a" as the same, obviously.
Console.WriteLine(new string("Asdfasdf".Distinct().ToArray()));
Outputs "Asdfa"
var test = "OMMMMMGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGMMM";
test.Distinct().Select(c => c.ToString()).ToList()
.ForEach(c =>
{
while (test.Contains(c + c))
test = test.Replace(c + c, c);
}
);
Edit : awful suggestion, please don't read, I truly deserve my -1 :)
I found here on technical nuggets something like what you're looking for.
There's nothing to do except a very long regex, because I've never heard about a regex sign for repetition ...
It's a total example, I won't paste it here but I think this will totally answer your question.

Regex pattern for checking if a string starts with a certain substring?

What's the regular expression to check if a string starts with "mailto" or "ftp" or "joe" or...
Now I am using C# and code like this in a big if with many ors:
String.StartsWith("mailto:")
String.StartsWith("ftp")
It looks like a regex would be better for this. Or is there a C# way I am missing here?
You could use:
^(mailto|ftp|joe)
But to be honest, StartsWith is perfectly fine to here. You could rewrite it as follows:
string[] prefixes = { "http", "mailto", "joe" };
string s = "joe:bloggs";
bool result = prefixes.Any(prefix => s.StartsWith(prefix));
You could also look at the System.Uri class if you are parsing URIs.
The following will match on any string that starts with mailto, ftp or http:
RegEx reg = new RegEx("^(mailto|ftp|http)");
To break it down:
^ matches start of line
(mailto|ftp|http) matches any of the items separated by a |
I would find StartsWith to be more readable in this case.
The StartsWith method will be faster, as there is no overhead of interpreting a regular expression, but here is how you do it:
if (Regex.IsMatch(theString, "^(mailto|ftp|joe):")) ...
The ^ mathes the start of the string. You can put any protocols between the parentheses separated by | characters.
edit:
Another approach that is much faster, is to get the start of the string and use in a switch. The switch sets up a hash table with the strings, so it's faster than comparing all the strings:
int index = theString.IndexOf(':');
if (index != -1) {
switch (theString.Substring(0, index)) {
case "mailto":
case "ftp":
case "joe":
// do something
break;
}
}
For the extension method fans:
public static bool RegexStartsWith(this string str, params string[] patterns)
{
return patterns.Any(pattern =>
Regex.Match(str, "^("+pattern+")").Success);
}
Usage
var answer = str.RegexStartsWith("mailto","ftp","joe");
//or
var answer2 = str.RegexStartsWith("mailto|ftp|joe");
//or
bool startsWithWhiteSpace = " does this start with space or tab?".RegexStartsWith(#"\s");
I really recommend using the String.StartsWith method over the Regex.IsMatch if you only plan to check the beginning of a string.
Firstly, the regular expression in C#
is a language in a language with does
not help understanding and code
maintenance. Regular expression is a
kind of DSL.
Secondly, many developers does not
understand regular expressions: it is
something which is not understandable
for many humans.
Thirdly, the StartsWith method brings
you features to enable culture
dependant comparison which regular
expressions are not aware of.
In your case you should use regular expressions only if you plan implementing more complex string comparison in the future.
You can get the substring before ':' using array slices and method String::IndexOf which returns -1 if search substring does not exist. Then you can compare gotten result with constant and logical patterns (C# 9.0+) to check that strings really start with these defined.
string s = "ftp:custom";
int index = s.IndexOf(':');
bool result = index > 0 && s[..index] is "mailto" or "ftp" or "joe";

Regex to match multiple strings

I need to create a regex that can match multiple strings. For example, I want to find all the instances of "good" or "great". I found some examples, but what I came up with doesn't seem to work:
\b(good|great)\w*\b
Can anyone point me in the right direction?
Edit: I should note that I don't want to just match whole words. For example, I may want to match "ood" or "reat" as well (parts of the words).
Edit 2: Here is some sample text: "This is a really great story."
I might want to match "this" or "really", or I might want to match "eall" or "reat".
If you can guarantee that there are no reserved regex characters in your word list (or if you escape them), you could just use this code to make a big word list into #"(a|big|word|list)". There's nothing wrong with the | operator as you're using it, as long as those () surround it. It sounds like the \w* and the \b patterns are what are interfering with your matches.
String[] pattern_list = whatever;
String regex = String.Format("({0})", String.Join("|", pattern_list));
(good)*(great)*
after your edit:
\b(g*o*o*d*)*(g*r*e*a*t*)*\b
I think you are asking for smth you dont really mean
if you want to search for any Part of the word, you litterally searching letters
e.g. Search {Jack, Jim} in "John and Shelly are cool"
is searching all letters in the names {J,a,c,k,i,m}
*J*ohn *a*nd Shelly *a*re
and for that you don't need REG-EX :)
in my opinion,
A Suffix Tree can help you with that
http://en.wikipedia.org/wiki/Suffix_tree#Functionality
enjoy.
I don't understand the problem correctly:
If you want to match "great" or "reat" you can express this by a pattern like:
"g?reat"
This simply says that the "reat"-part must exist and the "g" is optional.
This would match "reat" and "great" but not "eat", because the first "r" in "reat" is required.
If you have the too words "great" and "good" and you want to match them both with an optional "g" you can write this like this:
(g?reat|g?ood)
And if you want to include a word-boundary like:
\b(g?reat|g?ood)
You should be aware that this would not match anything like "breat" because you have the "reat" but the "r" is not at the word boundary because of the "b".
So if you want to match whole words that contain a substring link "reat" or "ood" then you should try:
"\b\w*?(reat|ood)\w+\b"
This reads:
1. Beginning with a word boundary begin matching any number word-characters, but don't be gready.
2. Match "reat" or "ood" enshures that only those words are matched that contain one of them.
3. Match any number of word characters following "reat" or "ood" until the next word boundary is reached.
This will match:
"goodness", "good", "ood" (if a complete word)
It can be read as: Give me all complete words that contain "ood" or "reat".
Is that what you are looking for?
I'm not entirely sure that regex alone offers a solution for what you're trying to do. You could, however, use the following code to create a regex expression for a given word. Although, the resulting regex pattern has the potential to become very long and slow:
function wordPermutations( $word, $minLength = 2 )
{
$perms = array( );
for ($start = 0; $start < strlen( $word ); $start++)
{
for ($end = strlen( $word ); $end > $start; $end--)
{
$perm = substr( $word, $start, ($end - $start));
if (strlen( $perm ) >= $minLength)
{
$perms[] = $perm;
}
}
}
return $perms;
}
Test Code:
$perms = wordPermutations( 'great', 3 ); // get all permutations of "great" that are 3 or more chars in length
var_dump( $perms );
echo ( '/\b('.implode( '|', $perms ).')\b/' );
Example Output:
array
0 => string 'great' (length=5)
1 => string 'grea' (length=4)
2 => string 'gre' (length=3)
3 => string 'reat' (length=4)
4 => string 'rea' (length=3)
5 => string 'eat' (length=3)
/\b(great|grea|gre|reat|rea|eat)\b/
Just check for the boolean that Regex.IsMatch() returns.
if (Regex.IsMatch(line, "condition") && Regex.IsMatch(line, "conditition2"))
The line will have both regex, right.

Categories

Resources