I have a website which allows users to comment on photos.
Of course, users leave comments like:
'OMGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG!!!!!!!!!!!!!!!'
or
'YOU SUCCCCCCCCCCCCCCCCCKKKKKKKKKKKKKKKKKK'
You get it.
Basically, I want to shorten those comments by removing at least most of those excess repeated characters.
I'm sure there's a way to do it with Regex..i just can't figure it out.
Any ideas?
Keeping in mind that the English language uses double letters often you probably don't want to blindly eliminate them. Here is a regex that will get rid of anything beyond a double.
Regex r = new Regex("(.)(?<=\\1\\1\\1)", RegexOptions.IgnoreCase | RegexOptions.CultureInvariant | RegexOptions.Compiled);
var x = r.Replace("YOU SUCCCCCCCCCCCCCCCCCKKKKKKKKKKKKKKKKKK", String.Empty);
// x = "YOU SUCCKK"
var y = r.Replace("OMGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG!!!!!!!!!!!!!!!", String.Empty);
// y = "OMGG!!"
Do you specifically want to shorten the strings in the code, or would it be enough to simply fail validation and present the form to the user again with a validation error? Something like "Too many repeated characters."
If the latter is acceptable, #"(\w)\1{2}" should match characters of 3 or more (interpreted as "repeated" two or more times).
Edit: As #Piskvor pointed out, this will match on exactly 3 characters. It works fine for matching, but not for replacing. His version, #"(\w)\1{2,}", would work better for replacing. However, I'd like to point out that I think replacing wouldn't be the best practice here. Better to just have the form fail validation than to try to scrub the text being submitted, because there likely will be edge cases where you turn otherwise readable (even if unreasonable) text into nonsense.
var nonRepeatedChars = myString.ToCharArray().Distinct().Where(c => !char.IsWhiteSpace(c) || !myString.Contains(c)).ToString();
Regex would be overkill.
Try this:
public static string RemoveRepeatedChars(String input, int maxRepeat)
{
if(input.Length==0)return input;
StringBuilder b = new StringBuilder;
Char[] chars = input.ToCharArray();
Char lastChar = chars[0];
int repeat = 0;
for(int i=1;i<input.Length;i++){
if(chars[i]==lastChar && ++repeat<maxRepeat)
{
b.Append(chars[i]);
}
else
{
b.Append(chars[i]);
repeat=0;
lastChar = chars[i];
}
}
return b.ToString();
}
Distinct() will remove all duplicates, however it will not see "A" and "a" as the same, obviously.
Console.WriteLine(new string("Asdfasdf".Distinct().ToArray()));
Outputs "Asdfa"
var test = "OMMMMMGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGMMM";
test.Distinct().Select(c => c.ToString()).ToList()
.ForEach(c =>
{
while (test.Contains(c + c))
test = test.Replace(c + c, c);
}
);
Edit : awful suggestion, please don't read, I truly deserve my -1 :)
I found here on technical nuggets something like what you're looking for.
There's nothing to do except a very long regex, because I've never heard about a regex sign for repetition ...
It's a total example, I won't paste it here but I think this will totally answer your question.
Related
This is an extra exercise given to us on our Uni course, in which we need to find whether a sentence contains a palindrome or not. Whereas finding if a word is a palindrome or not is fairly easy, there could be a situation where the given sentence looks like this: "Dog cat - kajak house". My logic is to, using functions I already wrote, first determine if a character is a letter or not, if not delete it. Then count number of spaces+1 to find out how many words there are in a sentence, prepare an array of those words and then cast a function that checks if a word is palindrome on every element of an array. However, the double space would mess everything up on a "counting" phase. I've spent around an hour fiddling with code to do this, however I can't wrap my head around this. Could anyone help me? Note that I'm not supposed to use any external methods or libraries. I've done this using RegEx and it was fairly easy, however I'd like to do this "legally". Thanks in advance!
Just split on space, the trick is to remove empties. Google StringSplitOptions.RemoveEmptyEntries
then obviously join with one clean space
You could copy all the characters you are interested in to a new string:
var copy = new StringBuilder();
// Indicates whether the previous character was whitespace
var whitespace = true;
foreach (var character in originalString)
{
if (char.IsLetter(character))
{
copy.Append(character);
whitespace = false;
}
else if (char.IsWhiteSpace(character) && !whitespace)
{
copy.Append(character);
whitespace = true;
}
else
{
// Ignore other characters
}
}
If you're looking for the most rudimentary/olde-worlde option, this will work. But no-one would actually do this, other than in an illustrative manner.
string test = "Dog cat kajak house"; // Your string minus any non-letters
int prevLen;
do
{
prevLen = test.Length;
test = test.Replace(" ", " "); // Repeat-replace double spaces until none left
} while (prevLen > test.Length);
Personally, I'd probably do the following to end up with an array of words to check:
string[] words = test.Split(new char[] { ' ' }, StringSplitOptions.RemoveEmptyEntries);
I am trying to see if my string starts with a string in an array of strings I've created. Here is my code:
string x = "Table a";
string y = "a table";
string[] arr = new string["table", "chair", "plate"]
if (arr.Contains(x.ToLower())){
// this should be true
}
if (arr.Contains(y.ToLower())){
// this should be false
}
How can I make it so my if statement comes up true? Id like to just match the beginning of string x to the contents of the array while ignoring the case and the following characters. I thought I needed regex to do this but I could be mistaken. I'm a bit of a newbie with regex.
It seems you want to check if your string contains an element from your list, so this should be what you are looking for:
if (arr.Any(c => x.ToLower().Contains(c)))
Or simpler:
if (arr.Any(x.ToLower().Contains))
Or based on your comments you may use this:
if (arr.Any(x.ToLower().Split(' ')[0].Contains))
Because you said you want regex...
you can set a regex to var regex = new Regex("(table|plate|fork)");
and check for if(regex.IsMatch(myString)) { ... }
but it for the issue at hand, you dont have to use Regex, as you are searching for an exact substring... you can use
(as #S.Akbari mentioned : if (arr.Any(c => x.ToLower().Contains(c))) { ... }
Enumerable.Contains matches exact values (and there is no build in compare that checks for "starts with"), you need Any that takes predicate that takes each array element as parameter and perform the check. So first step is you want "contains" to be other way around - given string to contain element from array like:
var myString = "some string"
if (arr.Any(arrayItem => myString.Contains(arrayItem)))...
Now you actually asking for "string starts with given word" and not just contains - so you obviously need StartsWith (which conveniently allows to specify case sensitivity unlike Contains - Case insensitive 'Contains(string)'):
if (arr.Any(arrayItem => myString.StartsWith(
arrayItem, StringComparison.CurrentCultureIgnoreCase))) ...
Note that this code will accept "tableAAA bob" - if you really need to break on word boundary regular expression may be better choice. Building regular expressions dynamically is trivial as long as you properly escape all the values.
Regex should be
beginning of string - ^
properly escaped word you are searching for - Escape Special Character in Regex
word break - \b
if (arr.Any(arrayItem => Regex.Match(myString,
String.Format(#"^{0}\b", Regex.Escape(arrayItem)),
RegexOptions.IgnoreCase)) ...
you can do something like below using TypeScript. Instead of Starts with you can also use contains or equals etc..
public namesList: Array<string> = ['name1','name2','name3','name4','name5'];
// SomeString = 'name1, Hello there';
private isNamePresent(SomeString : string):boolean{
if (this.namesList.find(name => SomeString.startsWith(name)))
return true;
return false;
}
I think I understand what you are trying to say here, although there are still some ambiguity. Are you trying to see if 1 word in your String (which is a sentence) exists in your array?
#Amy is correct, this might not have to do with Regex at all.
I think this segment of code will do what you want in Java (which can easily be translated to C#):
Java:
x = x.ToLower();
string[] words = x.Split("\\s+");
foreach(string word in words){
foreach(string element in arr){
if(element.Equals(word)){
return true;
}
}
}
return false;
You can also use a Set to store the elements in your array, which can make look up more efficient.
Java:
x = x.ToLower();
string[] words = x.Split("\\s+");
HashSet<string> set = new HashSet<string>(arr);
for(string word : words){
if(set.contains(word)){
return true;
}
}
return false;
Edit: (12/22, 11:05am)
I rewrote my solution in C#, thanks to reminders by #Amy and #JohnyL. Since the author only wants to match the first word of the string, this edited code should work :)
C#:
static bool contains(){
x = x.ToLower();
string[] words = x.Split(" ");
var set = new HashSet<string>(arr);
if(set.Contains(words[0])){
return true;
}
return false;
}
Sorry my question was so vague but here is the solution thanks to some help from a few people that answered.
var regex = new Regex("^(table|chair|plate) *.*");
if (regex.IsMatch(x.ToLower())){}
I've looked around this site for a good PO Box regex and didn't find any that I liked or worked consistently, so I tried my hand at making my own... I feel pretty good about it, but I'm sure the kind folks here on SO can poke some holes in it :) So... what problems do you see with this and what false-positives/false-negatives can you think up that would get through?
One caveat that I can see is that the PO Box pattern has to be at the start of the string, but what else is wrong with it?
public bool AddressContainsPOB(string Addr)
{
string input = Addr.Trim().ToLower();
bool Result = false;
Regex regexObj1 = new Regex(#"^p(ost){0,1}(\.){0,1}(\s){0,2}o(ffice){0,1}(\.){0,1}((\s){1}|b{1}|[1-9]{1})");
Regex regexObj2 = new Regex(#"^pob((\s){1}|[0-9]{1})");
Regex regexObj3 = new Regex(#"^box((\s){1}|[0-9]{1})");
Match match1 = regexObj1.Match(input);
if (match1.Success)
{ Result = true; }
Match match2 = regexObj2.Match(input);
if (match2.Success)
{ Result = true; }
Match match3 = regexObj3.Match(input);
if (match3.Success)
{ Result = true; }
return Result;
}
What do you expect from us? You don't even give us valid/invalid strings. Have you tested your regexes somehow?
What I see at the first glance, without knowing something about valid input is:
One caveat that I can see is that the PO Box pattern has to be at the start of the string
Do you want to match it only at the start of the string or not? You need to know that and define it in your pattern. If you don't want to, then remove the start of the string anchor ^ and replace it with a word boundary \b.
{1} is superfluous, you can just remove it.
For {0,1} there is a shortform ?, I like this better, because it is shorter.
^box((\s){1}|[0-9]{1}) matches either "box" followed by a whitespace OR followed by a digit. Is this really what you want to match?
(\.) in the first regex: Why do you group a single dot?
I am hopeless with regex (c#) so I would appreciate some help:
Basicaly I need to parse a text and I need to find the following information inside the text:
Sample text:
KeywordB:***TextToFind* the rest is not relevant but **KeywordB: Text ToFindB and then some more text.
I need to find the word(s) after a certain keyword which may end with a “:”.
[UPDATE]
Thanks Andrew and Alan: Sorry for reopening the question but there is quite an important thing missing in that regex. As I wrote in my last comment, Is it possible to have a variable (how many words to look for, depending on the keyword) as part of the regex?
Or: I could have a different regex for each keyword (will only be a hand full). But still don't know how to have the "words to look for" constant inside the regex
The basic regex is this:
var pattern = #"KeywordB:\s*(\w*)";
\s* = any number of spaces
\w* = 0 or more word characters (non-space, basically)
() = make a group, so you can extract the part that matched
var pattern = #"KeywordB:\s*(\w*)";
var test = #"KeywordB: TextToFind";
var match = Regex.Match(test, pattern);
if (match.Success) {
Console.Write("Value found = {0}", match.Groups[1]);
}
If you have more than one of these on a line, you can use this:
var test = #"KeywordB: TextToFind KeyWordF: MoreText";
var matches = Regex.Matches(test, #"(?:\s*(?<key>\w*):\s?(?<value>\w*))");
foreach (Match f in matches ) {
Console.WriteLine("Keyword '{0}' = '{1}'", f.Groups["key"], f.Groups["value"]);
}
Also, check out the regex designer here: http://www.radsoftware.com.au/. It is free, and I use it constantly. It works great to prototype expressions. You need to rearrange the UI for basic work, but after that it's easy.
(fyi) The "#" before strings means that \ no longer means something special, so you can type #"c:\fun.txt" instead of "c:\fun.txt"
Let me know if I should delete the old post, but perhaps someone wants to read it.
The way to do a "words to look for" inside the regex is like this:
regex = #"(Key1|Key2|Key3|LastName|FirstName|Etc):"
What you are doing probably isn't worth the effort in a regex, though it can probably be done the way you want (still not 100% clear on requirements, though). It involves looking ahead to the next match, and stopping at that point.
Here is a re-write as a regex + regular functional code that should do the trick. It doesn't care about spaces, so if you ask for "Key2" like below, it will separate it from the value.
string[] keys = {"Key1", "Key2", "Key3"};
string source = "Key1:Value1Key2: ValueAnd A: To Test Key3: Something";
FindKeys(keys, source);
private void FindKeys(IEnumerable<string> keywords, string source) {
var found = new Dictionary<string, string>(10);
var keys = string.Join("|", keywords.ToArray());
var matches = Regex.Matches(source, #"(?<key>" + keys + "):",
RegexOptions.IgnoreCase);
foreach (Match m in matches) {
var key = m.Groups["key"].ToString();
var start = m.Index + m.Length;
var nx = m.NextMatch();
var end = (nx.Success ? nx.Index : source.Length);
found.Add(key, source.Substring(start, end - start));
}
foreach (var n in found) {
Console.WriteLine("Key={0}, Value={1}", n.Key, n.Value);
}
}
And the output from this is:
Key=Key1, Value=Value1
Key=Key2, Value= ValueAnd A: To Test
Key=Key3, Value= Something
/KeywordB\: (\w)/
This matches any word that comes after your keyword. As you didn´t mentioned any terminator, I assumed that you wanted only the word next to the keyword.
I have two strings: the first's value is "catdog" and the second's is "got".
I'm trying to find a regex that tells me if the letters for "got" are in "catdog". I'm particularly looking to avoid the case where there are duplicate letters. For example, I know "got" is a match, however "gott" is not a match because there are not two "t" in "catdog".
EDIT:
Based on Adam's response below this is the C# code I got to work in my solution. Thanks to all those that responded.
Note: I had to convert the char to int and subtract 97 to get the appropriate index for the array. In my case the letters are always lower case.
private bool CompareParts(string a, string b)
{
int[] count1 = new int[26];
int[] count2 = new int[26];
foreach (var item in a.ToCharArray())
count1[(int)item - 97]++;
foreach (var item in b.ToCharArray())
count2[(int)item - 97]++;
for (int i = 0; i < count1.Length; i++)
if(count2[i] > count1[i])
return false;
return true;
}
You're using the wrong tool for the job. This is not something regular expressions are capable of handling easily. Fortunately, it's relatively easy to do this without regular expressions. You just count up the number of occurrences of each letter within both strings, and compare the counts between the two strings - if for each letter of the alphabet, the count in the first string is at least as large as the count in the second string, then your criteria are satisfied. Since you didn't specify a language, here's an answer in pseudocode that should be easily translatable into your language:
bool containsParts(string1, string2)
{
count1 = array of 26 0's
count2 = array of 26 0's
// Note: be sure to check for an ignore non-alphabetic characters,
// and do case conversion if you want to do it case-insensitively
for each character c in string1:
count1[c]++
for each character c in string2:
count2[c]++
for each character c in 'a'...'z':
if count1[c] < count2[c]:
return false
return true
}
Previous suggestions have already been made that perhaps regex isn't the best way to do this and I agree, however, your accepted answer is a little verbose considering what you're trying to achieve and that is test to see if a set of letters is the subset of another set of letters.
Consider the following code which achieves this in a single line of code:
MatchString.ToList().ForEach(Item => Input.Remove(Item));
Which can be used as follows:
public bool IsSubSetOf(string InputString, string MatchString)
{
var InputChars = InputString.ToList();
MatchString.ToList().ForEach(Item => InputChars.Remove(Item));
return InputChars.Count == 0;
}
You can then just call this method to verify if it's a subset or not.
What is interesting here is that "got" will return a list with no items because each item in the match string only appears once, but "gott" will return a list with a single item because there would only be a single call to remove the "t" from the list. Consequently you would have an item left in the list. That is, "gott" is not a subset of "catdog" but "got" is.
You could take it one step further and put the method into a static class:
using System;
using System.Linq;
using System.Runtime.CompilerServices;
static class extensions
{
public static bool IsSubSetOf(this string InputString, string MatchString)
{
var InputChars = InputString.ToList();
MatchString.ToList().ForEach(Item => InputChars.Remove(Item));
return InputChars.Count == 0;
}
}
which makes your method into an extension of the string object which actually makes thins a lot easier in the long run, because you can now make your calls like so:
Console.WriteLine("gott".IsSubSetOf("catdog"));
I don't think there is a sane way to do this with regular expressions. The insane way would be to write out all the permutations:
/^(c?a?t?d?o?g?|c?a?t?d?g?o?| ... )$/
Now, with a little trickery you could do this with a few regexps (example in Perl, untested):
$foo = 'got';
$foo =~ s/c//;
$foo =~ s/a//;
...
$foo =~ s/d//;
# if $foo is now empty, it passes the test.
Sane people would use a loop, of course:
$foo = 'got'
foreach $l (split(//, 'catdog') {
$foo =~ s/$l//;
}
# if $foo is now empty, it passes the test.
There are much better performing ways to pull this off, of course, but they don't use regexps. And there are no doubt ways to do it if e.g., you can use Perl's extended regexp features like embedded code.
You want a string that matches exact those letters, exactly once. It depends what you're writing the regex in, but it's going to be something like
^[^got]*(g|o|t)[^got]$
If you've got an operator for "exactly one match" that will help.
Charlie Martin almost has it right, but you have to do a complete pass for each letter. You can do that with a single regex by using lookaheads for all but the last pass:
/^
(?=[^got]*g[^got]*$)
(?=[^got]*o[^got]*$)
[^got]*t[^got]*
$/x
This makes a nice exercise for honing your regex skills, but if I had to do this in real-life, I wouldn't do it this way. A non-regex approach will require a lot more typing, but any minimally competent programmer will be able to understand and maintain it. If you use a regex, that hypothetical maintainer will also have to be more-than-minimally competent at regexes.
#Adam Rosenfield's solution in Python:
from collections import defaultdict
def count(iterable):
c = defaultdict(int)
for hashable in iterable:
c[hashable] += 1
return c
def can_spell(word, astring):
"""Whether `word` can be spelled using `astring`'s characters."""
count_string = count(astring)
count_word = count(word)
return all(count_string[c] >= count_word[c] for c in word)
The best way to do it with regular expressions is, IMO:
A. Sort the characters in the large string (search space)
Thus: turn "catdog" into "acdgot"
B.
Do the same with the string of which you search the characters of: "gott" becomes, eh, "gott"...
Insert ".*" between each of these characters
Use the latter as the regular expression to search in the former.
For example, some Perl code (if you don't mind):
$main = "catdog"; $search = "gott";
# break into individual characters, sort, and reconcatenate
$main = join '', sort split //, $main;
$regexp = join ".*", sort split //, $search;
print "Debug info: search in '$main' for /$regexp/ \n";
if($main =~ /$regexp/) {
print "Found a match!\n";
} else {
print "Sorry, no match...\n";
}
This prints:
Debug info: search in 'acdgot' for /g.*o.*t.*t/
Sorry, no match...
Drop one "t" and you get a match.