regex match partial or whole word - c#

I am trying to figure out a regular expression which can match either the whole word or a (predefined in length, e.g first 4 chars) part of the word.
For example, if I am trying to match the word between and my offset is set to 4, then
between betwee betwe betw
are matches, but not the
bet betweenx bet12 betw123 beta
I have created an example in regex101, where I am trying (with no luck) a combination of positive lookahead (?=) and a non-word boundary \B.
I found a similar question which proposes a word around in its accepted answer. As I understand, it overrides the matcher somehow, to run all the possible regular expressions, based on the word and an offset.
My code has to be written in C#, so I am trying to convert the aforementioned code. As I see Regex.Replace (and I assume Regex.Match also) can accept delegates to override the default functionality, but I can not make it work.

You could take the first 4 characters, and make the remaining ones optional.
Then wrap these in word boundaries and parenthesis.
So in the case of "between", it would be
#"\b(betw)(?:(e|ee|een)?)\b"
The code to achieve that would be:
public string createRegex(string word, int count)
{
var mandatory = word.Substring(0, count);
var optional = "(" + String.Join("|", Enumerable.Range(1, count - 1).Select(i => word.Substring(count, i))) + ")?";
var regex = #"\b(" + mandatory + ")(?:" + optional + #")\b";
return regex;
}

The code in the answer you mentioned simply builds up this:
betw|betwe|betwee|between
So all you need is to write a function, to build up a string with a substrings of given word given minimum length.
static String BuildRegex(String word, int min_len)
{
String toReturn = "";
for(int i = 0; i < word.Length - min_len +1; i++)
{
toReturn += word.Substring(0, min_len+i);
toReturn += "|";
}
toReturn = toReturn.Substring(0, toReturn.Length-1);
return toReturn;
}
Demo

You can use this regex
\b(bet(?:[^\s]){1,4})\b
And replace bet and the 4 dynamically like this:
public static string CreateRegex(string word, int minLen)
{
string token = word.Substring(0, minLen - 1);
string pattern = #"\b(" + token + #"(?:[^\s]){1," + minLen + #"})\b";
return pattern;
}
Here's a demo: https://regex101.com/r/lH0oL2/1
EDIT: as for the bet1122 match, you can edit the pattern this way:
\b(bet(?:[^\s0-9]){1,4})\b
If you don't want to match some chars, just enqueue them into the [] character class.
Demo: https://regex101.com/r/lH0oL2/2
For more info, see http://www.regular-expressions.info/charclass.html

Related

Extract specific number from string with fixed pattern in C#

This might sound like a very basic question, but it's one that's given me quite a lot of trouble in C#.
Assume I have, for example, the following Strings known as my chosenTarget.titles:
2008/SD128934 - Wordz aaaaand more words (1233-26-21)
20998/AD1234 - Wordz and less words (1263-21-21)
208/ASD12345 - Wordz and more words (1833-21-21)
Now as you can see, all three Strings are different in some ways.
What I need is to extract a very specific part of these Strings, but getting the subtleties right is what confuses me, and I was wondering if some of you knew better than I.
What I know is that the Strings will always come in the following pattern:
yearNumber + "/" + aFewLetters + theDesiredNumber + " - " + descriptiveText + " (" + someDate + ")"
In the above example, what I would want to return to me would be:
128934
1234
12345
I need to extract theDesiredNumber.
Now, I'm not (that) lazy so I have made a few attempts myself:
var a = chosenTarget.title.Substring(chosenTarget.title.IndexOf("/") + 1, chosenTarget.title.Length - chosenTarget.title.IndexOf("/"));
What this has done is sliced out yearNumber and the /, leaving me with aFewLetter before theDesiredNumber.
I have a hard time properly removing the rest however, and I was wondering if any of you could aid me in the matter?
It sounds as if you only need to extract the number behind the first / which ends at -. You could use a combination of string methods and LINQ:
int startIndex = str.IndexOf("/");
string number = null;
if (startIndex >= 0 )
{
int endIndex = str.IndexOf(" - ", startIndex);
if (endIndex >= 0)
{
startIndex++;
string token = str.Substring(startIndex, endIndex - startIndex); // SD128934
number = String.Concat(token.Where(char.IsDigit)); // 128934
}
}
Another mainly LINQ approach using String.Split:
number = String.Concat(
str.Split(new[] { " - " }, StringSplitOptions.None)[0]
.Split('/')
.Last()
.Where(char.IsDigit));
Try this:
int indexSlash = chosenTarget.title.IndexOf("/");
int indexDash = chosenTarget.title.IndexOf("-");
string out = new string(chosenTarget.title.Substring(indexSlash,indexDash-indexSlash).Where(c => Char.IsDigit(c)).ToArray());
You can use a regex:
var pattern = "(?:[0-9]+/\w+)[0-9]";
var matcher = new Regex(pattern);
var result = matcher.Matches(yourEntireSetOfLinesInAString);
Or you can loop every line and use Match instead of Matches. In this case you don't need to build a "matcher" in every iteration but build it outside the loop
Regex is your friend:
(new [] {"2008/SD128934 - Wordz aaaaand more words (1233-26-21)",
"20998/AD1234 - Wordz and less words (1263-21-21)",
"208/ASD12345 - Wordz and more words (1833-21-21)"})
.Select(x => new Regex(#"\d+/[A-Z]+(\d+)").Match(x).Groups[1].Value)
The pattern you had recognized is very important, here is the solution:
const string pattern = #"\d+\/[a-zA-Z]+(\d+).*$";
string s1 = #"2008/SD128934 - Wordz aaaaand more words(1233-26-21)";
string s2 = #"20998/AD1234 - Wordz and less words(1263-21-21)";
string s3 = #"208/ASD12345 - Wordz and more words(1833-21-21)";
var strings = new List<string> { s1, s2, s3 };
var desiredNumber = string.Empty;
foreach (var s in strings)
{
var match = Regex.Match(s, pattern);
if (match.Success)
{
desiredNumber = match.Groups[1].Value;
}
}
I would use a RegEx for this, the string you're looking for is in Match.Groups[1]
string composite = "2008/SD128934 - Wordz aaaaand more words (1233-26-21)";
Match m= Regex.Match(composite,#"^\d{4}\/[a-zA-Z]+(\d+)");
if (m.Success) Console.WriteLine(m.Groups[1]);
The breakdown of the RegEx is as follows
"^\d{4}\/[a-zA-Z]+(\d+)"
^ - Indicates that it's the beginning of the string
\d{4} - Four digits
\/ - /
[a-zA-Z]+ - More than one letters
(\d+) - More than one digits (the parenthesis indicate that this part is captured as a group - in this case group 1)

How to remove only certain substrings from a string?

Using C#, I have a string that is a SQL script containing multiple queries. I want to remove sections of the string that are enclosed in single quotes. I can do this using Regex.Replace, in this manner:
string test = "Only 'together' can we turn him to the 'dark side' of the Force";
test = Regex.Replace(test, "'[^']*'", string.Empty);
Results in: "Only can we turn him to the of the Force"
What I want to do is remove the substrings between quotes EXCEPT for substrings containing a specific substring. For example, using the string above, I want to remove the quoted substrings except for those that contain "dark," such that the resulting string is:
Results in: "Only can we turn him to the 'dark side' of the Force"
How can this be accomplished using Regex.Replace, or perhaps by some other technique? I'm currently trying a solution that involves using Substring(), IndexOf(), and Contains().
Note: I don't care if the single quotes around "dark side" are removed or not, so the result could also be: "Only can we turn him to the dark side of the Force." I say this because a solution using Split() would remove all the single quotes.
Edit: I don't have a solution yet using Substring(), IndexOf(), etc. By "working on," I mean I'm thinking in my head how this can be done. I have no code, which is why I haven't posted any yet. Thanks.
Edit: VKS's solution below works. I wasn't escaping the \b the first attempt which is why it failed. Also, it didn't work unless I included the single quotes around the whole string as well.
test = Regex.Replace(test, "'(?![^']*\\bdark\\b)[^']*'", string.Empty);
'(?![^']*\bdark\b)[^']*'
Try this.See demo.Replace by empty string.You can use lookahead here to check if '' contains a word dark.
https://www.regex101.com/r/rG7gX4/12
While vks's solution works, I'd like to demonstrate a different approach:
string test = "Only 'together' can we turn him to the 'dark side' of the Force";
test = Regex.Replace(test, #"'[^']*'", match => {
if (match.Value.Contains("dark"))
return match.Value;
// You can add more cases here
return string.Empty;
});
Or, if your condition is simple enough:
test = Regex.Replace(test, #"'[^']*'", match => match.Value.Contains("dark")
? match.Value
: string.Empty
);
That is, use a lambda to provide a callback for the replacement. This way, you can run arbitrary logic to replace the string.
some thing like this would work. you can add all strings you want to keep into the excludedStrings array
string test = "Only 'together' can we turn him to the 'dark side' of the Force";
var excludedString = new string[] { "dark side" };
int startIndex = 0;
while ((startIndex = test.IndexOf('\'', startIndex)) >= 0)
{
var endIndex = test.IndexOf('\'', startIndex + 1);
var subString = test.Substring(startIndex, (endIndex - startIndex) + 1);
if (!excludedString.Contains(subString.Replace("'", "")))
{
test = test.Remove(startIndex, (endIndex - startIndex) + 1);
}
else
{
startIndex = endIndex + 1;
}
}
Another method through regex alternation operator |.
#"('[^']*\bdark\b[^']*')|'[^']*'"
Then replace the matched character with $1
DEMO
string str = "Only 'together' can we turn him to the 'dark side' of the Force";
string result = Regex.Replace(str, #"('[^']*\bdark\b[^']*')|'[^']*'", "$1");
Console.WriteLine(result);
IDEONE
Explanation:
(...) called capturing group.
'[^']*\bdark\b[^']*' would match all the single quoted strings which contains the substring dark . [^']* matches any character but not of ', zero or more times.
('[^']*\bdark\b[^']*'), because the regex is within a capturing group, all the matched characters are stored inside the group index 1.
| Next comes the regex alternation operator.
'[^']*' Now this matches all the remaining (except the one contains dark) single quoted strings. Note that this won't match the single quoted string which contains the substring dark because we already matched those strings with the pattern exists before to the | alternation operator.
Finally replacing all the matched characters with the chars inside group index 1 will give you the desired output.
I made this attempt that I think you were thinking about (some solution using split, Contain, ... without regex)
string test = "Only 'together' can we turn him to the 'dark side' of the Force";
string[] separated = test.Split('\'');
string result = "";
for (int i = 0; i < separated.Length; i++)
{
string str = separated[i];
str = str.Trim(); //trim the tailing spaces
if (i % 2 == 0 || str.Contains("dark")) // you can expand your condition
{
result += str+" "; // add space after each added string
}
}
result = result.Trim(); //trim the tailing space again

string.IndexOf search for whole word match

I am seeking a way to search a string for an exact match or whole word match. RegEx.Match and RegEx.IsMatch don't seem to get me where I want to be. Consider the following scenario:
namespace test
{
class Program
{
static void Main(string[] args)
{
string str = "SUBTOTAL 34.37 TAX TOTAL 37.43";
int indx = str.IndexOf("TOTAL");
string amount = str.Substring(indx + "TOTAL".Length, 10);
string strAmount = Regex.Replace(amount, "[^.0-9]", "");
Console.WriteLine(strAmount);
Console.WriteLine("Press any key to continue...");
Console.ReadKey();
}
}
}
The output of the above code is:
// 34.37
// Press any key to continue...
The problem is, I don't want SUBTOTAL, but IndexOf finds the first occurrence of the word TOTAL which is in SUBTOTAL which then yields the incorrect value of 34.37.
So the question is, is there a way to force IndexOf to find only an exact match or is there another way to force that exact whole word match so that I can find the index of that exact match and then perform some useful function with it. RegEx.IsMatch and RegEx.Match are, as far as I can tell, simply boolean searches. In this case, it isn't enough to just know the exact match exists. I need to know where it exists in the string.
Any advice would be appreciated.
You can use Regex
string str = "SUBTOTAL 34.37 TAX TOTAL 37.43";
var indx = Regex.Match(str, #"\WTOTAL\W").Index; // will be 18
My method is faster than the accepted answer because it does not use Regex.
string str = "SUBTOTAL 34.37 TAX TOTAL 37.43";
var indx = str.IndexOfWholeWord("TOTAL");
public static int IndexOfWholeWord(this string str, string word)
{
for (int j = 0; j < str.Length &&
(j = str.IndexOf(word, j, StringComparison.Ordinal)) >= 0; j++)
if ((j == 0 || !char.IsLetterOrDigit(str, j - 1)) &&
(j + word.Length == str.Length || !char.IsLetterOrDigit(str, j + word.Length)))
return j;
return -1;
}
You can use word boundaries, \b, and the Match.Index property:
var text = "SUBTOTAL 34.37 TAX TOTAL 37.43";
var idx = Regex.Match(text, #"\bTOTAL\b").Index;
// => 19
See the C# demo.
The \bTOTAL\b matches TOTAL when it is not enclosed with any other letters, digits or underscores.
If you need to count a word as a whole word if it is enclosed with underscores, use
var idx = Regex.Match(text, #"(?<![^\W_])TOTAL(?![^\W_])").Index;
where (?<![^\W_]) is a negative lookbehind that fails the match if there is a character other than a non-word and underscore immediately to the left of the current location (so, there can be a start of string position, or a char that is a not a digit nor letter), and (?![^\W_]) is a similar negative lookahead that only matches if there is an end of string position or a char other than a letter or digit immediately to the right of the current location.
If the boundaries are whitespaces or start/end of string use
var idx = Regex.Match(text, #"(?<!\S)TOTAL(?!\S)").Index;
where (?<!\S) requires start of string or a whitespace immediately on the left, and (?!\S) requires the end of string or a whitespace on the right.
NOTE: \b, (?<!...) and (?!...) are non-consuming patterns, that is the regex index does not advance when matching these patterns, thus, you get the exact positions of the word you search for.
To make the accepted answer a little bit safer (since IndexOf returns -1 for unmatched):
string pattern = String.Format(#"\b{0}\b", findTxt);
Match mtc = Regex.Match(queryTxt, pattern);
if (mtc.Success)
{
return mtc.Index;
}
else
return -1;
While this may be a hack that just works for only your example, try
string amount = str.Substring(indx + " TOTAL".Length, 10);
giving an extra space before total. As this will not occur with SUBTOTAL, it should skip over the word you don't want and just look for an isolated TOTAL.
I'd recommend the Regex solution from L.B. too, but if you can't use Regex, then you could use String.LastIndexOf("TOTAL"). Assuming the TOTAL always comes after SUBTOTAL?
http://msdn.microsoft.com/en-us/library/system.string.lastindexof(v=vs.110).aspx

c# Regex question

I have a problem dealing with the # symbol in Regex, I am trying to remove #sometext
from a text string can't seem to find anywhere where it uses the # as a literal. I have tried myself but doesn't remove the word from the string. Any ideas?
public string removeAtSymbol(string input)
{
Regex findWords = new Regex(______);//Find the words like "#text"
Regex[] removeWords;
string test = input;
MatchCollection all = findWords.Matches(test);
removeWords = new Regex[all.Count];
int index = 0;
string[] values = new string[all.Count];
YesOutputBox.Text = " you got here";
foreach (Match m in all) //List all the words
{
values[index] = m.Value.Trim();
index++;
YesOutputBox.Text = YesOutputBox.Text + " " + m.Value;
}
for (int i = 0; i < removeWords.Length; i++)
{
removeWords[i] = new Regex(" " + values[i]);
// If the words appears more than one time
if (removeWords[i].Matches(test).Count > 1)
{
removeWords[i] = new Regex(" " + values[i] + " ");
test = removeWords[i].Replace(test, " "); //Remove the first word.
}
}
return test;
}
You can remove all occurences of "#sometext" from string test via the method
Regex.Replace(test, "#sometext", "")
or for any word starting with "#" you can use
Regex.Replace(test, "#\\w+", "")
If you need specifically a separate word (i.e. nothing like #comp within tom#comp.com) you may preceed the regex with a special word boundary (\b does not work here):
Regex.Replace(test, "(^|\\W)#\\w+", "")
You can use:
^\s#([A-Za-z0-9_]+)
as the regex to recognize Twitter usernames.
Regex to remove #something from this string: I want to remove #something from this string.
var regex = new Regex("#\\w*");
string result = regex.Replace(stringWithAt, "");
Is that what you are looking for?
I've had good luck applying this pattern:
\B#\w+
This will match any string starting with an # character that contains alphanumeric characters, plus some linking punctuation like the underscore character, if it does not occur on a boundary between alphanumeric and non-alphanumeric characters.
The result of executing this code:
string result = Regex.Replace(
#"#This1 #That2_thing this2#3that #the5Others #alpha#beta#gamma",
#"\B#\w+",
#"redacted");
is the following string:
redacted redacted this2#3that redacted redacted#beta#gamma
If this question is Twitter-specific, then Twitter provides an open source library that helps capture Twitter-specific entities like links, mentions and hashtags. This java file contains the code defining the regular expressions that Twitter uses, and this yml file contains test strings and expected outcomes of many unit tests that exercise the regular expressions in the Twitter library.
Twitter's mention-matching pattern (extracted from their library, modified to remove unnecessary capture groups, and edited to make sense in the context of a replacement) is shown below. The match should be performed in a case-insensitive manner.
(^|[^a-z0-9_])[#\uFF20][a-z0-9_]{1,20}
Here is an example which reproduces the results of the first replacement in my answer:
string result = Regex.Replace(
#"#This1 #That2_thing this2#3that #the5Others #alpha#beta#gamma",
#"(^|[^a-z0-9_])[#\uFF20][a-z0-9_]{1,20}",
#"$1redacted",
RegexOptions.IgnoreCase);
Note the need to include the substitution $1 since the first capture group can't be directly converted into an atomic zero-width assertion.

c# Best way to break up a long string

This question is not related to:
Best way to break long strings in C# source code
Which is about source, this is about processing long outputs. If someone enters:
WWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWW
As a comment, it breaks the container and makes the entire page really wide. Is there any clever regexp that can say, define a maximum word length of 20 chars and then force a whitespace character?
Thanks for any help!
There's probably no need to involve regexes in something this simple. Take this extension method:
public static string Abbreviate(this string text, int length) {
if (text.Length <= length) {
return text;
}
char[] delimiters = new char[] { ' ', '.', ',', ':', ';' };
int index = text.LastIndexOfAny(delimiters, length - 3);
if (index > (length / 2)) {
return text.Substring(0, index) + "...";
}
else {
return text.Substring(0, length - 3) + "...";
}
}
If the string is short enough, it's returned as-is. Otherwise, if a "word boundary" is found in the second half of the string, it's "gracefully" cut off at that point. If not, it's cut off the hard way at just under the desired length.
If the string is cut off at all, an ellipsis ("...") is appended to it.
If you expect the string to contain non-natural-language constructs (such as URLs) you 'd need to tweak this to ensure nice behavior in all circumstances. In that case working with a regex might be better.
You could try using a regular expression that uses a positive look-ahead like this:
string outputStr = Regex.Replace(inputStr, #"([\S]{20}(?=\S+))", "$1\n");
This should "insert" a line break into all words that are longer than 20 characters.
Yes you can use this one regex
string pattern = #"^([\w]{1,20})$";
this regex allow to enter not more than 20 characters
string strRegex = #"^([\w]{1,20})$";
string strTargetString = #"asdfasfasfasdffffff";
if(Regex.IsMatch(strTargetString, strRegex))
{
//do something
}
If you need only lenght constraint you should use this regex
^(.{1,20})$
because the \w is match only
alphanumeric and underscore symbol

Categories

Resources